I wrote a method to read information from S3 object.

It looks fine to me1.

There are multiple records in S3Object, what's the best way to read all the lines.

Your code should read all of the lines.

Does it only read the first line of the object?

No. It should read all of the lines2. That while loop reads until readLine() returns null, and that only happens when you reach the end of the stream.

How to make sure all the lines are read?

If you are getting fewer lines than you expect, EITHER the S3 object contains fewer lines than you think, OR something is causing the object stream to close prematurely.

For the former, count the lines as you read them and compare that with the expected line count.

The latter could possibly be due to a timeout when reading a very large file. See How to read file chunk by chunk from S3 using aws-java-sdk for some ideas on how to deal with that problem.


1 - Actually, it would be better if you used a try with resources to ensure that the S3 stream is always closed. But that won't cause you to "lose" lines.
2 - This assumes that the S3 service doesn't time out the connection, and that you are not requesting a part (chunk) or a range in the URI request parameters; see https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html .

Answer from Stephen C on Stack Overflow
🌐
AWS
docs.aws.amazon.com › AWSJavaSDK › latest › javadoc › com › amazonaws › services › s3 › model › S3ObjectInputStream.html
S3ObjectInputStream (AWS SDK for Java - 1.12.797)
public S3ObjectInputStream(InputStream in, org.apache.http.client.methods.HttpRequestBase httpRequest, boolean collectMetrics) ... Can be used to provide abortion logic prior to throwing the AbortedException. If the wrapped InputStream is also an instance of this class, then it will also be aborted, otherwise this is a no-op. Aborts the underlying http request without reading any more data and closes the stream. By default Apache HttpClient tries to reuse http connections by reading to the end of an attached input stream on InputStream.close().
Discussions

amazon web services - How to read file chunk by chunk from S3 using aws-java-sdk - Stack Overflow
I am trying to read large file into chunks from S3 without cutting any line for parallel processing. Let me explain by example: There is file of size 1G on S3. I want to divide this file into ... More on stackoverflow.com
🌐 stackoverflow.com
SDK repeatedly complaining "Not all bytes were read from the S3ObjectInputStream"
With a recent upgrade to the 1.11.134 SDK, tests seeking around a large CSV file is triggering a large set of repeated warnings about closing the stream early. 2017-06-27 15:47:05,121 [ScalaTest-ma... More on github.com
🌐 github.com
34
June 27, 2017
Line-by-line publisher of S3 object content
Does the current version of the SDK support returning the content of an S3 object reactively in a Publisher line by line similar to for example Athena pagination? More on github.com
🌐 github.com
14
May 13, 2019
string - How to get the value from the s3 input stream in java - Stack Overflow
You could wrap the S3ObjectInputStream within an InputStreamReader and the InputStreamReader within a BufferedInputStream. That way you can read the object line by line: More on stackoverflow.com
🌐 stackoverflow.com
🌐
Java2s
java2s.com › example › java-api › com › amazonaws › services › s3 › model › s3objectinputstream › close-0-0.html
Example usage for com.amazonaws.services.s3.model S3ObjectInputStream close
/** ************************************************************* * @param filename name of file to retrieve from s3 * @return file name of local file/*from w w w .ja v a2s . com*/ * Gets file from S3 and writes to local file */ public List<String> readS3File(String filename) { List<String> lines = new ArrayList<>(); try { S3Object object = client.getObject(bucket, filename); S3ObjectInputStream stream = object.getObjectContent(); BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(stream)); String line; while ((line = bufferedReader.readLine()) != null) lines.add(line); bufferedReader.close(); stream.close(); } catch (Exception e) { e.printStackTrace(); } return lines; } From source file:org.apache.oodt.cas.filemgr.datatransfer.S3DataTransferer.java ·
Top answer
1 of 6
10

My usual approach (InputStream -> BufferedReader.lines() -> batches of lines -> CompletableFuture) won't work here because the underlying S3ObjectInputStream times out eventually for huge files.

So I created a new class S3InputStream, which doesn't care how long it's open for and reads byte blocks on demand using short-lived AWS SDK calls. You provide a byte[] that will be reused. new byte[1 << 24] (16Mb) appears to work well.

package org.harrison;

import java.io.IOException;
import java.io.InputStream;

import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.GetObjectRequest;

/**
 * An {@link InputStream} for S3 files that does not care how big the file is.
 *
 * @author stephen harrison
 */
public class S3InputStream extends InputStream {
    private static class LazyHolder {
        private static final AmazonS3 S3 = AmazonS3ClientBuilder.defaultClient();
    }

    private final String bucket;
    private final String file;
    private final byte[] buffer;
    private long lastByteOffset;

    private long offset = 0;
    private int next = 0;
    private int length = 0;

    public S3InputStream(final String bucket, final String file, final byte[] buffer) {
        this.bucket = bucket;
        this.file = file;
        this.buffer = buffer;
        this.lastByteOffset = LazyHolder.S3.getObjectMetadata(bucket, file).getContentLength() - 1;
    }

    @Override
    public int read() throws IOException {
        if (next >= length) {
            fill();

            if (length <= 0) {
                return -1;
            }

            next = 0;
        }

        if (next >= length) {
            return -1;
        }

        return buffer[this.next++];
    }

    public void fill() throws IOException {
        if (offset >= lastByteOffset) {
            length = -1;
        } else {
            try (final InputStream inputStream = s3Object()) {
                length = 0;
                int b;

                while ((b = inputStream.read()) != -1) {
                    buffer[length++] = (byte) b;
                }

                if (length > 0) {
                    offset += length;
                }
            }
        }
    }

    private InputStream s3Object() {
        final GetObjectRequest request = new GetObjectRequest(bucket, file).withRange(offset,
                offset + buffer.length - 1);

        return LazyHolder.S3.getObject(request).getObjectContent();
    }
}
2 of 6
3

The aws-java-sdk already provides streaming functionality for your S3 objects. You have to call "getObject" and the result will be an InputStream.

1) AmazonS3Client.getObject(GetObjectRequest getObjectRequest) -> S3Object

2) S3Object.getObjectContent()

Note: The method is a simple getter and does not actually create a stream. If you retrieve an S3Object, you should close this input stream as soon as possible, because the object contents aren't buffered in memory and stream directly from Amazon S3. Further, failure to close this stream can cause the request pool to become blocked.

aws java docs

🌐
Tabnine
tabnine.com › home page › code › java › com.amazonaws.services.s3.model.s3objectinputstream
com.amazonaws.services.s3.model.S3ObjectInputStream.read java code examples | Tabnine
public boolean readBufferFromFile() throws IOException { int n = s3ObjectInputStream.read( bb ); if ( n == -1 ) { return false; } else { // adjust the highest used position... // bufferSize = endBuffer + n; // Store the data in our byte array // for ( int i = 0; i < n; i++ ) { byteBuffer[endBuffer + i] = bb[i]; } return true; } } origin: Alluxio/alluxio ·
🌐
GitHub
github.com › aws › aws-sdk-java › issues › 1211
SDK repeatedly complaining "Not all bytes were read from the S3ObjectInputStream" · Issue #1211 · aws/aws-sdk-java
June 27, 2017 - Request only the bytes you need via a ranged GET or drain the input stream after use. 2017-06-27 15:47:06,730 [ScalaTest-main-running-S3ACSVReadSuite] WARN internal.S3AbortableInputStream (S3AbortableInputStream.java:close(163)) - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection.
Author   steveloughran
🌐
GitHub
github.com › aws › aws-sdk-java-v2 › issues › 1253
Line-by-line publisher of S3 object content · Issue #1253 · aws/aws-sdk-java-v2
May 13, 2019 - Does the current version of the SDK support returning the content of an S3 object reactively in a Publisher line by line similar to for example Athena pagination?
Published   May 13, 2019
Author   martin-tarjanyi
Find elsewhere
🌐
AWS
docs.aws.amazon.com › aws sdk for java › developer guide for version 1.x › aws sdk for java code examples › amazon s3 examples using the aws sdk for java › performing operations on amazon s3 objects
Performing Operations on Amazon S3 Objects - AWS SDK for Java 1.x
System.out.format("Downloading %s from S3 bucket %s...\n", key_name, bucket_name); final AmazonS3 s3 = AmazonS3ClientBuilder.standard().withRegion(Regions.DEFAULT_REGION).build(); try { S3Object o = s3.getObject(bucket_name, key_name); S3ObjectInputStream s3is = o.getObjectContent(); FileOutputStream fos = new FileOutputStream(new File(key_name)); byte[] read_buf = new byte[1024]; int read_len = 0; while ((read_len = s3is.read(read_buf)) > 0) { fos.write(read_buf, 0, read_len); } s3is.close(); fos.close(); } catch (AmazonServiceException e) { System.err.println(e.getErrorMessage()); System.exit(1); } catch (FileNotFoundException e) { System.err.println(e.getMessage()); System.exit(1); } catch (IOException e) { System.err.println(e.getMessage()); System.exit(1); }
🌐
GitHub
github.com › aws › aws-sdk-java-v2 › issues › 306
Simpler / Easier mechanism to read S3 Object content as a String (like v1 getObjectAsString) · Issue #306 · aws/aws-sdk-java-v2
November 30, 2017 - The v1 SDK has a nice method for just grabbing the content of an S3 object as a String. It would be nice if the v2 SDK had this functionality. Expected Behavior String result = s3.getObjectAsString(request); Current Behavior Need to writ...
Author   plombardi89
🌐
HowToDoInJava
howtodoinjava.com › home › amazon web services › read a file from aws s3 using s3client
Read a File from AWS S3 using S3Client
September 28, 2022 - The following example reads the file using URL. @Test void testReadFileFromUrl() throws IOException { String FILE_URL = "https://howtodoinjava-s3-bucket.s3.amazonaws.com/test.txt"; BufferedReader bufferedReader = new BufferedReader( new InputStreamReader(new URL(FILE_URL).openConnection().getInputStream())); String line; StringBuilder content = new StringBuilder(); while ((line = bufferedReader.readLine()) != null) { content.append(line); } Assertions.assertEquals("Hello World!", content.toString()); }
🌐
Stack Overflow
stackoverflow.com › questions › 55970951 › java-read-json-data-from-s3-object-line-by-line
amazon s3 - Java - Read (JSON) data from S3 object line by line - Stack Overflow
May 3, 2019 - My goal here actually is to convert it into List<String> where each element is a line in the S3 object. 2019-05-03T13:46:15.68Z+00:00 ... I have added sample data. I was wondering if creating a BufferedStreamReader using the InputStream and then using readLine() would accomplish this.
🌐
IBM
ibm.github.io › ibm-cos-sdk-java › com › ibm › cloud › objectstorage › services › s3 › model › S3ObjectInputStream.html
S3ObjectInputStream (IBM COS SDK for Java 2.14.1 API)
public S3ObjectInputStream(InputStream in, org.apache.http.client.methods.HttpRequestBase httpRequest, boolean collectMetrics) ... Can be used to provide abortion logic prior to throwing the AbortedException. If the wrapped InputStream is also an instance of this class, then it will also be aborted, otherwise this is a no-op. Aborts the underlying http request without reading any more data and closes the stream. By default Apache HttpClient tries to reuse http connections by reading to the end of an attached input stream on InputStream.close().
🌐
Codeflex
codeflex.co › java-read-amazon-s3-object-as-string
Java Read Amazon S3 Object as String | CodeFlex
In previous post you saw how to delete several S3 objects from Amazon S3 using Java AWS SDK. Today I’ll show how to read specific S3 object and convert it to string.
🌐
AWS re:Post
forums.aws.amazon.com › thread.jspa
Forums | AWS re:Post
July 24, 2018 - AWS re:Post is a cloud knowledge service launched at re:Invent 2021. We've migrated selected questions and answers from Forums to AWS re:Post. The thread you are trying to access has outdated guidance, hence we have archived it. If you would like up-to-date guidance, then share your question ...
🌐
Alexwlchan
alexwlchan.net › 2019 › streaming-large-s3-objects
Streaming large objects from S3 with ranged GET requests – alexwlchan
September 12, 2019 - import java.io.{ByteArrayInputStream, InputStream, SequenceInputStream} import java.util import com.amazonaws.services.s3.model.{GetObjectRequest, S3ObjectInputStream} import com.amazonaws.services.s3.AmazonS3 import org.apache.commons.io.IOUtils import scala.util.Try /** Read objects from S3, buffering up to `bufferSize` bytes of an object * in-memory at a time, minimising the time needed to hold open * an HTTP connection to S3.
🌐
Bloomreach
community.bloomreach.com › experience manager (paas/onprem)
How to resolve warning: "Not all bytes were read from the S3ObjectInputStream" - Experience Manager (PaaS/OnPrem) - Bloomreach Developers Forum
September 3, 2018 - Hi, We’re using Hippo with an S3 DataStore, and have followed the instructions to Deploy the Authoring and Delivery Web Applications Separately. We’re seeing the following warning crop up repeatedly in the logs, mostly on app servers for the public site (but also a very few times on the CMS host): WARN [com.amazonaws.services.s3.internal.S3AbortableInputStream.close():178] Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection.
🌐
GitHub
github.com › aws › aws-sdk-java › issues › 1657
S3 Incomplete Read Warning Despite Aborting · Issue #1657 · aws/aws-sdk-java
June 28, 2018 - Hello! I'm getting the warnings when reading an S3 getObject stream incompletely and aborting at the end: try(S3Object object = s3.getObject(bucketName, objectKey)) { S3ObjectInputStream is = object.getObjectContent(); try { readSomeFrom...
Author   atorstling