FullStack Me

Curiosity driven journal of perfecting a comprehensive and mindful living

Reading Lined JSON files with Java 8

7 March 2018

Came across this, seeming trivial at a glance, task of parsing a relatively well-formatted data feed just recently. Sure, you may say, what could be easier than parsing a JSON format given that there are plenty of tools for that, especially for Java? Well, sorry, not exactly JSON... In effect, compared to other unstructured data sources I previously worked with, this feed used a lined JSON format (i.e. IJSON). Example:

{“id”: “us-cia-world-leaders.bc0...”, “type”: “individual”, ...}
{“id”: “us-cia-world-leaders.924...”, “type”: “entity”, ...}
{...}

Even though this format is widely used, mainstream JSON parsers such as Jackson cannot handle this structure since it’s not a valid JSON file. Looks like we have a little problem here?

Tackling IJSON with Java

A quick solution is to simply read the lined JSON file line by line and transform each line to a POJO entry. Combined with streamed input readers, the lined JSON format appeared to be more efficacious than the “classic” JSON, merely because we no longer need to preload the entire structure in memory and then transform it. With 30Mb+ files, the performance benefits are evidently noticeable.

The below code snippet illustrates how this can be achieved:

import com.fasterxml.jackson.databind.ObjectMapper;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.*;
import java.util.Objects;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.function.Consumer;
import java.util.stream.Stream;

/**
 * Simple streamed reader to go through Lined JSON files, convert each line to POJO entry 
 * and perform a specified action on every row.
 * @author Vladimir Salin
 */
public class LineBasedJsonReader {

    private static final Logger log = LoggerFactory.getLogger(LineBasedJsonReader.class);
    private ObjectMapper objectMapper;

    public LineBasedJsonReader(ObjectMapper objectMapper) {
        this.objectMapper = objectMapper;
    }

    /**
     * Parses a provided input in a streamed way. Converts each line in it 
     * (which is supposed to be a JSON) to a specified POJO class
     * and performs an action provided as a Java 8 Consumer.
     * 
     * @param stream lined JSON input 
     * @param entryClass POJO class to convert JSON to
     * @param consumer action to perform on each entry
     * @return number of rows read
     */
    public int parseAsStream(final InputStream stream, final Class entryClass, final Consumer<? super Object> consumer) {
        long start = System.currentTimeMillis();

        final AtomicInteger total = new AtomicInteger(0);
        final AtomicInteger failed = new AtomicInteger(0);

        try (Stream<String> lines = new BufferedReader(new InputStreamReader(stream)).lines()) {
            lines
                    .map(line -> {
                        try {
                            total.incrementAndGet();
                            return objectMapper.readerFor(entryClass).readValue(line);
                        } catch (IOException e) {
                            log.error("Failed to parse a line {}. Reason: {}", total.get()-1, e.getMessage());
                            log.debug("Stacktrace: ", e);
                            failed.incrementAndGet();
                            return null;
                        }
                    })
                    .filter(Objects::nonNull)
                    .forEach(consumer);
        }
        long took = System.currentTimeMillis() - start;
        log.info("Parsed {} lines with {} failures. Took {}ms", total.get(), failed.get(), took);

        return total.get() - failed.get();
    }
}

As you can see, we simply need to pass a source as an InputStream, a POJO class for the JSON we want to parse to, a Java 8 Consumer to act on each parsed row, and that’s it. The above is just a simple snippet for illustrative purposes. In a production environment, one should add more robust error handling.

So why Lined JSON?

Indeed, with these numerous JSON parsing tools, why the heck someone decided to go Lined JSON? Is it any fancy writing every single line in this JSON-y object format?

Actually, yes, it is fancy. Just think of it for a second -- you read the line and get a valid JSON object. Let me put it this way: you load just one line into memory and get a valid JSON object you can work with in your code. Another line -- another object. Worked with it, released from memory, going next. And this is how you proceed through the entire file, no matter how long it is.

Just imagine a huge JSON array weighting a good couple of huundred MBs. Going straightforward and reading in full would take quite a bunch of memory. Going lined JSON approach would allow you iterating through each line and spending just a little of your precious memory. For sure, in some cases we need the whole thing loaded, but for others it's just fine to go one by one. So, lessons learned, another convenient data structure to use and to handle!

Back to other articles