Compare two large json files python

I am trying to compare two files and each line is in JSON format. I need to compare each line between two files and should return the difference.Since the file size is too big and I am unable to read and compare each line.Please suggest me some optimised way in doing this.

Compare two large json files python

wonea

4,55317 gold badges83 silver badges137 bronze badges

asked Jul 23, 2013 at 5:26

2

Two possible ways :

  1. Using the technique mentioned in the comment posted by Josh.
  2. Using the technique mentioned here : how to compare 2 json in python.

Given that you have a large file, you are better off using difflib technique described in point 1.

Edit based on response to my below answer:

After some research, it appears that the best way to deal with large data payloads is to process this payload in a streamed manner. This way we ensure a speedy processing of the data keeping in mind the memory usage and performance of the software in general.

Refer to this link that talks about Streaming JSON data objects using Python. Similarly take a look at ijson - this is an iterator based JSON parsing/processing library in python.

Hopefully, this helps you towards identifying a good fit library that will solve your use case

answered Jul 23, 2013 at 5:58

1

This seems to be a pretty solid start: https://github.com/ZoomerAnalytics/jsondiff

>>> pip install jsondiff
>>> from jsondiff import diff
>>> diff({'a': 1, 'b': 2}, {'b': 3, 'c': 4}, syntax='symmetric')
{insert: {'c': 4}, 'b': [2, 3], delete: {'a': 1}}

I'm also going to try it out for a current project, I'll try to maintain updates and edits as I go along.

answered Oct 24, 2018 at 15:05

mjfredmjfred

771 silver badge11 bronze badges

If you need to process a large JSON file in Python, it’s very easy to run out of memory. Even if the raw data fits in memory, the Python representation can increase memory usage even more.

And that means either slow processing, as your program swaps to disk, or crashing when you run out of memory.

One common solution is streaming parsing, aka lazy parsing, iterative parsing, or chunked processing. Let’s see how you can apply this technique to JSON processing.

The problem: Python’s memory-inefficient JSON loading

For illustrative purposes, we’ll be using this JSON file, large enough at 24MB that it has a noticeable memory impact when loaded. It encodes a list of JSON objects (i.e. dictionaries), which look to be GitHub events, users doing things to repositories:

[{"id":"2489651045","type":"CreateEvent","actor":{"id":665991,"login":"petroav","gravatar_id":"","url":"https://api.github.com/users/petroav","avatar_url":"https://avatars.githubusercontent.com/u/665991?"},"repo":{"id":28688495,"name":"petroav/6.828","url":"https://api.github.com/repos/petroav/6.828"},"payload":{"ref":"master","ref_type":"branch","master_branch":"master","description":"Solution to homework and assignments from MIT's 6.828 (Operating Systems Engineering). Done in my spare time.","pusher_type":"user"},"public":true,"created_at":"2015-01-01T15:00:00Z"},
...
]

Our goal is to figure out which repositories a given user interacted with. Here’s a simple Python program that does so:

import json

with open("large-file.json", "r") as f:
    data = json.load(f)

user_to_repos = {}
for record in data:
    user = record["actor"]["login"]
    repo = record["repo"]["name"]
    if user not in user_to_repos:
        user_to_repos[user] = set()
    user_to_repos[user].add(repo)

The result is a dictionary mapping usernames to sets of repository names.

When we run this with the Fil memory profiler, here’s what we get:

Looking at peak memory usage, we see two main sources of allocation:

  1. Reading the file.
  2. Decoding the resulting bytes into Unicode strings.

And if we look at the implementation of the json module in Python, we can see that the json.load() just loads the whole file into memory before parsing!

def load(fp, *, cls=None, object_hook=None, parse_float=None,
        parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
    """Deserialize ``fp`` (a ``.read()``-supporting file-like object containing
    a JSON document) to a Python object.
    ...
    """
    return loads(fp.read(), ...)

So that’s one problem: just loading the file will take a lot of memory. In addition, there should be some usage from creating the Python objects. However, in this case they don’t show up at all, probably because peak memory is dominated by loading the file and decoding it from bytes to Unicode. That’s why actual profiling is so helpful in reducing memory usage and speeding up your software: the real bottlenecks might not be obvious.

Even if loading the file is the bottleneck, that still raises some questions. The original file we loaded is 24MB. Once we load it into memory and decode it into a text (Unicode) Python string, it takes far more than 24MB. Why is that?

A brief digression: Python’s string memory representation

Python’s string representation is optimized to use less memory, depending on what the string contents are. First, every string has a fixed overhead. Then, if the string can be represented as ASCII, only one byte of memory is used per character. If the string uses more extended characters, it might end up using as many as 4 bytes per character. We can see how much memory an object needs using sys.getsizeof():

>>> import sys
>>> s = "a" * 1000
>>> len(s)
1000
>>> sys.getsizeof(s)
1049

>>> s2 = "❄" + "a" * 999
>>> len(s2)
1000
>>> sys.getsizeof(s2)
2074

>>> s3 = "💵" + "a" * 999
>>> len(s3)
1000
>>> sys.getsizeof(s3)
4076

Notice how all 3 strings are 1000 characters long, but they use different amounts of memory depending on which characters they contain.

If you look at our large JSON file, it contains characters that don’t fit in ASCII. Because it’s loaded as one giant string, that whole giant string uses a less efficient memory representation.

A streaming solution

It’s clear that loading the whole JSON file into memory is a waste of memory. With a larger file, it would be impossible to load at all.

Given a JSON file that’s structured as a list of objects, we could in theory parse it one chunk at a time instead of all at once. The resulting API would probably allow processing the objects one at a time. And if we look at the algorithm we want to run, that’s just fine; the algorithm does not require all the data be loaded into memory at once. We can process the records one at a time.

Whatever term you want to describe this approach—streaming, iterative parsing, chunking, or reading on-demand—it means we can reduce memory usage to:

  1. The in-progress data, which should typically be fixed.
  2. The result data structure, which in our case shouldn’t be too large.

There are a number of Python libraries that support this style of JSON parsing; in the following example, I used the ijson library.

import ijson

user_to_repos = {}

with open("large-file.json", "rb") as f:
    for record in ijson.items(f, "item"):
        user = record["actor"]["login"]
        repo = record["repo"]["name"]
        if user not in user_to_repos:
            user_to_repos[user] = set()
        user_to_repos[user].add(repo)

In the previous version, using the standard library, once the data is loaded we no longer to keep the file open. With this API the file has to stay open because the JSON parser is reading from the file on demand, as we iterate over the records.

The items() API takes a query string that tells you which part of the record to return. In this case, "item" just means “each item in the top-level list we’re iterating over”; see the ijson documentation for more details.

Here’s what memory usage looks like with this approach:

When it comes to memory usage, problem solved! And as far as runtime performance goes, the streaming/chunked solution with ijson actually runs slightly faster, though this won’t necessarily be the case for other datasets or algorithms.

Other approaches

As always, there are other solutions you can try:

  • Pandas: Pandas has the ability to read JSON, and, in theory, it could do it in a more memory-efficient way for certain JSON layouts. In practice, for this example at least peak memory was much worse at 287MB, not including the overhead of importing Pandas.
  • SQLite: The SQLite database can parse JSON, store JSON in columns, and query JSON (see the documentation). One could therefore load the JSON into a disk-backed database file, and run queries against it to extract only the relevant subset of the data. I haven’t measured this approach, but if you need to run multiple queries against the same JSON file, this might be a good path going forward; you can add indexes, too.

Finally, if you have control over the output format, there are ways to reduce the memory usage of JSON processing by switching to a more efficient representation. For example, switching from a single giant JSON list of objects to a JSON record per line, which means every decoded JSON record will only use a small amount of memory.

How do I compare large JSON files?

Input json code, compare 2 objects. Copy the original JSON data in the block on the left and new/modified data in the right block. Just click Compare and View button to view a table of all the differences found with node!

How do I compare two JSON files in Python?

Comparing json is quite simple, we can use '==' operator, Note: '==' and 'is' operator are not same, '==' operator is use to check equality of values , whereas 'is' operator is used to check reference equality, hence one should use '==' operator, 'is' operator will not give expected result.

How do I compare two files in Python?

Approach.
Open both files in read mode..
Store list of strings..
Start comparing both files with the help of intersection() method for common strings..
Compare both files for differences using while loop..
Close both files..

How do I handle a large JSON file?

There are some excellent libraries for parsing large JSON files with minimal resources. One is the popular GSON library. It gets at the same effect of parsing the file as both stream and object. It handles each record as it passes, then discards the stream, keeping memory usage low.