I am trying to compare two files and each line is in JSON format. I need to compare each line between two files and should return the difference.Since the file size is too big and I am unable to read and compare each line.Please suggest me some optimised way in doing this.
wonea
4,55317 gold badges83 silver badges137 bronze badges
asked Jul 23, 2013 at 5:26
2
Two possible ways :
- Using the technique mentioned in the comment posted by Josh.
- Using the technique mentioned here : how to compare 2 json in python.
Given that you have a large file, you are better off using difflib technique described in point 1.
Edit based on response to my below answer:
After some research, it appears that the best way to deal with large data payloads is to process this payload in a streamed manner. This way we ensure a speedy processing of the data keeping in mind the memory usage and performance of the software in general.
Refer to this link that talks about Streaming JSON data objects using Python. Similarly take a look at ijson - this is an iterator based JSON parsing/processing library in python.
Hopefully, this helps you towards identifying a good fit library that will solve your use case
answered Jul 23, 2013 at 5:58
1
This seems to be a pretty solid start: //github.com/ZoomerAnalytics/jsondiff
>>> pip install jsondiff
>>> from jsondiff import diff
>>> diff[{'a': 1, 'b': 2}, {'b': 3, 'c': 4}, syntax='symmetric']
{insert: {'c': 4}, 'b': [2, 3], delete: {'a': 1}}
I'm also going to try it out for a current project, I'll try to maintain updates and edits as I go along.
answered Oct 24, 2018 at 15:05
mjfredmjfred
771 silver badge11 bronze badges
If you need to process a large JSON file in Python, it’s very easy to run out of memory. Even if the raw data fits in memory, the Python representation can increase memory usage even more.
And that means either slow processing, as your program swaps to disk, or crashing when you run out of memory.
One common solution is streaming parsing, aka lazy parsing, iterative parsing, or chunked processing. Let’s see how you can apply this technique to JSON processing.
The problem: Python’s memory-inefficient JSON loading
For illustrative purposes, we’ll be using this JSON file, large enough at 24MB that it has a noticeable memory impact when loaded. It encodes a list of JSON objects [i.e. dictionaries], which look to be GitHub events, users doing things to repositories:
[{"id":"2489651045","type":"CreateEvent","actor":{"id":665991,"login":"petroav","gravatar_id":"","url":"//api.github.com/users/petroav","avatar_url":"//avatars.githubusercontent.com/u/665991?"},"repo":{"id":28688495,"name":"petroav/6.828","url":"//api.github.com/repos/petroav/6.828"},"payload":{"ref":"master","ref_type":"branch","master_branch":"master","description":"Solution to homework and assignments from MIT's 6.828 [Operating Systems Engineering]. Done in my spare time.","pusher_type":"user"},"public":true,"created_at":"2015-01-01T15:00:00Z"},
...
]
Our goal is to figure out which repositories a given user interacted with. Here’s a simple Python program that does so:
import json
with open["large-file.json", "r"] as f:
data = json.load[f]
user_to_repos = {}
for record in data:
user = record["actor"]["login"]
repo = record["repo"]["name"]
if user not in user_to_repos:
user_to_repos[user] = set[]
user_to_repos[user].add[repo]
The result is a dictionary mapping usernames to sets of repository names.
When we run this with the Fil memory profiler, here’s what we get:
Looking at peak memory usage, we see two main sources of allocation:
- Reading the file.
- Decoding the resulting bytes into Unicode strings.
And if we
look at the implementation of the json
module in Python, we can see that the json.load[]
just loads the whole file into memory before parsing!
def load[fp, *, cls=None, object_hook=None, parse_float=None,
parse_int=None, parse_constant=None, object_pairs_hook=None, **kw]:
"""Deserialize ``fp`` [a ``.read[]``-supporting file-like object containing
a JSON document] to a Python object.
...
"""
return loads[fp.read[], ...]
So that’s one problem: just loading the file will take a lot of memory. In addition, there should be some usage from creating the Python objects. However, in this case they don’t show up at all, probably because peak memory is dominated by loading the file and decoding it from bytes to Unicode. That’s why actual profiling is so helpful in reducing memory usage and speeding up your software: the real bottlenecks might not be obvious.
Even if loading the file is the bottleneck, that still raises some questions. The original file we loaded is 24MB. Once we load it into memory and decode it into a text [Unicode] Python string, it takes far more than 24MB. Why is that?
A brief digression: Python’s string memory representation
Python’s string representation is optimized to use less memory, depending on what the string contents are. First, every string has a fixed overhead. Then, if the string can be represented as ASCII, only one byte of memory is used per character. If the string uses more extended characters, it might end up using as many as 4
bytes per character. We can see how much memory an object needs using sys.getsizeof[]
:
>>> import sys
>>> s = "a" * 1000
>>> len[s]
1000
>>> sys.getsizeof[s]
1049
>>> s2 = "❄" + "a" * 999
>>> len[s2]
1000
>>> sys.getsizeof[s2]
2074
>>> s3 = "💵" + "a" * 999
>>> len[s3]
1000
>>> sys.getsizeof[s3]
4076
Notice how all 3 strings are 1000 characters long, but they use different amounts of memory depending on which characters they contain.
If you look at our large JSON file, it contains characters that don’t fit in ASCII. Because it’s loaded as one giant string, that whole giant string uses a less efficient memory representation.
A streaming solution
It’s clear that loading the whole JSON file into memory is a waste of memory. With a larger file, it would be impossible to load at all.
Given a JSON file that’s structured as a list of objects, we could in theory parse it one chunk at a time instead of all at once. The resulting API would probably allow processing the objects one at a time. And if we look at the algorithm we want to run, that’s just fine; the algorithm does not require all the data be loaded into memory at once. We can process the records one at a time.
Whatever term you want to describe this approach—streaming, iterative parsing, chunking, or reading on-demand—it means we can reduce memory usage to:
- The in-progress data, which should typically be fixed.
- The result data structure, which in our case shouldn’t be too large.
There are a number of Python libraries that support this style of JSON parsing; in the following example, I used the
ijson
library.
import ijson
user_to_repos = {}
with open["large-file.json", "rb"] as f:
for record in ijson.items[f, "item"]:
user = record["actor"]["login"]
repo = record["repo"]["name"]
if user not in user_to_repos:
user_to_repos[user] = set[]
user_to_repos[user].add[repo]
In the previous version, using the standard library, once the data is loaded we no longer to keep the file open. With this API the file has to stay open because the JSON parser is reading from the file on demand, as we iterate over the records.
The items[]
API takes a query string that tells you which part of the record to return. In this case, "item"
just means “each item in the
top-level list we’re iterating over”; see the ijson
documentation for more details.
Here’s what memory usage looks like with this approach:
When it comes to memory usage, problem solved! And as far as runtime performance goes, the streaming/chunked solution with ijson
actually runs slightly faster, though this won’t necessarily be the case for other datasets or algorithms.
Other approaches
As always, there are other solutions you can try:
- Pandas: Pandas has the ability to read JSON, and, in theory, it could do it in a more memory-efficient way for certain JSON layouts. In practice, for this example at least peak memory was much worse at 287MB, not including the overhead of importing Pandas.
- SQLite: The SQLite database can parse JSON, store JSON in columns, and query JSON [see the documentation]. One could therefore load the JSON into a disk-backed database file, and run queries against it to extract only the relevant subset of the data. I haven’t measured this approach, but if you need to run multiple queries against the same JSON file, this might be a good path going forward; you can add indexes, too.
Finally, if you have control over the output format, there are ways to reduce the memory usage of JSON processing by switching to a more efficient representation. For example, switching from a single giant JSON list of objects to a JSON record per line, which means every decoded JSON record will only use a small amount of memory.