Compare two large json files python
I am trying to compare two files and each line is in JSON format. I need to compare each line between two files and should return the difference.Since the file size is too big and I am unable to read and compare each line.Please suggest me some optimised way in doing this. Show
wonea 4,55317 gold badges83 silver badges137 bronze badges asked Jul 23, 2013 at 5:26
2 Two possible ways :
Given that you have a large file, you are better off using difflib technique described in point 1. Edit based on response to my below answer: After some research, it appears that the best way to deal with large data payloads is to process this payload in a streamed manner. This way we ensure a speedy processing of the data keeping in mind the memory usage and performance of the software in general. Refer to this link that talks about Streaming JSON data objects using Python. Similarly take a look at ijson - this is an iterator based JSON parsing/processing library in python. Hopefully, this helps you towards identifying a good fit library that will solve your use case
answered Jul 23, 2013 at 5:58
1 This seems to be a pretty solid start: https://github.com/ZoomerAnalytics/jsondiff
I'm also going to try it out for a current project, I'll try to maintain updates and edits as I go along. answered Oct 24, 2018 at 15:05
mjfredmjfred 771 silver badge11 bronze badges If you need to process a large JSON file in Python, it’s very easy to run out of memory. Even if the raw data fits in memory, the Python representation can increase memory usage even more. And that means either slow processing, as your program swaps to disk, or crashing when you run out of memory. One common solution is streaming parsing, aka lazy parsing, iterative parsing, or chunked processing. Let’s see how you can apply this technique to JSON processing. The problem: Python’s memory-inefficient JSON loadingFor illustrative purposes, we’ll be using this JSON file, large enough at 24MB that it has a noticeable memory impact when loaded. It encodes a list of JSON objects (i.e. dictionaries), which look to be GitHub events, users doing things to repositories:
Our goal is to figure out which repositories a given user interacted with. Here’s a simple Python program that does so:
The result is a dictionary mapping usernames to sets of repository names. When we run this with the Fil memory profiler, here’s what we get: Looking at peak memory usage, we see two main sources of allocation:
And if we
look at the implementation of the
So that’s one problem: just loading the file will take a lot of memory. In addition, there should be some usage from creating the Python objects. However, in this case they don’t show up at all, probably because peak memory is dominated by loading the file and decoding it from bytes to Unicode. That’s why actual profiling is so helpful in reducing memory usage and speeding up your software: the real bottlenecks might not be obvious. Even if loading the file is the bottleneck, that still raises some questions. The original file we loaded is 24MB. Once we load it into memory and decode it into a text (Unicode) Python string, it takes far more than 24MB. Why is that? A brief digression: Python’s string memory representationPython’s string representation is optimized to use less memory, depending on what the string contents are. First, every string has a fixed overhead. Then, if the string can be represented as ASCII, only one byte of memory is used per character. If the string uses more extended characters, it might end up using as many as 4
bytes per character. We can see how much memory an object needs using
Notice how all 3 strings are 1000 characters long, but they use different amounts of memory depending on which characters they contain. If you look at our large JSON file, it contains characters that don’t fit in ASCII. Because it’s loaded as one giant string, that whole giant string uses a less efficient memory representation. A streaming solutionIt’s clear that loading the whole JSON file into memory is a waste of memory. With a larger file, it would be impossible to load at all. Given a JSON file that’s structured as a list of objects, we could in theory parse it one chunk at a time instead of all at once. The resulting API would probably allow processing the objects one at a time. And if we look at the algorithm we want to run, that’s just fine; the algorithm does not require all the data be loaded into memory at once. We can process the records one at a time. Whatever term you want to describe this approach—streaming, iterative parsing, chunking, or reading on-demand—it means we can reduce memory usage to:
There are a number of Python libraries that support this style of JSON parsing; in the following example, I used the
In the previous version, using the standard library, once the data is loaded we no longer to keep the file open. With this API the file has to stay open because the JSON parser is reading from the file on demand, as we iterate over the records. The Here’s what memory usage looks like with this approach: When it comes to memory usage, problem solved! And as far as runtime performance goes, the streaming/chunked solution with Other approachesAs always, there are other solutions you can try:
Finally, if you have control over the output format, there are ways to reduce the memory usage of JSON processing by switching to a more efficient representation. For example, switching from a single giant JSON list of objects to a JSON record per line, which means every decoded JSON record will only use a small amount of memory. How do I compare large JSON files?Input json code, compare 2 objects. Copy the original JSON data in the block on the left and new/modified data in the right block. Just click Compare and View button to view a table of all the differences found with node!
How do I compare two JSON files in Python?Comparing json is quite simple, we can use '==' operator, Note: '==' and 'is' operator are not same, '==' operator is use to check equality of values , whereas 'is' operator is used to check reference equality, hence one should use '==' operator, 'is' operator will not give expected result.
How do I compare two files in Python?Approach. Open both files in read mode.. Store list of strings.. Start comparing both files with the help of intersection() method for common strings.. Compare both files for differences using while loop.. Close both files.. How do I handle a large JSON file?There are some excellent libraries for parsing large JSON files with minimal resources. One is the popular GSON library. It gets at the same effect of parsing the file as both stream and object. It handles each record as it passes, then discards the stream, keeping memory usage low.
|