Using Dask bag to load and read large json file

Hello, I would like to use dask.bag to load a json file that is too big for my computer memory. From my searching, it seems like if I have done things correctly, I should be able to read a smal portion of the file at a time. Here is what I did:

import dask.bag as db
import json

b = db.read_text("data.json").map(json.load)

b.take(1)

After I run this code, the usage of my computer memory significantly increase to its limit. So I guess maybe this is not how things work. Can anyone help me with this? Many thanks!!

Hi @yusuw, welcome to Dask forum!

That’s correct.

Well, it might depend on how your data is formatted. Could you post an representative sample of your JSON file? You have several options like blocksize to read_text that could be useful.

Hi @guillaumeeb, Thanks for your reply! My json file is a list of dictionarys and each dictionary is a profile of a material structure with the same keywords. I think the structure is very similar to the example available here: Dask Bags — Dask Examples documentation, but with much larger size. Below is a simple illustration of the data format.

Example:
josn file before saved: [{struct_1}, {struct_2}, {struct_3}, …]

Each dictionary:
{“structure name”: “string”,
“structure_id”: “string”,
“space_group”: “string”
“peaks”: [a list of float numbers],
“intensities”: [a list of float numbers]
}

I tried to set blocksize to 2**28, but it seems make no difference.

The question is: Has your file only one line? or is there a Dict per line?

Looks like the file is only one line.

Sorry, I think didn’t load my data in a correct way now I’m not sure how to answer this question. Here is my code and output:

import dask.bag as db
import json

b = db.read_text("test_data.json").map(json.loads)
b.take(1)

([{‘peak_position’: [1.093, 1.44, 1.642, 1.786, 1.898, 1.989],
‘intensity’: [100.0,
98.25460278228483,
42.91088528032965,
23.72132133082235,
74.61009853261571,
61.12144575420072],
‘d_hkls’: [2.106,
1.4891668811788688,
1.2158996669133517,
1.053,
0.9418318321229112,
0.8597708997168954]},
{‘peak_position’: [1.09, 1.437, 1.639, 1.783, 1.895, 1.986],
‘intensity’: [100.0,
98.26514882265482,
42.92003498059408,
23.728848991358397,
74.64138270682857,
61.15315087770432],
‘d_hkls’: [2.1125,
1.4937630752565816,
1.2196524436630845,
1.05625,
0.9447387204936611,
0.8624245136049107]},
{‘peak_position’: [1.086, 1.433, 1.635, 1.779, 1.891, 1.982],
‘intensity’: [100.0,
98.27879721598336,
42.93187854236903,
23.73859497503004,
74.68189512400386,
61.194218469903554],
‘d_hkls’: [2.1209999999999996,
1.4997734828966671,
1.2245599209511961,
1.0604999999999998,
0.9485400360554106,
0.8658946240738535]},
{‘peak_position’: [1.082, 1.429, 1.631, 1.775, 1.887, 1.978],
‘intensity’: [100.0,
98.29228631329433,
42.943586388365965,
23.74823151758385,
74.72196269895193,
61.23484623406181],
‘d_hkls’: [2.1295,
1.505783890536753,
1.229467398239308,
1.06475,
0.9523413516171605,
0.8693647345427964]},
{‘peak_position’: [1.078, 1.424, 1.627, 1.771, 1.882, 1.973],
‘intensity’: [100.0,
98.30717686318627,
42.956513521321746,
23.758874213606628,
74.76622534708027,
61.27974050454893],
‘d_hkls’: [2.139,
1.512501404958025,
1.2349522257966092,
1.0695,
0.95658988077441,
0.873243093302203]}],)

I was hoping to get one dictionary, but I got all 5.

Dask Dataframe’s read_json has a blocksize keyword argument that is intended for exactly this situation. However it relies on newline characters in your file, so if everything is on one line you may need to preprocess your file to add them.

For example you could use a tool like sed to stream through the file and replace all instances of }, with },\n.