Using Dask bag to load and read large json file

yusuw · April 19, 2023, 12:06am

Hello, I would like to use dask.bag to load a json file that is too big for my computer memory. From my searching, it seems like if I have done things correctly, I should be able to read a smal portion of the file at a time. Here is what I did:

import dask.bag as db
import json

b = db.read_text("data.json").map(json.load)

b.take(1)

After I run this code, the usage of my computer memory significantly increase to its limit. So I guess maybe this is not how things work. Can anyone help me with this? Many thanks!!

guillaumeeb · April 19, 2023, 6:33am

Hi @yusuw, welcome to Dask forum!

That’s correct.

Well, it might depend on how your data is formatted. Could you post an representative sample of your JSON file? You have several options like blocksize to read_text that could be useful.

yusuw · April 19, 2023, 3:15pm

Hi @guillaumeeb, Thanks for your reply! My json file is a list of dictionarys and each dictionary is a profile of a material structure with the same keywords. I think the structure is very similar to the example available here: Dask Bags — Dask Examples documentation, but with much larger size. Below is a simple illustration of the data format.

Example:
josn file before saved: [{struct_1}, {struct_2}, {struct_3}, …]

Each dictionary:
{“structure name”: “string”,
“structure_id”: “string”,
“space_group”: “string”
“peaks”: [a list of float numbers],
“intensities”: [a list of float numbers]
}

I tried to set blocksize to 2**28, but it seems make no difference.

guillaumeeb · April 19, 2023, 3:58pm

The question is: Has your file only one line? or is there a Dict per line?

yusuw · April 19, 2023, 6:10pm

Looks like the file is only one line.

yusuw · April 19, 2023, 9:00pm

Sorry, I think didn’t load my data in a correct way now I’m not sure how to answer this question. Here is my code and output:

import dask.bag as db
import json

b = db.read_text("test_data.json").map(json.loads)
b.take(1)

([{‘peak_position’: [1.093, 1.44, 1.642, 1.786, 1.898, 1.989],
‘intensity’: [100.0,
98.25460278228483,
42.91088528032965,
23.72132133082235,
74.61009853261571,
61.12144575420072],
‘d_hkls’: [2.106,
1.4891668811788688,
1.2158996669133517,
1.053,
0.9418318321229112,
0.8597708997168954]},
{‘peak_position’: [1.09, 1.437, 1.639, 1.783, 1.895, 1.986],
‘intensity’: [100.0,
98.26514882265482,
42.92003498059408,
23.728848991358397,
74.64138270682857,
61.15315087770432],
‘d_hkls’: [2.1125,
1.4937630752565816,
1.2196524436630845,
1.05625,
0.9447387204936611,
0.8624245136049107]},
{‘peak_position’: [1.086, 1.433, 1.635, 1.779, 1.891, 1.982],
‘intensity’: [100.0,
98.27879721598336,
42.93187854236903,
23.73859497503004,
74.68189512400386,
61.194218469903554],
‘d_hkls’: [2.1209999999999996,
1.4997734828966671,
1.2245599209511961,
1.0604999999999998,
0.9485400360554106,
0.8658946240738535]},
{‘peak_position’: [1.082, 1.429, 1.631, 1.775, 1.887, 1.978],
‘intensity’: [100.0,
98.29228631329433,
42.943586388365965,
23.74823151758385,
74.72196269895193,
61.23484623406181],
‘d_hkls’: [2.1295,
1.505783890536753,
1.229467398239308,
1.06475,
0.9523413516171605,
0.8693647345427964]},
{‘peak_position’: [1.078, 1.424, 1.627, 1.771, 1.882, 1.973],
‘intensity’: [100.0,
98.30717686318627,
42.956513521321746,
23.758874213606628,
74.76622534708027,
61.27974050454893],
‘d_hkls’: [2.139,
1.512501404958025,
1.2349522257966092,
1.0695,
0.95658988077441,
0.873243093302203]}],)

I was hoping to get one dictionary, but I got all 5.

jacobtomlinson · April 20, 2023, 11:08am

Dask Dataframe’s read_json has a blocksize keyword argument that is intended for exactly this situation. However it relies on newline characters in your file, so if everything is on one line you may need to preprocess your file to add them.

For example you could use a tool like sed to stream through the file and replace all instances of }, with },\n.

Topic		Replies	Views
54k of small files. Is Dask good for it? Dask DataFrame	7	624	March 17, 2023
How can I optimize the speed of reading JSON Lines file(s) into a Dask dataframe? Dask DataFrame json	2	1636	January 10, 2023
Reading arbitrary files within bag Dask Bag	1	213	March 25, 2022
Loading large dataset from postgres using the minimun amount of memory	3	364	April 13, 2023
Using Dask Mango to read into Dask Dataframe gives Runtime Memory Warning performance	3	616	November 10, 2023

Using Dask bag to load and read large json file

Related topics