Creating a Dask DataFrame with read_json or converting from Pandas

I have been trying to build a simple logs mining application. A typical log entry looks like:

{"repoType": 29, "repo": "Jason_hbase", "reqUser": "hbase", "evtTime": "2023-02-08 20:44:08", "access": "INSERT", "resource": "default/tas2", "resType": "@column", "action": "select", "result": 1, "agent": "William", "policy": 15, "enforcer": "ranger-acl", "sess": "1d5e2d26-177c-4ab0-907b-a1b2c2e1f14c", "cliType": "WILLIAM", "cliIP": "128.74.107.129", "logType": "RangerAudit", "id": "bffe8d84-5b9d-4cda-9256-1dff82a9f4ec", "seq_num": 1, "event_count": 1, "event_dur_ms": 1, "tags": [], "cluster_name": "cdp-dc-profilers-fc820", "policy_version": 1}

My script works perfectly if I create a dataframe with pandas:

import pandas as pd
import dask.dataframe as dd
df = pd.read_json('./log.txt')

ddf = dd.from_pandas(df, npartitions=5)

print('\nTop 10 users')
# print(ddf['reqUser'].value_counts().nlargest(10).to_frame().compute())
print(ddf.reqUser.value_counts().nlargest(10).to_frame().compute())

and the output is:

Top 10 users
         reqUser
hbase       1089
impala      1041
iceberg     1005
hive         961

However, If I try to read this Json directly from dask.dataframe.read_json The reference to column is giving an error:

import pandas as pd
import dask.dataframe as dd

ddf = dd.read_json('./log.txt', encoding='utf-8')
print('\nTop 10 users')
print(ddf.reqUser.value_counts().nlargest(10).to_frame().compute())

The error is:

File "/Users/ap/.local/share/virtualenvs/python-demo-xaGE8vZ_/lib/python3.11/site-packages/dask/dataframe/core.py", line 4806, in __getattr__
    raise AttributeError("'DataFrame' object has no attribute %r" % key)
AttributeError: 'DataFrame' object has no attribute 'reqUser'

Ideally there should not be any difference once a dataframe has been formed. I am not sure what I am missing ?
Moreover, the dataframe does not formed properly.

print(ddf.dtypes)

0       object
1       object
2       object
         ...  
4094    object
4095    object

Am I missing something quite obvious?

Hi @matrixbegins, welcome to Dask community!

I’m not sure of the reason behind it, but orient kwarg default value is not the same between Dask Dataframe and Pandas. I was able to make your code works with:

ddf = dd.read_json('./log.txt', orient='columns')
1 Like