just a background, I am novice in Python, PD and Dask. I would appreciated your input.
I have a large export (CSV) with data from various countries. (source_Country)
I want to achieve two things:
- get an % of how each column is filled by each country (focussing on this issue first)
- get a value count on important a selection of columns in my csv file.
as said the file is very large therefore Pandas didn’t work.
the code i have so far is
import pandas as pd
import dask.array as da
import dask.dataframe as dd
from dask.distributed import Client
client = Client(n_workers=10, threads_per_worker=8, processes=True, memory_limit='10GB')
client
ddf = dd.read_csv('eloqua_export (2).csv',blocksize="50MB",on_bad_lines='skip',engine='python',sep=';')
DE_=ddf['Source_Country']== 'DE'
DE_count=(100-(DE_.isna().sum())/(len(DE_))*100).compute
here I get all sort of issues with calculation issues related to DType=float64
in an ideal world i would like a table(csv export)
Columnname ;Country1;Country2
column1;100;100
column2;25.3;60
how can I achieve this. Any help is appreciated.