Calculations in a column

J3ss0n · October 26, 2022, 1:20pm

just a background, I am novice in Python, PD and Dask. I would appreciated your input.

I have a large export (CSV) with data from various countries. (source_Country)
I want to achieve two things:

get an % of how each column is filled by each country (focussing on this issue first)
get a value count on important a selection of columns in my csv file.

as said the file is very large therefore Pandas didn’t work.

the code i have so far is

import pandas as pd
import dask.array as da
import dask.dataframe as dd
from dask.distributed import Client
client = Client(n_workers=10, threads_per_worker=8, processes=True, memory_limit='10GB')
client

ddf = dd.read_csv('eloqua_export (2).csv',blocksize="50MB",on_bad_lines='skip',engine='python',sep=';')

DE_=ddf['Source_Country']== 'DE'
DE_count=(100-(DE_.isna().sum())/(len(DE_))*100).compute

here I get all sort of issues with calculation issues related to DType=float64

in an ideal world i would like a table(csv export)

Columnname ;Country1;Country2
column1;100;100
column2;25.3;60

how can I achieve this. Any help is appreciated.

Topic		Replies	Views
How to save the database so that it is readable for the dataframe Dask DataFrame	2	399	April 14, 2022
Why dask runs with no results? Dask DataFrame	6	343	June 30, 2023
How to append to a Dask Dataframe Dask DataFrame	11	3120	March 14, 2023
Memory filled up when compute dataframe-mean with 67 million rows Dask DataFrame	1	314	March 1, 2022
How to handle a Dask DF in multiple modules? Dask DataFrame	6	578	February 8, 2023

Calculations in a column

Related topics