Why does dd.DataFrame say do not use this directly?

so I use dd.from_pandas()?, what if the whole point of using Dask was because my .csv is too big for Pandas?

My code in Pandas

def get_data(symbols, dates, addSPY=True, colname="Adj Close"):  		  	   		  		 			  		 			 	 	 		 		 	
    """Read stock data (adjusted close) for given symbols from CSV files."""  		  	   		  		 			  		 			 	 	 		 		 	
    df = dd.DataFrame(index=dates) //here it says __init__ got unexpected keyword 'index'
    if addSPY and "SPY" not in symbols:  # add SPY for reference, if absent  		  	   		  		 			  		 			 	 	 		 		 	
        symbols = ["SPY"] + list(  		  	   		  		 			  		 			 	 	 		 		 	
        )  # handles the case where symbols is np array of 'object'  		  	   		  		 			  		 			 	 	 		 		 	
    for symbol in symbols:  		  	   		  		 			  		 			 	 	 		 		 	
        df_temp = dd.read_csv(  		  	   		  		 			  		 			 	 	 		 		 	
            usecols=["Date", colname],  		  	   		  		 			  		 			 	 	 		 		 	
        df_temp = df_temp.rename(columns={colname: symbol})  		  	   		  		 			  		 			 	 	 		 		 	
        df = df.join(df_temp)  		  	   		  		 			  		 			 	 	 		 		 	
        if symbol == "SPY":  # drop dates SPY did not trade  		  	   		  		 			  		 			 	 	 		 		 	
            df = df.dropna(subset=["SPY"])  		  	   		  		 			  		 			 	 	 		 		 	
    return df  		  	   		  		 		

I saw something about divisions so I tried passing (dates) as a tuple argument but that doesn’t work.
Overall, would it be possible for you to add some example code like Pandas documentation?
Especially for those of us coming from using Pandas, a short explanation of differences would also help.
Also as you can see I create an empty dataframe in Pandas and then use the loop to join each stock’s one column dataframe one after another and I end up with a 4 column dataframe with dates as the index and daily closing prices for 4 stocks in each row.
How would I do this in Dask? Say I’m looking at some gargantuan date range like 1980.01.01 to 2023.01.01 with the time frequency being every hour or minute or so, I’m probably not going to be able to use Pandas since my machine only has 24 GB of RAM and I skimmed those rules about Pandas size limits vs. available RAM right? That might have been on one of your sites somewhere but can you address issues like this in the documentation?
I’m really lost here as to how to replicate this code using Dask rather than Pandas.
Here is the notebook: ml-for-nom-2023/assess_portfolio_2023Sum/assess_portfolio/dask_analysis.ipynb at master · nyck33/ml-for-nom-2023 · GitHub
You can find analysis.ipynb in the same repo that uses Pandas only.
Then util.py and dask_util.py, former is Pandas, latter is Dask.
The Pandas version works but you need download the dataset from this site:

Then change the path utils.py or dask_uits.py reads from depending on where you put the data.

Also is below a VSCode issue? I deleted index but it’s still picking it up.

/mnt/d/fintech/ml4t/assess_portfolio_2023Sum/assess_portfolio/dask_util.py in get_data(symbols, dates, addSPY, colname)
     27 def get_data(symbols, dates, addSPY=True, colname="Adj Close"):
     28     """Read stock data (adjusted close) for given symbols from CSV files."""
---> 29     df = dd.DataFrame()
     31     if addSPY and "SPY" not in symbols:  # add SPY for reference, if absent

TypeError: __init__() got an unexpected keyword argument 'index'

Hi @nyck33,

Why does dd.DataFrame say do not use this directly?

Because a Dask DataFrame is a complex lazy structure made of a Task graph and other metadata telling how to build the collection and its chunks, and it is really hard to build on your own. So instead, you should use one of the method described here.

No, I would recommend to use read_csv directly. As Dask DataFrame are lazy and distributed, there is no gain, and probably the opposite, if trying to build an “empty” DataFrame to fill it up.

Example code is on the link given above, bust perhaps it lacks some explanation?

Based on your code above, I’ve come up with the following example:

stock_df = None
colname = 'Adj Close'
for stock in ['SPY', 'CTX', 'DELL', 'GLD', 'GOOG']:
    temp_df = dd.read_csv(f'ML4T_2023Sum/data/{stock}.csv', 
                          usecols=['Date', colname], 
                          na_values=['nan']).set_index('Date', sorted=True)
    temp_df = temp_df.rename(columns={colname: stock})
    if stock_df is None:
        stock_df = temp_df
        stock_df = stock_df.join(temp_df)
stock_df = stock_df.dropna(subset=["SPY"])

Yes, you are right. Pandas cannot be use if the dataset size in memory exceeds your available RAM.

I’m not sure of what issue or documentation you are referring to.

Probably some cache somewhere, when I remove index kwarg, I’m getting:

TypeError: DataFrame.__init__() missing 4 required positional arguments: 'dsk', 'name', 'meta', and 'divisions'