so I use dd.from_pandas()?
, what if the whole point of using Dask was because my .csv is too big for Pandas?
My code in Pandas
def get_data(symbols, dates, addSPY=True, colname="Adj Close"):
"""Read stock data (adjusted close) for given symbols from CSV files."""
df = dd.DataFrame(index=dates) //here it says __init__ got unexpected keyword 'index'
if addSPY and "SPY" not in symbols: # add SPY for reference, if absent
symbols = ["SPY"] + list(
symbols
) # handles the case where symbols is np array of 'object'
for symbol in symbols:
df_temp = dd.read_csv(
symbol_to_path(symbol),
index_col="Date",
parse_dates=True,
usecols=["Date", colname],
na_values=["nan"],
)
df_temp = df_temp.rename(columns={colname: symbol})
df = df.join(df_temp)
if symbol == "SPY": # drop dates SPY did not trade
df = df.dropna(subset=["SPY"])
return df
I saw something about divisions so I tried passing (dates) as a tuple argument but that doesn’t work.
Overall, would it be possible for you to add some example code like Pandas documentation?
Especially for those of us coming from using Pandas, a short explanation of differences would also help.
Also as you can see I create an empty dataframe in Pandas and then use the loop to join each stock’s one column dataframe one after another and I end up with a 4 column dataframe with dates as the index and daily closing prices for 4 stocks in each row.
How would I do this in Dask? Say I’m looking at some gargantuan date range like 1980.01.01 to 2023.01.01 with the time frequency being every hour or minute or so, I’m probably not going to be able to use Pandas since my machine only has 24 GB of RAM and I skimmed those rules about Pandas size limits vs. available RAM right? That might have been on one of your sites somewhere but can you address issues like this in the documentation?
I’m really lost here as to how to replicate this code using Dask rather than Pandas.
Here is the notebook: ml-for-nom-2023/assess_portfolio_2023Sum/assess_portfolio/dask_analysis.ipynb at master · nyck33/ml-for-nom-2023 · GitHub
You can find analysis.ipynb in the same repo that uses Pandas only.
Then util.py and dask_util.py, former is Pandas, latter is Dask.
The Pandas version works but you need download the dataset from this site:
https://lucylabs.gatech.edu/ml4t/summer2023/software-setup/
Then change the path utils.py or dask_uits.py reads from depending on where you put the data.
Also is below a VSCode issue? I deleted index but it’s still picking it up.
/mnt/d/fintech/ml4t/assess_portfolio_2023Sum/assess_portfolio/dask_util.py in get_data(symbols, dates, addSPY, colname)
27 def get_data(symbols, dates, addSPY=True, colname="Adj Close"):
28 """Read stock data (adjusted close) for given symbols from CSV files."""
---> 29 df = dd.DataFrame()
30
31 if addSPY and "SPY" not in symbols: # add SPY for reference, if absent
TypeError: __init__() got an unexpected keyword argument 'index'