Hi there!
I founded a very strange problem and prepared a reproducer for it:
import pandas
import numpy as np
import dask.dataframe as dd
from distributed import Client
size = 10
TEST_DATA = {i: [i * size + j for j in range(size)] for i in range(size)}
def dask_pipeline():
with Client() as client:
df = pandas.DataFrame(TEST_DATA)
df[1][:4] = np.nan
df[3][-4:] = np.nan
assert np.isnan(df[1][3])
assert np.isnan(df[3][7])
# dask_df = dd.from_pandas(df)
df.fillna(value=0, inplace=True)
df[1][:4] = np.nan
df[3][-4:] = np.nan
assert np.isnan(df[1][3])
assert np.isnan(df[3][7])
if __name__ == "__main__":
dask_pipeline()
If I run this code without importing dask.dataframe, it will complete successfully. If I try to import this, the behavior of the pandas Dataframe will change unexpectedly.
Please tell me, is it a bug or a very strange feature?
Hi @KSuvorov,
What difference do you see in the result with or without importing dask.dataframe
? I just tried the code and did not see anything special. Do you change anything else than importing dask.dataframe
?
Hi @guillaumeeb,
Thanks for your fast reply.
The second assertion block is invalid and throws an exception.
The AssertionError
Traceback (most recent call last):
File “/localdisk/ksuvorov/git/modin/test.py”, line 31, in
dask_pipeline()
File “/localdisk/ksuvorov/git/modin/test.py”, line 16, in dask_pipeline
assert np.isnan(df[1][3])
AssertionError
I use Python=3.9.18 and my environment includes the following libraries:
# Name Version Build Channel
dask 2024.3.0 pyhd8ed1ab_1 conda-forge
dask-core 2024.3.0 pyhd8ed1ab_0 conda-forge
dask-expr 1.0.1 pyhd8ed1ab_0 conda-forge
distributed 2024.3.0 pyhd8ed1ab_0 conda-forge
pandas 2.2.1 py39hddac248_0 conda-forge
Okay, I can reproduce your problem with your environment. This is probably due to importing dask.dataframe changes pandas behaviour in 2024.3.0 · Issue #10996 · dask/dask · GitHub.
This has been fixed in the next dask-expr
release, but be aware as stated in the above issue comments that you will encounter this problem again with Pandas 3.0. It already triggers a lot of warning messages!
Thanks, it resolves my problem!