Dtype Changing from int to float unexpectedly

ek2718 · March 31, 2023, 1:32am

posNeighborhoods = posNeighborhoods.astype(dtype=np.uint8)
print(posNeighborhoods)
posNeighborhoods = posNeighborhoods.compute()
print(posNeighborhoods)

posNeighborhoods transforms from a dask array with dtype uint8 to a numpy array with dtype float64.

dask.array<rechunk-merge, shape=(5507144, 5), dtype=uint8, chunksize=(5507144, 1), chunktype=numpy.ndarray>
[[ 0.  0.  6. 25.  6.]
 [ 0.  0.  6. 25.  7.]
 [ 0.  0.  6. 25.  9.]
 ...
 [31. 24. 20.  3. 29.]
 [31. 24. 20.  3. 30.]
 [31. 24. 20.  3. 31.]]
float64

This strange glitch doesn’t occur frequently, but it always happens in the same spot. In case it was a problem with overflow, I checked the same program with uint64, and the same glitch occurred in the same spot.
I was wondering if there was anything that could trigger a change in dtype like this, and how to prevent it?

ek2718 · March 31, 2023, 1:45am

It is also possible that this more of a numpy problem than a dask problem, but help would be helpful anyway!

guillaumeeb · March 31, 2023, 2:06pm

Hi @ek2718, welcome to Dask Discourse!

I’ve edited your first post with code cells to make it more readable.

Could you provide a Minimum Working example of this behavior?

ek2718 · March 31, 2023, 7:42pm

Actually, my astype function was formatted wrong, it should just be arr.astype(np.uint8)! I think the program was originally changing dtype in pandas, because one of the columns was int64 and the index was uint8, so I think sometimes pandas decides to make everything float!

guillaumeeb · April 1, 2023, 7:18am

So I guess your problem is solved? Could you describe the solution a little more?

ek2718 · April 2, 2023, 3:57am

Actually, I was wrong and I didn’t figure it out, but now I did.
Here is some minimal working code:

df = dd.from_dask_array(da.random.randint(0, 1 << 8, (10000, 5), dtype=np.uint8))
df[‘index’] = dd.from_array(np.sort(np.random.randint(0, 10000, 10000, dtype=np.int64)))
df = df.set_index(‘index’, sorted=True)

df2 = dd.from_dask_array(da.random.randint(0, 1 << 8, (10000, 5), dtype=np.uint8))
df2[‘index’] = dd.from_array(np.sort(np.random.randint(0, 10000, 10000, dtype=np.int64)))
df2 = df2.set_index(‘index’, sorted=True)

x = df.merge(df2, how=‘left’, left_index=True, right_index=True).to_dask_array(lengths=True)
x = da.rechunk(x, chunks=‘auto’)
x.compute()

The reason the code wasn’t working was because sometimes rows didn’t match with others and so returned a nan datatype, which is float. You could see how sometimes the merge returns int and others it returns float depending on probability and how high you set the high value when finding the random indexes.

Is there way to have a merge without including these rows?

guillaumeeb · April 3, 2023, 9:49am

You are asking for a left join, so every time you’ve got an index on the left that doesn’t exist on the right, you’ll have NaN. Did you try with an inner join if you want to avoid the NaN?

ek2718 · April 3, 2023, 1:29pm

That was what I was looking for, perfect!

Topic		Replies	Views
`save_cog_with_dask`: Cannot convert fill_value 999999 to dtype uint8	2	24	November 22, 2024
Can we specify dtypes of dask array as in numpy for typing? Dask Array dask-array	1	82	April 26, 2024
Dask created a datetimeindex and I cannot assign it back to the source ddf Dask DataFrame	5	357	March 8, 2022
Creating a new dask df using columns from 2 dataframes and keeping the index of the first Dask DataFrame dask-array , merge	15	110	July 31, 2024
Meta='int' failed Dask DataFrame	1	220	January 15, 2022

Dtype Changing from int to float unexpectedly

Related topics