Dtype Changing from int to float unexpectedly

posNeighborhoods = posNeighborhoods.astype(dtype=np.uint8)
print(posNeighborhoods)
posNeighborhoods = posNeighborhoods.compute()
print(posNeighborhoods)

posNeighborhoods transforms from a dask array with dtype uint8 to a numpy array with dtype float64.

dask.array<rechunk-merge, shape=(5507144, 5), dtype=uint8, chunksize=(5507144, 1), chunktype=numpy.ndarray>
[[ 0.  0.  6. 25.  6.]
 [ 0.  0.  6. 25.  7.]
 [ 0.  0.  6. 25.  9.]
 ...
 [31. 24. 20.  3. 29.]
 [31. 24. 20.  3. 30.]
 [31. 24. 20.  3. 31.]]
float64

This strange glitch doesn’t occur frequently, but it always happens in the same spot. In case it was a problem with overflow, I checked the same program with uint64, and the same glitch occurred in the same spot.
I was wondering if there was anything that could trigger a change in dtype like this, and how to prevent it?

It is also possible that this more of a numpy problem than a dask problem, but help would be helpful anyway!

Hi @ek2718, welcome to Dask Discourse!

I’ve edited your first post with code cells to make it more readable.

Could you provide a Minimum Working example of this behavior?

Actually, my astype function was formatted wrong, it should just be arr.astype(np.uint8)! I think the program was originally changing dtype in pandas, because one of the columns was int64 and the index was uint8, so I think sometimes pandas decides to make everything float!

So I guess your problem is solved? Could you describe the solution a little more?

Actually, I was wrong and I didn’t figure it out, but now I did.
Here is some minimal working code:

df = dd.from_dask_array(da.random.randint(0, 1 << 8, (10000, 5), dtype=np.uint8))
df[‘index’] = dd.from_array(np.sort(np.random.randint(0, 10000, 10000, dtype=np.int64)))
df = df.set_index(‘index’, sorted=True)

df2 = dd.from_dask_array(da.random.randint(0, 1 << 8, (10000, 5), dtype=np.uint8))
df2[‘index’] = dd.from_array(np.sort(np.random.randint(0, 10000, 10000, dtype=np.int64)))
df2 = df2.set_index(‘index’, sorted=True)

x = df.merge(df2, how=‘left’, left_index=True, right_index=True).to_dask_array(lengths=True)
x = da.rechunk(x, chunks=‘auto’)
x.compute()

The reason the code wasn’t working was because sometimes rows didn’t match with others and so returned a nan datatype, which is float. You could see how sometimes the merge returns int and others it returns float depending on probability and how high you set the high value when finding the random indexes.

Is there way to have a merge without including these rows?

You are asking for a left join, so every time you’ve got an index on the left that doesn’t exist on the right, you’ll have NaN. Did you try with an inner join if you want to avoid the NaN?

That was what I was looking for, perfect!