Dask Tutorial dask_delayed what's are they asking here?

What happens when you call it on sums and counts? What happens if you wait and call it on mean?

- Experiment with delaying the call to sum. What does the graph look like if sum is delayed? What does the graph look like if it isn’t?

But it looks like sums and counts are arrays, not functions.


# This is just one possible solution, there are
# several ways to do this using `dask.delayed`

def read_file(filename):
    # Read in file
    return pd.read_csv(filename)

sums = []
counts = []
for fn in filenames:
    # Delayed read in file
    df = read_file(fn)

    # Groupby origin airport
    by_origin = df.groupby("Origin")

    # Sum of all departure delays by origin
    total = by_origin.DepDelay.sum()

    # Number of flights by origin
    count = by_origin.DepDelay.count()

    # Save the intermediates

# Combine intermediates to get total mean-delay-per-origin
total_delays = sum(sums)
n_flights = sum(counts)
mean, *_ = dask.compute(total_delays / n_flights)

And mean is a result of dask.compute()

It would be nice to have the notebooks formatted to have solutions for these as you do for every other question earlier on in the notebook where you have a blank cell, “your code here” then the actual answer because I’m totally lost as to what you want. Btw, although I withdrew from the course due to a bad grade, I was actually taking a intro to HPC course at Georgia Tech and I just earned my basic Cuda certificate from Nvidia so theory-wise I am so-so regarding transfer of memory blocks, computing in different streams or on concurrent threads or processes, etc and I’m still lost as to what you want here so I could not imagine how someone who is knew to concurrency and parallelism would feel here.

Please improve these at your next convenience.

Hi @nyck33, welcome to Dask Discourse!

First of all thanks for the feedback, it’s always nice to have some advice on Dask related content. Be careful though, the end of the message might feel a bit rude, even if I understand this can be frustrating not understanding the meaning of a question.

Actually, there are several places in the notebook with “questions to consider”, these are more open questions that don’t really need a written answer and are here to dig deeper. The tutorial, even if it can be followed by yourself, is more meant to be given by some Dask expert who can explain things if needed.

You call dask.compute on Delayed objects, not functions, and you can call it on arrays of Delayed. So here, it means trying to do something like (not actually tried it):

total_delays = sum(dask.compute(sums))
n_flights = sum(dask.compute(counts))
mean, *_ = total_delays / n_flights


total_delays = sum(sums)
n_flights = sum(counts)
mean = total_delays / n_flights
mean = dask.compute(mean)

Here, it means trying:

total_delays = dask.delayed(sum)(sums)
n_flights = dask.delayed(sum)(counts)
mean, *_ = dask.compute(total_delays / n_flights)

Hope that helps.

Thanks but I want to add that not everyone has a dask expert nearby to ask as your company also mentions how you have about 10% of the market compared to others who have more so at this stage wouldn’t it be wiser to try to foster more dask experts? Ie. I’m not saying I can become a super-expert but I’m willing to bet I can get good enough to contribute on the forum if you could provide solutions for all tutorials and not rely on learners having to search out one of the few dask experts in the world who are probably in super-high demand and wouldn’t be able to answer every question on here or on Github. So technically I think you guys are very competent but at this stage in your company’s journey, I would suggest that you keep everything as simple as possible for learners and who knows, you might start to see more and more dash semi-experts who can at least sing your praises and bring more users into the ecosystem. Do you know what I mean? I don’t think I was being rude, only pointing out an error in how you are trying to increase the uptake of Dask, but your use of the imperative is actually quite rude and your supervisor needs to read this.

Also for this one, I think a comment like "We are delaying the pd.read_csv() because it is and I/O operation and when that occurs, there is no compute-bound tasks being executed which wastes CPU cycles and that is why it is better to delay it so you can have a worker do an IO while other workers perform cpu operations of sum, count, groupby (which is an O(n) operation to go through the entire dataframe row by row and bin by origin). Otherwise someone with zero background has no idea why you are delaying the reading in of csv’s.

This is very much something that Dask contributors consider. We’re not talking about a company here, but more of an open source community that is trying to foster the use of Dask. At one point, there were regularly held tutorial sessions, allowing anyone to be able to train with a Dask contributor or maintainer. There is still some video, like the one by @jacobtomlinson. Coiled also propose some free tutorial very regularly! Tutorials are also held in conferences like Pydata ones or Jupytercon.

I’m sorry to hear that, I was trying to be polite and measured, my apologies if I didn’t achieve it.

Actually, I think it would be more accurate to say we are delaying the call so that IO operations on each file could be executed concurrently, but at this point I think you could discuss this with Dask maintainers directly on the github issue tracker of the tutorial if you’d like to, or even directly propose some Pull request to improve it, that would be very welcomed.