[Best Practice] Set index on a DataFrame prior Join operations

Pirognoe · March 16, 2022, 5:24pm

Hello team.

I am very pleased by Coiled blog with step-by-step instructions on how to better perform commonly seen operations like indexing the DF or joining 2 DFs like in the articles below:
here and there

I reckon dict.fromkeys() is used to remove duplicates from the divisions list, however afaik using set() is more explicit and thus more pythonic.

A brief investigation on performance gains/losses is below:

import timeit
from pprint import pprint

mylist = ["nowplaying", "PBS", "PBS",  "nowplaying", "job",
          "debate", "thenandnow", "nowplaying", "PBS", "PBS",
          "nowplaying", "job", "debate", "thenandnow",]

coiled_div = list(dict.fromkeys(mylist))
uniq_div_via_set = list(set(mylist))

setup = "mylist = [u'nowplaying', u'PBS', u'PBS', u'nowplaying',\
                   u'job', u'debate', u'thenandnow',\
                   u'nowplaying', u'PBS', u'PBS', u'nowplaying',\
                   u'job', u'debate', u'thenandnow']"

fromk = timeit.timeit("list(dict.fromkeys(mylist))", setup=setup)
un_set = timeit.timeit("list(set(mylist))", setup=setup)

print( f"{fromk=:0.3f} sec, \n{un_set=:0.3f} sec"
       , coiled_div
       , uniq_div_via_set
       , sep="\n" )

pprint(dict.fromkeys(mylist),)

Produces the following result

fromk=0.468 sec,
un_set=0.331 sec

So the questions is - which option is considered better when extracting unique divisions for a stand-alone index setting?

Thanks a lot in advance for your time and attention.

scharlottej13 · March 16, 2022, 6:58pm

Hi @Pirognoe! Ah I see, you’re referring to the snippets like this (from this post):

# use set index to get divisions
dask_divisions = large.set_index("id").divisions
unique_divisions = list(dict.fromkeys(list(dask_divisions)))
 
# apply set index to both dataframes
large_sorted = large.set_index("id", divisions=unique_divisions)
also_large_sorted = also_large.set_index("id", divisions=unique_divisions)
large_join = large_sorted.merge(
    also_large_sorted, 
    how="left", 
    left_index=True, 
    right_index=True
)

In this context, I agree using set() is more explicit and also appears to be faster. Since this is a general Python question, I’d be curious to hear from others on the forum .

gjoseph92 · March 18, 2022, 12:31am

dict.fromkeys will preserve the order of the original input, just with duplicates removed. set won’t.

Since divisions need to be in sorted order—and you already know that dask_divisions (in the snippet @scharlottej13 shared) are in sorted order—dict.fromkeys(...) is more efficient than sorted(set(...)).

If the list of divisions is not already in sorted order (indeed, like in the example you gave), then by all means, use sorted(set(...)) instead of sorted(dict.fromkeys(...)).

Pirognoe · March 18, 2022, 12:19pm

Thanks Gabe, @gjoseph92

You have mentioned a valid point I have not considered before. I would imagine the dict keys order is preserved for Python version 3.6 and higher → other way you need to refrain to OrderedDict.

Meanwhile I have modified my test snippet to take into account the sorting pre-condition and still it looks like set beats the dict.

import timeit
from pprint import pprint

mylist = [
    "nowplaying",
    "PBS",
    "PBS",
    "nowplaying",
    "job",
    "debate",
    "thenandnow",
    "nowplaying",
    "PBS",
    "PBS",
    "nowplaying",
    "job",
    "debate",
    "thenandnow",
]
for _ in range(5):
    mylist.extend(mylist)

mylist.sort()

coiled_div = list(dict.fromkeys(mylist))
uniq_div_via_set = sorted(list(set(mylist)))
setup = """\
mylist = [u'nowplaying', u'PBS', u'PBS', u'nowplaying', u'job', u'debate', u'thenandnow',\
          u'nowplaying', u'PBS', u'PBS', u'nowplaying', u'job', u'debate', u'thenandnow']
          
for _ in range(5):
    mylist.extend(mylist)

mylist.sort()          
          
"""

fromk = timeit.timeit("list(dict.fromkeys(mylist))", setup=setup)
un_set = timeit.timeit("sorted(list(set(mylist)))", setup=setup)

print(
    f"{fromk=:0.3f} sec, \n{un_set=:0.3f} sec", coiled_div, uniq_div_via_set, sep="\n"
)

pprint(
    dict.fromkeys(mylist),
)

Produces the following results:

fromk=5.834 sec, 
un_set=3.206 sec
['PBS', 'debate', 'job', 'nowplaying', 'thenandnow']
['PBS', 'debate', 'job', 'nowplaying', 'thenandnow']
{'PBS': None,
 'debate': None,
 'job': None,
 'nowplaying': None,
 'thenandnow': None}

Topic		Replies	Views
Filter df based on indices of other dfs Dask DataFrame	1	385	August 24, 2023
Align a secondary DataFrame to use the same workers and index structure as a primary DataFrame Dask DataFrame	6	50	January 30, 2025
Please explain sorting Dask DataFrame delayed , distributed	5	1682	May 17, 2023
Dataframe indexes Dask DataFrame	3	860	June 16, 2022
Dask very slow with simple processing of large parquet file Dask DataFrame	2	1731	August 29, 2022

[Best Practice] Set index on a DataFrame prior Join operations

Related topics