[Best Practice] Set index on a DataFrame prior Join operations

Hello team.

I am very pleased by Coiled blog with step-by-step instructions on how to better perform commonly seen operations like indexing the DF or joining 2 DFs like in the articles below:
here and there

I reckon dict.fromkeys() is used to remove duplicates from the divisions list, however afaik using set() is more explicit and thus more pythonic.

A brief investigation on performance gains/losses is below:

import timeit
from pprint import pprint

mylist = ["nowplaying", "PBS", "PBS",  "nowplaying", "job",
          "debate", "thenandnow", "nowplaying", "PBS", "PBS",
          "nowplaying", "job", "debate", "thenandnow",]

coiled_div = list(dict.fromkeys(mylist))
uniq_div_via_set = list(set(mylist))

setup = "mylist = [u'nowplaying', u'PBS', u'PBS', u'nowplaying',\
                   u'job', u'debate', u'thenandnow',\
                   u'nowplaying', u'PBS', u'PBS', u'nowplaying',\
                   u'job', u'debate', u'thenandnow']"

fromk = timeit.timeit("list(dict.fromkeys(mylist))", setup=setup)
un_set = timeit.timeit("list(set(mylist))", setup=setup)

print( f"{fromk=:0.3f} sec, \n{un_set=:0.3f} sec"
       , coiled_div
       , uniq_div_via_set
       , sep="\n" )

pprint(dict.fromkeys(mylist),)

Produces the following result

fromk=0.468 sec,
un_set=0.331 sec

So the questions is - which option is considered better when extracting unique divisions for a stand-alone index setting?

Thanks a lot in advance for your time and attention.

Hi @Pirognoe! Ah I see, you’re referring to the snippets like this (from this post):

# use set index to get divisions
dask_divisions = large.set_index("id").divisions
unique_divisions = list(dict.fromkeys(list(dask_divisions)))
 
# apply set index to both dataframes
large_sorted = large.set_index("id", divisions=unique_divisions)
also_large_sorted = also_large.set_index("id", divisions=unique_divisions)
large_join = large_sorted.merge(
    also_large_sorted, 
    how="left", 
    left_index=True, 
    right_index=True
)

In this context, I agree using set() is more explicit and also appears to be faster. Since this is a general Python question, I’d be curious to hear from others on the forum :slight_smile:.

dict.fromkeys will preserve the order of the original input, just with duplicates removed. set won’t.

Since divisions need to be in sorted order—and you already know that dask_divisions (in the snippet @scharlottej13 shared) are in sorted order—dict.fromkeys(...) is more efficient than sorted(set(...)).

If the list of divisions is not already in sorted order (indeed, like in the example you gave), then by all means, use sorted(set(...)) instead of sorted(dict.fromkeys(...)).

3 Likes

Thanks Gabe, @gjoseph92

You have mentioned a valid point I have not considered before. I would imagine the dict keys order is preserved for Python version 3.6 and higher → other way you need to refrain to OrderedDict.

Meanwhile I have modified my test snippet to take into account the sorting pre-condition and still it looks like set beats the dict.

import timeit
from pprint import pprint

mylist = [
    "nowplaying",
    "PBS",
    "PBS",
    "nowplaying",
    "job",
    "debate",
    "thenandnow",
    "nowplaying",
    "PBS",
    "PBS",
    "nowplaying",
    "job",
    "debate",
    "thenandnow",
]
for _ in range(5):
    mylist.extend(mylist)

mylist.sort()

coiled_div = list(dict.fromkeys(mylist))
uniq_div_via_set = sorted(list(set(mylist)))
setup = """\
mylist = [u'nowplaying', u'PBS', u'PBS', u'nowplaying', u'job', u'debate', u'thenandnow',\
          u'nowplaying', u'PBS', u'PBS', u'nowplaying', u'job', u'debate', u'thenandnow']
          
for _ in range(5):
    mylist.extend(mylist)

mylist.sort()          
          
"""

fromk = timeit.timeit("list(dict.fromkeys(mylist))", setup=setup)
un_set = timeit.timeit("sorted(list(set(mylist)))", setup=setup)

print(
    f"{fromk=:0.3f} sec, \n{un_set=:0.3f} sec", coiled_div, uniq_div_via_set, sep="\n"
)

pprint(
    dict.fromkeys(mylist),
)

Produces the following results:

fromk=5.834 sec, 
un_set=3.206 sec
['PBS', 'debate', 'job', 'nowplaying', 'thenandnow']
['PBS', 'debate', 'job', 'nowplaying', 'thenandnow']
{'PBS': None,
 'debate': None,
 'job': None,
 'nowplaying': None,
 'thenandnow': None}