PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Tuesday, November 1, 2022

[FIXED] How to create subindex efficiently?

 November 01, 2022     pandas, performance, python, running-count     No comments   

Issue

I would like to create a subindex for my dataframe based on the index. For example, I have a dataframe like this:

      Content        Date
ID                       
Bob  birthday  2010.03.01
Bob    school  2010.04.01
Tom  shopping  2010.02.01
Tom      work  2010.09.01
Tom   holiday  2010.10.01

I'd like create a subindex for for my ID and the resulting dataframe looks like below:

               Content        Date
ID  subindex                      
Bob 1         birthday  2010.03.01
    2           school  2010.04.01
Tom 1         shopping  2010.02.01
    2             work  2010.09.01
    3          holiday  2010.10.01

To do this I need to first create my subindex list. I searched in the help document and it seems to most neat way is to use transform:

subindex = df['Date'].groupby(df.index).transform(lambda x: np.arange(1, len(x) + 1))

However, it is really slow. I looked around and found apply can do the work too:

subindex = df['Date'].groupby(df.index).apply(lambda x: np.arange(1, len(x) + 1))

Of course the subindex needed to be flattened since it is a list of lists here. This works much faster than the transform method. Then I tested with a for loop of my own:

subindex_size = df.groupby(df.index, sort = False).size()
subindex = []
for i in np.arange(len(subindex_size)):
    subindex.extend(np.arange(1,subindex_size[i]+1))

It's even faster. With my larger dataset (about 90k rows), the transform method takes about 44 secs on my computer, apply takes ~2 secs and the for loop takes only ~1 secs. I need to work on much larger dataset so even the time difference between the apply and for loop makes a difference to me. However, the for loop looks ugly and may not be easily applied if I need to create other group-based variables.

So my question is, why the built-in functions that are supposed to do the right thing are slower? Am I missing something here or is there a reason for this? Is there any other way to improve this process?


Solution

You can use cumcount to do this:

In [11]: df.groupby(level=0).cumcount()
Out[11]: 
ID
Bob    0
Bob    1
Tom    0
Tom    1
Tom    2
dtype: int64

In [12]: df['subindex'] = df.groupby(level=0).cumcount()  # possibly + 1 here.

In [13]: df.set_index('subindex', append=True)
Out[13]: 
               Content        Date
ID  subindex                      
Bob 0         birthday  2010.03.01
    1           school  2010.04.01
Tom 0         shopping  2010.02.01
    1             work  2010.09.01
    2          holiday  2010.10.01

To start at 1 (rather than 0) just add 1 to the result of cumcount.



Answered By - Andy Hayden
Answer Checked By - Mary Flores (PHPFixing Volunteer)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing