PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Tuesday, November 1, 2022

[FIXED] How to improve performance of dataframe slices matching?

 November 01, 2022     match, numpy, pandas, performance, python     No comments   

Issue

I need to improve the performance of the following dataframe slices matching. What I need to do is find the matching trips between 2 dataframes, according to the sequence column values with order conserved.

My 2 dataframes:

>>>df1
  trips sequence
0   11  a
1   11  d
2   21  d
3   21  a
4   31  a
5   31  b
6   31  c

>>>df2
  trips sequence
0   12  a
1   12  d
2   22  c
3   22  b
4   22  a
5   32  a
6   32  d

Expected output:

['11 match 12']

This is the following code I' m using:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'trips': [11, 11, 21, 21, 31, 31, 31], 'sequence': ['a', 'd', 'd', 'a', 'a', 'b', 'c']})

df2 = pd.DataFrame({'trips': [12, 12, 22, 22, 22, 32, 32], 'sequence': ['a', 'd', 'c', 'b', 'a', 'a', 'd']})

route_match = []
for trip1 in df1['trips'].drop_duplicates():
    for trip2 in df2['trips'].drop_duplicates():
        route1 = df1[df1['trips'] == trip1]['sequence']
        route2 = df2[df2['trips'] == trip2]['sequence']
        if np.array_equal(route1.values,route2.values):
            route_match.append(str(trip1) + ' match ' + str(trip2))
            break
        else:
            continue

Despite working, this is very time costly and unefficient as my real dataframes are longer. Any suggestions?


Solution

You can aggregate each trip as tuple with groupby.agg, then merge the two outputs to identify the identical routes:

out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),
               df2.groupby('trips', as_index=False)['sequence'].agg(tuple),
               on='sequence'
              )

output:

   trips_x sequence  trips_y
0       11   (a, d)       12
1       11   (a, d)       32

If you only want the first match, drop_duplicates the output of df2 aggregation to prevent unnecessary merging:

out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),
               df2.groupby('trips', as_index=False)['sequence'].agg(tuple)
                  .drop_duplicates(subset='sequence'),
               on='sequence'
              )

output:

   trips_x sequence  trips_y
0       11   (a, d)       12


Answered By - mozway
Answer Checked By - Candace Johnson (PHPFixing Volunteer)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing