PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Sunday, August 28, 2022

[FIXED] How to query a dataframe optimally?

 August 28, 2022     csv, dataframe, pandas, python     No comments   

Issue

I have a csv with large number of rows.

Sample:

User_id , Marks
12    3
13    2
.     .

Marks can be in range (1,5) I want to count for every user_id The count of marks received in various ranges (1-2), (2,3), (3,4), (4,5).

I used df.query() for this but it is taking a lot of time as rows are large. Please suggest optimal way to achieve this.


Solution

I think this should work for you (Edit, I restructured the code to make it more readable):

df = pd.DataFrame({'User_id':[12, 13, 12, 13, 12, 13] , 'Marks': [3.0, 2.2, 4.9, 1.0, 3.1, 2.9]})

mark_ranges = pd.cut(
    df['Marks'],
    bins=[0,2,3,4,5],
    labels=['1-2', '2-3', '3-4', '4-5'],    
).rename('Mark Range')

result = (
    df
    .join(mark_ranges)
    .groupby(['User_id', 'Mark Range'])
    .size()
)
print(result)

output:

User_id  Mark Range
12       1-2           0
         2-3           1
         3-4           1
         4-5           1
13       1-2           1
         2-3           2
         3-4           0
         4-5           0
dtype: int64

In this example 3 (e.g. 3.0) is put in the 2-3 group, whereas anything over 3 (e.g. 3.1) is put in 3-4. In other words the upper end of the range is inclusive.



Answered By - Zach Flanders
Answer Checked By - Cary Denson (PHPFixing Admin)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing