PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0
Showing posts with label pandas-groupby. Show all posts
Showing posts with label pandas-groupby. Show all posts

Tuesday, November 1, 2022

[FIXED] How to keep original index of a DataFrame after groupby 2 columns?

 November 01, 2022     dataframe, indexing, pandas, pandas-groupby, python     No comments   

Issue

Is there any way I can retain the original index of my large dataframe after I perform a groupby? The reason I need to this is because I need to do an inner merge back to my original df (after my groupby) to regain those lost columns. And the index value is the only 'unique' column to perform the merge back into. Does anyone know how I can achieve this?

My DataFrame is quite large. My groupby looks like this:

df.groupby(['col1', 'col2']).agg({'col3': 'count'}).reset_index()

This drops my original indexes from my original dataframe, which I want to keep.


Solution

I think you are are looking for transform in this situation:

df['count'] = df.groupby(['col1', 'col2'])['col3'].transform('count')


Answered By - Scott Boston
Answer Checked By - Robin (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] How can I pivot a dataframe?

 November 01, 2022     group-by, pandas, pandas-groupby, pivot, python     No comments   

Issue

  • What is pivot?
  • How do I pivot?
  • Is this a pivot?
  • Long format to wide format?

I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting...

... But I'm going to give it a go.


The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)

Look a few examples from my Google Search

  1. How to pivot a dataframe in Pandas?
  • Good question and answer. But the answer only answers the specific question with little explanation.
  1. pandas pivot table to data frame
  • In this question, the OP is concerned with the output of the pivot. Namely how the columns look. OP wanted it to look like R. This isn't very helpful for pandas users.
  1. pandas pivoting a dataframe, duplicate rows
  • Another decent question but the answer focuses on one method, namely pd.DataFrame.pivot

So whenever someone searches for pivot they get sporadic results that are likely not going to answer their specific question.


Setup

You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below.

import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

df = pd.DataFrame(
    add(cols, arr1), columns=cols
).join(
    pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)

     key   row   item   col  val0  val1
0   key0  row3  item1  col3  0.81  0.04
1   key1  row2  item1  col2  0.44  0.07
2   key1  row0  item1  col0  0.77  0.01
3   key0  row4  item0  col2  0.15  0.59
4   key1  row0  item2  col1  0.81  0.64
5   key1  row2  item2  col4  0.13  0.88
6   key2  row4  item1  col3  0.88  0.39
7   key1  row4  item1  col1  0.10  0.07
8   key1  row0  item2  col4  0.65  0.02
9   key1  row2  item0  col2  0.35  0.61
10  key2  row0  item2  col1  0.40  0.85
11  key2  row4  item1  col2  0.64  0.25
12  key0  row2  item2  col3  0.50  0.44
13  key0  row4  item1  col4  0.24  0.46
14  key1  row3  item2  col3  0.28  0.11
15  key0  row3  item1  col1  0.31  0.23
16  key0  row0  item2  col3  0.86  0.01
17  key0  row4  item0  col3  0.64  0.21
18  key2  row2  item2  col0  0.13  0.45
19  key0  row2  item0  col4  0.37  0.70

Question(s)

  1. Why do I get ValueError: Index contains duplicate entries, cannot reshape

  2. How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?

     col   col0   col1   col2   col3  col4
     row
     row0  0.77  0.605    NaN  0.860  0.65
     row2  0.13    NaN  0.395  0.500  0.25
     row3   NaN  0.310    NaN  0.545   NaN
     row4   NaN  0.100  0.395  0.760  0.24
    
  3. How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

     col   col0   col1   col2   col3  col4
     row
     row0  0.77  0.605  0.000  0.860  0.65
     row2  0.13  0.000  0.395  0.500  0.25
     row3  0.00  0.310  0.000  0.545  0.00
     row4  0.00  0.100  0.395  0.760  0.24
    
  4. Can I get something other than mean, like maybe sum?

     col   col0  col1  col2  col3  col4
     row
     row0  0.77  1.21  0.00  0.86  0.65
     row2  0.13  0.00  0.79  0.50  0.50
     row3  0.00  0.31  0.00  1.09  0.00
     row4  0.00  0.10  0.79  1.52  0.24
    
  5. Can I do more that one aggregation at a time?

            sum                          mean
     col   col0  col1  col2  col3  col4  col0   col1   col2   col3  col4
     row
     row0  0.77  1.21  0.00  0.86  0.65  0.77  0.605  0.000  0.860  0.65
     row2  0.13  0.00  0.79  0.50  0.50  0.13  0.000  0.395  0.500  0.25
     row3  0.00  0.31  0.00  1.09  0.00  0.00  0.310  0.000  0.545  0.00
     row4  0.00  0.10  0.79  1.52  0.24  0.00  0.100  0.395  0.760  0.24
    
  6. Can I aggregate over multiple value columns?

           val0                             val1
     col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
     row
     row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
     row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
     row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
     row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
    
  7. Can Subdivide by multiple columns?

     item item0             item1                         item2
     col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
     row
     row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
     row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
     row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
     row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
    
  8. Or

     item      item0             item1                         item2
     col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
     key  row
     key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
          row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
          row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
          row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
     key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
          row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
          row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
          row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
     key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
          row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
          row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
    
  9. Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

     col   col0  col1  col2  col3  col4
     row
     row0     1     2     0     1     1
     row2     1     0     2     1     2
     row3     0     1     0     2     0
     row4     0     1     2     2     1
    
  10. How do I convert a DataFrame from long to wide by pivoting on ONLY two columns? Given,

    np.random.seed([3, 1415])
    df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})
    df2
       A   B
    0  a   0
    1  a  11
    2  a   2
    3  a  11
    4  b  10
    5  b  10
    6  b  14
    7  c   7
    

    The expected should look something like

          a     b    c
    0   0.0  10.0  7.0
    1  11.0  10.0  NaN
    2   2.0  14.0  NaN
    3  11.0   NaN  NaN
    
  11. How do I flatten the multiple index to single index after pivot?

    From

       1  2
       1  1  2
    a  2  1  1
    b  2  1  0
    c  1  0  0
    

    To

       1|1  2|1  2|2
    a    2    1    1
    b    2    1    0
    c    1    0    0
    

Solution

We start by answering the first question:

Question 1

Why do I get ValueError: Index contains duplicate entries, cannot reshape

This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:

df.duplicated(['row', 'col']).any()

True

So when I pivot using

df.pivot(index='row', columns='col', values='val0')

I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:

df.set_index(['row', 'col'])['val0'].unstack()

Here is a list of idioms we can use to pivot

  1. pd.DataFrame.groupby + pd.DataFrame.unstack

    • Good general approach for doing just about any type of pivot
    • You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.
  2. pd.DataFrame.pivot_table

    • A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.
    • Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
  3. pd.DataFrame.set_index + pd.DataFrame.unstack

    • Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
    • Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
  4. pd.DataFrame.pivot

    • Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.
    • Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
  5. pd.crosstab

    • This a specialized version of pivot_table and in its purest form is the most intuitive way to perform several tasks.
  6. pd.factorize + np.bincount

    • This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
  7. pd.get_dummies + pd.DataFrame.dot

    • I use this for cleverly performing cross tabulation.

Examples

What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.

Question 3

How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

  • pd.DataFrame.pivot_table

    • fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it's the same as this answer without the fill_value

    • aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.

          df.pivot_table(
              values='val0', index='row', columns='col',
              fill_value=0, aggfunc='mean')
      
          col   col0   col1   col2   col3  col4
          row
          row0  0.77  0.605  0.000  0.860  0.65
          row2  0.13  0.000  0.395  0.500  0.25
          row3  0.00  0.310  0.000  0.545  0.00
          row4  0.00  0.100  0.395  0.760  0.24
      
  • pd.DataFrame.groupby

      df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
    
  • pd.crosstab

      pd.crosstab(
          index=df['row'], columns=df['col'],
          values=df['val0'], aggfunc='mean').fillna(0)
    

Question 4

Can I get something other than mean, like maybe sum?

  • pd.DataFrame.pivot_table

      df.pivot_table(
          values='val0', index='row', columns='col',
          fill_value=0, aggfunc='sum')
    
      col   col0  col1  col2  col3  col4
      row
      row0  0.77  1.21  0.00  0.86  0.65
      row2  0.13  0.00  0.79  0.50  0.50
      row3  0.00  0.31  0.00  1.09  0.00
      row4  0.00  0.10  0.79  1.52  0.24
    
  • pd.DataFrame.groupby

      df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
    
  • pd.crosstab

      pd.crosstab(
          index=df['row'], columns=df['col'],
          values=df['val0'], aggfunc='sum').fillna(0)
    

Question 5

Can I do more that one aggregation at a time?

Notice that for pivot_table and crosstab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.

  • pd.DataFrame.pivot_table

      df.pivot_table(
          values='val0', index='row', columns='col',
          fill_value=0, aggfunc=[np.size, np.mean])
    
           size                      mean
      col  col0 col1 col2 col3 col4  col0   col1   col2   col3  col4
      row
      row0    1    2    0    1    1  0.77  0.605  0.000  0.860  0.65
      row2    1    0    2    1    2  0.13  0.000  0.395  0.500  0.25
      row3    0    1    0    2    0  0.00  0.310  0.000  0.545  0.00
      row4    0    1    2    2    1  0.00  0.100  0.395  0.760  0.24
    
  • pd.DataFrame.groupby

      df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
    
  • pd.crosstab

      pd.crosstab(
          index=df['row'], columns=df['col'],
          values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
    

Question 6

Can I aggregate over multiple value columns?

  • pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely

      df.pivot_table(
          values=['val0', 'val1'], index='row', columns='col',
          fill_value=0, aggfunc='mean')
    
            val0                             val1
      col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
      row
      row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
      row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
      row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
      row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
    
  • pd.DataFrame.groupby

      df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
    

Question 7

Can Subdivide by multiple columns?

  • pd.DataFrame.pivot_table

      df.pivot_table(
          values='val0', index='row', columns=['item', 'col'],
          fill_value=0, aggfunc='mean')
    
      item item0             item1                         item2
      col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
      row
      row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
      row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
      row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
      row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
    
  • pd.DataFrame.groupby

      df.groupby(
          ['row', 'item', 'col']
      )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
    

Question 8

Can Subdivide by multiple columns?

  • pd.DataFrame.pivot_table

      df.pivot_table(
          values='val0', index=['key', 'row'], columns=['item', 'col'],
          fill_value=0, aggfunc='mean')
    
      item      item0             item1                         item2
      col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
      key  row
      key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
           row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
           row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
           row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
      key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
           row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
           row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
           row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
      key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
           row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
           row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
    
  • pd.DataFrame.groupby

      df.groupby(
          ['key', 'row', 'item', 'col']
      )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
    
  • pd.DataFrame.set_index because the set of keys are unique for both rows and columns

      df.set_index(
          ['key', 'row', 'item', 'col']
      )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
    

Question 9

Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

  • pd.DataFrame.pivot_table

      df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
    
          col   col0  col1  col2  col3  col4
      row
      row0     1     2     0     1     1
      row2     1     0     2     1     2
      row3     0     1     0     2     0
      row4     0     1     2     2     1
    
  • pd.DataFrame.groupby

      df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
    
  • pd.crosstab

      pd.crosstab(df['row'], df['col'])
    
  • pd.factorize + np.bincount

      # get integer factorization `i` and unique values `r`
      # for column `'row'`
      i, r = pd.factorize(df['row'].values)
      # get integer factorization `j` and unique values `c`
      # for column `'col'`
      j, c = pd.factorize(df['col'].values)
      # `n` will be the number of rows
      # `m` will be the number of columns
      n, m = r.size, c.size
      # `i * m + j` is a clever way of counting the
      # factorization bins assuming a flat array of length
      # `n * m`.  Which is why we subsequently reshape as `(n, m)`
      b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
      # BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
      pd.DataFrame(b, r, c)
    
            col3  col2  col0  col1  col4
      row3     2     0     0     1     0
      row2     1     2     1     0     2
      row0     1     0     1     2     1
      row4     2     2     0     1     1
    
  • pd.get_dummies

      pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
    
            col0  col1  col2  col3  col4
      row0     1     2     0     1     1
      row2     1     0     2     1     2
      row3     0     1     0     2     0
      row4     0     1     2     2     1
    

Question 10

How do I convert a DataFrame from long to wide by pivoting on ONLY two columns?

  • DataFrame.pivot

    The first step is to assign a number to each row - this number will be the row index of that value in the pivoted result. This is done using GroupBy.cumcount:

      df2.insert(0, 'count', df2.groupby('A').cumcount())
      df2
    
         count  A   B
      0      0  a   0
      1      1  a  11
      2      2  a   2
      3      3  a  11
      4      0  b  10
      5      1  b  10
      6      2  b  14
      7      0  c   7
    

    The second step is to use the newly created column as the index to call DataFrame.pivot.

      df2.pivot(*df2)
      # df2.pivot(index='count', columns='A', values='B')
    
      A         a     b    c
      count
      0       0.0  10.0  7.0
      1      11.0  10.0  NaN
      2       2.0  14.0  NaN
      3      11.0   NaN  NaN
    
  • DataFrame.pivot_table

    Whereas DataFrame.pivot only accepts columns, DataFrame.pivot_table also accepts arrays, so the GroupBy.cumcount can be passed directly as the index without creating an explicit column.

      df2.pivot_table(index=df2.groupby('A').cumcount(), columns='A', values='B')
    
      A         a     b    c
      0       0.0  10.0  7.0
      1      11.0  10.0  NaN
      2       2.0  14.0  NaN
      3      11.0   NaN  NaN
    

Question 11

How do I flatten the multiple index to single index after pivot

If columns type object with string join

df.columns = df.columns.map('|'.join)

else format

df.columns = df.columns.map('{0[0]}|{0[1]}'.format)


Answered By - piRSquared
Answer Checked By - Gilberto Lyons (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] How can I pivot a dataframe?

 November 01, 2022     group-by, pandas, pandas-groupby, pivot, python     No comments   

Issue

  • What is pivot?
  • How do I pivot?
  • Is this a pivot?
  • Long format to wide format?

I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting...

... But I'm going to give it a go.


The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)

Look a few examples from my Google Search

  1. How to pivot a dataframe in Pandas?
  • Good question and answer. But the answer only answers the specific question with little explanation.
  1. pandas pivot table to data frame
  • In this question, the OP is concerned with the output of the pivot. Namely how the columns look. OP wanted it to look like R. This isn't very helpful for pandas users.
  1. pandas pivoting a dataframe, duplicate rows
  • Another decent question but the answer focuses on one method, namely pd.DataFrame.pivot

So whenever someone searches for pivot they get sporadic results that are likely not going to answer their specific question.


Setup

You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below.

import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

df = pd.DataFrame(
    add(cols, arr1), columns=cols
).join(
    pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)

     key   row   item   col  val0  val1
0   key0  row3  item1  col3  0.81  0.04
1   key1  row2  item1  col2  0.44  0.07
2   key1  row0  item1  col0  0.77  0.01
3   key0  row4  item0  col2  0.15  0.59
4   key1  row0  item2  col1  0.81  0.64
5   key1  row2  item2  col4  0.13  0.88
6   key2  row4  item1  col3  0.88  0.39
7   key1  row4  item1  col1  0.10  0.07
8   key1  row0  item2  col4  0.65  0.02
9   key1  row2  item0  col2  0.35  0.61
10  key2  row0  item2  col1  0.40  0.85
11  key2  row4  item1  col2  0.64  0.25
12  key0  row2  item2  col3  0.50  0.44
13  key0  row4  item1  col4  0.24  0.46
14  key1  row3  item2  col3  0.28  0.11
15  key0  row3  item1  col1  0.31  0.23
16  key0  row0  item2  col3  0.86  0.01
17  key0  row4  item0  col3  0.64  0.21
18  key2  row2  item2  col0  0.13  0.45
19  key0  row2  item0  col4  0.37  0.70

Question(s)

  1. Why do I get ValueError: Index contains duplicate entries, cannot reshape

  2. How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?

     col   col0   col1   col2   col3  col4
     row
     row0  0.77  0.605    NaN  0.860  0.65
     row2  0.13    NaN  0.395  0.500  0.25
     row3   NaN  0.310    NaN  0.545   NaN
     row4   NaN  0.100  0.395  0.760  0.24
    
  3. How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

     col   col0   col1   col2   col3  col4
     row
     row0  0.77  0.605  0.000  0.860  0.65
     row2  0.13  0.000  0.395  0.500  0.25
     row3  0.00  0.310  0.000  0.545  0.00
     row4  0.00  0.100  0.395  0.760  0.24
    
  4. Can I get something other than mean, like maybe sum?

     col   col0  col1  col2  col3  col4
     row
     row0  0.77  1.21  0.00  0.86  0.65
     row2  0.13  0.00  0.79  0.50  0.50
     row3  0.00  0.31  0.00  1.09  0.00
     row4  0.00  0.10  0.79  1.52  0.24
    
  5. Can I do more that one aggregation at a time?

            sum                          mean
     col   col0  col1  col2  col3  col4  col0   col1   col2   col3  col4
     row
     row0  0.77  1.21  0.00  0.86  0.65  0.77  0.605  0.000  0.860  0.65
     row2  0.13  0.00  0.79  0.50  0.50  0.13  0.000  0.395  0.500  0.25
     row3  0.00  0.31  0.00  1.09  0.00  0.00  0.310  0.000  0.545  0.00
     row4  0.00  0.10  0.79  1.52  0.24  0.00  0.100  0.395  0.760  0.24
    
  6. Can I aggregate over multiple value columns?

           val0                             val1
     col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
     row
     row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
     row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
     row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
     row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
    
  7. Can Subdivide by multiple columns?

     item item0             item1                         item2
     col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
     row
     row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
     row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
     row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
     row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
    
  8. Or

     item      item0             item1                         item2
     col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
     key  row
     key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
          row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
          row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
          row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
     key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
          row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
          row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
          row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
     key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
          row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
          row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
    
  9. Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

     col   col0  col1  col2  col3  col4
     row
     row0     1     2     0     1     1
     row2     1     0     2     1     2
     row3     0     1     0     2     0
     row4     0     1     2     2     1
    
  10. How do I convert a DataFrame from long to wide by pivoting on ONLY two columns? Given,

    np.random.seed([3, 1415])
    df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})
    df2
       A   B
    0  a   0
    1  a  11
    2  a   2
    3  a  11
    4  b  10
    5  b  10
    6  b  14
    7  c   7
    

    The expected should look something like

          a     b    c
    0   0.0  10.0  7.0
    1  11.0  10.0  NaN
    2   2.0  14.0  NaN
    3  11.0   NaN  NaN
    
  11. How do I flatten the multiple index to single index after pivot?

    From

       1  2
       1  1  2
    a  2  1  1
    b  2  1  0
    c  1  0  0
    

    To

       1|1  2|1  2|2
    a    2    1    1
    b    2    1    0
    c    1    0    0
    

Solution

We start by answering the first question:

Question 1

Why do I get ValueError: Index contains duplicate entries, cannot reshape

This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:

df.duplicated(['row', 'col']).any()

True

So when I pivot using

df.pivot(index='row', columns='col', values='val0')

I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:

df.set_index(['row', 'col'])['val0'].unstack()

Here is a list of idioms we can use to pivot

  1. pd.DataFrame.groupby + pd.DataFrame.unstack

    • Good general approach for doing just about any type of pivot
    • You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.
  2. pd.DataFrame.pivot_table

    • A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.
    • Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
  3. pd.DataFrame.set_index + pd.DataFrame.unstack

    • Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
    • Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
  4. pd.DataFrame.pivot

    • Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.
    • Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
  5. pd.crosstab

    • This a specialized version of pivot_table and in its purest form is the most intuitive way to perform several tasks.
  6. pd.factorize + np.bincount

    • This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
  7. pd.get_dummies + pd.DataFrame.dot

    • I use this for cleverly performing cross tabulation.

Examples

What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.

Question 3

How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

  • pd.DataFrame.pivot_table

    • fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it's the same as this answer without the fill_value

    • aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.

          df.pivot_table(
              values='val0', index='row', columns='col',
              fill_value=0, aggfunc='mean')
      
          col   col0   col1   col2   col3  col4
          row
          row0  0.77  0.605  0.000  0.860  0.65
          row2  0.13  0.000  0.395  0.500  0.25
          row3  0.00  0.310  0.000  0.545  0.00
          row4  0.00  0.100  0.395  0.760  0.24
      
  • pd.DataFrame.groupby

      df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
    
  • pd.crosstab

      pd.crosstab(
          index=df['row'], columns=df['col'],
          values=df['val0'], aggfunc='mean').fillna(0)
    

Question 4

Can I get something other than mean, like maybe sum?

  • pd.DataFrame.pivot_table

      df.pivot_table(
          values='val0', index='row', columns='col',
          fill_value=0, aggfunc='sum')
    
      col   col0  col1  col2  col3  col4
      row
      row0  0.77  1.21  0.00  0.86  0.65
      row2  0.13  0.00  0.79  0.50  0.50
      row3  0.00  0.31  0.00  1.09  0.00
      row4  0.00  0.10  0.79  1.52  0.24
    
  • pd.DataFrame.groupby

      df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
    
  • pd.crosstab

      pd.crosstab(
          index=df['row'], columns=df['col'],
          values=df['val0'], aggfunc='sum').fillna(0)
    

Question 5

Can I do more that one aggregation at a time?

Notice that for pivot_table and crosstab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.

  • pd.DataFrame.pivot_table

      df.pivot_table(
          values='val0', index='row', columns='col',
          fill_value=0, aggfunc=[np.size, np.mean])
    
           size                      mean
      col  col0 col1 col2 col3 col4  col0   col1   col2   col3  col4
      row
      row0    1    2    0    1    1  0.77  0.605  0.000  0.860  0.65
      row2    1    0    2    1    2  0.13  0.000  0.395  0.500  0.25
      row3    0    1    0    2    0  0.00  0.310  0.000  0.545  0.00
      row4    0    1    2    2    1  0.00  0.100  0.395  0.760  0.24
    
  • pd.DataFrame.groupby

      df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
    
  • pd.crosstab

      pd.crosstab(
          index=df['row'], columns=df['col'],
          values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
    

Question 6

Can I aggregate over multiple value columns?

  • pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely

      df.pivot_table(
          values=['val0', 'val1'], index='row', columns='col',
          fill_value=0, aggfunc='mean')
    
            val0                             val1
      col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
      row
      row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
      row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
      row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
      row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
    
  • pd.DataFrame.groupby

      df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
    

Question 7

Can Subdivide by multiple columns?

  • pd.DataFrame.pivot_table

      df.pivot_table(
          values='val0', index='row', columns=['item', 'col'],
          fill_value=0, aggfunc='mean')
    
      item item0             item1                         item2
      col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
      row
      row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
      row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
      row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
      row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
    
  • pd.DataFrame.groupby

      df.groupby(
          ['row', 'item', 'col']
      )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
    

Question 8

Can Subdivide by multiple columns?

  • pd.DataFrame.pivot_table

      df.pivot_table(
          values='val0', index=['key', 'row'], columns=['item', 'col'],
          fill_value=0, aggfunc='mean')
    
      item      item0             item1                         item2
      col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
      key  row
      key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
           row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
           row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
           row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
      key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
           row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
           row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
           row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
      key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
           row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
           row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
    
  • pd.DataFrame.groupby

      df.groupby(
          ['key', 'row', 'item', 'col']
      )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
    
  • pd.DataFrame.set_index because the set of keys are unique for both rows and columns

      df.set_index(
          ['key', 'row', 'item', 'col']
      )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
    

Question 9

Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

  • pd.DataFrame.pivot_table

      df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
    
          col   col0  col1  col2  col3  col4
      row
      row0     1     2     0     1     1
      row2     1     0     2     1     2
      row3     0     1     0     2     0
      row4     0     1     2     2     1
    
  • pd.DataFrame.groupby

      df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
    
  • pd.crosstab

      pd.crosstab(df['row'], df['col'])
    
  • pd.factorize + np.bincount

      # get integer factorization `i` and unique values `r`
      # for column `'row'`
      i, r = pd.factorize(df['row'].values)
      # get integer factorization `j` and unique values `c`
      # for column `'col'`
      j, c = pd.factorize(df['col'].values)
      # `n` will be the number of rows
      # `m` will be the number of columns
      n, m = r.size, c.size
      # `i * m + j` is a clever way of counting the
      # factorization bins assuming a flat array of length
      # `n * m`.  Which is why we subsequently reshape as `(n, m)`
      b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
      # BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
      pd.DataFrame(b, r, c)
    
            col3  col2  col0  col1  col4
      row3     2     0     0     1     0
      row2     1     2     1     0     2
      row0     1     0     1     2     1
      row4     2     2     0     1     1
    
  • pd.get_dummies

      pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
    
            col0  col1  col2  col3  col4
      row0     1     2     0     1     1
      row2     1     0     2     1     2
      row3     0     1     0     2     0
      row4     0     1     2     2     1
    

Question 10

How do I convert a DataFrame from long to wide by pivoting on ONLY two columns?

  • DataFrame.pivot

    The first step is to assign a number to each row - this number will be the row index of that value in the pivoted result. This is done using GroupBy.cumcount:

      df2.insert(0, 'count', df2.groupby('A').cumcount())
      df2
    
         count  A   B
      0      0  a   0
      1      1  a  11
      2      2  a   2
      3      3  a  11
      4      0  b  10
      5      1  b  10
      6      2  b  14
      7      0  c   7
    

    The second step is to use the newly created column as the index to call DataFrame.pivot.

      df2.pivot(*df2)
      # df2.pivot(index='count', columns='A', values='B')
    
      A         a     b    c
      count
      0       0.0  10.0  7.0
      1      11.0  10.0  NaN
      2       2.0  14.0  NaN
      3      11.0   NaN  NaN
    
  • DataFrame.pivot_table

    Whereas DataFrame.pivot only accepts columns, DataFrame.pivot_table also accepts arrays, so the GroupBy.cumcount can be passed directly as the index without creating an explicit column.

      df2.pivot_table(index=df2.groupby('A').cumcount(), columns='A', values='B')
    
      A         a     b    c
      0       0.0  10.0  7.0
      1      11.0  10.0  NaN
      2       2.0  14.0  NaN
      3      11.0   NaN  NaN
    

Question 11

How do I flatten the multiple index to single index after pivot

If columns type object with string join

df.columns = df.columns.map('|'.join)

else format

df.columns = df.columns.map('{0[0]}|{0[1]}'.format)


Answered By - piRSquared
Answer Checked By - Mary Flores (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Friday, October 7, 2022

[FIXED] How to deal with statsmodel.api OLS efficiency

 October 07, 2022     dataframe, pandas, pandas-groupby, statistics, statsmodels     No comments   

Issue

I have a df consist of columns: industry, comp_id, year, y and x1, x2, x3, ... The data looks like this:

   indus  comp_id  year      y   x1    x2     x3  other_var
0      A      100  2000   90.0    9  44.0   95.0         66
1      A      100  2001   59.0   65   4.0   63.0         52
2      A      100  2002   75.0   49  29.0   64.0         27
3      A      100  2003   83.0   29  86.0   48.0         57
4      A      100  2004   82.0   64  43.0   67.0         58
5      A      100  2005   88.0   30  54.0   75.0         29
6      A      100  2006   91.0    3  46.0  100.0         68
7      A      101  2005    NaN   65  53.0  100.0         57
8      A      101  2006   83.0   56  67.0   60.0         40
9      A      101  2006  112.0  100  97.0   59.0         54
10     A      101  2007   53.0   95   NaN    3.0         55
11     A      101  2007  113.0   28  77.0   96.0         30
12     A      101  2008   84.0    2  79.0   57.0         54
13     A      101  2008   53.0   19  12.0    NaN         62
14     B      102  2002   83.0   92  88.0   31.0         31
15     B      102  2003  102.0    7  67.0   96.0         28
16     B      103  2003    NaN   49   NaN   72.0         22
17     B      103  2004    NaN   60   NaN   52.0         69
18     B      103  2005    NaN   74   NaN   44.0         49
19     B      103  2006    NaN   21   NaN   31.0         38
20     B      103  2007    NaN   69   NaN   36.0         28
21     B      103  2008    NaN    5   NaN    6.0         32
22     B      103  2009    NaN   22   NaN   10.0         61

I need to run a regression by group of company over year and return the new column with residual. My code is:

def func_reg_err(df, yvar, xvar, alpha=True):
    y = df[yvar].copy()
    x = pd.DataFrame(df[xvar].copy())
    if alpha == True:
        x['intercept'] = 1.
    mod = sm.OLS(y,x, missing='drop')
    res = mod.fit()
    err = y - mod.predict(res.params, x)
    return err

To run regression by group, I use the following code:

df['residual'] = df.groupby('comp_id', group_keys=False).apply(func_reg_err, 'y', ['x1', 'x2', 'x3'], False)

If the data is nice then nothing happens, the code work smoothly. However, in my df, there are many companies with Missing data (entire of a column) or companies with too few observation (only 1 or 2). In the sample, they are companies with comp_id = 102, 1003. One has only 2 observations, and the other has too many missing value. Those group of companies make my code keep returning error.

I could work around by create a copy of df, then manually filter those disqualify companies, run func_reg_err and then merge back with df again. However, I see this method is just a work around solution. I am looking for an efficiency way to deal with this.

What I want is for those groups that are not qualified, the column residual will return with np.Nan


Solution

You could define a minimum acceptable sample length (either as a total or for one/all regressor columns) at your helper function, for instance:

def func_reg_err(df, yvar, xvar, alpha=True, min_samples=30):
    # Return NaNs only
    if len(df) < min_samples or (df.notnull().sum(1) < min_samples).any():
        return pd.Series(index=df.index)

    # Carry on with your regression
    y = df[yvar].copy()
    x = pd.DataFrame(df[xvar].copy())
    if alpha == True:
        x['intercept'] = 1.
    mod = sm.OLS(y,x, missing='drop')
    res = mod.fit()
    err = y - mod.predict(res.params, x)

    return err


Answered By - fsl
Answer Checked By - Terry (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Wednesday, August 17, 2022

[FIXED] How can I customize the output of this .groupby operation done on this DataFrame in Python?

 August 17, 2022     dataframe, output, pandas, pandas-groupby, python     No comments   

Issue

I am working with a DataFrame to create a frequency distribution by counting the three types of values in one column. In this example, I'm counting and displaying each person's "personal status". When I execute the code, all of the other columns are displayed with the count repeated in each column. I'd like the count of each value to be displayed once without a column heading. What do I need to do to accomplish this?

creditData.groupby(['Personal_Status']).count()

Here's an image of my output: Current Output

Edit: Here's what I'd like the output to look like: Desired Output


Solution

What's recommended in the documentation is to use Named aggregation

import pandas as pd
animals = pd.DataFrame(
     {
         "kind": ["cat", "dog", "cat", "dog"],
         "height": [9.1, 6.0, 9.5, 34.0],
         "weight": [7.9, 7.5, 9.9, 198.0],
     }
 )

animals.groupby('kind').agg(**{
    '':('height','count')
})

This will get you

kind    
cat 2
dog 2

For reference https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html (search for named aggregation)



Answered By - Bertrand
Answer Checked By - Senaida (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Tuesday, August 2, 2022

[FIXED] How to modify code to scrape data off of 2nd table on this webpage

 August 02, 2022     beautifulsoup, dataframe, html-table, pandas-groupby, python-3.8     No comments   

Issue

I am trying to scrape data from a table on the following website: https://www.eliteprospects.com/league/nhl/stats/2021-2022

This is the code I found to successfully scrape off data from the first table for skater stats:

import requests
import pandas as pd
from bs4 import BeautifulSoup

dfs = []
for page in range(1,10):
    url = f"https://www.eliteprospects.com/league/nhl/stats/2021-2022?sort=tp&page={page}"
    print(f"Loading {url=}")
    soup = BeautifulSoup(requests.get(url).content, "html.parser")

    df = (
        pd.read_html(str(soup.select_one(".player-stats")))[0]
        .dropna(how="all")
        .reset_index(drop=True)
    )
    dfs.append(df)

df_final = pd.concat(dfs).reset_index(drop=True)
print(df_final)
df_final.to_csv("data.csv", index=False)

But I am having difficulty scraping off the goalie stats from the bottom table. Any idea how to modify the code to get the stats from the bottom table? I tried changing line 13 to "(".goalie-stats")" but it returned an error when I tried to run the code.

Thank you!!


Solution

I found a way to get the data, but it isn't perfect. When I get it, it makes a lot of unnamed columns. Still, it gets the data, so I hope it's helpful

import requests
import pandas as pd
from bs4 import BeautifulSoup

dfs = []
for page in range(1,3):
    url = f"https://www.eliteprospects.com/league/nhl/stats/2021-2022?sort-goalie-stats=svp&page-goalie={page}#goalies"
    print(f"Loading {url=}")
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    df = (
        pd.read_html(str(soup.select_one(".goalie-stats")).replace('%', ''))[0]
        .dropna(how="all")
        .reset_index(drop=True)
    )
    dfs.append(df)

df_final = pd.concat(dfs).reset_index(drop=True)
print(df_final)
df_final.to_csv("data.csv", index=False)


Answered By - Jet_Mouse
Answer Checked By - Candace Johnson (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Older Posts Home

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
All Comments
Atom
All Comments

Copyright © PHPFixing