PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Tuesday, November 1, 2022

[FIXED] How can I replace the values in one pyspark dataframe column with the values from another column in a sub-section of the dataframe?

 November 01, 2022     dataframe, pivot, pyspark     No comments   

Issue

I have to perform a group-by and pivot operation on a dataframe's "activity" column, and populate the new columns resulting from the pivot with the sum of the "quantity" column. One of the activity columns, however has to be populated with the sum of the "cost" column.

Data frame before group-by and pivot:

+----+-----------+-----------+-----------+-----------+
| id | quantity  | cost      | activity  | category  | 
+----+-----------+-----------+-----------+-----------+
|  1 |         2 |         2 |    skiing |   outdoor |
|  2 |         0 |         2 |  swimming |  outdoor  |
+----+-----------+-----------+-----------+-----------+

pivot code:

pivotDF = df.groupBy("category").pivot("activity").sum("quantity")

result:

    +----+-----------+-----------+-----------+
    | id | category  | skiing    | swimming  | 
    +----+-----------+-----------+-----------+
    |  1 |   outdoor |    2      |   5       |
    |  2 |   outdoor |    4      |   7       |
    +----+-----------+-----------+-----------+

The problem is that for one of these activities, I need the activity column to be populated with sum("cost") instead of sum("quantity"). I can't seem to find a way to specify this during the pivot operation itself, so I thought maybe I can just exchange the values in the quantity column for the ones in the cost column wherever the activity column value corresponds to the relevant activity. However, I can't find an example of how to do this in a pyspark data frame.

Any help would be much appreciated.


Solution

You can provide more than 1 aggregation after the pivot.

Let's say the input dataframe looks like the following

# +---+---+----+--------+-------+
# | id|qty|cost|     act|    cat|
# +---+---+----+--------+-------+
# |  1|  2|   2|  skiing|outdoor|
# |  2|  0|   2|swimming|outdoor|
# |  3|  1|   2|  skiing|outdoor|
# |  4|  2|   4|swimming|outdoor|
# +---+---+----+--------+-------+

Do a pivot and use agg() to provide more than 1 aggregation.

data_sdf. \
    groupBy('id', 'cat'). \
    pivot('act'). \
    agg(func.sum('cost').alias('cost'), 
        func.sum('qty').alias('qty')
        ). \
    show()

# +---+-------+-----------+----------+-------------+------------+
# | id|    cat|skiing_cost|skiing_qty|swimming_cost|swimming_qty|
# +---+-------+-----------+----------+-------------+------------+
# |  2|outdoor|       null|      null|            2|           0|
# |  1|outdoor|          2|         2|         null|        null|
# |  3|outdoor|          2|         1|         null|        null|
# |  4|outdoor|       null|      null|            4|           2|
# +---+-------+-----------+----------+-------------+------------+

Notice the field names. Pyspark automatically assigned the suffix based on the alias provided in the aggregations. Use a drop or select to retain the columns required and rename them per your choice.


Removing id from the groupBy makes the result much better.

data_sdf. \
    groupBy('cat'). \
    pivot('act'). \
    agg(func.sum('cost').alias('cost'), 
        func.sum('qty').alias('qty')
        ). \
    show()

# +-------+-----------+----------+-------------+------------+
# |    cat|skiing_cost|skiing_qty|swimming_cost|swimming_qty|
# +-------+-----------+----------+-------------+------------+
# |outdoor|          4|         3|            6|           2|
# +-------+-----------+----------+-------------+------------+


Answered By - samkart
Answer Checked By - Clifford M. (PHPFixing Volunteer)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing