Issue

I have to perform a group-by and pivot operation on a dataframe's "activity" column, and populate the new columns resulting from the pivot with the sum of the "quantity" column. One of the activity columns, however has to be populated with the sum of the "cost" column.

Data frame before group-by and pivot:

+----+-----------+-----------+-----------+-----------+
| id | quantity  | cost      | activity  | category  | 
+----+-----------+-----------+-----------+-----------+
|  1 |         2 |         2 |    skiing |   outdoor |
|  2 |         0 |         2 |  swimming |  outdoor  |
+----+-----------+-----------+-----------+-----------+

pivot code:

pivotDF = df.groupBy("category").pivot("activity").sum("quantity")

result:

    +----+-----------+-----------+-----------+
    | id | category  | skiing    | swimming  | 
    +----+-----------+-----------+-----------+
    |  1 |   outdoor |    2      |   5       |
    |  2 |   outdoor |    4      |   7       |
    +----+-----------+-----------+-----------+

The problem is that for one of these activities, I need the activity column to be populated with sum("cost") instead of sum("quantity"). I can't seem to find a way to specify this during the pivot operation itself, so I thought maybe I can just exchange the values in the quantity column for the ones in the cost column wherever the activity column value corresponds to the relevant activity. However, I can't find an example of how to do this in a pyspark data frame.

Any help would be much appreciated.

Solution

You can provide more than 1 aggregation after the pivot.

Let's say the input dataframe looks like the following

# +---+---+----+--------+-------+
# | id|qty|cost|     act|    cat|
# +---+---+----+--------+-------+
# |  1|  2|   2|  skiing|outdoor|
# |  2|  0|   2|swimming|outdoor|
# |  3|  1|   2|  skiing|outdoor|
# |  4|  2|   4|swimming|outdoor|
# +---+---+----+--------+-------+

Do a pivot and use agg() to provide more than 1 aggregation.

data_sdf. \
    groupBy('id', 'cat'). \
    pivot('act'). \
    agg(func.sum('cost').alias('cost'), 
        func.sum('qty').alias('qty')
        ). \
    show()

# +---+-------+-----------+----------+-------------+------------+
# | id|    cat|skiing_cost|skiing_qty|swimming_cost|swimming_qty|
# +---+-------+-----------+----------+-------------+------------+
# |  2|outdoor|       null|      null|            2|           0|
# |  1|outdoor|          2|         2|         null|        null|
# |  3|outdoor|          2|         1|         null|        null|
# |  4|outdoor|       null|      null|            4|           2|
# +---+-------+-----------+----------+-------------+------------+

Notice the field names. Pyspark automatically assigned the suffix based on the alias provided in the aggregations. Use a drop or select to retain the columns required and rename them per your choice.

Removing id from the groupBy makes the result much better.

data_sdf. \
    groupBy('cat'). \
    pivot('act'). \
    agg(func.sum('cost').alias('cost'), 
        func.sum('qty').alias('qty')
        ). \
    show()

# +-------+-----------+----------+-------------+------------+
# |    cat|skiing_cost|skiing_qty|swimming_cost|swimming_qty|
# +-------+-----------+----------+-------------+------------+
# |outdoor|          4|         3|            6|           2|
# +-------+-----------+----------+-------------+------------+

Answered By - samkart

Answer Checked By - Clifford M. (PHPFixing Volunteer)

Tuesday, November 1, 2022

[FIXED] How can I replace the values in one pyspark dataframe column with the values from another column in a sub-section of the dataframe?

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Tuesday, November 1, 2022

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To