Issue

This might be related to pivoting, but I am not sure. Basically, what I want to achieve is the following binary transformation:

+-----------------+
| C1     | C2     |
+--------|--------+
| A      | xxx    |
| B      | yyy    |
| A      | yyy    |
| B      | www    |
| B      | xxx    |
| A      | zzz    |
| A      | xxx    |
| A      | yyy    |
+-----------------+

+--------------------------------------------+
| C1     | www    | xxx    | yyy    | zzz    |
+--------|--------|--------|--------|--------|
| A      |   0    |   1    |   1    |   1    |
| B      |   1    |   1    |   1    |   0    |
+--------------------------------------------+

How does one attain this in PySpark? Presence is 1 and absence is 0.

Solution

Yes, you will need pivot. But for aggregation, in your case it's best just to use F.first(F.lit(1)) and when you get nulls, just replace them with 0 using df.fillna(0).

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('A', 'xxx'),
     ('B', 'yyy'),
     ('A', 'yyy'),
     ('B', 'www'),
     ('B', 'xxx'),
     ('A', 'zzz'),
     ('A', 'xxx'),
     ('A', 'yyy')],
    ['C1', 'C2'])

df = df.groupBy('C1').pivot('C2').agg(F.first(F.lit(1)))
df = df.fillna(0)

df.show()
# +---+---+---+---+---+
# | C1|www|xxx|yyy|zzz|
# +---+---+---+---+---+
# |  B|  1|  1|  1|  0|
# |  A|  0|  1|  1|  1|
# +---+---+---+---+---+

Answered By - ZygD

Answer Checked By - Robin (PHPFixing Admin)

Tuesday, November 1, 2022

[FIXED] How to map one column to multiple binary columns in Spark?

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Tuesday, November 1, 2022

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To