Issue
This might be related to pivoting, but I am not sure. Basically, what I want to achieve is the following binary transformation:
+-----------------+
| C1     | C2     |
+--------|--------+
| A      | xxx    |
| B      | yyy    |
| A      | yyy    |
| B      | www    |
| B      | xxx    |
| A      | zzz    |
| A      | xxx    |
| A      | yyy    |
+-----------------+
to
+--------------------------------------------+
| C1     | www    | xxx    | yyy    | zzz    |
+--------|--------|--------|--------|--------|
| A      |   0    |   1    |   1    |   1    |
| B      |   1    |   1    |   1    |   0    |
+--------------------------------------------+
How does one attain this in PySpark? Presence is 1 and absence is 0.
Solution
Yes, you will need pivot. But for aggregation, in your case it's best just to use F.first(F.lit(1)) and when you get nulls, just replace them with 0 using df.fillna(0).
from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('A', 'xxx'),
     ('B', 'yyy'),
     ('A', 'yyy'),
     ('B', 'www'),
     ('B', 'xxx'),
     ('A', 'zzz'),
     ('A', 'xxx'),
     ('A', 'yyy')],
    ['C1', 'C2'])
df = df.groupBy('C1').pivot('C2').agg(F.first(F.lit(1)))
df = df.fillna(0)
df.show()
# +---+---+---+---+---+
# | C1|www|xxx|yyy|zzz|
# +---+---+---+---+---+
# |  B|  1|  1|  1|  0|
# |  A|  0|  1|  1|  1|
# +---+---+---+---+---+
                        
                        Answered By - ZygD Answer Checked By - Robin (PHPFixing Admin)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.