PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Wednesday, November 2, 2022

[FIXED] How to remove double quotes and ';' from header in PySpark

 November 02, 2022     apache-spark, csv, file, pyspark, python     No comments   

Issue

I am trying to remove "" and ; from my CSV file in PySpark. The data in CSV looks like below:

age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"

Code I am using is:

df = spark.read.options(delimiter=';').csv("C:/Project_bankdata.csv", header=True)
df1 = df.select([F.regexp_replace(c, '"', '').alias(c) for c in df.columns])
df1.show(10,truncate=0)

Output:

|"age;""job""   |""marital""|""education""|""default""|""balance""|""housing""|""loan""|""contact""|""day""|""month""|""duration""|""campaign""|""pdays""|""previous""|""poutcome""|""y"""|
+---------------+-----------+-------------+-----------+-----------+-----------+--------+-----------+-------+---------+------------+------------+---------+------------+------------+------+
|58;management  |married    |tertiary     |no         |2143       |yes        |no      |unknown    |5      |may      |261         |1           |-1       |0           |unknown     |no    |

I am able to get rid of quotes from data, but not from the header. How can I remove double quotes from header as well?


Solution

I was only able to reproduce your output if I used this input CSV:

"age;""job"";""marital"";""education"";""default"";""balance"";""housing"";""loan"";""contact"";""day"";""month"";""duration"";""campaign"";""pdays"";""previous"";""poutcome"";""y"""
"58;"management"";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"

You can read the CSV as text file, remove all the double quotes " from every line and then make a dataframe.

rdd = spark.sparkContext.textFile(r"C:\temp\temp.csv")
rdd = rdd.map(lambda line: line.replace('"', '').split(';'))

header = rdd.first()
df = rdd.filter(lambda line: line != header).toDF(header)

df.show()
# +---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
# |age|       job|marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|  y|
# +---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
# | 58|management|married| tertiary|     no|   2143|    yes|  no|unknown|  5|  may|     261|       1|   -1|       0| unknown| no|
# +---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+

Note. This effectively removes string notation from the CSV file. So, this will only work well, if you don't have such values which contain ; inside them.



Answered By - ZygD
Answer Checked By - Dawn Plyler (PHPFixing Volunteer)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing