PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Monday, August 29, 2022

[FIXED] How to correct the pandas script to edit csv file - removing single quotes, adding double ones, deleting the unnecessary columns

 August 29, 2022     csv, data-science, pandas, python     No comments   

Issue

Started learning pandas and maybe got lost with it so just need some assistance.

I am trying to automate a process for editing a csv file. I am receiving unsorted ones and trying to get them ready to go for some bulk updates.

So, what I need to do is: 1.Rename the needed columns (which I did okay) 2.Replace existing double quotes for two single ones in title column 3.Add double quotes to all the titles (that's first column) 4.drop columns (not sure what's best - del or df.drop) 5.Save changes to a new file somewhere defined

This is the code I have tried now and it worked with:

import pandas as pd

df = pd.read_csv('Sheet1.csv')
droped = df.drop(df.columns[[2,3,4,5,6,7,8,9,10]],axis = 1,inplace=True)
renamed = df.rename(columns={df.columns[0]: 'title', df.columns[1]: 'product_id'})
swapped = renamed[[0]].str.replace(r'\"', r"\''")
updated = swapped.update('"' + df.columns[[0]].astype(str) + '"')

print(renamed.head())

These quotes are headache to me now - some titles have double quotes inside - I need to remove them and then add double quotes around every title.

First three lines work fine, I am able to drop the unnecessary columns and rename the remaining ones, but last two operations are not well written. I am trying to figure out googling around but no luck so far.

Example for this quote swap:

Now:Banini Movie" Teather Desired:"Banini Movie'' Theater"

Error:

Traceback (most recent call last):
  File "Desktop/testing_scripts/bulk_script.py", line 6, in <module>
    swapped = renamed[[0]].str.replace(r'\"', r"\''")
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 3511, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5782, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5842, in _raise_if_missing
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Int64Index([0], dtype='int64')] are in the [columns]"

Solution

To answer your question, you just have a typo.

# Here, you rename your columns:
renamed = df.rename(columns={df.columns[0]: 'title', df.columns[1]: 'product_id'})
# Here, you try to access the column by its old name...
renamed[[0]]
> None of [Int64Index([0], dtype='int64')] are in the [columns]
# This error is correct, only 'title' and 'product_id' are in 'columns'.

# What it could be:
renamed['title']

# Or to access by col #:
renamed.iloc[:, 0]

  1. Reading in Specific Columns:
from io import StringIO 
# StringIO just allows me to treat a string like a file, 
# you'll use your file name in read_csv instead.
import pandas as pd

file = '0,1,2,3,4\nsr,ddpi,np.log(dpi),pop75,np.log(pop15)\nnp.log(pop15),sr,ddpi,np.log(dpi),pop75\npop75,np.log(pop15),sr,ddpi,np.log(dpi)\nnp.log(dpi),pop75,np.log(pop15),sr,ddpi\nddpi,np.log(dpi),pop75,np.log(pop15),sr\n'

print(pd.read_csv(StringIO(file)))
print(pd.read_csv(StringIO(file), usecols=[0,1]))

Output, we can see that by using the keyword usecols we can filter for the columns we want when reading the file, instead of having to drop them later:

               0              1              2              3              4
0             sr           ddpi    np.log(dpi)          pop75  np.log(pop15)
1  np.log(pop15)             sr           ddpi    np.log(dpi)          pop75
2          pop75  np.log(pop15)             sr           ddpi    np.log(dpi)
3    np.log(dpi)          pop75  np.log(pop15)             sr           ddpi
4           ddpi    np.log(dpi)          pop75  np.log(pop15)             sr

               0              1
0             sr           ddpi
1  np.log(pop15)             sr
2          pop75  np.log(pop15)
3    np.log(dpi)          pop75
4           ddpi    np.log(dpi)

Also, an alternative method of 'dropping' columns, when what your really want is to just keep a couple columns is doing:

df = pd.read_csv(StringIO(file))
df = df[['0', '1']]
print(df)
               0              1
0             sr           ddpi
1  np.log(pop15)             sr
2          pop75  np.log(pop15)
3    np.log(dpi)          pop75
4           ddpi    np.log(dpi)

  1. Renaming columns when you know their order:
# Personally, I find these methods to be simpler for the case you've presented~

df = pd.read_csv(StringIO(file), usecols=[0,1])
df.columns = ['title', 'product_id']

# OR
# If your file doesn't have a header, then remove 'header=0'
df = pd.read_csv(StringIO(file), usecols=[0,1], header=0, names=['title', 'product_id'])

print(df)
           title     product_id
0             sr           ddpi
1  np.log(pop15)             sr
2          pop75  np.log(pop15)
3    np.log(dpi)          pop75
4           ddpi    np.log(dpi)

  1. Changing up the quotes, and putting it all together:
file = '0,1,2,3,4\nsr,Banini Movie" Teather,np.log(dpi),pop75,np.log(pop15)\nnp.log(pop15),sr,ddpi,np.log(dpi),pop75\npop75,Banini Movie" Teather,sr,ddpi,np.log(dpi)\nBanini Movie" Teather,pop75,Banini Movie" Teather,sr,ddpi\nddpi,np.log(dpi),pop75,Banini Movie" Teather,sr\n'

df = pd.read_csv(StringIO(file), usecols=[0,1], header=0, names=['title', 'product_id'])
print(df)
                   title             product_id
0                     sr  Banini Movie" Teather
1          np.log(pop15)                     sr
2                  pop75  Banini Movie" Teather
3  Banini Movie" Teather                  pop75
4                   ddpi            np.log(dpi)
# To do it to one column:
df['title'] = df['title'].str.replace('"', "''")
df['title'] = '"' + df['title'] + '"'
print(df)
                      title             product_id
0                      "sr"  Banini Movie" Teather
1           "np.log(pop15)"                     sr
2                   "pop75"  Banini Movie" Teather
3  "Banini Movie'' Teather"                  pop75
4                    "ddpi"            np.log(dpi)
# To do it to all columns:
for col in df:
    df[col] = df[col].str.replace('"', "''")
    df[col] = '"' + df[col] + '"'

print(df)
                      title                product_id
0                      "sr"  "Banini Movie'' Teather"
1           "np.log(pop15)"                      "sr"
2                   "pop75"  "Banini Movie'' Teather"
3  "Banini Movie'' Teather"                   "pop75"
4                    "ddpi"             "np.log(dpi)"


Answered By - BeRT2me
Answer Checked By - Timothy Miller (PHPFixing Admin)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing