Issue
Started learning pandas and maybe got lost with it so just need some assistance.
I am trying to automate a process for editing a csv file. I am receiving unsorted ones and trying to get them ready to go for some bulk updates.
So, what I need to do is: 1.Rename the needed columns (which I did okay) 2.Replace existing double quotes for two single ones in title column 3.Add double quotes to all the titles (that's first column) 4.drop columns (not sure what's best - del or df.drop) 5.Save changes to a new file somewhere defined
This is the code I have tried now and it worked with:
import pandas as pd
df = pd.read_csv('Sheet1.csv')
droped = df.drop(df.columns[[2,3,4,5,6,7,8,9,10]],axis = 1,inplace=True)
renamed = df.rename(columns={df.columns[0]: 'title', df.columns[1]: 'product_id'})
swapped = renamed[[0]].str.replace(r'\"', r"\''")
updated = swapped.update('"' + df.columns[[0]].astype(str) + '"')
print(renamed.head())
These quotes are headache to me now - some titles have double quotes inside - I need to remove them and then add double quotes around every title.
First three lines work fine, I am able to drop the unnecessary columns and rename the remaining ones, but last two operations are not well written. I am trying to figure out googling around but no luck so far.
Example for this quote swap:
Now:Banini Movie" Teather Desired:"Banini Movie'' Theater"
Error:
Traceback (most recent call last):
File "Desktop/testing_scripts/bulk_script.py", line 6, in <module>
swapped = renamed[[0]].str.replace(r'\"', r"\''")
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 3511, in __getitem__
indexer = self.columns._get_indexer_strict(key, "columns")[1]
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5782, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5842, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Int64Index([0], dtype='int64')] are in the [columns]"
Solution
To answer your question, you just have a typo.
# Here, you rename your columns:
renamed = df.rename(columns={df.columns[0]: 'title', df.columns[1]: 'product_id'})
# Here, you try to access the column by its old name...
renamed[[0]]
> None of [Int64Index([0], dtype='int64')] are in the [columns]
# This error is correct, only 'title' and 'product_id' are in 'columns'.
# What it could be:
renamed['title']
# Or to access by col #:
renamed.iloc[:, 0]
- Reading in Specific Columns:
from io import StringIO
# StringIO just allows me to treat a string like a file,
# you'll use your file name in read_csv instead.
import pandas as pd
file = '0,1,2,3,4\nsr,ddpi,np.log(dpi),pop75,np.log(pop15)\nnp.log(pop15),sr,ddpi,np.log(dpi),pop75\npop75,np.log(pop15),sr,ddpi,np.log(dpi)\nnp.log(dpi),pop75,np.log(pop15),sr,ddpi\nddpi,np.log(dpi),pop75,np.log(pop15),sr\n'
print(pd.read_csv(StringIO(file)))
print(pd.read_csv(StringIO(file), usecols=[0,1]))
Output, we can see that by using the keyword usecols
we can filter for the columns we want when reading the file, instead of having to drop them later:
0 1 2 3 4
0 sr ddpi np.log(dpi) pop75 np.log(pop15)
1 np.log(pop15) sr ddpi np.log(dpi) pop75
2 pop75 np.log(pop15) sr ddpi np.log(dpi)
3 np.log(dpi) pop75 np.log(pop15) sr ddpi
4 ddpi np.log(dpi) pop75 np.log(pop15) sr
0 1
0 sr ddpi
1 np.log(pop15) sr
2 pop75 np.log(pop15)
3 np.log(dpi) pop75
4 ddpi np.log(dpi)
Also, an alternative method of 'dropping' columns, when what your really want is to just keep a couple columns is doing:
df = pd.read_csv(StringIO(file))
df = df[['0', '1']]
print(df)
0 1
0 sr ddpi
1 np.log(pop15) sr
2 pop75 np.log(pop15)
3 np.log(dpi) pop75
4 ddpi np.log(dpi)
- Renaming columns when you know their order:
# Personally, I find these methods to be simpler for the case you've presented~
df = pd.read_csv(StringIO(file), usecols=[0,1])
df.columns = ['title', 'product_id']
# OR
# If your file doesn't have a header, then remove 'header=0'
df = pd.read_csv(StringIO(file), usecols=[0,1], header=0, names=['title', 'product_id'])
print(df)
title product_id
0 sr ddpi
1 np.log(pop15) sr
2 pop75 np.log(pop15)
3 np.log(dpi) pop75
4 ddpi np.log(dpi)
- Changing up the quotes, and putting it all together:
file = '0,1,2,3,4\nsr,Banini Movie" Teather,np.log(dpi),pop75,np.log(pop15)\nnp.log(pop15),sr,ddpi,np.log(dpi),pop75\npop75,Banini Movie" Teather,sr,ddpi,np.log(dpi)\nBanini Movie" Teather,pop75,Banini Movie" Teather,sr,ddpi\nddpi,np.log(dpi),pop75,Banini Movie" Teather,sr\n'
df = pd.read_csv(StringIO(file), usecols=[0,1], header=0, names=['title', 'product_id'])
print(df)
title product_id
0 sr Banini Movie" Teather
1 np.log(pop15) sr
2 pop75 Banini Movie" Teather
3 Banini Movie" Teather pop75
4 ddpi np.log(dpi)
# To do it to one column:
df['title'] = df['title'].str.replace('"', "''")
df['title'] = '"' + df['title'] + '"'
print(df)
title product_id
0 "sr" Banini Movie" Teather
1 "np.log(pop15)" sr
2 "pop75" Banini Movie" Teather
3 "Banini Movie'' Teather" pop75
4 "ddpi" np.log(dpi)
# To do it to all columns:
for col in df:
df[col] = df[col].str.replace('"', "''")
df[col] = '"' + df[col] + '"'
print(df)
title product_id
0 "sr" "Banini Movie'' Teather"
1 "np.log(pop15)" "sr"
2 "pop75" "Banini Movie'' Teather"
3 "Banini Movie'' Teather" "pop75"
4 "ddpi" "np.log(dpi)"
Answered By - BeRT2me Answer Checked By - Timothy Miller (PHPFixing Admin)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.