Issue
Explanation: From a csv file it's needed convert it to a json file. The csv holds the following information: given an account this account must to pay 1 or n documents, one with the due date the current day and the others with dates before. The csv file has the following structure:
account | customer_name | due_date | id_type | id_document | to_pay |
---|---|---|---|---|---|
2001 | John Doe | 05082022 | FC | 987654123 | 113,25 |
2002 | Karl Steven | 05082022 | FC | 987654124 | 113,25 |
2003 | Sarah Miller | 15022022 | FC | 987654125 | 199,75 |
2004 | Marty Mcfly | 15032022 | FC | 987654126 | 113,25 |
2001 | John Doe | 06032022 | FC | 987654999 | 150,00 |
2001 | John Doe | 15042022 | FC | 987654777 | 15,00 |
2003 | Sarah Miller | 05082022 | FC | 987444123 | 113,25 |
2001 | John Doe | 15032022 | FC | 987654123 | 113,25 |
account | customer_name | due_date | id_type | id_document | to_pay |
2001 | John Doe | 05082022 | FC | 987654123 | 113,25 |
2002 | Karl Steven | 05082022 | FC | 987654124 | 113,25 |
2003 | Sarah Miller | 15022022 | FC | 987654125 | 199,75 |
2004 | Marty Mcfly | 15032022 | FC | 987654126 | 113,25 |
2001 | John Doe | 06032022 | FC | 987654999 | 150,00 |
2001 | John Doe | 15042022 | FC | 987654777 | 15,00 |
2003 | Sarah Miller | 05082022 | FC | 987444123 | 113,25 |
2001 | John Doe | 15032022 | FC | 987654123 | 113,25 |
Desire Output: The output must be a multi-line json file that represent the logic mentioned above so taking just one line of the json, and with a pretty formatting, must look like this:
{
"account": "2001",
"customer_name": "John Doe",
"due_date" : "05-08-2022",
"id_type": "FC",
"id_document": "987654123",
"to_pay": "113,25",
"pending_documents":[
{
"id_type" : "FC",
"id_document" : "987654999",
"to_pay" : "150,00",
"due_date": "06-03-2022"
},
{
"id_type" : "FC",
"id_document" : "987654777",
"to_pay" : "15,00",
"due_date": "15-04-2022"
},
{
"id_type" : "FC",
"id_document" : "987654123",
"to_pay" : "113,25",
"due_date": "15-03-2022"
}
]
}
What Have I tried: Using just python: (This piece of code it's not functional, but it's for the sake of a concrete approach)
with codecs.open(input_file, 'r', encoding='utf-8', errors='ignore') as in_file:
for line in in_file:
if pattern.search(line):
line_object['account'] = line[2:10]
cuentas_cto.append(line_object['cuenta'])
line_object['due_date'] = datetime.strptime(line[90:98],'%d%m%Y').strftime('%Y-%m-%d')
line_object['id_document'] = line[105:113]
pattern_cuenta = re.compile(line_object['cuenta'])
file_reopen = codecs.open(input_file, 'r', encoding='utf-8', errors='ignore')
lines_file_reopen = file_reopen.readlines()
for line in lines_file_reopen:
if pattern_cuenta.search(line) and not pattern.search(line):
inner_object = {
"due_date" : "",
"id_document" : ""
}
inner_object['due_date'] = datetime.strptime(line[90:98],'%d%m%Y').strftime('%Y-%m-%d')
inner_object['id_document'] = line[105:113]
line_object['pending_documents'].append(inner_object)
file_reopen.close()
output.append(json.dumps(line_object))
This approach it's far from being a beauty solution but works. (works but takes a massive amount of time to complete). This is not a valid solution because each csv file has a size of 150 MB with around 1 000 000 lines each.
Why pandas: It's really fast for loading and processing so it's promising. Here is a really small POC:
>>> import pandas as pd
>>> df = pd.DataFrame(pd.read_csv("inCSV.csv", sep="|", header=0, dtype=str))
>>> df = pd.read_csv("inCSV.csv", sep="|", header=0, dtype=str)
>>> csv_file = pd.read_csv("inCSV.csv", sep="|", header=0, dtype=str)
>>> csv_file['pending_documents'] = csv_file[['id_document', 'id_document', 'due_date', 'doc_importe']].apply(lambda s: s.to_dict(), axis=1)
>>> csv_file[['account', 'pending_documents']].to_json("data-out-pd.json", orient="records")
>>> csv_file[['account', 'pending_documents']].to_json("data-out-pd.json", orient="records", lines=True)
>>>
Here the problem is that I know really little about pandas and the documentation it's huge. Doing this task of data treatment is not always in my TO-DO list at work. It's clear that I need to study pandas. Thanks for reading. Any further information needed just let me know.
Solution
You can use .groupby('account')
to work with every account separatelly.
And when you will have one account then you can get first row ([0]
) as main document, and other rows ([1:]
) as pending_documents
.
Using .to_dict()
you can create dict for account - and later you can append it to some list to have all accounts.
Minimal working code with example data.
I use io
to create file in memory - so everyone can simply copy and test it - but you can use your filename
.
I use engine='python'
so I can use regex
in separator sep='\s+\|\s*'
and remove all spaces around values.
data = '''account | customer_name | due_date | id_type | id_document | to_pay |
2001 | John Doe | 05082022 | FC | 987654123 | 113,25 |
2002 | Karl Steven | 05082022 | FC | 987654124 | 113,25 |
2003 | Sarah Miller | 15022022 | FC | 987654125 | 199,75 |
2004 | Marty Mcfly | 15032022 | FC | 987654126 | 113,25 |
2001 | John Doe | 06032022 | FC | 987654999 | 150,00 |
2001 | John Doe | 15042022 | FC | 987654777 | 15,00 |
2003 | Sarah Miller | 05082022 | FC | 987444123 | 113,25 |
2001 | John Doe | 15032022 | FC | 987654123 | 113,25 |'''
import pandas as pd
import io
#df = pd.read_csv('filename.csv')
df = pd.read_csv(io.StringIO(data), sep='\s+\|\s*', engine='python')
#print(df)
df = df.drop('Unnamed: 6', axis=1) # remove column after last `|`
#print(df)
all_accounts = []
for selected, group in df.groupby(['account']):
#print('---', selected, '---')
account = group.iloc[0].to_dict()
docs = group.iloc[1:][['id_type', 'id_document', 'to_pay', 'due_date']]
account["pending_documents"] = docs.to_dict(orient='records')
#print(account)
all_accounts.append(account)
import json
print(json.dumps(all_accounts, indent=2))
Result:
[
{
"account": 2001,
"customer_name": "John Doe",
"due_date": 5082022,
"id_type": "FC",
"id_document": 987654123,
"to_pay": "113,25",
"pending_documents": [
{
"id_type": "FC",
"id_document": 987654999,
"to_pay": "150,00",
"due_date": 6032022
},
{
"id_type": "FC",
"id_document": 987654777,
"to_pay": "15,00",
"due_date": 15042022
},
{
"id_type": "FC",
"id_document": 987654123,
"to_pay": "113,25",
"due_date": 15032022
}
]
},
{
"account": 2002,
"customer_name": "Karl Steven",
"due_date": 5082022,
"id_type": "FC",
"id_document": 987654124,
"to_pay": "113,25",
"pending_documents": []
},
{
"account": 2003,
"customer_name": "Sarah Miller",
"due_date": 15022022,
"id_type": "FC",
"id_document": 987654125,
"to_pay": "199,75",
"pending_documents": [
{
"id_type": "FC",
"id_document": 987444123,
"to_pay": "113,25",
"due_date": 5082022
}
]
},
{
"account": 2004,
"customer_name": "Marty Mcfly",
"due_date": 15032022,
"id_type": "FC",
"id_document": 987654126,
"to_pay": "113,25",
"pending_documents": []
}
]
Answered By - furas Answer Checked By - David Goodson (PHPFixing Volunteer)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.