PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Saturday, November 5, 2022

[FIXED] How can I filter pandas rows and build new rows from them?

 November 05, 2022     lambda, pandas, python     No comments   

Issue

I have a table here https://deepnote.com/workspace/juan-luis-chulilla-4ba37b7e-8a41-4d3a-9001-c575bfd7f57f/project/Playing-with-Oryx-data-and-timeline-f288aa01-871d-4a84-8efa-dc27c37a12f3/%2Fnotebook.ipynb

which integrates all the equipment losses in the Ukraine war 2022.

Table format is something like this

country,equipment_type,destroyed,abandoned,captured,damaged,type_total,row_id,Date
Ukraine,Aircraft,2,0,0,0,2,,2022-02-24
Russia,Tanks,1,0,0,0,1,,2022-02-24
Russia,Armoured Fighting Vehicles,3,0,2,0,5,,2022-02-24
Russia,Infantry Fighting Vehicles,0,0,1,0,1,,2022-02-24
Russia,Mine-resistant ambush protected,0,0,1,0,1,,2022-02-24
Russia,Infantry Mobility Vehicles,1,0,0,1,2,,2022-02-24
Russia,Engineering Vehicles,1,0,2,0,3,,2022-02-24
Russia,Surface-to-air missile systems,1,0,0,0,1,,2022-02-24
Russia,Aircraft,2,0,0,0,2,,2022-02-24
Russia,Helicopters,1,0,0,1,2,,2022-02-24
Russia,All Types,0,0,0,0,0,,2022-02-24
Russia,"Trucks, Vehicles and Jeeps",1,0,6,2,9,,2022-02-24
Ukraine,Tanks,1,0,0,0,1,,2022-02-24
Ukraine,Armoured Fighting Vehicles,0,1,1,0,2,,2022-02-24
Ukraine,,0,0,0,0,0,,2022-02-24
Ukraine,Infantry Fighting Vehicles,0,3,2,0,5,,2022-02-24
Ukraine,Armoured Personnel Carriers,0,1,1,0,2,,2022-02-24
Ukraine,Infantry Mobility Vehicles,1,0,0,0,1,,2022-02-24
Ukraine,Self-Propelled Artillery,0,0,2,0,2,,2022-02-24
Ukraine,Surface-to-air missile systems,1,0,0,0,1,,2022-02-24
Ukraine,Radars,3,0,0,2,5,,2022-02-24
Ukraine,All Types,0,0,0,0,0,,2022-02-24
Ukraine,,0,0,0,0,0,,2022-02-24
Ukraine,"Trucks, Vehicles and Jeeps",5,0,11,0,16,,2022-02-24
Ukraine,Surface-To-Air Missile Systems,3,0,0,0,3,,2022-02-25

As you can see, there is a column for country and other for equipment type. What I want to do is to add new rows for each date like this

frame.append(["Rus/Uk","Aircraft", Rus Aircraft Destroyed / Uk aircraft destroyed, Rus Aircraft Abandoned / Uk aircraft Abandoned, Rus Aircraft Captured / Uk aircraft captured, Rus Aircraft Damaged / Uk aircraft Damaged, Rus Aircraft type_total / Uk aircraft type_total, ,"2022-02-24"]

one for each equipment_type. I tested a shitty block as a proof of concept (yeah, I know, nested loops for adding Panda Rows is guilty of dismemberment), but I couldn't figure out how to make it work

listtype = frame["equipment_type"].unique().tolist()
listdates = frame["Date"].unique().tolist()
df2 = frame.copy(deep=True)
for date in listdates:
    for tipo in listtype:
        provisionalrow = {'Country' : 'Russia/Ukraine',
                    'equipment_type' : tipo,
                    'destroyed': frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].destroyed / frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].destroyed, 
                    'abandoned': frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].abandoned / frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].abandoned,
                    'captured': frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].captured / frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].captured,
                    'abandoned': frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].damaged / frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].damaged,
                    'type_total': frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].type_total / frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].type_total,
                    'Date': date,
                    'country_and_type': "Russia/Ukraine's " + tipo} 
                    # , 
                    # ignore_index=True
        print(provisionalrow)   
        #call stackoverflow

I'm totally stuck and appreciate any hint in order to define a not-so-arcane lambda operation which could filter and then produce new rows dividing russian losses by ukrainian losses per each equipment type

Thanks in advance!


Solution

I would attempt this like this

import pandas as pd
import io

data = pd.read_csv(io.StringIO(csv))  # using your input csv from the question

(data
 .dropna(subset=["equipment_type"])
 .drop(columns=["row_id"])
 .pivot(index=["Date", "country"], columns="equipment_type")
 .style.format(na_rep="-", precision=0)
)

It creates a quite wide table - with two levels of columns - destroyed,abandoned,captured,damaged,type_total is the first level and the equipment type is the second level. The combined effect is that you have all column combinations of loss type and equipment.

This worked with the small sample of data you've shown in the question.

Maybe I'd personally look at the data like this, with loss type as an index level - to create a less wide table. I'd use this view while looking at the data while deciding how to clean it up - should some equipment types be merged together or removed?

(data
 .dropna(subset=["equipment_type"])
 .drop(columns=["row_id"])
 .pivot(index=["Date", "country"], columns="equipment_type")
 .rename_axis(columns=['loss_type', 'equipment_type'])
 .stack('loss_type')
 .style.format(na_rep="-", precision=0)
)

And it's your call how to proceed - maybe parse the date column so that it's a proper date, and decide if date, country etc should be a regular column or an index.


I fetched the full dataset and made some incomplete data cleaning like this:

import pandas as pd

dates = pd.date_range("2022-02-24", "2022-06-03")
base_url = "https://raw.githubusercontent.com/leedrake5/Russia-Ukraine/main/data/byType/{}.csv"

datas = []
for _, date in zip(range(1000), dates):
    url = base_url.format(date.strftime('%Y-%m-%d'))
    print("fetch", url)
    datas.append(pd.read_csv(url, index_col=0).reset_index(drop=True))
    
data = pd.concat(datas, axis=0, ignore_index=True)


(data
 # You'd better look closer at these duplicates and why they exist
 .drop_duplicates(subset=["Date", "country", "equipment_type"])
 .dropna(subset=["equipment_type"])
 .loc[lambda df: df.equipment_type != "\n"]
 #.set_index(["Date", "country"])
 # At this point - after set_index I'd investigate the data if there are duplicates
 #
 .drop(columns=["row_id"])
 .assign(equipment_type = lambda df: df.equipment_type.str.title().str.strip())
 .pivot(index=["Date", "country"], columns="equipment_type")
 .rename_axis(columns=['loss_type', 'equipment_type'])
 .stack('loss_type')
 .style.format(na_rep="-", precision=0)
)

The resulting displayed dataframe is too large to show in this answer, but it looks something like this:

enter image description here

Regarding the division or ratio, your question is a bit unclear on the goal. But with these well made indexes we can just use broadcasting to do the division (assign the previous result as wide_data)

wide_data.xs("Ukraine", level="country").fillna(0) / wide_data.xs("Russia", level="country")

And we get a table with the ratio of losses for one country over the other. How to handle missing data - you'll have to decide.



Answered By - creanion
Answer Checked By - Terry (PHPFixing Volunteer)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing