Issue
which integrates all the equipment losses in the Ukraine war 2022.
Table format is something like this
country,equipment_type,destroyed,abandoned,captured,damaged,type_total,row_id,Date
Ukraine,Aircraft,2,0,0,0,2,,2022-02-24
Russia,Tanks,1,0,0,0,1,,2022-02-24
Russia,Armoured Fighting Vehicles,3,0,2,0,5,,2022-02-24
Russia,Infantry Fighting Vehicles,0,0,1,0,1,,2022-02-24
Russia,Mine-resistant ambush protected,0,0,1,0,1,,2022-02-24
Russia,Infantry Mobility Vehicles,1,0,0,1,2,,2022-02-24
Russia,Engineering Vehicles,1,0,2,0,3,,2022-02-24
Russia,Surface-to-air missile systems,1,0,0,0,1,,2022-02-24
Russia,Aircraft,2,0,0,0,2,,2022-02-24
Russia,Helicopters,1,0,0,1,2,,2022-02-24
Russia,All Types,0,0,0,0,0,,2022-02-24
Russia,"Trucks, Vehicles and Jeeps",1,0,6,2,9,,2022-02-24
Ukraine,Tanks,1,0,0,0,1,,2022-02-24
Ukraine,Armoured Fighting Vehicles,0,1,1,0,2,,2022-02-24
Ukraine,,0,0,0,0,0,,2022-02-24
Ukraine,Infantry Fighting Vehicles,0,3,2,0,5,,2022-02-24
Ukraine,Armoured Personnel Carriers,0,1,1,0,2,,2022-02-24
Ukraine,Infantry Mobility Vehicles,1,0,0,0,1,,2022-02-24
Ukraine,Self-Propelled Artillery,0,0,2,0,2,,2022-02-24
Ukraine,Surface-to-air missile systems,1,0,0,0,1,,2022-02-24
Ukraine,Radars,3,0,0,2,5,,2022-02-24
Ukraine,All Types,0,0,0,0,0,,2022-02-24
Ukraine,,0,0,0,0,0,,2022-02-24
Ukraine,"Trucks, Vehicles and Jeeps",5,0,11,0,16,,2022-02-24
Ukraine,Surface-To-Air Missile Systems,3,0,0,0,3,,2022-02-25
As you can see, there is a column for country and other for equipment type. What I want to do is to add new rows for each date like this
frame.append(["Rus/Uk","Aircraft", Rus Aircraft Destroyed / Uk aircraft destroyed, Rus Aircraft Abandoned / Uk aircraft Abandoned, Rus Aircraft Captured / Uk aircraft captured, Rus Aircraft Damaged / Uk aircraft Damaged, Rus Aircraft type_total / Uk aircraft type_total, ,"2022-02-24"]
one for each equipment_type
. I tested a shitty block as a proof of concept (yeah, I know, nested loops for adding Panda Rows is guilty of dismemberment), but I couldn't figure out how to make it work
listtype = frame["equipment_type"].unique().tolist()
listdates = frame["Date"].unique().tolist()
df2 = frame.copy(deep=True)
for date in listdates:
for tipo in listtype:
provisionalrow = {'Country' : 'Russia/Ukraine',
'equipment_type' : tipo,
'destroyed': frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].destroyed / frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].destroyed,
'abandoned': frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].abandoned / frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].abandoned,
'captured': frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].captured / frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].captured,
'abandoned': frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].damaged / frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].damaged,
'type_total': frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].type_total / frame[(frame["Date"] == date) & (frame["country"] == "Russia") & (frame["equipment_type"] == tipo)].type_total,
'Date': date,
'country_and_type': "Russia/Ukraine's " + tipo}
# ,
# ignore_index=True
print(provisionalrow)
#call stackoverflow
I'm totally stuck and appreciate any hint in order to define a not-so-arcane lambda operation which could filter and then produce new rows dividing russian losses by ukrainian losses per each equipment type
Thanks in advance!
Solution
I would attempt this like this
import pandas as pd
import io
data = pd.read_csv(io.StringIO(csv)) # using your input csv from the question
(data
.dropna(subset=["equipment_type"])
.drop(columns=["row_id"])
.pivot(index=["Date", "country"], columns="equipment_type")
.style.format(na_rep="-", precision=0)
)
It creates a quite wide table - with two levels of columns - destroyed,abandoned,captured,damaged,type_total
is the first level and the equipment type is the second level. The combined effect is that you have all column combinations of loss type and equipment.
This worked with the small sample of data you've shown in the question.
Maybe I'd personally look at the data like this, with loss type as an index level - to create a less wide table. I'd use this view while looking at the data while deciding how to clean it up - should some equipment types be merged together or removed?
(data
.dropna(subset=["equipment_type"])
.drop(columns=["row_id"])
.pivot(index=["Date", "country"], columns="equipment_type")
.rename_axis(columns=['loss_type', 'equipment_type'])
.stack('loss_type')
.style.format(na_rep="-", precision=0)
)
And it's your call how to proceed - maybe parse the date column so that it's a proper date, and decide if date, country etc should be a regular column or an index.
I fetched the full dataset and made some incomplete data cleaning like this:
import pandas as pd
dates = pd.date_range("2022-02-24", "2022-06-03")
base_url = "https://raw.githubusercontent.com/leedrake5/Russia-Ukraine/main/data/byType/{}.csv"
datas = []
for _, date in zip(range(1000), dates):
url = base_url.format(date.strftime('%Y-%m-%d'))
print("fetch", url)
datas.append(pd.read_csv(url, index_col=0).reset_index(drop=True))
data = pd.concat(datas, axis=0, ignore_index=True)
(data
# You'd better look closer at these duplicates and why they exist
.drop_duplicates(subset=["Date", "country", "equipment_type"])
.dropna(subset=["equipment_type"])
.loc[lambda df: df.equipment_type != "\n"]
#.set_index(["Date", "country"])
# At this point - after set_index I'd investigate the data if there are duplicates
#
.drop(columns=["row_id"])
.assign(equipment_type = lambda df: df.equipment_type.str.title().str.strip())
.pivot(index=["Date", "country"], columns="equipment_type")
.rename_axis(columns=['loss_type', 'equipment_type'])
.stack('loss_type')
.style.format(na_rep="-", precision=0)
)
The resulting displayed dataframe is too large to show in this answer, but it looks something like this:
Regarding the division or ratio, your question is a bit unclear on the goal.
But with these well made indexes we can just use broadcasting to do the division (assign the previous result as wide_data
)
wide_data.xs("Ukraine", level="country").fillna(0) / wide_data.xs("Russia", level="country")
And we get a table with the ratio of losses for one country over the other. How to handle missing data - you'll have to decide.
Answered By - creanion Answer Checked By - Terry (PHPFixing Volunteer)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.