Issue

I am trying to take the input and keyword in the two tables from the database. So am using pandas to read both the tables and using the respective columns for splitting up of data and then write back the output in the same table in DB.

My input:

Original_Input

LARIDENT SRL
MIZUHO Corporation Gosen Factory
ZIMMER MANUFACTURING BV
GALT MEDICAL CORP
MIZUHO AMERICA INC
AVENT S de RL de CV
LUV N CARE LTD
STERIS ISOMEDIX PUERTO RICO INC
MEDISTIM INC
Cadence Science Inc
TECHNOLOGIES SA
AMG Mdicale Co Inc

My keyword table:

**Name_Extension**                 **Company_Type**        **Priority**
  co llc                             Company LLC                2
  Pvt ltd                            Private Limited            8
  Corp                               Corporation                4
  CO Ltd                             Company Limited            3
  inc                                Incorporated               5
  CO                                 Company                    1
  ltd                                Limited                    7
  llc                                LLC                        6
  Corporation                        Corporation                4
  & Co                               Company                    1
  Company Limited                    Company Limited            3
  Limited                            Limited                    7
  Co inc                             Company Incorporated       9
  AB                                  AB                        10
  SA                                  SA                        11
  S A                                 SA                        11
  GmbH                                GmbH                      12
  Sdn Bhd                             Sdn Bhd                   13
  llp                                 LLP                       14
  co llp                              LLP                       14
  SA DE CV                           SA DE CV                   19
  Company                            Company                    1 
  Coinc                              Company Incorporated       9
  Coltd                              Company Limited            3

So if the input(in table 1) has any of the name extension(this is in table 2) then it has to be split and put in as Core_input and Type_input columns where core input will contain the company names and type_input will contain the company type(from table 2 column 2) and it has to be checked with the priority.

My output will be:

Core_Input                                         Type_input
    NULL                                               NULL
    NULL                                               NULL
    NULL                                               NULL
   GALT MEDICAL                                    Corporation
   MIZUHO AMERICA                                   Incorporated
     NULL                                               NULL
   LUV N CARE                                         Limited
 STERIS ISOMEDIX PUERTO RICO                         Incorporated
    MEDISTIM                                         Incorporated
   Cadence Science                                   Incorporated

My Code:

k1=[]
k2=[]

df1=pd.read_sql('select * from [dbo].[company_Extension]',engine)

for inp1 in df1['Name_Extension']:
    k1.append(inp1.strip())

for inp2 in df1['Company_Type']:
    k2.append(inp2.strip())


p=1
p1=max(df1['Priority'])

for k1 in df1['Name_Extension']:
    for k2 in df1['Company_Type']:
      #for pr in df1['Priority']:
         for i in df['Cleansed_Input']:
            while p<=p1:
                if re.search(r'[^>]*?\s'+str(k1).strip(),str(i).strip(),re.I) and (p == (pr for pr in 
                                                                               df1['Priority'])):
                    splits = i.str.split(str(k1),re.I)

                    df['Core_Input'] = splits[0] #df['Cleansed_Input'].str.replace(str(k1),'',re.I) 

                    df['Type_input'] = str(k2)
                 p=p+1
data.to_sql('Testtable', con=engine, if_exists='replace',index= False)

Any help is appreciated.

Edit:

df=pd.read_sql('select * from [dbo].[TempCompanyName]',engine)

df1=pd.read_sql('select * from [dbo].[company_Extension]',engine)

ext_list = df1['Name_Extension']
type_list =df1['Company_Type']

for i, j in df.iterrows():
    comp_name = df['Original_Input'][i]
    for idx, ex in enumerate(ext_list):
        if re.search(rf'\b{ex}\b', comp_name,re.IGNORECASE):
            df['Core_Input'] = type_list[idx]
            df['Type_input'].iloc[i] = comp_type

print(df)
df.to_sql('TempCompanyName', con=engine, if_exists='replace',index= False)


Edit:
ext_list = df1['Name_Extension']
type_list =df1['Company_Type']

for i, j in enumerate(df['Cleansed_Input']):
    comp_name = df['Cleansed_Input'][i]

    for idx, ex in enumerate(ext_list):
        comp_name.replace('.,','')
        if re.search(rf'(\b{ex}\b)', comp_name, re.I):
            comp_type = type_list[idx]
            df['Type_input'].iloc[i]= comp_type
            # Delete the extension name from company name
            updated_comp_name = 
            re.sub(rf'(\b{str(ex).upper()}\b)','',str(comp_name).upper())
            # Above regex is leaving space post word removal adding space 
            from next word becomes 2 spaces
            updated_comp_name = str(updated_comp_name).replace('  ',' ')
            # Update the company name
            df['Core_Input'].iloc[i] = updated_comp_name

Solution

Hi Hope below lines help you to get the solution...i am not using SQL due to some reason, but taken your data in 2 different excels...you need to add a column Type in input table before run the code...

import pandas as pd
import numpy
import re

input_df = pd.read_excel('input.xlsx',sheet_name='Sheet1')
exts_df = pd.read_excel('exts.xlsx', sheet_name='Sheet1')

# Check if correct data is loaded
print(input_df.head())

ext_list = exts_df['Name_Extension']
type_list =exts_df['Company_Type']

for i, j in input_df.iterrows():
    comp_name = input_df['Company Names'][i]
    for idx, ex in enumerate(ext_list):
        if re.search(rf'\b{ex}\b', comp_name,re.IGNORECASE):
            comp_type = type_list[idx]
            input_df['Type'].iloc[i] = comp_type
            # Delete teh extension name from company name
            updated_comp_name = re.sub(rf'\b{str(ex).upper()}\b','',str(comp_name).upper())
            # Above regex is leaving space post word removal adding space from next word becomes 2 spaces
            updated_comp_name = str(updated_comp_name).replace('  ',' ')
            # Update the company name
            input_df['Company Names'].iloc[i] = updated_comp_name

print(input_df)
input_df.to_excel('output.xlsx', index=False)

output post removal extension from input Company Name Column mapping Company_Type ...

...

Answered By - Hietsh Kumar

Answer Checked By - Mildred Charles (PHPFixing Admin)

Saturday, July 9, 2022

[FIXED] How to split the input based by comparing two dataframes in pandas

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Saturday, July 9, 2022

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To