Showing posts with label azure-databricks. Show all posts

Sunday, October 23, 2022

[FIXED] How to update two columns in PySpark satisfying the same condition?

October 23, 2022 apache-spark, apache-spark-sql, azure-databricks, pyspark, sql-update No comments

Issue

I have a table in which there are 4 columns: "ID", "FLAG_A", "FLAG_B", "FLAG_C". This is the SQL query I want to transform into PySpark, there are two conditions which I need to satisfy for both columns "FLAG_A" and "FLAG_B". How to do it in PySpark?

UPDATE STATUS_TABLE SET STATUS_TABLE.[FLAG_A] = "JAVA", 
STATUS_TABLE.FLAG_B = "PYTHON"
WHERE (((STATUS_TABLE.[FLAG_A])="PROFESSIONAL_CODERS") AND 
((STATUS_TABLE.FLAG_C) Is Null));

Is it possible to code this in a single statement by giving two conditions and satisfying the "FLAG_A" and "FLAG_B" columns in PySpark?

Solution

I can't think of any way to rewrite this into a single statement which you thought of. I tried writing the UPDATE query inside Spark, but it seems UPDATE is not working:

: java.lang.UnsupportedOperationException: UPDATE TABLE is not supported temporarily.

The following does exactly the same as your UPDATE query:

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(1, 'PROFESSIONAL_CODERS', 'X', None),
     (2, 'KEEP', 'KEEP', 'KEEP')],
    ['ID', 'FLAG_A', 'FLAG_B', 'FLAG_C'])

Script:

cond = (F.col('FLAG_A') == 'PROFESSIONAL_CODERS') & F.isnull('FLAG_C')
df = df.withColumn('FLAG_B', F.when(cond, 'PYTHON').otherwise(F.col('FLAG_B')))
df = df.withColumn('FLAG_A', F.when(cond, 'JAVA').otherwise(F.col('FLAG_A')))

df.show()
# +---+------+------+------+
# | ID|FLAG_A|FLAG_B|FLAG_C|
# +---+------+------+------+
# |  1|  JAVA|PYTHON|  null|
# |  2|  KEEP|  KEEP|  KEEP|
# +---+------+------+------+

Answered By - ZygD

Answer Checked By - Mary Flores (PHPFixing Volunteer)

[FIXED] How to translate MS Access UPDATE query which uses inner join into PySpark?

October 23, 2022 apache-spark, azure-databricks, join, pyspark, sql-update No comments

Issue

I have two MS Access SQL queries which I want to convert into PySpark. The queries look like this (we have two tables Employee and Department):

UPDATE EMPLOYEE INNER JOIN [DEPARTMENT] ON
EMPLOYEE.STATEPROVINCE = [DEPARTMENT].[STATE_LEVEL] 
SET EMPLOYEE.STATEPROVINCE = [DEPARTMENT]![STATE_ABBREVIATION];

UPDATE EMPLOYEE INNER JOIN [DEPARTMENT] ON
EMPLOYEE.STATEPROVINCE = [DEPARTMENT].[STATE_LEVEL] 
SET EMPLOYEE.MARKET = [DEPARTMENT]![MARKET];

Solution

Test dataframes:

from pyspark.sql import functions as F

df_emp = spark.createDataFrame([(1, 'a'), (2, 'bb')], ['EMPLOYEE', 'STATEPROVINCE'])
df_emp.show()
# +--------+-------------+
# |EMPLOYEE|STATEPROVINCE|
# +--------+-------------+
# |       1|            a|
# |       2|           bb|
# +--------+-------------+

df_dept = spark.createDataFrame([('bb', 'b')], ['STATE_LEVEL', 'STATE_ABBREVIATION'])
df_dept.show()
# +-----------+------------------+
# |STATE_LEVEL|STATE_ABBREVIATION|
# +-----------+------------------+
# |         bb|                 b|
# +-----------+------------------+

Running your SQL query in Microsoft Access does the following:

In PySpark, you can get it like this:

df = (df_emp.alias('a')
    .join(df_dept.alias('b'), df_emp.STATEPROVINCE == df_dept.STATE_LEVEL, 'left')
    .select(
        *[c for c in df_emp.columns if c != 'STATEPROVINCE'],
        F.coalesce('b.STATE_ABBREVIATION', 'a.STATEPROVINCE').alias('STATEPROVINCE')
    )
)
df.show()
# +--------+-------------+
# |EMPLOYEE|STATEPROVINCE|
# +--------+-------------+
# |       1|            a|
# |       2|            b|
# +--------+-------------+

First you do a left join. Then, select.

The select has 2 parts.

First, you select everything from df_emp except for "STATEPROVINCE".
Then, for the new "STATEPROVINCE", you select "STATE_ABBREVIATION" from df_dept, but in case it's null (i.e. not existent in df_dept), you take "STATEPROVINCE" from df_emp.

For your second query, you only need to change values in the select statement:

df = (df_emp.alias('a')
    .join(df_dept.alias('b'), df_emp.STATEPROVINCE == df_dept.STATE_LEVEL, 'left')
    .select(
        *[c for c in df_emp.columns if c != 'MARKET'],
        F.coalesce('b.MARKET', 'a.MARKET').alias('MARKET')
    )
)

Answered By - ZygD

Answer Checked By - Dawn Plyler (PHPFixing Volunteer)

[FIXED] How to connect and read files from Azure FTP folder using Python in Azure Databricks?

August 01, 2022 azure-databricks, databricks, ftp, python No comments

Issue

I need to use Python in Azure Databricks to do the following:

Merge multiple text files stored in Azure FTP folder (\VMAZR1\ABCDFiles). Here, 'VMAZR1' is the server name and 'ABCDFiles' is the folder name
Store the merged file in the same location with new name

I can write the code to do the merging but I need assistance with connecting to Azure FTP folder and reading text file names only. Can someone please assist?

Solution

You can rely on this answer. Just change the method of storing to retrieving, e.g., retrbinary, or retrlines as well as mlsd to get a list of file names.

Answered By - Phuri Chalermkiatsakul

Answer Checked By - Willingham (PHPFixing Volunteer)

Sunday, October 23, 2022

[FIXED] How to update two columns in PySpark satisfying the same condition?

Issue

Solution

[FIXED] How to translate MS Access UPDATE query which uses inner join into PySpark?

Issue

Solution

Monday, August 1, 2022

[FIXED] How to connect and read files from Azure FTP folder using Python in Azure Databricks?

Issue

Solution

Total Pageviews

Featured Post

Why Learn PHP Programming

Sunday, October 23, 2022

Issue

Solution

Issue

Solution

Monday, August 1, 2022

Issue

Solution

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To