Showing posts with label databricks. Show all posts

Sunday, September 18, 2022

[FIXED] how to convert a println output to a dataframe in Scala

September 18, 2022 apache-spark, databricks, printing, python, scala No comments

Issue

I have this code which generates a list by means of a for, I look for the output of the println to pass it to a dataframe to be able to manipulate the resulting damage, in Scala.

for (l <- ListArchive){  
     val LastModified: (String, String) =(l,getLastModifiedLCO(l))
     println(LastModified)
  }

Output println (LCO_2014-12-09_3.XML.gz,Tue Dec 09 07:48:30 UTC 2014) (LCO_2014-12-09_1.XML.gz,Tue Dec 09 07:48:30 UTC 2014)

Solution

Rewrite it to generate a list/sequence, and then turn into a DataFrame. Something like this:

import spark.implicits._
val df = ListArchive.map(l => (l, getLastModifiedLCO(l)))
  .toDF("col1Name", "col2Name")

If the list is very big, then you can try to turn it into an RDD via parallelize, and then apply similar map to it, but it will run in the distributed manner.

Answered By - Alex Ott

Answer Checked By - David Marino (PHPFixing Volunteer)

[FIXED] How to import a module into another module in databricks notebook?

August 24, 2022 databricks, module, python No comments

Issue

This is my config.py in Databricks

DATA_S3_LOCATION='s3://server-data/data1'
DATA_S3_FILE_TYPE='orc'
DATA2_S3_LOCATION='s3://server-data/data2'
DATA2_S3_FILE_TYPE='orc'

I have init . py in this folder as well

I am trying to access these variables in another file

import sys
sys.path.insert(1,'/Users/file')
from file import config

I am facing error , no module named file

Solution

There are several aspects here.

If these files are notebooks, then you need to use %run ./config to include notebook from the current directory (doc)
if you're using Databricks Repos and arbitrary files support is enabled, then your code needs to be a Python file, not notebook, and have correct directory layout with __init__.py, etc. In this case, you can use Python imports. Your repository directory will be automatically added into a sys.path and you don't need to modify it.

P.S. I have an example of repository with both notebooks & Python files approaches.

Answered By - Alex Ott

Answer Checked By - Senaida (PHPFixing Volunteer)

[FIXED] How to get username inside spark submit task in databricks?

August 21, 2022 apache-spark, databricks, environment-variables, scala No comments

Issue

I'm trying to retrieve the user name inside spark-submit task in Databricks to write additional information to the table about a user who was changing the data. Unfortunately, I'm not able to find the correct way. For now, I was trying two things:

spark.sparkContext.sparkUser

and

System.getProperty("user.name")

but they both return root. Do you have any idea how to accomplish that?

Solution

If you're using Delta Lake tables, then information about performed operations is captured in the history of the Delta Lake table - see an example in the documentation.

Databricks exposes a lot of information via spark.conf - the configuration properties are starting with spark.databricks.clusterUsageTags., so you can filter all configurations and search for necessary information.

But you need to take into account that all operations in the job are performed under identity of the job owner, even if it's triggered by someone else.

There is a spark.databricks.clusterUsageTags.clusterAllTags configuration property that contains a JSON string containing a list of cluster tags, that also include Owner field with email of user who owns that Databricks job.

Answered By - Alex Ott

Answer Checked By - Marilyn (PHPFixing Volunteer)

[FIXED] How to connect and read files from Azure FTP folder using Python in Azure Databricks?

August 01, 2022 azure-databricks, databricks, ftp, python No comments

Issue

I need to use Python in Azure Databricks to do the following:

Merge multiple text files stored in Azure FTP folder (\VMAZR1\ABCDFiles). Here, 'VMAZR1' is the server name and 'ABCDFiles' is the folder name
Store the merged file in the same location with new name

I can write the code to do the merging but I need assistance with connecting to Azure FTP folder and reading text file names only. Can someone please assist?

Solution

You can rely on this answer. Just change the method of storing to retrieving, e.g., retrbinary, or retrlines as well as mlsd to get a list of file names.

Answered By - Phuri Chalermkiatsakul

Answer Checked By - Willingham (PHPFixing Volunteer)

Sunday, September 18, 2022

[FIXED] how to convert a println output to a dataframe in Scala

Issue

Solution

Wednesday, August 24, 2022

[FIXED] How to import a module into another module in databricks notebook?

Issue

Solution

Sunday, August 21, 2022

[FIXED] How to get username inside spark submit task in databricks?

Issue

Solution

Monday, August 1, 2022

[FIXED] How to connect and read files from Azure FTP folder using Python in Azure Databricks?

Issue

Solution

Total Pageviews

Featured Post

Why Learn PHP Programming

Sunday, September 18, 2022

Issue

Solution

Wednesday, August 24, 2022

Issue

Solution

Sunday, August 21, 2022

Issue

Solution

Monday, August 1, 2022

Issue

Solution

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To