PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0
Showing posts with label databricks. Show all posts
Showing posts with label databricks. Show all posts

Sunday, September 18, 2022

[FIXED] how to convert a println output to a dataframe in Scala

 September 18, 2022     apache-spark, databricks, printing, python, scala     No comments   

Issue

I have this code which generates a list by means of a for, I look for the output of the println to pass it to a dataframe to be able to manipulate the resulting damage, in Scala.

for (l <- ListArchive){  
     val LastModified: (String, String) =(l,getLastModifiedLCO(l))
     println(LastModified)
  }

Output println (LCO_2014-12-09_3.XML.gz,Tue Dec 09 07:48:30 UTC 2014) (LCO_2014-12-09_1.XML.gz,Tue Dec 09 07:48:30 UTC 2014)


Solution

Rewrite it to generate a list/sequence, and then turn into a DataFrame. Something like this:

import spark.implicits._
val df = ListArchive.map(l => (l, getLastModifiedLCO(l)))
  .toDF("col1Name", "col2Name")

If the list is very big, then you can try to turn it into an RDD via parallelize, and then apply similar map to it, but it will run in the distributed manner.



Answered By - Alex Ott
Answer Checked By - David Marino (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Wednesday, August 24, 2022

[FIXED] How to import a module into another module in databricks notebook?

 August 24, 2022     databricks, module, python     No comments   

Issue

This is my config.py in Databricks

DATA_S3_LOCATION='s3://server-data/data1'
DATA_S3_FILE_TYPE='orc'
DATA2_S3_LOCATION='s3://server-data/data2'
DATA2_S3_FILE_TYPE='orc'

I have init . py in this folder as well

I am trying to access these variables in another file

import sys
sys.path.insert(1,'/Users/file')
from file import config

I am facing error , no module named file


Solution

There are several aspects here.

  • If these files are notebooks, then you need to use %run ./config to include notebook from the current directory (doc)
  • if you're using Databricks Repos and arbitrary files support is enabled, then your code needs to be a Python file, not notebook, and have correct directory layout with __init__.py, etc. In this case, you can use Python imports. Your repository directory will be automatically added into a sys.path and you don't need to modify it.

P.S. I have an example of repository with both notebooks & Python files approaches.



Answered By - Alex Ott
Answer Checked By - Senaida (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Sunday, August 21, 2022

[FIXED] How to get username inside spark submit task in databricks?

 August 21, 2022     apache-spark, databricks, environment-variables, scala     No comments   

Issue

I'm trying to retrieve the user name inside spark-submit task in Databricks to write additional information to the table about a user who was changing the data. Unfortunately, I'm not able to find the correct way. For now, I was trying two things:

spark.sparkContext.sparkUser

and

System.getProperty("user.name")

but they both return root. Do you have any idea how to accomplish that?


Solution

If you're using Delta Lake tables, then information about performed operations is captured in the history of the Delta Lake table - see an example in the documentation.

Databricks exposes a lot of information via spark.conf - the configuration properties are starting with spark.databricks.clusterUsageTags., so you can filter all configurations and search for necessary information.

But you need to take into account that all operations in the job are performed under identity of the job owner, even if it's triggered by someone else.

There is a spark.databricks.clusterUsageTags.clusterAllTags configuration property that contains a JSON string containing a list of cluster tags, that also include Owner field with email of user who owns that Databricks job.



Answered By - Alex Ott
Answer Checked By - Marilyn (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Monday, August 1, 2022

[FIXED] How to connect and read files from Azure FTP folder using Python in Azure Databricks?

 August 01, 2022     azure-databricks, databricks, ftp, python     No comments   

Issue

I need to use Python in Azure Databricks to do the following:

  1. Merge multiple text files stored in Azure FTP folder (\VMAZR1\ABCDFiles). Here, 'VMAZR1' is the server name and 'ABCDFiles' is the folder name
  2. Store the merged file in the same location with new name

I can write the code to do the merging but I need assistance with connecting to Azure FTP folder and reading text file names only. Can someone please assist?


Solution

You can rely on this answer. Just change the method of storing to retrieving, e.g., retrbinary, or retrlines as well as mlsd to get a list of file names.



Answered By - Phuri Chalermkiatsakul
Answer Checked By - Willingham (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Older Posts Home
View mobile version

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
All Comments
Atom
All Comments

Copyright © PHPFixing