Showing posts with label bigdata. Show all posts

Monday, October 31, 2022

[FIXED] How to make python for loops faster

October 31, 2022 arrays, bigdata, for-loop, performance, python No comments

Issue

I have a list of dictionaries, like this:

[{'user': '123456', 'db': 'db1', 'size': '8628'}
{'user': '123456', 'db': 'db1', 'size': '7168'}
{'user': '123456', 'db': 'db1', 'size': '38160'}
{'user': '222345', 'db': 'db3', 'size': '8628'}
{'user': '222345', 'db': 'db3', 'size': '8628'}
{'user': '222345', 'db': 'db5', 'size': '840'}
{'user': '34521', 'db': 'db6', 'size': '12288'}
{'user': '34521', 'db': 'db6', 'size': '476'}
{'user': '2345156', 'db': 'db7', 'size': '5120'}.....]

This list contains millions of entries. Each user can be found in multiple dbs, each user can have multiple entires in the same db. I want to sum up how much is the size occupied by each user, per each db. I don't want to use pandas. At the moment I do it this way:

I create 2 lists of unique users and unique dbs
Use those lists to iterate through the big list and sum up where user and db are the same

result = []
for user in unique_users:
    for db in unique_dbs:
        total_size = 0
        for i in big_list:
            if (i['user'] == user and i['db'] == db):
                total_size += float(i['size'])
        if(total_size) > 0:
            row = {}
            row['user'] = user
            row['db'] = db
            row['size'] = total_size
            result.append(row)

The problem is that this triple for loop develops into something very large (hundreds of billions of iterations) which takes forever to sum up the result. If the big_list is small, this works very well.

How should I approach this in order to keep it fast and simple? Thanks a lot!

Solution

There are two main issue with the current approach: the inefficient algorithm and the inefficient data structure.

The first is that the algorithm used is clearly inefficient as it iterates many times over the big list. There is not need to iterate over the whole list to filter a unique user and db. You can iterate over the big list once and aggregate data using a dictionary. The key of the target dictionary is simply a (user, db) tuple. The value of the dictionary is total_size. Here is an untested example:

# Aggregation part
# Note: a default dict can be used instead to make the code possibly simpler
aggregate_dict = dict()
for i in big_list:
    key = (i['user'], i['db'])
    value = float(i['size'])
    if key in aggregate_dict:
        aggregate_dict[key] += value
    else:
        aggregate_dict[key] = value

# Fast creation of `result`
result = []
for user in unique_users:
    for db in unique_dbs:
        total_size = aggregate_dict.get((user, key))
        if total_size is not None and total_size > 0:
            result.append({'user': user, 'db': db, 'size': total_size})

The other issue is the inefficient data structure: for each row, the keys are replicated while tuples can be used instead. In fact, a better data structure is to store a dictionary of (column, items) key-values where items is a list of items for the target column. This way of storing data is called a dataframe. This is roughly what Pandas uses internally (except it is a Numpy array which is even better as it is more compact and generally more efficient than a list for most operations). Using this data structure for both the input and the output should result in a significant speed up (if combined with Numpy) and a lower memory footprint.

Answered By - Jérôme Richard

Answer Checked By - Terry (PHPFixing Volunteer)

[FIXED] How to find which data have the longest decimal places in impala

October 24, 2022 bigdata, decimal, impala, rounding No comments

Issue

I have a column stored in decimal(28, 7). How do I know which data have the longest decimal places in impala?

For example:
0.0000001 -> 7 decimal places
0.0100000 -> 2 decimal places
0.1000000 -> 1 decimal places

Solution

use below SQL

length(substr( Rtrim(cast(0.0100000  as string),'0'),instr( Rtrim(cast(0.0100000  as string),'0'),'.')+1)
) len

rtrim - is used to remove trailing 0s from the data.
instr - is used to locate position of '.'.
substr - is used to cut strings after position of '.'.

sample SQL-

SELECT length(substr( Rtrim(cast(0.0100000  as string),'0'),instr( Rtrim(cast(0.0100000  as string),'0'),'.')+1)
) len

Answered By - Koushik Roy

Answer Checked By - Cary Denson (PHPFixing Admin)

[FIXED] How to run 2 APIs simultaneously and one is dependent on the other

September 30, 2022 api, bigdata, concurrency, multithreading, python No comments

Issue

I am trying to run the loops simultaneously the second loop is dependent on the first one output and needs it to fetch the input from the ids list so no need to wait for the first one until the finish. I tried to do it with multiple libraries and methods but failed to find the optimal structure for that.

import time 
import pandas as pd
import requests
import json
from matplotlib import pyplot
import seaborn as sns
import numpy as np



API_KEY = ''

df = pd.read_csv('lat_long file')


# get name and information of each place
id  = df['id']
lat = df['latitude']
lon = df['longitude']
ids=[]
loc=[]
unit=[]

print('First API now running')




def get_details(lat, lon):
    try:
        
        url = "https://maps.googleapis.com/maps/api/geocode/json?latlng="+ str(lat) + ',' + str(lon)+'&key='+ API_KEY
        response = requests.get(url)
        data = json.loads(response.text)
        ids.append(data['results'][0]['place_id'])
    except Exception as e:
        print('This code NOT be running because of', e)
    return data

def get_deta(ids):
        
        url1 = "https://maps.googleapis.com/maps/api/place/details/json?language=en-US&placeid="+str(ids)+"&key=" + API_KEY
        responsedata = requests.get(url1)
        data2 = json.loads(responsedata.text)
        if 'business_status' in data2['result'].keys():
            loc.append((data2['result']['business_status']))
        else:
            loc.append('0')
        flag = False
        if data2['result']:
            for level in data2['result']['address_components']:
                #if len(level['types']) > 1:
                    if level['types'][0] == 'premise':
                        
                        flag = True
                        unit.append(level['long_name'][4:])
        else:
          print(data2)
        if not flag: 
          unit.append('0')
        
        return data2

def loop1():
    
    for i in range(len(id)):
          
          get_details(lat[i], lon[i])
    return

print('Seconed API now running')



def loop2(len(id)):
    #printing and appending addresses to use them with the next API
    for i in range(50):
          
          get_deta(ids[i])
        
          
    return 



loop1()
loop2()

Solution

It is not very clear what you are trying to achieve here. How exactly does the second API depends on the first?

To achieve concurrency you could use the AsyncIO library which is designed to perform concurrent network requests efficiently. However, the requests library you are using is a synchronous one, you must change to an asynchronous one such as aiohttp.

Given that, you can communicate between two concurrent tasks using asyncio.Queue. Here is a draft of what your program could look like:

import asyncio
import aiohttp
import json

async def get_details(lat, lon, session: aiohttp.ClientSession, id_queue: asyncio.Queue):
    url: str = f"https://maps.googleapis.com/maps/api/geocode/json?latlng={lat},{lon}&key={API_KEY}"
    async with session.get(url) as response:
        data = await json.loads(response.text())
    await id_queue.put(data['results'][0]['place_id'])

async def get_data(id, session: aiohttp.ClientSSLError, loc_queue: asyncio.Queue):
    # Network requests and JSON decoding
    ...
    await loc_queue.put((data['result']['business_status']))

async def loop_1(coords, id_queue: asyncio.Queue):
    await asyncio.gather(
        *[get_details(lat, lon) for lat, lon in coords]
    )

async def loop_2(id_queue: asyncio.Queue, loc_queue: asyncio.Queue):
    while True:
        id = await id_queue.get()
        await get_data(id)

async def main():
    id_queue = asyncio.Queue(maxsize=100)
    loc_queue = asyncio.Queue(maxsize=100)
        
    await asyncio.gather(
        loop_1(),
        loop_2()
    )

if __name__ == "__main__":
    asyncio.run(main())

I simplified your example for the purpose of the example. If you take a look at the main() function, the two loops are executed concurrently with asyncio.gather(). The first loop gets the details of all places concurrently (again with asyncio.gather) and feed a shared queue id_queue. The second loop waits for new ids to come up in the queue and process them with the second API as soon as they are available. It then enqueue the results in a last queue loc_queue.

You could extend this program by adding a third API linked plugged to this last queue and continue to process.

Answered By - Louis Lac

Answer Checked By - Senaida (PHPFixing Volunteer)

[FIXED] How to split large csv files into 125MB-1000MB small csv files dynamically using split command in UNIX

August 29, 2022 bigdata, csv, python, shell, unix No comments

Issue

I am trying to split large csv files to small csv files which is having 125MB to 1GB. split command will work if we give number of records per file it will split but i want get that row count dynamically on basis of file size. if the file size is 20GB then while laoding this whole file into redshift table using copy command but this is taking lot of time, so if we chunk the 20GB file into mentioned size files so i will get good results.
Example 20GB file we can split 6_000_000 records per file so in that way the chunk file size will be around 125mb, in that way i want that 600_000 row count dynamically depends on size

Solution

You can get the file size in MB and divide by some ideal size that you need to predetermine (for my example I picked your minimum of 125MB), and that will give you the number of chunks.

You then get the row count (wc -l, assuming your CSV has no line breaks inside a cell) and divide that by the number of chunks to give your rows per chunk.

Rows per chunk is your "lines per chunk" count that you can finally pass to split.

Because we are doing division which will most likely result in a remainder, you'll probably get an extra file with a relatively few amount of these remainder rows (which you can see in the example).

Here's how I coded this up. I'm using shellcheck, so I think this is pretty POSIX compliant:

csvFile=$1

maxSizeMB=125

rm -f chunked_*

fSizeMB=$(du -ms "$csvFile" | cut -f1)
echo "File size is $fSizeMB, max size per new file is $maxSizeMB"

nChunks=$(( fSizeMB / maxSizeMB ))
echo "Want $nChunks chunks"

nRows=$(wc -l "$csvFile" | cut -d' ' -f2)
echo "File row count is $nRows"

nRowsPerChunk=$(( nRows / nChunks ))
echo "Need $nChunks files at around $nRowsPerChunk rows per file (plus one more file, maybe, for remainder)"


split -d -a 4 -l $nRowsPerChunk "$csvFile" "chunked_"


echo "Row (line) counts per file:"
wc -l chunked_00*

echo
echo "Size (MB) per file:"
du -ms chunked_00*

I created a mock CSV with 60_000_000 rows that is about 5GB:

ll -h gen_60000000x11.csv
-rw-r--r--  1 zyoung  staff   4.7G Jun 24 15:21 gen_60000000x11.csv

When I ran that script I got this output:

./main.sh gen_60000000x11.csv
File size is 4801MB, max size per new file is 125MB
Want 38 chunks
File row count is 60000000
Need 38 files at around 1578947 rows per file (plus one more file, maybe, for remainder)
Row (line) counts per file:
 1578947 chunked_0000
 1578947 chunked_0001
 1578947 chunked_0002
 ...
 1578947 chunked_0036
 1578947 chunked_0037
      14 chunked_0038
 60000000 total

Size (MB) per file:
129     chunked_0000
129     chunked_0001
129     chunked_0002
...
129     chunked_0036
129     chunked_0037
1       chunked_0038

Answered By - Zach Young

Answer Checked By - Marie Seifert (PHPFixing Admin)

[FIXED] How to Insert empty string in amazon redhsift without getting converteed to NULL?

August 29, 2022 amazon-redshift, amazon-s3, amazon-web-services, bigdata, csv No comments

Issue

I am tryig to load data using copy command in redshift using pipe delimeter csv file, while loading || empty is getting converted to NULL as i wanted but |""| this also getting converted to NULL. how can i handel this situation please respond.

Solution

\copy (select * from schema.table) to 'path/filename.csv' NULL 'NULL' DELIMETER '|' CSV HEADER; for exporting. In redshift I used NULL as 'NULL' option in the copy command it worked

Answered By - Surya Appana

Answer Checked By - Robin (PHPFixing Admin)

[FIXED] How to insert big data on the laravel?

February 19, 2022 bigdata, insert, laravel, laravel-5, laravel-5.6 No comments

Issue

I am using laravel 5.6

My script to insert big data is like this :

...
$insert_data = [];
foreach ($json['value'] as $value) {
    $posting_date = Carbon::parse($value['Posting_Date']);
    $posting_date = $posting_date->format('Y-m-d');
    $data = [
        'item_no'                   => $value['Item_No'],
        'entry_no'                  => $value['Entry_No'], 
        'document_no'               => $value['Document_No'],
        'posting_date'              => $posting_date,
        ....
    ];
    $insert_data[] = $data;
}
\DB::table('items_details')->insert($insert_data);

I have tried to insert 100 record with the script, it works. It successfully insert data

But if I try to insert 50000 record with the script, it becomes very slow. I've waited about 10 minutes and it did not work. There exist error like this :

504 Gateway Time-out

How can I solve this problem?

Solution

As it was stated, chunks won't really help you in this case if it is a time execution problem. I think that bulk insert you are trying to use cannot handle that amount of data , so I see 2 options:

1 - Reorganise your code to properly use chunks, this will look something like this:

$insert_data = [];

foreach ($json['value'] as $value) {
    $posting_date = Carbon::parse($value['Posting_Date']);

    $posting_date = $posting_date->format('Y-m-d');

    $data = [
        'item_no'                   => $value['Item_No'],
        'entry_no'                  => $value['Entry_No'], 
        'document_no'               => $value['Document_No'],
        'posting_date'              => $posting_date,
        ....
    ];

    $insert_data[] = $data;
}

$insert_data = collect($insert_data); // Make a collection to use the chunk method

// it will chunk the dataset in smaller collections containing 500 values each. 
// Play with the value to get best result
$chunks = $insert_data->chunk(500);

foreach ($chunks as $chunk)
{
   \DB::table('items_details')->insert($chunk->toArray());
}

This way your bulk insert will contain less data, and be able to process it in a rather quick way.

2 - In case your host supports runtime overloads, you can add a directive right before the code starts to execute :

ini_set('max_execution_time', 120 ) ; // time in seconds

$insert_data = [];

foreach ($json['value'] as $value)
{
   ...
}

To read more go to the official docs

Answered By - Vit Kos

Monday, October 31, 2022

[FIXED] How to make python for loops faster

Issue

Solution

Monday, October 24, 2022

[FIXED] How to find which data have the longest decimal places in impala

Issue

Solution

Friday, September 30, 2022

[FIXED] How to run 2 APIs simultaneously and one is dependent on the other

Issue

Solution

Monday, August 29, 2022

[FIXED] How to split large csv files into 125MB-1000MB small csv files dynamically using split command in UNIX

Issue

Solution

[FIXED] How to Insert empty string in amazon redhsift without getting converteed to NULL?

Issue

Solution

Saturday, February 19, 2022

[FIXED] How to insert big data on the laravel?

Issue

Solution

Total Pageviews

Featured Post

Why Learn PHP Programming

Monday, October 31, 2022

Issue

Solution

Monday, October 24, 2022

Issue

Solution

Friday, September 30, 2022

Issue

Solution

Monday, August 29, 2022

Issue

Solution

Issue

Solution

Saturday, February 19, 2022

Issue

Solution

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To