PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Monday, October 31, 2022

[FIXED] How to make python for loops faster

 October 31, 2022     arrays, bigdata, for-loop, performance, python     No comments   

Issue

I have a list of dictionaries, like this:

[{'user': '123456', 'db': 'db1', 'size': '8628'}
{'user': '123456', 'db': 'db1', 'size': '7168'}
{'user': '123456', 'db': 'db1', 'size': '38160'}
{'user': '222345', 'db': 'db3', 'size': '8628'}
{'user': '222345', 'db': 'db3', 'size': '8628'}
{'user': '222345', 'db': 'db5', 'size': '840'}
{'user': '34521', 'db': 'db6', 'size': '12288'}
{'user': '34521', 'db': 'db6', 'size': '476'}
{'user': '2345156', 'db': 'db7', 'size': '5120'}.....]

This list contains millions of entries. Each user can be found in multiple dbs, each user can have multiple entires in the same db. I want to sum up how much is the size occupied by each user, per each db. I don't want to use pandas. At the moment I do it this way:

  • I create 2 lists of unique users and unique dbs
  • Use those lists to iterate through the big list and sum up where user and db are the same
result = []
for user in unique_users:
    for db in unique_dbs:
        total_size = 0
        for i in big_list:
            if (i['user'] == user and i['db'] == db):
                total_size += float(i['size'])
        if(total_size) > 0:
            row = {}
            row['user'] = user
            row['db'] = db
            row['size'] = total_size
            result.append(row)

The problem is that this triple for loop develops into something very large (hundreds of billions of iterations) which takes forever to sum up the result. If the big_list is small, this works very well.

How should I approach this in order to keep it fast and simple? Thanks a lot!


Solution

There are two main issue with the current approach: the inefficient algorithm and the inefficient data structure.

The first is that the algorithm used is clearly inefficient as it iterates many times over the big list. There is not need to iterate over the whole list to filter a unique user and db. You can iterate over the big list once and aggregate data using a dictionary. The key of the target dictionary is simply a (user, db) tuple. The value of the dictionary is total_size. Here is an untested example:

# Aggregation part
# Note: a default dict can be used instead to make the code possibly simpler
aggregate_dict = dict()
for i in big_list:
    key = (i['user'], i['db'])
    value = float(i['size'])
    if key in aggregate_dict:
        aggregate_dict[key] += value
    else:
        aggregate_dict[key] = value

# Fast creation of `result`
result = []
for user in unique_users:
    for db in unique_dbs:
        total_size = aggregate_dict.get((user, key))
        if total_size is not None and total_size > 0:
            result.append({'user': user, 'db': db, 'size': total_size})

The other issue is the inefficient data structure: for each row, the keys are replicated while tuples can be used instead. In fact, a better data structure is to store a dictionary of (column, items) key-values where items is a list of items for the target column. This way of storing data is called a dataframe. This is roughly what Pandas uses internally (except it is a Numpy array which is even better as it is more compact and generally more efficient than a list for most operations). Using this data structure for both the input and the output should result in a significant speed up (if combined with Numpy) and a lower memory footprint.



Answered By - Jérôme Richard
Answer Checked By - Terry (PHPFixing Volunteer)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing