PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Monday, October 31, 2022

[FIXED] How to divide data into groups in a fast way

 October 31, 2022     dataframe, for-loop, performance, python, r     No comments   

Issue

I have a large matrix with 12 columns and approximately 1.000.000 rows. Each column represents the money spent by a client in a given month, so with the 12 columns I have information for 1 full year. Each row represents one client.

I need to divide the people into groups based on how much money they spent each month, and I consider the following intervals:

  • money=0
  • 0<money<=25
  • 25<money<=50
  • 50<money<=75

So for example group1 would be formed by clients that spent 0$ each month for the whole year, group2 would be clients who spent between 0 and 25$ the first month, and 0$ the rest of the months, and so on. In the end I have 12 months, and 4 intervals, so I need to divide data into 4^12=16.777.216 groups (I know this yields to more groups than observations, and that many of the groups will be empty or with very few clients, but that is another problem, so far I am interested in doing this division into groups)

I am currently working in R although I could also switch to Python if required (those are the programming languages I control best), and so far my only idea has been to use nested for loops, one for loop for each month. But this is very, very slow.

So my question is: is there a faster way to do this?

Here I provide a small example with fake data, 10 observations (instead of the 1.000.000), 5 columns (instead of 12) and a simplified version of my current code for doing the grouping.

set.seed(5)
data = data.frame(id=1:10, matrix(rnorm(50), nrow=10, ncol=5))

intervals = c(-4, -1, 0, 1, 4)

id_list = c()
group_list = c()

group_idx = 0

for(idx1 in 1:(length(intervals)-1))
{
  data1 = data[(data[, 2] >= intervals[idx1]) & (data[, 2] < intervals[idx1+1]),]
  for(idx2 in 1:(length(intervals))-1)
  {
    data2 = data1[(data1[, 3] >= intervals[idx2]) & (data1[, 3] < intervals[idx2+1]),]
    for(idx3 in 1:(length(intervals)-1))
    {
      data3 = data2[(data2[, 4] >= intervals[idx3]) & (data2[, 4] < intervals[idx3+1]),]
      for(idx4 in 1:(length(intervals)-1))
      {
        data4 = data3[(data3[, 5] >= intervals[idx4]) & (data3[, 5] < intervals[idx4+1]),]
        for(idx5 in 1:(length(intervals)-1))
        {
          data5 = data4[(data4[, 6] >= intervals[idx5]) & (data4[, 6] < intervals[idx5+1]),]
          group_idx = group_idx + 1
          id_list = c(id_list, data5$id)
          group_list = c(group_list, rep(group_idx, nrow(data5)))
        }
      }
    }
  }
}

Solution

If you do need to do this--which I certainly have my doubts about--I would suggest creating a matrix with the classification for each cell of the original data, and then pasting them together to make a group label.

Doing this we can set the group labels to be human readable, which might be nice.

I would recommend simply adding this grouping column to the original data and then using dplyr or data.table to do grouped operations for your next steps, but if you really want separate data frames for each you can then split the original data based on these group labels.

## I redid your sample data to put it on the same general scale as 
## your actual data
set.seed(5)
data = data.frame(id=1:10, matrix(rnorm(50, mean = 50, sd = 20), nrow=10, ncol=5))

my_breaks = c(0, 25 * 1:3, Inf)
## you could use default labels, but this seems nicer
my_labs = c("Low", "Med", "High", "Extreme")

## classify each value from the data
grouping = vapply(
  data[-1], \(x) as.character(cut(x, breaks = my_breaks)),
  FUN.VALUE = character(nrow(data))
)

## create labels for the groups
group_labels = apply(grouping, 2, \(x) paste(1:(ncol(data) - 1), x, sep = ":", collapse = " | "))

## either add the grouping value to the original data or split the data based on groups
data$group = group_labels
result = split(data, group_labels)

result
# $`1:(25,50] | 2:(75,Inf] | 3:(0,25] | 4:(50,75] | 5:(75,Inf] | 1:(25,50] | 2:(25,50] | 3:(25,50] | 4:(25,50] | 5:(50,75]`
#   id       X1       X2       X3       X4       X5
# 1  1 33.18289 74.55261 68.01024 56.31830 81.00121
# 6  6 37.94184 47.22028 44.13036 69.03148 61.24447
# 
# $`1:(50,75] | 2:(25,50] | 3:(25,50] | 4:(25,50] | 5:(25,50] | 1:(25,50] | 2:(25,50] | 3:(0,25] | 4:(50,75] | 5:(25,50]`
#   id       X1       X2       X3       X4       X5
# 2  2 77.68719 33.96441 68.83739 72.19388 33.95154
# 7  7 40.55667 38.05374 78.37178 29.80935 32.25983
# 
# $`1:(50,75] | 2:(50,75] | 3:(75,Inf] | 4:(50,75] | 5:(50,75] | 1:(25,50] | 2:(75,Inf] | 3:(75,Inf] | 4:(25,50] | 5:(25,50]`
#   id       X1        X2       X3        X4       X5
# 3  3 24.89016 28.392148 79.35924 94.309211 48.50842
# 8  8 37.29257  6.320665 79.97548  9.990545 40.79511
# 
# $`1:(50,75] | 2:(50,75] | 3:(75,Inf] | 4:(50,75] | 5:(75,Inf] | 1:(50,75] | 2:(25,50] | 3:(0,25] | 4:(0,25] | 5:(25,50]`
#   id       X1       X2       X3       X4       X5
# 4  4 51.40286 46.84931 64.13522 74.34207 87.91336
# 9  9 44.28453 54.81635 36.85836 14.75628 35.51343
# 
# $`1:(75,Inf] | 2:(25,50] | 3:(25,50] | 4:(75,Inf] | 5:(25,50] | 1:(50,75] | 2:(25,50] | 3:(25,50] | 4:(25,50] | 5:(25,50]`
#    id       X1       X2       X3       X4       X5
# 5   5 84.22882 28.56480 66.38018 79.58444 40.86862
# 10 10 52.76216 44.81289 32.94409 47.14784 48.61578


Answered By - Gregor Thomas
Answer Checked By - Cary Denson (PHPFixing Admin)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing