Showing posts with label numpy. Show all posts

Friday, November 4, 2022

[FIXED] what does myarray[0][:,0] mean

November 04, 2022 lambda, numpy, numpy-ndarray No comments

Issue

This is an excerpt from a documentation.

lambda ind, r: 1.0 + any(np.array(points_2d)[ind][:,0] == 0.0)

But I don't understand np.array(points_2d)[ind][:,0].

It seems equivalent to myarray[0][:,0], which doesn't make sense to me.

Can anyone help to explain?

Solution

With points_2d from earlier in the doc:

In [38]: points_2d = [(0., 0.), (0., 1.), (1., 1.), (1., 0.),
    ...:           (0.5, 0.25), (0.5, 0.75), (0.25, 0.5), (0.75, 0.5)]

In [39]: np.array(points_2d)
Out[39]: 
array([[0.  , 0.  ],
       [0.  , 1.  ],
       [1.  , 1.  ],
       [1.  , 0.  ],
       [0.5 , 0.25],
       [0.5 , 0.75],
       [0.25, 0.5 ],
       [0.75, 0.5 ]])

Indexing with a scalar gives a 1d array, which can't be further indexed with [:,0].

In [40]: np.array(points_2d)[0]
Out[40]: array([0., 0.])

But with a list or slice:

In [41]: np.array(points_2d)[[0,1,2]]
Out[41]: 
array([[0., 0.],
       [0., 1.],
       [1., 1.]])

In [42]: np.array(points_2d)[[0,1,2]][:,0]
Out[42]: array([0., 0., 1.])

So this selects the first column of a subset of rows.

In [43]: np.array(points_2d)[[0,1,2]][:,0]==0.0
Out[43]: array([ True,  True, False])

In [44]: any(np.array(points_2d)[[0,1,2]][:,0]==0.0)
Out[44]: True

I think they could have used:

In [45]: np.array(points_2d)[[0,1,2],0]
Out[45]: array([0., 0., 1.])

Answered By - hpaulj

Answer Checked By - Mildred Charles (PHPFixing Admin)

[FIXED] How to apply formula to a dataframe in pandas

November 03, 2022 dataframe, lambda, numpy, pandas, python No comments

Issue

       tr   Atr
0   0.00276 0.00276
1   0.01455 NaN
2   0.00895 NaN
3   0.00816 NaN
4   0.00596 NaN
5   0.00816 NaN
6   0.00844 NaN
7   0.01150 NaN
8   0.00473 NaN
9   0.00502 NaN

Please how to do a apply this formula to each tr

Atr = (prev_Atr * (14 - 1) + tr) / 14

what i want to do is

df["Atr"] = lambda x, y: (x * (14 -1) + y)/14

but i dont know how to assign

x = prev_Atr & y = tr

Solution

It seems you are looking for a rolling computation. But its not a simple sum() or such. You can achieve what you want with a simple for loop:

for i in df.index[1:]:
    df['Atr'].iloc[i] = (df['Atr'].iloc[i-1]*13 + df['tr'].iloc[i])/14

print(df):

        tr       Atr
0  0.00276  0.002760
1  0.01455  0.003602
2  0.00895  0.003984
3  0.00816  0.004282
4  0.00596  0.004402
5  0.00816  0.004671
6  0.00844  0.004940
7  0.01150  0.005408
8  0.00473  0.005360
9  0.00502  0.005336

Answered By - SomeDude

Answer Checked By - Mary Flores (PHPFixing Volunteer)

[FIXED] How do I calculate average number of ride per week from Total Count in Pandas/Python

November 03, 2022 lambda, numpy, pandas, python No comments

Issue

My dataframe (df) is a 12 months data which consist of 5m rows. One of the columns is day_of_week which are Monday to Sunday. This df also has a unique key which is the ride_id column. I want to calculate the average number of rides per day_of_week. I have calculated the number of rides per day_of_week using

copydf.groupby(['day_of_week']).agg(number_of_rides=('day_of_week', 'count'))

However, I find it hard to calculate the mean/average for each day of week. I have tried:

copydf.groupby(['day_of_week']).agg(number_of_rides=('ride_id', 'count')).mean()

and

avg_days = copydf.groupby(['day_of_week']).agg(number_of_rides=('ride_id', 'count'))
avg_days.groupby(['day_of_week']).agg('number_of_rides', 'mean')

They didn't work. I want the output to be in three columns, day_of_week, number_of_rides, and avg_num_of_ride or two columns day_of_week or weekday_num and avg_num_of_rides

This is my df. kindly note that code block have tampered with some columns line due to the long column names.

    ride_id rideable_type   started_at  ended_at    start_station_name  start_station_id    end_station_name    end_station_id  start_lat   start_lng   end_lat end_lng member_or_casual    ride_length year    month   day_of_week hour    weekday_num
0   9DC7B962304CBFD8    electric_bike   2021-09-28 16:07:10 2021-09-28 16:09:54 Streeter Dr & Grand Ave 13022   Streeter Dr & Grand Ave 13022   41.89   -87.68  41.89   -87.67  casual  2   2021    September   Tuesday 16  1
1   F930E2C6872D6B32    electric_bike   2021-09-28 14:24:51 2021-09-28 14:40:05 Streeter Dr & Grand Ave 13022   Streeter Dr & Grand Ave 13022   41.94   -87.64  41.98   -87.67  casual  15  2021    September   Tuesday 14  1
2   6EF72137900BB910    electric_bike   2021-09-28 00:20:16 2021-09-28 00:23:57 Streeter Dr & Grand Ave 13022   Streeter Dr & Grand Ave 13022   41.81   -87.72  41.80   -87.72  casual  3   2021    September   Tuesday 0   1

This is the output I desire

    number_of_rides average_number_of_rides
day_of_week 
Saturday    964079  50.4
Sunday  841919       70.9
Wednesday   840272   90.2
Thursday    836973    77.2
Friday  818205        34.4
Tuesday 814496       34.4
Monday  767002        200.3

Again, I have calculated the number of ride per day_of_week, what I want to do is just to add the third column or better still, have average_ride per weekday(Monday or 0, Tuesday or 1, Wednesday or 2) on its own output df

Thanks

Solution

To get average number of rides per week day, you need total rides on that week day and number of weeks.

You can compute the week number from date:

df["week_number"] = df["started_at"].dt.isocalendar().week

>>    ride_id started_at day_of_week  week_number
>> 0        1 2021-09-20      Monday           38
>> 1        2 2021-09-21     Tuesday           38
>> 2        3 2021-09-20      Monday           38
>> 3        4 2021-09-21     Tuesday           38
>> 4        5 2021-09-27      Monday           39
>> 5        6 2021-09-28     Tuesday           39

Then group by day_of_week and week_number to compute an aggregate dataframe:

week_number_group_df = df.groupby(["day_of_week", "week_number"]).agg(number_of_rides_on_day=("ride_id", "count"))

>>                             number_of_rides_on_day
>> day_of_week   week_number                          
>> Monday        38                                  2
>>               39                                  1
>> Tuesday       38                                  2
>>               39                                  1

Use the aggregated dataframe to get the final results:

week_number_group_df.groupby("day_of_week").agg(number_of_rides=("number_of_rides_on_day", "sum"), average_number_of_rides=("number_of_rides_on_day", "mean"))

>>              number_of_rides  average_number_of_rides
>> day_of_week                                          
>> Monday                     3                   1.5000
>> Tuesday                    3                   1.5000

Answered By - Azhar Khan

Answer Checked By - Marilyn (PHPFixing Volunteer)

[FIXED] How to find last occurrence of value meeting condition in column in python

November 02, 2022 dataframe, indexing, numpy, pandas, python No comments

Issue

I have the following dataframe:

df = pd.DataFrame({"A":['a','b','c','d','e','f','g','h','i','j','k'],
                    "B":[1,3,4,5,6,7,6,5,8,5,5]})
df

displayed as:

I first want to find the letter in column "A" that corresponds to the first occurrence of a value in column "B" that is >= 6. Looking at this, we see that this would be row index 4, corresponding to a value of 6 and "e" in column "A".

I can identify the column "A" value we just got with this code:

#Find first occurrence >= threshold
threshold = 6
array = df.values
array[np.where(array[:,1] >= threshold)][0,0]

This code returns 'e', which is what I want.

This code is referenced from this Stack Overflow source: Python find first occurrence in Pandas dataframe column 2 below threshold and return column 1 value same row using NumPy

What I am having trouble figuring out is how to modify this code to find the last occurrence meeting my criteria of being >= the threshold of 6. And so looking at my code above, I want to produce 'i', because looking at the above data frame, the row containing "i" in column "A" correspond to a value of 8 in column "B", which is the last occurrence of a value >= the threshold of 6. I want to preserve the order of the rows as alphabetical referencing column "A". I am guessing this might have to do with somehow modifying the indexing in my code, specifically the array[:,1] component or the [0,0] component, but I am not sure how to specifically call for the last occurrence meeting my criteria. How can I modify my code to find the value in column "A" corresponding to the last occurrence of a value >= the threshold of 6 in column "B"?

Solution

To get the first occurrence, You can use idxmax:

df.loc[df['B'].ge(6).idxmax()]

output:

A    e
B    6
Name: 4, dtype: object

For just the value in 'A':

df.loc[df['B'].ge(6).idxmax(), 'A']

output: 'e'

For the last, do the same on the reversed Series:

df.loc[df.loc[::-1,'B'].ge(6).idxmax()]

output:

A    k
B    8
Name: 10, dtype: object

df.loc[df.loc[::-1, 'B'].ge(6).idxmax(), 'A']

output: 'k'

Answered By - mozway

Answer Checked By - Katrina (PHPFixing Volunteer)

[FIXED] How to mask indices smaller than or bigger than certain index?

November 01, 2022 indexing, indices, numpy, python No comments

Issue

Is there a way to mask, using np.ma module, all indices in a specific array smaller or bigger than a given number? For example, if I have an array of 365 elements and I want to mask all of the ones between 170 and 200 and only take into account[0:170] and [201:], can I do it?

Tried researching the answer but nothing I found seems like the right solution (it's not an issue for me to mask the indices using for example list comprehension, but I specifically need to use the np.ma module)

Solution

You could make a mask along the lines of

mymask = np.array([0 if x < 170 or x >=200 else 1 for x in range(365)])

and then use

x = np.ma.masked_array(myarray, mask = mymask)

Answered By - user19077881

Answer Checked By - David Goodson (PHPFixing Volunteer)

[FIXED] How to improve performance of dataframe slices matching?

November 01, 2022 match, numpy, pandas, performance, python No comments

Issue

I need to improve the performance of the following dataframe slices matching. What I need to do is find the matching trips between 2 dataframes, according to the sequence column values with order conserved.

My 2 dataframes:

>>>df1
  trips sequence
0   11  a
1   11  d
2   21  d
3   21  a
4   31  a
5   31  b
6   31  c

>>>df2
  trips sequence
0   12  a
1   12  d
2   22  c
3   22  b
4   22  a
5   32  a
6   32  d

Expected output:

['11 match 12']

This is the following code I' m using:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'trips': [11, 11, 21, 21, 31, 31, 31], 'sequence': ['a', 'd', 'd', 'a', 'a', 'b', 'c']})

df2 = pd.DataFrame({'trips': [12, 12, 22, 22, 22, 32, 32], 'sequence': ['a', 'd', 'c', 'b', 'a', 'a', 'd']})

route_match = []
for trip1 in df1['trips'].drop_duplicates():
    for trip2 in df2['trips'].drop_duplicates():
        route1 = df1[df1['trips'] == trip1]['sequence']
        route2 = df2[df2['trips'] == trip2]['sequence']
        if np.array_equal(route1.values,route2.values):
            route_match.append(str(trip1) + ' match ' + str(trip2))
            break
        else:
            continue

Despite working, this is very time costly and unefficient as my real dataframes are longer. Any suggestions?

Solution

You can aggregate each trip as tuple with groupby.agg, then merge the two outputs to identify the identical routes:

out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),
               df2.groupby('trips', as_index=False)['sequence'].agg(tuple),
               on='sequence'
              )

output:

   trips_x sequence  trips_y
0       11   (a, d)       12
1       11   (a, d)       32

If you only want the first match, drop_duplicates the output of df2 aggregation to prevent unnecessary merging:

out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),
               df2.groupby('trips', as_index=False)['sequence'].agg(tuple)
                  .drop_duplicates(subset='sequence'),
               on='sequence'
              )

output:

   trips_x sequence  trips_y
0       11   (a, d)       12

Answered By - mozway

Answer Checked By - Candace Johnson (PHPFixing Volunteer)

[FIXED] What is the fastest way to sum 2 matrices using Numba?

November 01, 2022 multithreading, numba, numpy, performance, time No comments

Issue

I am trying to find the fastest way to sum 2 matrices of the same size using Numba. I came up with 3 different approaches but none of them could beat Numpy. Here is my code:

import numpy as np
from numba import njit,vectorize, prange,float64
import timeit
import time

# function 1: 
def sum_numpy(A,B):
    return A+B

# function 2: 
sum_numba_simple= njit(cache=True,fastmath=True) (sum_numpy)

# function 3: 
@vectorize([float64(float64, float64)])
def sum_numba_vectorized(A,B):
    return A+B

# function 4: 
@njit('(float64[:,:],float64[:,:])', cache=True, fastmath=True, parallel=True)
def sum_numba_loop(A,B):
    n=A.shape[0]
    m=A.shape[1]
    C = np.empty((n, m), A.dtype)

    for i in prange(n):
        for j in prange(m):
            C[i,j]=A[i,j]+B[i,j]
  
    return C

#Test the functions with 2 matrices of size 1,000,000x3:
N=1000000
np.random.seed(123)
A=np.random.uniform(low=-10, high=10, size=(N,3))
B=np.random.uniform(low=-5, high=5, size=(N,3)) 

t1=min(timeit.repeat(stmt='sum_numpy(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t2=min(timeit.repeat(stmt='sum_numba_simple(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t3=min(timeit.repeat(stmt='sum_numba_vectorized(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t4=min(timeit.repeat(stmt='sum_numba_loop(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))

print("function 1 (sum_numpy): t1= ",t1,"\n")
print("function 2 (sum_numba_simple): t2= ",t2,"\n")
print("function 3 (sum_numba_vectorized): t3= ",t3,"\n")
print("function 4 (sum_numba_loop): t4= ",t4,"\n")

Here are the results:

function 1 (sum_numpy): t1= 0.1655790419999903

function 2 (sum_numba_simple): t2= 0.3019776669998464

function 3 (sum_numba_vectorized): t3= 0.16486266700030683

function 4 (sum_numba_loop): t4= 0.1862256660001549

As you can see, the results show that there isn't any advantage in using Numba in this case. Therefore, my question is:
Is there any other implementation that would increase the speed of the summation ?

Solution

Your code is bound by page-faults (see here, here and there for more information about this). Page-faults happens because the array is newly allocated. A solution is to preallocate it and then write within it so to no cause pages to be remapped in physical memory. np.add(A, B, out=C) does this as indicated by @August in the comments. Another solution could be to adapt the standard allocator so not to give the memory back to the OS at the expense of a significant memory usage overhead (AFAIK TC-Malloc can do that for example).

There is another issue on most platforms (especially x86 ones): the cache-line write allocations of write-back caches are expensive during writes. The typical solution to avoid this is to do non-temporal store (if available on the target processor, which is the case on x86-64 one but maybe not others). That being said, neither Numpy nor Numba are able to do that yet. For Numba, I filled an issue covering a simple use-case. Compilers themselves (GCC for Numpy and Clang for Numba) tends not to generate such instructions because they can be detrimental in performance when arrays fit in cache and compilers do not know the size of the array at compile time (they could generate a specific code when they can evaluate the amount of data computed but this is not easy and can slow-down some other codes). AFAIK, the only possible way to fix this is to write a C code and use low-level instructions or to use compiler directives. In your case, about 25% of the bandwidth is lost due to this effect, causing a slowdown up to 33%.

Using multiple threads do not always make memory-bound code faster. In fact, it generally barely scale because using more core do not speed up the execution when the RAM is already saturated. Few cores are generally required so to saturate the RAM on most platforms. Page faults can benefit from using multiple cores regarding the target system (Linux does that in parallel quite well, Windows generally does not scale well, IDK for MacOS).

Finally, there is another issue: the code is not vectorized (at least not on my machine while it can be). On solution is to flatten the array view and do one big loop that the compiler can more easily vectorize (the j-based loop is too small for SIMD instructions to be effective). The contiguity of the input array should also be specified for the compiler to generate a fast SIMD code. Here is the resulting Numba code:

@njit('(float64[:,::1], float64[:,::1], float64[:,::1])', cache=True, fastmath=True, parallel=True)
def sum_numba_fast_loop(A, B, C):
    n, m = A.shape
    assert C.shape == A.shape
    A_flat = A.reshape(n*m)
    B_flat = B.reshape(n*m)
    C_flat = C.reshape(n*m)
    for i in prange(n*m):
        C_flat[i]=A_flat[i]+B_flat[i]
    return C

Here are results on my 6-core i5-9600KF processor with a ~42 GiB/s RAM:

sum_numpy:                       0.642 s    13.9 GiB/s
sum_numba_simple:                0.851 s    10.5 GiB/s
sum_numba_vectorized:            0.639 s    14.0 GiB/s
sum_numba_loop serial:           0.759 s    11.8 GiB/s
sum_numba_loop parallel:         0.472 s    18.9 GiB/s
Numpy "np.add(A, B, out=C)":     0.281 s    31.8 GiB/s  <----
Numba fast:                      0.288 s    31.0 GiB/s  <----
Optimal time:                    0.209 s    32.0 GiB/s

The Numba code and the Numpy one saturate my RAM. Using more core does not help (in fact it is a bit slower certainly due to the contention of the memory controller). Both are sub-optimal since they do not use non-temporal store instructions that can prevent cache-line write allocations (causing data to be fetched from the RAM before being written back). The optimal time is the one expected using such instruction. Note that it is expected to reach only 65-80% of the RAM bandwidth because of RAM mixed read/writes. Indeed, interleaving reads and writes cause low-level overheads preventing the RAM to be saturated. For more information about how RAM works, please consider reading Introduction to High Performance Scientific Computing -- Chapter 1.3 and What Every Programmer Should Know About Memory (and possibly this).

Answered By - Jérôme Richard

Answer Checked By - David Marino (PHPFixing Volunteer)

[FIXED] Why is numpy.dot as fast as these GPU implementations of matrix multiplication?

October 31, 2022 numpy, performance No comments

Issue

According to the following table (from this paper), numpy's np.dot performance is comparable to a CUDA implementation of matrix multiplication, in experiments with 320x320 matrices. And I did replicate this Speedup in my machine for np.dot with enough precision. Their code for CUDA with Numba ran much slower though, with a Speedup of about 1200 instead of the 49258 reported.

Why is numpy's implementation so fast?

Edit: here's the code taken from the paper. I just added the timeit calls. I ran it in the following laptop.

CUDA

import numpy as np
from numba import cuda
@cuda.jit('void( float64 [ : , : ] , float64 [ : , : ] , float64 [ : , : ] , int32 )')
def cu_matmul(a , b, c , n) :
    x, y = cuda.grid (2)
    if (x >= n) or (y >= n) :
        return
    c[x, y] = 0
    for i in range(n) :
        c[x, y] += a[x, i ] * b[ i , y]

device = cuda.get_current_device()
tpb = device.WARP_SIZE
n = 320
bpg = (n+tpb-1)//tpb
grid_dim = (bpg, bpg)
block_dim = (tpb , tpb)
A = np.random.random((n, n ) ).astype (np. float64 )
B = np.random.random((n, n ) ).astype (np. float64 )
C = np.empty((n, n) , dtype=np.float64 )
dev_A = cuda.to_device(A)
dev_B = cuda.to_device(B)
dev_C = cuda.to_device(C, copy=False )
result_cuda = cu_matmul[grid_dim , block_dim](dev_A, dev_B, dev_C, n)
dev_C. copy_to_host(C)
assert (np. allclose (np. dot(A, B) , C))

Numpy

np.dot(A, B)

System specs

Solution

Aside from what @norok2 links to, there is the large overhead of transferring the data to the GPU. This becomes significant in several cases:

it is comparably expensive to what you do on the GPU when compared to data transfer overhead, i.e. you only do one operation on less than a MB of data.
The size of your problem doesn't scale extremely well. This is the case if your data size or your underlying problem don't allow the GPU to use its parallel processing sufficiently.
There are too many branches in your parallel code. This usually means a large set of parallel processors needs to wait on each branch (branching hardware is usually grouped per X number of arithmetic processors on a GPU), slowing down the whole computation.

Both points apply here. 320x320 is not extremely large, and a multiplication is the only thing you're doing. CPUs aren't obsoleted by GPUs by far, and let this type of thing prove exactly that.

Answered By - rubenvb

Answer Checked By - David Goodson (PHPFixing Volunteer)

[FIXED] How to map colors in an image to a palette fast with numpy?

October 31, 2022 color-mapping, numpy, performance, python, python-3.x No comments

Issue

I have two arrays. One is an image array and the other is a palette array. Both have elements containing 8-bit RGB channels. I need to replace every color in the image with the closest color in the palette.

Currently I'm measuring distance in the RGB-space, which is not ideal, but easy to implement.

This is my implementation:

image_array = np.array(image) # converts PIL image, values are uint8
# palette values are also 8-bit but I use int so I don't have to cast types
palette_array = np.array(palette, dtype=[('red', np.int), ('green', np.int), ('blue', np.int)])
mapped_image = np.empty((image_height, image_width, 3), dtype=np.uint8)
for x in range(image_width):
    for y in range(image_height):
        r, g, b = image_array[y, x]
        distances_squared = (r-palette['red'])**2 + (g-palette['green'])**2 + (b-palette['blue'])**2
        closest_index = np.argmin(distances_squared)
        closest_color = palette.flat[closest_index]
        mapped_image[y, x] = closest_color

The palette has 4096 random colors (simple conversion is not possible). When mapping a 600x448 sized image this takes roughly a minute even on my core i5 machine. I plan to use this on lower-end devices like a raspberry pi, where it takes roughly 3 minutes to map a small image.

This is way too slow. I believe this can be sped up significantly when the full loop is implemented with numpy syntax, but I can't figure out how to do this.

How do I get from the original image to the mapped one all implemented with numpy syntax?

Solution

You can try using cKDTree function from scipy.

import numpy as np
from scipy.spatial import cKDTree
palette=np.random.randint(0, 255, size=(4096,3), dtype=np.uint8) # random palette
image_in=np.random.randint(0, 255, size=(800, 600, 3), dtype=np.uint8) # random image
size=image_in.shape
vor=cKDTree(palette)
test_points=np.reshape(image_in, (-1,3))
_, test_point_regions = vor.query(test_points, k=1)
image_out=palette[test_point_regions]
np.reshape(image_out, size)

This program runs for approximately 0.8 seconds.

Answered By - Alex Alex

Answer Checked By - Timothy Miller (PHPFixing Admin)

[FIXED] How make np.argsort place empty strings at the END of an array instead of at the beginning

October 28, 2022 is-empty, numpy, python, sorting No comments

Issue

I'm honestly surprised that this question hasn't come up on the forums (at least from what I have seen) earlier. Anyway, I am currently attempting to sort a list of strings, many of which are empty, in alphabetic fashion using np.argsort like so:

list = [ "Carrot", "Star", "Beta", "Zoro" , ""]

Right now, any call of np.argsort(list) will return the following array of indices:

[4,2,0,1,3] # => ["", "Beta", "Carrot", "Star", "Zoro"]

Is there a way to specify the order of the argsort function so that the empty strings are placed at the end of the array like so:

[2,0,1,3,4] # => ["Beta", "Carrot", "Star", "Zoro", ""]

Any input will be greatly appreciated!

Solution

One simple way of getting the order you want would be using np.roll:

lst = [ "Carrot", "Star", "Beta", "Zoro" , ""]
arr = np.array(lst)
idx = np.roll(arr.argsort(),np.count_nonzero(arr))
arr[idx]
# array(['Beta', 'Carrot', 'Star', 'Zoro', ''], dtype='<U6')

Answered By - Paul Panzer

Answer Checked By - Senaida (PHPFixing Volunteer)

[FIXED] How do I check if two vectors are equal using a function?

October 19, 2022 equals, function, numpy, python, vector No comments

Issue

I am attempting to check if two vectors are equal using a function. I don't know if I am using the correct function because I am not getting true or false as a return. Here is my code:

import numpy as np

x=np.array([1,2,3,4])

y=np.array([1,2,3,4])

def check(x,y):

    if x == y:
        print("They are equal")

When I run the code, it does not return anything so I am assuming it is not running the if statement. Am I writing the function correctly or what should I adjust?

Solution

To check the NumPy array equal you can use np.array_equal. And it's better to practice using return for function instead of printing the result.

def check(x,y):
    if np.array_equal(x,y):
        return "They are equal"
    return "Not equal"

Execution:

print(check(x,y))
# They are equal

Answered By - Rahul K P

Answer Checked By - Clifford M. (PHPFixing Volunteer)

[FIXED] How to combine/union two separate numpy Gaussian sets?

October 09, 2022 numpy, python, statistics No comments

Issue

I want to combine two separate random Gaussian data sets, one with its own mean and std and the other with an outlier mean and std. My code that I have is this:

import random
import numpy as np
import numpy.random as ra
from numpy.random import seed 

#This makes the random numbers generated not change when rerunning the code
np.random.seed(0)

#Creating two Gaussian sets, one with mean 0 and std 1, the second is outlier with mean 3 and std 1
#Each set contains 1,000 trials, first set contains 99 points while outlier set contains 1 point for each trial (for 1% outlier)

data = np.random.normal(loc=0, scale=1, size=(1000, 99))
dataoutlier = np.random.normal(loc=3, scale=1, size=(1000, 1))

Now how can I combine this so the outlier numbers are with the first set for each trial? I thought using np.union1d would work, but that combines all the trials into one giant array. Any help would be very appreciated!

Solution

In order to combine two numpy arrays by column you might use the append method.

np.append(data, dataoutlier, axis=1)

Answered By - Grzegorz

Answer Checked By - Gilberto Lyons (PHPFixing Admin)

[FIXED] How to count percentile of distribution in python

October 09, 2022 math, numpy, percentile, python, statistics No comments

Issue

Is there any python/numpy function that calculates n-th percentile of given probability distribution?

# Like This
distr = [.2, .6, .2]
do_some_magic(distr, 50)  # 1
distr = [.1, .1, .6, .2]
do_some_magic(distr, 50)  # 2

Solution

Yes, you can use scipy's percentileofscore.

from scipy.stats import percentileofscore

distr = [.2, .6, .2]

print(percentileofscore(distr,50)/100)

1.0

Answered By - Machetes0602

Answer Checked By - Willingham (PHPFixing Volunteer)

[FIXED] How to interpret scipy.stats.probplot results?

October 09, 2022 matplotlib, numpy, plot, python, statistics No comments

Issue

I wanted to use scipy.stats.probplot() to perform some gaussianity test on mydata.

from scipy import stats
_,fit=stats.probplot(mydata, dist=stats.norm,plot=ax)
goodness_fit="%.2f" %fit[2]

The documentation says:

Generates a probability plot of sample data against the quantiles of a specified theoretical distribution (the normal distribution by default). probplot optionally calculates a best-fit line for the data and plots the results using Matplotlib or a given plot function. probplot generates a probability plot, which should not be confused with a Q-Q or a P-P plot. Statsmodels has more extensive functionality of this type, see statsmodels.api.ProbPlot.

But if google probability plot, it is a common name for P-P plot, while the documentation says not to confuse the two things.

Now I am confused, what is this function doing?

Solution

I looked since hours for an answer to this question, and this can be found in the Scipy/Statsmodel code comments.

In Scipy, comment at https://github.com/scipy/scipy/blob/abdab61d65dda1591f9d742230f0d1459fd7c0fa/scipy/stats/morestats.py#L523 says:

probplot generates a probability plot, which should not be confused with a Q-Q or a P-P plot. Statsmodels has more extensive functionality of this type, see statsmodels.api.ProbPlot.

So, now, let's look at Statsmodels, where comment at https://github.com/statsmodels/statsmodels/blob/66fc298c51dc323ce8ab8564b07b1b3797108dad/statsmodels/graphics/gofplots.py#L58 says:

ppplot : Probability-Probability plot Compares the sample and theoretical probabilities (percentiles).

qqplot : Quantile-Quantile plot Compares the sample and theoretical quantiles

probplot : Probability plot Same as a Q-Q plot, however probabilities are shown in the scale of the theoretical distribution (x-axis) and the y-axis contains unscaled quantiles of the sample data.

So, difference between QQ plot and Probability plot, in these modules, is related to the scales.

Answered By - mike123

Answer Checked By - Robin (PHPFixing Admin)

[FIXED] How to calculate moving / running / rolling arbitrary function (e.g. kurtosis & skewness) using NumPy / SciPy

October 09, 2022 numpy, python, scipy, statistics No comments

Issue

I am working on the time-series data. To get features from data I have to calculate moving mean, median, mode, slop, kurtosis, skewness etc. I am familiar with scipy.stat which provides an easy way to calculate these quantities for straight calculation. But for the moving/running part, I have explored the whole internet and got nothing.

Surprisingly moving mean, median and mode are very easy to calculate with numpy. Unfortunately, there is no built-in function for calculating kurtosis and skewness. If someone can help, how to calculate moving kurtosis and skewness with scipy? Many thanks

Solution

Pandas offers a DataFrame.rolling() method which can be used, in combination with its Rolling.apply() method (i.e. df.rolling().apply()) to apply an arbitrary function to the specified rolling window.

If you are looking for NumPy-based solution, you could use FlyingCircus Numeric (disclaimer: I am the main author of it).

There, you could find the following:

flyingcircus_numeric.running_apply(): can apply any function to a 1D array and supports weights, but it is slow;
flyingcircus_numeric.moving_apply(): can apply any function supporting a axis: int parameter to a 1D array and supports weights, and it is fast (but memory-hungry);
flyingcircus_numeric.rolling_apply_nd(): can apply any function supporting a axis: int|Sequence[int] parameter to any ND array and it is fast (and memory-efficient), but it does not support weights.

Based on your requirements, I would suggest to use rolling_apply_nd(), e.g.:

import numpy as np
import scipy as sp
import flyingcircus_numeric as fcn

import scipy.stats


NUM = 30
arr = np.arange(NUM)

window = 4
new_arr = fcn.rolling_apply_nd(arr, window, func=sp.stats.kurtosis)
print(new_arr)
# [-1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36
#  -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36
#  -1.36 -1.36 -1.36]

Of course, feel free to inspect the source code, it is open source (GPL).

EDIT

Just to get a feeling of the kind of speed we are talking about, these are the benchmarks for the solutions implemented in FlyingCircus:

The general approach flyingcircus_numeric.running_apply() is a couple of orders of magnitude slower than either flyingcircus_numeric.rolling_apply_nd() or flyingcircus_numeric.moving_apply(), with the first being approx. one order of magnitude faster than the second. This shows the speed price for generality or support for weighting.

The above plots were obtained using the scripts from here and the following code:

import scipy as sp
import flyingcircus_numeric as fcn 
import scipy.stats


WINDOW = 4
FUNC = sp.stats.kurtosis


def my_rolling_apply_nd(arr, window=WINDOW, func=FUNC):
    return fcn.rolling_apply_nd(arr, window, func=FUNC)


def my_moving_apply(arr, window=WINDOW, func=FUNC):
    return fcn.moving_apply(arr, window, func)


def my_running_apply(arr, window=WINDOW, func=FUNC):
    return fcn.running_apply(arr, window, func)


def equal_output(a, b):
    return np.all(np.isclose(a, b))


input_sizes = (5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000)
funcs = my_rolling_apply_nd, my_moving_apply, my_running_apply

runtimes, input_sizes, labels, results = benchmark(
    funcs, gen_input=np.random.random, equal_output=equal_output,
    input_sizes=input_sizes)

plot_benchmarks(runtimes, input_sizes, labels, units='s')
plot_benchmarks(runtimes, input_sizes, labels, units='ms', zoom_fastest=8)

(EDITED to reflect some refactoring of FlyingCircus)

Answered By - norok2

Answer Checked By - Katrina (PHPFixing Volunteer)

[FIXED] What is the most efficient way to bootstrap the mean of a list of numbers?

October 08, 2022 numpy, python, statistics, statistics-bootstrap No comments

Issue

I have a list of numbers (floats) and I would like to estimate the mean. I also need to estimate the variation of such mean. My goal is to resample the list 100 times, and my output would be an array with length 100, each element corresponding to the mean of a resampled list.

Here is a simple workable example for what I would like to achieve:

import numpy as np
data = np.linspace(0, 4, 5)
ndata, boot = len(data), 100
output = np.mean(np.array([data[k] for k in np.random.uniform(high=ndata, size=boot*ndata).astype(int)]).reshape((boot, ndata)), axis=1)

This is however quite slow when I have to repeat for many lists with large number of elements. The method also seems very clunky and un-Pythonic. What would be a better way to achieve my goal?

P.S. I am aware of scipy.stats.bootstrap, but I have problem upgrading scipy to 1.7.1 in anaconda to import this.

Solution

Use np.random.choice:

import numpy as np

data = np.linspace(0, 4, 5)
ndata, boot = len(data), 100
output = np.mean(
    np.random.choice(data, size=(100, ndata)),
    axis=1)

If I understood correctly, this expression (in your question's code):

np.array([data[k] for k in np.random.uniform(high=ndata, size=boot*ndata).astype(int)]).reshape((boot, ndata)

is doing a sampling with replacement and that is exactly what np.random.choice does.

Here are some timings for reference:

%timeit np.mean(np.array([data[k] for k in np.random.uniform(high=ndata, size=boot*ndata).astype(int)]).reshape((boot, ndata)), axis=1)
133 µs ± 3.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.mean(np.random.choice(data, size=(boot, ndata)),axis=1)
41.1 µs ± 538 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

As it can be seen np.random.choice yields 3x improvement.

Answered By - Dani Mesejo

Answer Checked By - Marilyn (PHPFixing Volunteer)

[FIXED] How do I calculate r-squared using Python and Numpy?

October 08, 2022 curve-fitting, math, numpy, python, statistics No comments

Issue

I'm using Python and Numpy to calculate a best fit polynomial of arbitrary degree. I pass a list of x values, y values, and the degree of the polynomial I want to fit (linear, quadratic, etc.).

This much works, but I also want to calculate r (coefficient of correlation) and r-squared(coefficient of determination). I am comparing my results with Excel's best-fit trendline capability, and the r-squared value it calculates. Using this, I know I am calculating r-squared correctly for linear best-fit (degree equals 1). However, my function does not work for polynomials with degree greater than 1.

Excel is able to do this. How do I calculate r-squared for higher-order polynomials using Numpy?

Here's my function:

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)
     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    correlation = numpy.corrcoef(x, y)[0,1]

     # r
    results['correlation'] = correlation
     # r-squared
    results['determination'] = correlation**2

    return results

Solution

From the numpy.polyfit documentation, it is fitting linear regression. Specifically, numpy.polyfit with degree 'd' fits a linear regression with the mean function

E(y|x) = p_d * x**d + p_{d-1} * x **(d-1) + ... + p_1 * x + p_0

So you just need to calculate the R-squared for that fit. The wikipedia page on linear regression gives full details. You are interested in R^2 which you can calculate in a couple of ways, the easisest probably being

SST = Sum(i=1..n) (y_i - y_bar)^2
SSReg = Sum(i=1..n) (y_ihat - y_bar)^2
Rsquared = SSReg/SST

Where I use 'y_bar' for the mean of the y's, and 'y_ihat' to be the fit value for each point.

I'm not terribly familiar with numpy (I usually work in R), so there is probably a tidier way to calculate your R-squared, but the following should be correct

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)

     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    # r-squared
    p = numpy.poly1d(coeffs)
    # fit values, and mean
    yhat = p(x)                         # or [p(z) for z in x]
    ybar = numpy.sum(y)/len(y)          # or sum(y)/len(y)
    ssreg = numpy.sum((yhat-ybar)**2)   # or sum([ (yihat - ybar)**2 for yihat in yhat])
    sstot = numpy.sum((y - ybar)**2)    # or sum([ (yi - ybar)**2 for yi in y])
    results['determination'] = ssreg / sstot

    return results

Answered By - leif

Answer Checked By - Senaida (PHPFixing Volunteer)

[FIXED] How to prevent overflow in MLE method for large data

October 08, 2022 numpy, python, scipy, scipy-optimize, statistics No comments

Issue

I am trying to do a manual MLE estimation using scipy. My dataset is not that large so it surprises me that my values get very large very fast and scipy.optimize.minimize seems to get into NaNs for my density extremely quickly. I've tried to use the sum of logarithms instead of the product of the densities but that did not make things better at all.

This is my data:

[
        1.000, 1.000, 1.000, 1.004, 1.005, 1.008, 1.014, 1.015, 1.023, 1.035, 1.038,
        1.046, 1.048, 1.050, 1.050, 1.052, 1.052, 1.057, 1.063, 1.070, 1.070, 1.076,
        1.087, 1.090, 1.091, 1.096, 1.101, 1.102, 1.113, 1.114, 1.120, 1.130, 1.131,
        1.150, 1.152, 1.154, 1.155, 1.162, 1.170, 1.177, 1.189, 1.191, 1.193, 1.200,
        1.200, 1.200, 1.200, 1.205, 1.210, 1.218, 1.238, 1.238, 1.241, 1.250, 1.250,
        1.256, 1.257, 1.272, 1.278, 1.289, 1.299, 1.300, 1.316, 1.331, 1.349, 1.374,
        1.378, 1.382, 1.396, 1.426, 1.429, 1.439, 1.443, 1.446, 1.473, 1.475, 1.478,
        1.499, 1.506, 1.559, 1.568, 1.594, 1.609, 1.626, 1.649, 1.650, 1.669, 1.675,
        1.687, 1.715, 1.720, 1.735, 1.750, 1.755, 1.787, 1.797, 1.805, 1.898, 1.908,
        1.940, 1.989, 2.010, 2.012, 2.024, 2.047, 2.081, 2.085, 2.097, 2.136, 2.178,
        2.181, 2.193, 2.200, 2.220, 2.301, 2.354, 2.359, 2.382, 2.409, 2.418, 2.430,
        2.477, 2.500, 2.534, 2.572, 2.588, 2.591, 2.599, 2.660, 2.700, 2.700, 2.744,
        2.845, 2.911, 2.952, 3.006, 3.021, 3.048, 3.059, 3.092, 3.152, 3.276, 3.289,
        3.440, 3.447, 3.498, 3.705, 3.870, 3.896, 3.969, 4.000, 4.009, 4.196, 4.202,
        4.311, 4.467, 4.490, 4.601, 4.697, 5.100, 5.120, 5.136, 5.141, 5.165, 5.260,
        5.329, 5.778, 5.794, 6.285, 6.460, 6.917, 7.295, 7.701, 8.032, 8.142, 8.864,
        9.263, 9.359, 10.801, 11.037, 11.504, 11.933, 11.998, 12.000, 14.153, 15.000,
        15.398, 19.793, 23.150, 27.769, 28.288, 34.325, 42.691, 62.037, 77.839
]

How can I perform the MLE algorithm without running into overlow issues?

In case you are wondering, I am trying to fit the function f(x) = alpha * 500000^(alpha) / x^(alpha+1). So what I have is

import numpy as np
from scipy.optimize import minimize

# data = the given dataset
log_pareto_pdf = lambda alpha, x: np.log(alpha * (5e5 ** alpha) / (x ** (alpha + 1)))
ret = minimize(lambda alpha: -np.sum([log_pareto_pdf(alpha, x) for x in data]), x0=np.array([1]))

Solution

You need to avoid the giant exponentiations. One way to do this is to actually simplify your function:

log_pareto_pdf = lambda alpha, x: np.log(alpha) + alpha*np.log(5e5) - (alpha + 1)*np.log(x)

Without simplifying, your program still needs to try to calculate the 5e5**alpha term, which will get really big really fast (overflow around 55).

You'll also need to supply the bounds argument to minimize to prevent any negatives inside the log's.

Answered By - AJ Biffl

Answer Checked By - Terry (PHPFixing Volunteer)

[FIXED] How to force zero interception in linear regression?

October 08, 2022 linear-regression, numpy, python, scipy, statistics No comments

Issue

I have some more or less linear data of the form:

x = [0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 2.0, 4.0, 6.0, 8.0, 10.0, 20.0, 40.0, 60.0, 80.0]
y = [0.50505332505407008, 1.1207373784533172, 2.1981844719020001, 3.1746209003398689, 4.2905482471260044, 6.2816226678076958, 11.073788414382639, 23.248479770546009, 32.120462301367183, 44.036117671229206, 54.009003143831116, 102.7077685684846, 185.72880217806673, 256.12183145545811, 301.97120103079675]

I am using scipy.optimize.leastsq to fit a linear regression to this:

def lin_fit(x, y):
    '''Fits a linear fit of the form mx+b to the data'''
    fitfunc = lambda params, x: params[0] * x + params[1]    #create fitting function of form mx+b
    errfunc = lambda p, x, y: fitfunc(p, x) - y              #create error function for least squares fit

    init_a = 0.5                            #find initial value for a (gradient)
    init_b = min(y)                         #find initial value for b (y axis intersection)
    init_p = numpy.array((init_a, init_b))  #bundle initial values in initial parameters

    #calculate best fitting parameters (i.e. m and b) using the error function
    p1, success = scipy.optimize.leastsq(errfunc, init_p.copy(), args = (x, y))
    f = fitfunc(p1, x)          #create a fit with those parameters
    return p1, f

And it works beautifully (although I am not sure if scipy.optimize is the right thing to use here, it might be a bit over the top?).

However, due to the way the data points lie it does not give me a y-axis interception at 0. I do know though that it has to be zero in this case, if x = 0 than y = 0.

Is there any way I can force this?

Solution

I am not adept at these modules, but I have some experience in statistics, so here is what I see. You need to change your fit function from

fitfunc = lambda params, x: params[0] * x + params[1]

to:

fitfunc = lambda params, x: params[0] * x

Also remove the line:

init_b = min(y)

And change the next line to:

init_p = numpy.array((init_a))

This should get rid of the second parameter that is producing the y-intercept and pass the fitted line through the origin. There might be a couple more minor alterations you might have to do for this in the rest of your code.

But yes, I'm not sure if this module will work if you just pluck the second parameter away like this. It depends on the internal workings of the module as to whether it can accept this modification. For example, I don't know where params, the list of parameters, is being initialized, so I don't know if doing just this will change its length.

And as an aside, since you mentioned, this I actually think is a bit of an over-the-top way to optimize just a slope. You could read up linear regression a little and write small code to do it yourself after some back-of-the envelope calculus. It's pretty simple and straightforward, really. In fact, I just did some calculations, and I guess the optimized slope will just be <xy>/<x^2>, i.e. the mean of x*y products divided by the mean of x^2's.

Answered By - Abhranil Das

Answer Checked By - Cary Denson (PHPFixing Admin)

[FIXED] How to calculate mean/variance/standard deviation per index of array?

October 07, 2022 arrays, mean, numpy, python, statistics No comments

Issue

I have some data like [[0, 1, 2], [0.5, 1.5, 2.5], [0.3, 1.3, 2.3]].

I am using numpy and python and I wish to calculate the mean and standard deviation for my data, per index. So I wish to calculate the mean/std for (0, 0.5, 0.3) (e.g. index 0 of each subarray), (1, 1.5, 1.3) (e.g. index 1 of each subarray), and so on.

Any suggestions? (including how I can store the result and visualize it, maybe using graphing or matplotlib?)

Thank you so much, in advance. Any introduction to packages that might solve this problem would be really helpful, as well.

Solution

The various statistics functions all take an axis argument that will allow you to calculate the statistic over a column:

import numpy as np

a = np.array([[0, 1, 2], [0.5, 1.5, 2.5], [0.3, 1.3, 2.3]])

np.mean(a, axis=0)
# array([0.26666667, 1.26666667, 2.26666667])

np.std(a, axis=0)
# array([0.20548047, 0.20548047, 0.20548047])

np.var(a, axis=0)
# array([0.04222222, 0.04222222, 0.04222222])

Answered By - Mark

Answer Checked By - Senaida (PHPFixing Volunteer)

Friday, November 4, 2022

Issue

Solution

Thursday, November 3, 2022

Issue

Solution

Issue

Solution

Wednesday, November 2, 2022

Issue

Solution

Tuesday, November 1, 2022

Issue

Solution

Issue

Solution

Issue

Solution

Monday, October 31, 2022

Issue

CUDA

Numpy

System specs

Solution

Issue

Solution

Friday, October 28, 2022

Issue

Solution

Wednesday, October 19, 2022

Issue

Solution

Sunday, October 9, 2022

Issue

Solution

Issue

Solution

Issue

Solution

Issue

Solution

EDIT

Saturday, October 8, 2022

Issue

Solution

Issue

Solution

Issue

Solution

Issue

Solution

Friday, October 7, 2022

Issue

Solution

Total Pageviews

Featured Post

Subscribe To