PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0
Showing posts with label scipy. Show all posts
Showing posts with label scipy. Show all posts

Friday, November 4, 2022

[FIXED] How to pass args() to integrate.quad in scipy?

 November 04, 2022     lambda, parameter-passing, python, scipy     No comments   

Issue

I was doing a numerical integration where instead of using

scipy.integrate.dblquad

I tried to build the double integral directly using

scipy.integrate.quad

From camz's answer: https://stackoverflow.com/a/30786163/11141816

a=1
b=2

def integrand(theta, t): 
    return sin(t)*sin(theta);

def theta_integral(t):
    # Note scipy will pass args as the second argument
    return integrate.quad(integrand, b-a, t , args=(t))[0]

integrate.quad(theta_integral, b-a,b+a ) 

However, the part where it was confusing was how the args was passed through the function. For example, in the post args=(t) was pass through to the second argument t of the function integrand automatically, not the first argument theta, where in the scipy document https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.quad.html the only reference was

args: tuple, optional
    Extra arguments to pass to func.

and in fact their arguments had a comma , to be args=(1,), instead of args=(1) directly.

What if I decide to pass several variables, i.e.

def integrand(theta, t,a,b): 
    return sin(t)*sin(theta)*a*b;

how would I know which args(t,a,b) or args(t,a,b,) were the correct args() for the function?

How to pass args() to integrate.quad in scipy?


Solution

In [214]: from scipy.integrate import quad

A simple example. I'm not using the args, just showing them:

In [215]: def foo(x,a,b,c):
     ...:     print(x,a,b,c)
     ...:     return x**2
     ...:     

In [216]: quad(foo, 0, 1, (1,(1,2,3),np.arange(2)))
0.5 1 (1, 2, 3) [0 1]
0.013046735741414128 1 (1, 2, 3) [0 1]
0.9869532642585859 1 (1, 2, 3) [0 1]
0.06746831665550773 1 (1, 2, 3) [0 1]
0.9325316833444923 1 (1, 2, 3) [0 1]
0.16029521585048778 1 (1, 2, 3) [0 1]
0.8397047841495122 1 (1, 2, 3) [0 1]
 ...
0.35280356864926987 1 (1, 2, 3) [0 1]
0.6471964313507301 1 (1, 2, 3) [0 1]
Out[216]: (0.33333333333333337, 3.700743415417189e-15)

That's all there is to args.

Looking at the code for quad, the first line is

if not isinstance(args, tuple):
    args = (args,)

So if you use args=(t), that will be changed to args=(t,). Remember, in python (2) is the same as 2; the () just group operations in complex cases, otherwise they are ignored. But (1,) is a tuple, as (1,2). It's really the comma that defines the tuple.

With your inner function, and an added print:

In [232]: def integrand(theta, t): 
     ...:     print(theta, t)
     ...:     return np.sin(t)*np.sin(theta)
     ...:     

In [233]: integrand(1,2)     # pass 2 numbers 
1 2
Out[233]: 0.7651474012342926

In [234]: integrand(1,np.arange(3))   # is still works if the 2nd is an array:

1 [0 1 2]
Out[234]: array([0.        , 0.70807342, 0.7651474 ]) 

t is an array, and so is np.sin(t), and also the return. But you aren't passing arrays. Calling it via quad, the theta varies, the t does not:

In [235]: quad(integrand, 0, 1, args=(1,))
0.5 1
0.013046735741414128 1
0.9869532642585859 1
...
0.6471964313507301 1
Out[235]: (0.38682227139505565, 4.2945899214059184e-15)

In [236]: quad(integrand, 0, 1, args=(1,2))
---------------------------------------------------------------------------
...
TypeError: integrand() takes 2 positional arguments but 3 were given

The actual error is raised in compiled code, so isn't quite as transparent, but the effect the same as if I tried to pass 3 arguments:

In [237]: integrand(1,1,2)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [237], in <cell line: 1>()
----> 1 integrand(1,1,2)

TypeError: integrand() takes 2 positional arguments but 3 were given


Answered By - hpaulj
Answer Checked By - Katrina (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Friday, October 28, 2022

[FIXED] How to check if a SciPy CSR matrix is empty (i.e. contains only zeroes)?

 October 28, 2022     is-empty, python, scipy, sparse-matrix     No comments   

Issue

What is the canonical way to check if a SciPy CSR matrix is empty (i.e. contains only zeroes)?

I use nonzero():

def is_csr_matrix_only_zeroes(my_csr_matrix):
    return(len(my_csr_matrix.nonzero()[0]) == 0)

from scipy.sparse import csr_matrix
print(is_csr_matrix_only_zeroes(csr_matrix([[1,2,0],[0,0,3],[4,0,5]])))
print(is_csr_matrix_only_zeroes(csr_matrix([[0,0,0],[0,0,0],[0,0,0]])))
print(is_csr_matrix_only_zeroes(csr_matrix((2,3))))
print(is_csr_matrix_only_zeroes(csr_matrix([[0,0,0],[0,1,0],[0,0,0]])))

outputs

False
True
True
False

but I wonder whether there exist more direct or efficient ways.

(Related but different: Check if scipy sparse matrix entry exists)


Solution

my_csr_matrix.nnz == 0

The nnz attribute records the Number of NonZero entries... unless your CSR matrix is in a weird, nonnormalized form, for example if it has duplicate entries or explicitly stored zeros.

If you have to deal with duplicate entries or explicit zeros, you can use the much more expensive csr_matrix.count_nonzero method:

my_csr_matrix.count_nonzero() == 0


Answered By - user2357112
Answer Checked By - Clifford M. (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Sunday, October 9, 2022

[FIXED] What does "inf" mean for F value in scipy.stats.f_oneway?

 October 09, 2022     anova, pandas, python, scipy, statistics     No comments   

Issue

I ran a one-way anova, and some groups had "inf" for the F value and "0.000000e+00" for the p value. Does this mean that the difference is significant?

I separated the dataframe using groupby and looped through, example code:

from scipy import stats

c_jobs_anova = []

for name_group in c_jobs.groupby(['Name']):
    samples = [condition[1] for condition in name_group[1].groupby('Condition')['Value']]
    f_value, p_value = stats.f_oneway(*samples)
    print('Group: {}, F value: {:.3f}, p value: {:.3f}'.format(name_group[0], f_value, p_value))
    c_jobs_anova.append((name_group[0], f_value, p_value))

The result:

enter image description here


Solution

Yes, very large values of F-statistic indicate high significance, as witnessed by p being reported as 0. Mathematically, F comes up as infinity if there is no in-group variability, e.g.,

>>> stats.f_oneway([2, 2, 2], [1, 1, 1])
F_onewayResult(statistic=inf, pvalue=0.0)

This result is also possible if there is very little in-group variability compared to between-group variability, resulting in numerical overflow.

>>> stats.f_oneway([2, 2, 2], [1, 1, 1.00000001])
F_onewayResult(statistic=inf, pvalue=0.0)


Answered By - user6655984
Answer Checked By - Mildred Charles (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] How to calculate moving / running / rolling arbitrary function (e.g. kurtosis & skewness) using NumPy / SciPy

 October 09, 2022     numpy, python, scipy, statistics     No comments   

Issue

I am working on the time-series data. To get features from data I have to calculate moving mean, median, mode, slop, kurtosis, skewness etc. I am familiar with scipy.stat which provides an easy way to calculate these quantities for straight calculation. But for the moving/running part, I have explored the whole internet and got nothing.

Surprisingly moving mean, median and mode are very easy to calculate with numpy. Unfortunately, there is no built-in function for calculating kurtosis and skewness. If someone can help, how to calculate moving kurtosis and skewness with scipy? Many thanks


Solution

Pandas offers a DataFrame.rolling() method which can be used, in combination with its Rolling.apply() method (i.e. df.rolling().apply()) to apply an arbitrary function to the specified rolling window.


If you are looking for NumPy-based solution, you could use FlyingCircus Numeric (disclaimer: I am the main author of it).

There, you could find the following:

  1. flyingcircus_numeric.running_apply(): can apply any function to a 1D array and supports weights, but it is slow;
  2. flyingcircus_numeric.moving_apply(): can apply any function supporting a axis: int parameter to a 1D array and supports weights, and it is fast (but memory-hungry);
  3. flyingcircus_numeric.rolling_apply_nd(): can apply any function supporting a axis: int|Sequence[int] parameter to any ND array and it is fast (and memory-efficient), but it does not support weights.

Based on your requirements, I would suggest to use rolling_apply_nd(), e.g.:

import numpy as np
import scipy as sp
import flyingcircus_numeric as fcn

import scipy.stats


NUM = 30
arr = np.arange(NUM)

window = 4
new_arr = fcn.rolling_apply_nd(arr, window, func=sp.stats.kurtosis)
print(new_arr)
# [-1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36
#  -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36
#  -1.36 -1.36 -1.36]

Of course, feel free to inspect the source code, it is open source (GPL).


EDIT

Just to get a feeling of the kind of speed we are talking about, these are the benchmarks for the solutions implemented in FlyingCircus:

benchmark benchmark_zoom

The general approach flyingcircus_numeric.running_apply() is a couple of orders of magnitude slower than either flyingcircus_numeric.rolling_apply_nd() or flyingcircus_numeric.moving_apply(), with the first being approx. one order of magnitude faster than the second. This shows the speed price for generality or support for weighting.

The above plots were obtained using the scripts from here and the following code:

import scipy as sp
import flyingcircus_numeric as fcn 
import scipy.stats


WINDOW = 4
FUNC = sp.stats.kurtosis


def my_rolling_apply_nd(arr, window=WINDOW, func=FUNC):
    return fcn.rolling_apply_nd(arr, window, func=FUNC)


def my_moving_apply(arr, window=WINDOW, func=FUNC):
    return fcn.moving_apply(arr, window, func)


def my_running_apply(arr, window=WINDOW, func=FUNC):
    return fcn.running_apply(arr, window, func)


def equal_output(a, b):
    return np.all(np.isclose(a, b))


input_sizes = (5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000)
funcs = my_rolling_apply_nd, my_moving_apply, my_running_apply

runtimes, input_sizes, labels, results = benchmark(
    funcs, gen_input=np.random.random, equal_output=equal_output,
    input_sizes=input_sizes)

plot_benchmarks(runtimes, input_sizes, labels, units='s')
plot_benchmarks(runtimes, input_sizes, labels, units='ms', zoom_fastest=8)

(EDITED to reflect some refactoring of FlyingCircus)



Answered By - norok2
Answer Checked By - Katrina (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Saturday, October 8, 2022

[FIXED] How to calculate probability in a normal distribution given mean & standard deviation?

 October 08, 2022     probability, python, scipy, statistics     No comments   

Issue

How to calculate probability in normal distribution given mean, std in Python? I can always explicitly code my own function according to the definition like the OP in this question did: Calculating Probability of a Random Variable in a Distribution in Python

Just wondering if there is a library function call will allow you to do this. In my imagine it would like this:

nd = NormalDistribution(mu=100, std=12)
p = nd.prob(98)

There is a similar question in Perl: How can I compute the probability at a point given a normal distribution in Perl?. But I didn't see one in Python.

Numpy has a random.normal function, but it's like sampling, not exactly what I want.


Solution

There's one in scipy.stats:

>>> import scipy.stats
>>> scipy.stats.norm(0, 1)
<scipy.stats.distributions.rv_frozen object at 0x928352c>
>>> scipy.stats.norm(0, 1).pdf(0)
0.3989422804014327
>>> scipy.stats.norm(0, 1).cdf(0)
0.5
>>> scipy.stats.norm(100, 12)
<scipy.stats.distributions.rv_frozen object at 0x928352c>
>>> scipy.stats.norm(100, 12).pdf(98)
0.032786643008494994
>>> scipy.stats.norm(100, 12).cdf(98)
0.43381616738909634
>>> scipy.stats.norm(100, 12).cdf(100)
0.5

[One thing to beware of -- just a tip -- is that the parameter passing is a little broad. Because of the way the code is set up, if you accidentally write scipy.stats.norm(mean=100, std=12) instead of scipy.stats.norm(100, 12) or scipy.stats.norm(loc=100, scale=12), then it'll accept it, but silently discard those extra keyword arguments and give you the default (0,1).]



Answered By - DSM
Answer Checked By - Katrina (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] How to prevent overflow in MLE method for large data

 October 08, 2022     numpy, python, scipy, scipy-optimize, statistics     No comments   

Issue

I am trying to do a manual MLE estimation using scipy. My dataset is not that large so it surprises me that my values get very large very fast and scipy.optimize.minimize seems to get into NaNs for my density extremely quickly. I've tried to use the sum of logarithms instead of the product of the densities but that did not make things better at all.

This is my data:

[
        1.000, 1.000, 1.000, 1.004, 1.005, 1.008, 1.014, 1.015, 1.023, 1.035, 1.038,
        1.046, 1.048, 1.050, 1.050, 1.052, 1.052, 1.057, 1.063, 1.070, 1.070, 1.076,
        1.087, 1.090, 1.091, 1.096, 1.101, 1.102, 1.113, 1.114, 1.120, 1.130, 1.131,
        1.150, 1.152, 1.154, 1.155, 1.162, 1.170, 1.177, 1.189, 1.191, 1.193, 1.200,
        1.200, 1.200, 1.200, 1.205, 1.210, 1.218, 1.238, 1.238, 1.241, 1.250, 1.250,
        1.256, 1.257, 1.272, 1.278, 1.289, 1.299, 1.300, 1.316, 1.331, 1.349, 1.374,
        1.378, 1.382, 1.396, 1.426, 1.429, 1.439, 1.443, 1.446, 1.473, 1.475, 1.478,
        1.499, 1.506, 1.559, 1.568, 1.594, 1.609, 1.626, 1.649, 1.650, 1.669, 1.675,
        1.687, 1.715, 1.720, 1.735, 1.750, 1.755, 1.787, 1.797, 1.805, 1.898, 1.908,
        1.940, 1.989, 2.010, 2.012, 2.024, 2.047, 2.081, 2.085, 2.097, 2.136, 2.178,
        2.181, 2.193, 2.200, 2.220, 2.301, 2.354, 2.359, 2.382, 2.409, 2.418, 2.430,
        2.477, 2.500, 2.534, 2.572, 2.588, 2.591, 2.599, 2.660, 2.700, 2.700, 2.744,
        2.845, 2.911, 2.952, 3.006, 3.021, 3.048, 3.059, 3.092, 3.152, 3.276, 3.289,
        3.440, 3.447, 3.498, 3.705, 3.870, 3.896, 3.969, 4.000, 4.009, 4.196, 4.202,
        4.311, 4.467, 4.490, 4.601, 4.697, 5.100, 5.120, 5.136, 5.141, 5.165, 5.260,
        5.329, 5.778, 5.794, 6.285, 6.460, 6.917, 7.295, 7.701, 8.032, 8.142, 8.864,
        9.263, 9.359, 10.801, 11.037, 11.504, 11.933, 11.998, 12.000, 14.153, 15.000,
        15.398, 19.793, 23.150, 27.769, 28.288, 34.325, 42.691, 62.037, 77.839
]

How can I perform the MLE algorithm without running into overlow issues?

In case you are wondering, I am trying to fit the function f(x) = alpha * 500000^(alpha) / x^(alpha+1). So what I have is

import numpy as np
from scipy.optimize import minimize

# data = the given dataset
log_pareto_pdf = lambda alpha, x: np.log(alpha * (5e5 ** alpha) / (x ** (alpha + 1)))
ret = minimize(lambda alpha: -np.sum([log_pareto_pdf(alpha, x) for x in data]), x0=np.array([1]))

Solution

You need to avoid the giant exponentiations. One way to do this is to actually simplify your function:

log_pareto_pdf = lambda alpha, x: np.log(alpha) + alpha*np.log(5e5) - (alpha + 1)*np.log(x)

Without simplifying, your program still needs to try to calculate the 5e5**alpha term, which will get really big really fast (overflow around 55).

You'll also need to supply the bounds argument to minimize to prevent any negatives inside the log's.



Answered By - AJ Biffl
Answer Checked By - Terry (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] How to force zero interception in linear regression?

 October 08, 2022     linear-regression, numpy, python, scipy, statistics     No comments   

Issue

I have some more or less linear data of the form:

x = [0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 2.0, 4.0, 6.0, 8.0, 10.0, 20.0, 40.0, 60.0, 80.0]
y = [0.50505332505407008, 1.1207373784533172, 2.1981844719020001, 3.1746209003398689, 4.2905482471260044, 6.2816226678076958, 11.073788414382639, 23.248479770546009, 32.120462301367183, 44.036117671229206, 54.009003143831116, 102.7077685684846, 185.72880217806673, 256.12183145545811, 301.97120103079675]

I am using scipy.optimize.leastsq to fit a linear regression to this:

def lin_fit(x, y):
    '''Fits a linear fit of the form mx+b to the data'''
    fitfunc = lambda params, x: params[0] * x + params[1]    #create fitting function of form mx+b
    errfunc = lambda p, x, y: fitfunc(p, x) - y              #create error function for least squares fit

    init_a = 0.5                            #find initial value for a (gradient)
    init_b = min(y)                         #find initial value for b (y axis intersection)
    init_p = numpy.array((init_a, init_b))  #bundle initial values in initial parameters

    #calculate best fitting parameters (i.e. m and b) using the error function
    p1, success = scipy.optimize.leastsq(errfunc, init_p.copy(), args = (x, y))
    f = fitfunc(p1, x)          #create a fit with those parameters
    return p1, f    

And it works beautifully (although I am not sure if scipy.optimize is the right thing to use here, it might be a bit over the top?).

However, due to the way the data points lie it does not give me a y-axis interception at 0. I do know though that it has to be zero in this case, if x = 0 than y = 0.

Is there any way I can force this?


Solution

I am not adept at these modules, but I have some experience in statistics, so here is what I see. You need to change your fit function from

fitfunc = lambda params, x: params[0] * x + params[1]  

to:

fitfunc = lambda params, x: params[0] * x 

Also remove the line:

init_b = min(y) 

And change the next line to:

init_p = numpy.array((init_a))

This should get rid of the second parameter that is producing the y-intercept and pass the fitted line through the origin. There might be a couple more minor alterations you might have to do for this in the rest of your code.

But yes, I'm not sure if this module will work if you just pluck the second parameter away like this. It depends on the internal workings of the module as to whether it can accept this modification. For example, I don't know where params, the list of parameters, is being initialized, so I don't know if doing just this will change its length.

And as an aside, since you mentioned, this I actually think is a bit of an over-the-top way to optimize just a slope. You could read up linear regression a little and write small code to do it yourself after some back-of-the envelope calculus. It's pretty simple and straightforward, really. In fact, I just did some calculations, and I guess the optimized slope will just be <xy>/<x^2>, i.e. the mean of x*y products divided by the mean of x^2's.



Answered By - Abhranil Das
Answer Checked By - Cary Denson (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] Why p-value / 2 for one-tailed ttest is correct?

 October 08, 2022     p-value, python, scipy, statistics     No comments   

Issue

i have a specific question about case with one-tailed ttest in Python.

In many articles i can read the statement like this:

In scipy there is no direct way to indicate that we want to run a one-tailed variant of the test. However, to obtain the desired results we adjust the output ourselves. In the case of this setting, we simply need to divide the p-value by 2 (the test statistic stays the same).

F.e. here https://towardsdatascience.com/one-tailed-or-two-tailed-test-that-is-the-question-1283387f631c

And i totally don't understand, why division by 2 works correctly?

Lets see the hist for one-tailed/two-tailed: here

There is just the same area under curve but from one side. And after z-transform we have not the same std distance (this is 1.645, not 1.96).

So, finally question is: Why if we check the same area under curve and the std distance not the same division by 2 is correct?

p.s. if you have some math proof, gonna be very thankful!


Solution

By definition, p-value is some area under the pdf of the test statistic under the null hypothesis.

Suppose we got a t statistic of 1.96.

For two-sided test, the p-value is by definition the area further than this statistic in both directions, i.e., the area of "less than -1.96 or more than 1.96", which happens to be 0.05; this is the p-value when we do a two-sided test.

On the other hand, for one-sided test, the p-value is by definition the area further than this statistic, in the direction of the statistic, i.e., the area of "more than 1.96". Since the distribution of the test statistics under the null is symmetric (around 0), this is exactly half of the p-value of the corresponding two-sided test.



Answered By - j1-lee
Answer Checked By - David Marino (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] How to properly use Kolmogorov Smirnoff test in SciPy?

 October 08, 2022     python, scipy, statistics     No comments   

Issue

I have a distribution

enter image description here

This one looks pretty gaussian, and we also can't reject the idea with such a high p-value from the KS test.

BUT, the test distribution is actually also a generated one with a finite sample size and not the CDF itself, as you'll notice in the code. So that's kind of cheating, compared to using the CDF for a smooth gaussian function.

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)

d1 = np.random.normal(loc = 3, scale = 2, size = 1000)
d2 = np.random.normal(loc = 3, scale = 0.5, size = 250) # Vary this to test

data = np.concatenate((d1,d2))

xmin, xmax = min(data), max(data)
lnspc = np.linspace(xmin, xmax, len(data))

# lets try the normal distribution first
m, s = stats.norm.fit(data)         # get mean and standard deviation from fit
pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval
plt.hist(data, color = "lightgrey", normed = True, bins = 50)
plt.plot(lnspc, pdf_g, color = "black", label="Gaussian") # plot it


# Test how not-gaussian our distribution is by generating a distribution from the fit
test_dist = np.random.normal(m, s, len(data))
KS_D, KS_p = stats.ks_2samp(data, test_dist)
plt.title("D = {0:.2f}, p = {1:.2f}".format(KS_D, KS_p))

plt.show()

But I can't figure out how to use the default KS test for, that is

KS_D, KS_p = stats.kstest(data, "norm"),

as it always returns a p-value of 0, i.e. my gaussian data must be in the wrong format.

How should I normalize my data to properly use the KS test? And is simulating the comparison distribution a valid usage, or more incorrect than testing against the continuous CDF for the distribution?


Solution

"norm" uses a normal distribution that defaults to be zero-mean, with standard deviation 1 [ref]. Your data have values m and s for that, which are quite different. It is telling you they are very different from this standard reference distribution.

You could still use this test to check if the data look Gaussian if you first normalize (haha) your data appropriately:

data_n = (data - m) / s
KS_D, KS_p = stats.kstest(data_n, "norm")


Answered By - surtur
Answer Checked By - Dawn Plyler (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] How to operate a function over multiple columns (Pandas/Python)?

 October 08, 2022     pandas, python, scipy, statistics     No comments   

Issue

Let's consider IBM HR Attrition Dataset from Kaggle (https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset). How do I rapdly gets the variable with the highest Shapiro p-value?

In other words, I can apply a function shapiro() in a column as shapiro(df['column']). And I would like to calculate for all the numeric columns these function.

I tried this:

from scypy.stats import shapiro
df = pd.read_csv('path')

#here i was expecting the output to be a sequential prints with the name of the columns and their respective p-value from shapiro()
for col in hr:
   print(col," : ", shapiro(hr[col])[0])

Anyone that could help on this?

Thanks in advance.


Solution

I hope this helps! I'm sure there are a lot better ways, but it was fun trying :)

import pandas as pd
from scipy import stats
df = pd.read_csv('path.csv')
# make a new dataframe newdf with only the columns containing numeric data

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

newdf = df.select_dtypes (include=numerics)
#check to see that the columns are only numeric
print(newdf.head())
# new dataframe with rows "W" and "P"
shapiro_wilks = (newdf).apply(lambda x: pd.Series(shapiro(x), index=['W','P'])).reset_index()
shapiro_wilks = shapiro_wilks.set_index('index') #ugh


print(shapiro_wilks)


Answered By - LynneKLR
Answer Checked By - Pedro (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Friday, October 7, 2022

[FIXED] How to generate random values for a predefined function?

 October 07, 2022     matplotlib, numpy, python, scipy, statistics     No comments   

Issue

I have a predefined function, for example this:

my_func = lambda x: (9 * math.exp((-0.5 * y) / 60))/1000

How can I generate random values against it so I can plot the results of the function using matplotlib?


Solution

If you want to plot, don't use random x values but rather a range.

Also you should use numpy.exp that can take a vector as input and your y in the lambda should be x (y is undefined).

This gives us:

import numpy as np
import matplotlib.pyplot as plt

my_func = lambda x: (9 * np.exp((-0.5 * x) / 60))/1000

xs = np.arange(-1000,10)

plt.plot(xs, my_func(xs))

output:

enter image description here



Answered By - mozway
Answer Checked By - Senaida (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] Why does the Gamma distribution in SciPy have three parameters?

 October 07, 2022     scipy, statistics     No comments   

Issue

Usually, the Gamma distribution has two parameters: shape and scale (or alternatively shape and rate). However, it seems that in SciPy the Gamma distribution has three parameters: two shape parameters and a location parameter.

Does anyone know the mapping between the SciPy parameters of Gamma, and, e.g., the parameters in the definition given on wikipedia:

http://en.wikipedia.org/wiki/Gamma_distribution

Thanks!


Solution

All the continuous distributions in scipy.stats have location and scale parameters, even those for which the location is not generally used. For the gamma distribution, just leave the location at its default value 0. If you are using the fit method, use the argument floc=0 to ensure that it does not treat the location as a free parameter.

The shape and scale parameters in the scipy gamma distribution correspond to k and θ, respectively, in the wikipedia page.



Answered By - Warren Weckesser
Answer Checked By - Katrina (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] How do you compute the confidence interval for Pearson's r in Python?

 October 07, 2022     pearson, python, scipy, statistics     No comments   

Issue

In Python, I know how to calculate r and associated p-value using scipy.stats.pearsonr, but I'm unable to find a way to calculate the confidence interval of r. How is this done? Thanks for any help :)


Solution

According to [1], calculation of confidence interval directly with Pearson r is complicated due to the fact that it is not normally distributed. The following steps are needed:

  1. Convert r to z',
  2. Calculate the z' confidence interval. The sampling distribution of z' is approximately normally distributed and has standard error of 1/sqrt(n-3).
  3. Convert the confidence interval back to r.

Here are some sample codes:

def r_to_z(r):
    return math.log((1 + r) / (1 - r)) / 2.0

def z_to_r(z):
    e = math.exp(2 * z)
    return((e - 1) / (e + 1))

def r_confidence_interval(r, alpha, n):
    z = r_to_z(r)
    se = 1.0 / math.sqrt(n - 3)
    z_crit = stats.norm.ppf(1 - alpha/2)  # 2-tailed z critical value

    lo = z - z_crit * se
    hi = z + z_crit * se

    # Return a sequence
    return (z_to_r(lo), z_to_r(hi))

Reference:

  1. http://onlinestatbook.com/2/estimation/correlation_ci.html


Answered By - bennylp
Answer Checked By - Mildred Charles (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] How to run scipy.stats.gmean for rows contanining values less than 1 and zeros?

 October 07, 2022     geometric-mean, numpy, python-3.x, scipy, statistics     No comments   

Issue

We have the following dataframe (df)

print(df)

 #Gene  GSM772  GSM773  GSM774  GSM775  GSM776
0610007P14Rik    0.003485    0.003415    0.005431    0.003667    0.007146
0610009B22Rik    0.001220    0.001351    0.001762    0.001404    0.002177
0610009L18Rik    0.000055    0.000009    0.000152    0.000082    0.000179
0610009O20Rik    0.000000    0.006830    00000000    0.006653    0.006907
0610010F05Rik    0.008310    0.008329    0.007091    0.006919    0.006915

We want to calculate Geometric Mean for every row.

  • And append the result as the last column with the column name GeometricMean.

For some rows there are "zero" values, which needs to be ignored so the geometric mean for that row is regarded as zero.

We wrote the following python script,

import scipy
import numpy
import numpy as np
from scipy.stats.mstats import gmean
from scipy import stats

numpy.seterr(divide = 'ignore') 
scipy.stats.gmean(df.iloc[:,1:5],axis=1)

gmean = scipy.stats.gmean(df.iloc[:,1:5],axis=1)

df.assign(GeometricMean=gmean)
results = df.assign(GeometricMean=gmean)

print(results)
  • Following error is encountered:

    AttributeError: 'str' object has no attribute 'log'
    
     The above exception was the direct cause of the following exception:
    
     Traceback (most recent call last):
       File "calculate_gmean.py", line 99, in <module>
         scipy.stats.gmean(df.iloc[:,1:5],axis=1) #calculates gmean rowwise, axis=1 for rowwise
       File "/home/.local/lib/python3.6/site-packages/scipy/stats/stats.py", line 402, in gmean
         log_a = np.log(np.array(a, dtype=dtype))
     TypeError: loop of ufunc does not support argument 0 of type str which has no callable log method
    

Can anyone please suggest the best way to resolve this issue?

Thanks !!


Solution

Problem solved. Actually, the above script works without any issue. Sorry, this question was posted without hindsight. We cannot delete any question, so this will stay here. Hope the script is useful for someone.

Note, that this script will not work if the dataframe contains any column with strings. After removing those columns, this script will work without any problem in generating the last column with geometric mean for every row.

print(df.shape)

(5, 6)



print(df)


           #Gene  GSM772  GSM773  GSM774  GSM775  GSM776
0  0610007P14Rik    0.003485    0.003415    0.005431    0.003667    0.007146
1  0610009B22Rik    0.001220    0.001351    0.001762    0.001404    0.002177
2  0610009L18Rik    0.000055    0.000009    0.000152    0.000082    0.000179
3  0610009O20Rik    0.006369    0.006830    0.007176    0.006653    0.006907
4  0610010F05Rik    0.008310    0.008329    0.007091    0.006919    0.006915


print(results)

           #Gene  GSM772  GSM773  GSM774  GSM775  GSM776  GeometricMean
0  0610007P14Rik    0.003485    0.003415    0.005431    0.003667    0.007146       0.004424
1  0610009B22Rik    0.001220    0.001351    0.001762    0.001404    0.002177       0.001548
2  0610009L18Rik    0.000055    0.000009    0.000152    0.000082    0.000179       0.000064
3  0610009O20Rik    0.006369    0.006830    0.007176    0.006653    0.006907       0.006782
4  0610010F05Rik    0.008310    0.008329    0.007091    0.006919    0.006915       0.007484


Answered By - TraPS-VarI
Answer Checked By - Mary Flores (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Thursday, October 6, 2022

[FIXED] How do you create a logit-normal distribution in Python?

 October 06, 2022     distribution, python, scipy, statistics     No comments   

Issue

Following this post, I tried to create a logit-normal distribution by creating the LogitNormal class:

import numpy as np
import matplotlib.pyplot as plt
from scipy.special import logit
from scipy.stats import norm, rv_continuous


class LogitNormal(rv_continuous):
    def _pdf(self, x, **kwargs):
        return norm.pdf(logit(x), **kwargs)/(x*(1-x))


class OtherLogitNormal:
    def pdf(self, x, **kwargs):
        return norm.pdf(logit(x), **kwargs)/(x*(1-x))


fig, ax = plt.subplots()
values = np.linspace(10e-10, 1-10e-10, 1000)
sigma, mu = 1.78, 0
ax.plot(
    values, LogitNormal().pdf(values, loc=mu, scale=sigma), label='subclassed'
)
ax.plot(
    values, OtherLogitNormal().pdf(values, loc=mu, scale=sigma),
    label='not subclassed'
)
ax.legend()
fig.show()

However, the LogitNormal class does not produce the desired results. When I don't subclass rv_continuous it works. Why is that? I need the subclassing to work because I also need the other methods that come with it like rvs.

Btw, the only reason I am creating my own logit-normal distribution in Python is because the only implementations of that distribution that I could find were from the PyMC3 package and from the TensorFlow package, both of which are pretty heavy / overkill if you only need them for that one function. I already tried PyMC3, but apparently it doesn't do well with scipy I think, it always crashed for me. But that's a whole different story.


Solution

Forewords

I came across this problem this week and the only relevant issue I have found about it is this post. I have almost same requirement as the OP:

  • Having a random variable for Logit Normal distribution.

But I also need:

  • To be able to perform statistical test as well;
  • While being compliant with the scipy random variable interface.

As @Jacques Gaudin pointed out the interface for rv_continous (see distribution architecture for details) does not ensure follow up for loc and scale parameters when inheriting from this class. And this is somehow misleading and unfortunate.

Implementing the __init__ method of course allow to create the missing binding but the trade off is: it breaks the pattern scipy is currently using to implement random variables (see an example of implementation for lognormal).

So, I took time to dig into the scipy code and I have created a MCVE for this distribution. Although it is not totally complete (it mainly misses moments overrides) it fits the bill for both OP and my purposes while having satisfying accuracy and performance.

MCVE

An interface compliant implementation of this random variable could be:

class logitnorm_gen(stats.rv_continuous):

    def _argcheck(self, m, s):
        return (s > 0.) & (m > -np.inf)
    
    def _pdf(self, x, m, s):
        return stats.norm(loc=m, scale=s).pdf(special.logit(x))/(x*(1-x))
    
    def _cdf(self, x, m, s):
        return stats.norm(loc=m, scale=s).cdf(special.logit(x))
    
    def _rvs(self, m, s, size=None, random_state=None):
        return special.expit(m + s*random_state.standard_normal(size))
    
    def fit(self, data, **kwargs):
        return stats.norm.fit(special.logit(data), **kwargs)

logitnorm = logitnorm_gen(a=0.0, b=1.0, name="logitnorm")

This implementation unlock most of the scipy random variables potential.

N = 1000
law = logitnorm(0.24, 1.31)             # Defining a RV
sample = law.rvs(size=N)                # Sampling from RV
params = logitnorm.fit(sample)          # Infer parameters w/ MLE
check = stats.kstest(sample, law.cdf)   # Hypothesis testing
bins = np.arange(0.0, 1.1, 0.1)         # Bin boundaries
expected = np.diff(law.cdf(bins))       # Expected bin counts

As it relies on scipy normal distribution we may assume underlying functions have the same accuracy and performance than normal random variable object. But it might indeed be subject to float arithmetic inaccuracy especially when dealing with highly skewed distributions at the support boundary.

Tests

To check out how it performs we draw some distribution of interest and check them. Let's create some fixtures:

def generate_fixtures(
    locs=[-2.0, -1.0, 0.0, 0.5, 1.0, 2.0],
    scales=[0.32, 0.56, 1.00, 1.78, 3.16],
    sizes=[100, 1000, 10000],
    seeds=[789, 123456, 999999]
):
    for (loc, scale, size, seed) in itertools.product(locs, scales, sizes, seeds):
        yield {"parameters": {"loc": loc, "scale": scale}, "size": size, "random_state": seed}

And perform checks on related distributions and samples:

eps = 1e-8
x = np.linspace(0. + eps, 1. - eps, 10000)

for fixture in generate_fixtures():
    
    # Reference:
    parameters = fixture.pop("parameters")
    normal = stats.norm(**parameters)
    sample = special.expit(normal.rvs(**fixture))
    
    # Logit Normal Law:
    law = logitnorm(m=parameters["loc"], s=parameters["scale"])
    check = law.rvs(**fixture)
    
    # Fit:
    p = logitnorm.fit(sample)
    trial = logitnorm(*p)
    resample = trial.rvs(**fixture)
    
    # Hypothetis Tests:
    ks = stats.kstest(check, trial.cdf)
    bins = np.histogram(resample)[1]
    obs = np.diff(trial.cdf(bins))*fixture["size"]
    ref = np.diff(law.cdf(bins))*fixture["size"]
    chi2 = stats.chisquare(obs, ref, ddof=2)

Some adjustments with n=1000, seed=789 (this sample is quite normal) are shown below:

enter image description here enter image description here enter image description here enter image description here



Answered By - jlandercy
Answer Checked By - Gilberto Lyons (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] How to set significance level for scipy.stats test

 October 06, 2022     data-analysis, data-science, python, scipy, statistics     No comments   

Issue

I'm using scipy.stats.ttest_ind and scipy.stats.wilcoxon to perform t test and the Wilcoxon test, but I have no idea how can I set the significant level for them. In the official documents, nothing about it is mentioned and it seems like no arguments of these two methods are about it. Does anyone know how to set it?

The official documents:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html


Solution

The tests also report the p-value to the user. This p-value can be used to test a hypothesis at the significance level of your choice. We reject the null hypothesis if the p-value is less than the significance level. For example, with a p-value of 0.073 we reject at the significance level of 10% but do not reject at a significance level of 5%.



Answered By - Hanno Reuvers
Answer Checked By - Timothy Miller (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] How to plot empirical cdf (ecdf)

 October 06, 2022     matplotlib, numpy, python, scipy, statistics     No comments   

Issue

How can I plot the empirical CDF of an array of numbers in matplotlib in Python? I'm looking for the cdf analog of pylab's "hist" function.

One thing I can think of is:

from scipy.stats import cumfreq
a = array([...]) # my array of numbers
num_bins =  20
b = cumfreq(a, num_bins)
plt.plot(b)

Solution

That looks to be (almost) exactly what you want. Two things:

First, the results are a tuple of four items. The third is the size of the bins. The second is the starting point of the smallest bin. The first is the number of points in the in or below each bin. (The last is the number of points outside the limits, but since you haven't set any, all points will be binned.)

Second, you'll want to rescale the results so the final value is 1, to follow the usual conventions of a CDF, but otherwise it's right.

Here's what it does under the hood:

def cumfreq(a, numbins=10, defaultreallimits=None):
    # docstring omitted
    h,l,b,e = histogram(a,numbins,defaultreallimits)
    cumhist = np.cumsum(h*1, axis=0)
    return cumhist,l,b,e

It does the histogramming, then produces a cumulative sum of the counts in each bin. So the ith value of the result is the number of array values less than or equal to the the maximum of the ith bin. So, the final value is just the size of the initial array.

Finally, to plot it, you'll need to use the initial value of the bin, and the bin size to determine what x-axis values you'll need.

Another option is to use numpy.histogram which can do the normalization and returns the bin edges. You'll need to do the cumulative sum of the resulting counts yourself.

a = array([...]) # your array of numbers
num_bins = 20
counts, bin_edges = numpy.histogram(a, bins=num_bins, normed=True)
cdf = numpy.cumsum(counts)
pylab.plot(bin_edges[1:], cdf)

(bin_edges[1:] is the upper edge of each bin.)



Answered By - AFoglia
Answer Checked By - Gilberto Lyons (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] how to calculate percentile value of number in dataframe column grouped by index

 October 06, 2022     pandas, python-3.x, scipy, statistics     No comments   

Issue

I have a dataframe like this:

df:
        Score
group
  A      100
  A      34
  A      40
  A      30
  C      24
  C      60
  C      35

For every group in the data, I want to find out the percentile value of Score 35. (i.e the percentile where the 35 fits in the grouped data)

I tried different tricks but none of them worked.

scipy.stats.percentileofscore(df['Score], 35, kind='weak')
 --> This is working but this doesn't give me the percentile grouped by index

df.groupby('group')['Score].percentileofscore()
 --> 'SeriesGroupBy' object has no attribute 'percentileofscore'

scipy.stats.percentileofscore(df.groupby('group')[['Score]], 35, kind='strict')
 --> TypeError: '<' not supported between instances of 'str' and 'int'

My ideal output looks like this:

df:
        Score Percentile
group 
  A       50
  C       33

Can anyone suggest to me what works well here?


Solution

Inverse quantile function for a sequence at point X is the proportion of values less than X in the sequence, right? So:

In [158]: df["Score"].lt(35).groupby(df["group"]).mean().mul(100)
Out[158]:
group
A    50.000000
C    33.333333
Name: Score, dtype: float64
  • get a True/False Series of whether < 35 or not on "Score"
  • group this Series over "group"
  • take the mean
    • since True == 1 and False == 0, it will effectively give the proportion!
  • multiply by 100 to get percentages


Answered By - Mustafa Aydın
Answer Checked By - Robin (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] How to estimate confidence-intervals beyond the current simulated step, based on existing data for 1,000,000 monte-carlo simulations?

 October 06, 2022     curve-fitting, numpy, python, scipy, statistics     No comments   

Issue

Situation:

  1. I have a program which generates 600 random numbers per "step".
  2. These 600 numbers are fed into a complicated algorithm, which then outputs a single value (which can be positive or negative) for that "step"; let's call this Value-X for that step.
  3. This Value-X is then added to a Global-Value-Y, making the latter a running sum of each step in the series.
  4. I have essentially run this simulation 1,000,000 times, recording the values of Global-Value-Y at each step in those simulations.
  5. I have "calculated confidence intervals" from those one-million simulations, by sorting the simulations by (the absolute value of) their Global-Value-Y at each column, and finding the 90th percentile, the 99th percentile, etc.

What I want to do:

  • Using the pool of simulation results, "extrapolate" from that to find some equation that will estimate the confidence intervals for results from the used algorithm, many "steps" into the future, without having to extend the runs of those one-million simulations further. (it would take too long to keep running those one-million simulations indefinitely)

Note that the results do not have to be terribly precise at this point; the results are mainly used atm as a visual indicator on the graph, for the user to get an idea of how "normal" the current simulation's results are relative to the confidence-intervals extrapolated from the historical data of the one-million simulations.

Anyway, I've already made some attempts at finding an "estimated curve-fit" of the confidence-intervals from the historical data (ie. those based on the one-million simulations), but the results are not quite precise enough.

Here are the key parts from the curve-fitting Python code I've tried: (link to full code here)

# curve fit functions

def func_linear(t, a, b):
    return a*t +b

def func_quadratic(t, a, b, c):
    return a*pow(t,2) + b*t +c

def func_cubic(t, a, b, c, d):
    return a*pow(t,3) + b*pow(t,2) + c*t + d

def func_biquadratic(t, a, b, c, d, e):
    return a*pow(t,4) + b*pow(t,3) + c*pow(t,2) + d*t + e

[...]

import numpy as np 
import matplotlib.pyplot as plt 
from scipy.optimize import curve_fit

# calling the read function on the recorded percentile/confidence-intervals data
xVals,yVals = read_file()

# using inbuilt function for linear fit
popt, pcov = curve_fit(func_linear, xVals, yVals)
fit_linear = func_linear(np.array(xVals), *popt)

[same here for the other curve-fit functions]

[...]

plt.rcParams["figure.figsize"] = (40,20)

# plotting the respective curve fits 
plt.plot(xVals, yVals, color="blue", linewidth=3)
plt.plot(xVals, fit_linear, color="red", linewidth=2)
plt.plot(xVals, fit_quadratic, color="green", linewidth=2)
plt.plot(xVals, fit_cubic, color="orange", linewidth=2)
#plt.plot(xVals, fit_biquadratic, color="black", linewidth=2) # extremely off
plt.legend(['Actual data','linear','quadratic','cubic','biquadratic'])
plt.xlabel('Session Column')
plt.ylabel('y-value for CI')
plt.title('Curve fitting')
plt.show()  

And here are the results: (with the purple "actual data" being the 99.9th percentile of the Global-Value-Ys at each step, from the one-million recorded simulations)

While it seems to be attempting to estimate a curve-fit (over the graphed ~630 steps), the results are not quite accurate enough for my purposes. The imprecision is particularly noticeable at the first ~5% of the graph, where the estimate-curve is far too high. (though practically, my issues are more on the other end, as the CI curve-fit keeps getting less accurate the farther out from the historical data it goes)

EDIT: As requested by a commenter, here is a GitHub gist of the Python code I'm using to attempt the curve-fit (and it includes the data used): https://gist.github.com/Venryx/74a44ed25d5c4dc7768d13e22b36c8a4

So my questions:

  1. Is there something wrong with my usage of scypy's curve_fit function?
  2. Or is the curve_fit function too basic of a tool to get meaningful estimates/extrapolations of confidence-intervals for random-walk data like this?
  3. If so, is there some alternative that works better for estimating/extrapolating confidence-intervals from random-walk data of this sort?

Solution

If I well understand your question, you have a lot of Monte Carlo simulations gathered as (x,y) points and you want to find a model that reasonably fits them.

Importing your data to create a MCVE:

import numpy as np
from scipy import optimize
import matplotlib.pyplot as plt

data = np.array([
    [0,77],[1,101],[2,121],[3,138],[4,151],[5,165],[6,178],[7,189],[8,200],[9,210],[10,221],[11,229],[12,238],[13,247],[14,254],[15,264],[16,271],[17,278],[18,285],[19,291],[20,299],[21,305],[22,312],[23,318],[24,326],[25,331],[26,338],[27,344],[28,350],[29,356],[30,362],[31,365],[32,371],[33,376],[34,383],[35,387],[36,393],[37,399],[38,404],[39,409],[40,414],[41,419],[42,425],[43,430],[44,435],[45,439],[46,444],[47,447],[48,453],[49,457],[50,461],[51,467],[52,472],[53,476],[54,480],[55,483],[56,488],[57,491],[58,495],[59,499],[60,504],[61,508],[62,512],[63,516],[64,521],[65,525],[66,528],[67,532],[68,536],[69,540],[70,544],[71,547],[72,551],[73,554],[74,560],[75,563],[76,567],[77,571],[78,574],[79,577],[80,582],[81,585],[82,588],[83,591],[84,595],[85,600],[86,603],[87,605],[88,610],[89,613],[90,617],[91,621],[92,624],[93,627],[94,630],[95,632],[96,636],[97,638],[98,642],[99,645],[100,649],[101,653],[102,656],[103,660],[104,664],[105,667],[106,670],[107,673],[108,674],[109,679],[110,681],[111,684],[112,687],[113,689],[114,692],[115,697],[116,698],[117,701],[118,705],[119,708],[120,710],[121,712],[122,716],[123,718],[124,722],[125,725],[126,728],[127,730],[128,732],[129,735],[130,739],[131,742],[132,743],[133,747],[134,751],[135,753],[136,754],[137,757],[138,760],[139,762],[140,765],[141,768],[142,769],[143,774],[144,775],[145,778],[146,782],[147,784],[148,788],[149,790],[150,793],[151,795],[152,799],[153,801],[154,804],[155,808],[156,811],[157,812],[158,814],[159,816],[160,819],[161,820],[162,824],[163,825],[164,828],[165,830],[166,832],[167,834],[168,836],[169,839],[170,843],[171,845],[172,847],[173,850],[174,853],[175,856],[176,858],[177,859],[178,863],[179,865],[180,869],[181,871],[182,873],[183,875],[184,878],[185,880],[186,883],[187,884],[188,886],[189,887],[190,892],[191,894],[192,895],[193,898],[194,900],[195,903],[196,903],[197,905],[198,907],[199,910],[200,911],[201,914],[202,919],[203,921],[204,922],[205,926],[206,927],[207,928],[208,931],[209,933],[210,935],[211,940],[212,942],[213,944],[214,943],[215,948],[216,950],[217,954],[218,955],[219,957],[220,959],[221,963],[222,965],[223,967],[224,969],[225,970],[226,971],[227,973],[228,975],[229,979],[230,980],[231,982],[232,983],[233,986],[234,988],[235,990],[236,992],[237,993],[238,996],[239,998],[240,1001],[241,1003],[242,1007],[243,1007],[244,1011],[245,1012],[246,1013],[247,1016],[248,1019],[249,1019],[250,1020],[251,1024],[252,1027],[253,1029],[254,1028],[255,1031],[256,1033],[257,1035],[258,1038],[259,1040],[260,1041],[261,1046],[262,1046],[263,1046],[264,1048],[265,1052],[266,1053],[267,1055],[268,1056],[269,1057],[270,1059],[271,1061],[272,1064],[273,1067],[274,1065],[275,1068],[276,1071],[277,1073],[278,1074],[279,1075],[280,1080],[281,1081],[282,1083],[283,1084],[284,1085],[285,1086],[286,1088],[287,1090],[288,1092],[289,1095],[290,1097],[291,1100],[292,1100],[293,1102],[294,1104],[295,1107],[296,1109],[297,1110],[298,1113],[299,1114],[300,1112],[301,1116],[302,1118],[303,1120],[304,1121],[305,1124],[306,1124],[307,1126],[308,1126],[309,1130],[310,1131],[311,1131],[312,1135],[313,1137],[314,1138],[315,1141],[316,1145],[317,1147],[318,1147],[319,1148],[320,1152],[321,1151],[322,1152],[323,1155],[324,1157],[325,1158],[326,1161],[327,1161],[328,1163],[329,1164],[330,1167],[331,1169],[332,1172],[333,1175],[334,1177],[335,1177],[336,1179],[337,1181],[338,1180],[339,1184],[340,1186],[341,1186],[342,1188],[343,1190],[344,1193],[345,1195],[346,1197],[347,1198],[348,1198],[349,1200],[350,1203],[351,1204],[352,1206],[353,1207],[354,1209],[355,1210],[356,1210],[357,1214],[358,1215],[359,1215],[360,1219],[361,1221],[362,1222],[363,1223],[364,1224],[365,1225],[366,1228],[367,1232],[368,1233],[369,1236],[370,1237],[371,1239],[372,1239],[373,1243],[374,1244],[375,1244],[376,1244],[377,1247],[378,1249],[379,1251],[380,1251],[381,1254],[382,1256],[383,1260],[384,1259],[385,1260],[386,1263],[387,1264],[388,1265],[389,1267],[390,1271],[391,1271],[392,1273],[393,1274],[394,1277],[395,1278],[396,1279],[397,1281],[398,1285],[399,1286],[400,1289],[401,1288],[402,1290],[403,1290],[404,1291],[405,1292],[406,1295],[407,1297],[408,1298],[409,1301],[410,1300],[411,1301],[412,1303],[413,1305],[414,1307],[415,1311],[416,1312],[417,1313],[418,1314],[419,1316],[420,1317],[421,1316],[422,1319],[423,1319],[424,1321],[425,1322],[426,1323],[427,1325],[428,1326],[429,1329],[430,1328],[431,1330],[432,1334],[433,1335],[434,1338],[435,1340],[436,1342],[437,1342],[438,1344],[439,1346],[440,1347],[441,1347],[442,1349],[443,1351],[444,1352],[445,1355],[446,1358],[447,1356],[448,1359],[449,1362],[450,1362],[451,1366],[452,1365],[453,1367],[454,1368],[455,1368],[456,1368],[457,1371],[458,1371],[459,1374],[460,1374],[461,1377],[462,1379],[463,1382],[464,1384],[465,1387],[466,1388],[467,1386],[468,1390],[469,1391],[470,1396],[471,1395],[472,1396],[473,1399],[474,1400],[475,1403],[476,1403],[477,1406],[478,1406],[479,1412],[480,1409],[481,1410],[482,1413],[483,1413],[484,1418],[485,1418],[486,1422],[487,1422],[488,1423],[489,1424],[490,1426],[491,1426],[492,1430],[493,1430],[494,1431],[495,1433],[496,1435],[497,1437],[498,1439],[499,1440],[500,1442],[501,1443],[502,1442],[503,1444],[504,1447],[505,1448],[506,1448],[507,1450],[508,1454],[509,1455],[510,1456],[511,1460],[512,1459],[513,1460],[514,1464],[515,1464],[516,1466],[517,1467],[518,1469],[519,1470],[520,1471],[521,1475],[522,1477],[523,1476],[524,1478],[525,1480],[526,1480],[527,1479],[528,1480],[529,1483],[530,1484],[531,1485],[532,1486],[533,1487],[534,1489],[535,1489],[536,1489],[537,1492],[538,1492],[539,1494],[540,1493],[541,1494],[542,1496],[543,1497],[544,1499],[545,1500],[546,1501],[547,1504],[548,1506],[549,1508],[550,1507],[551,1509],[552,1510],[553,1510],[554,1512],[555,1513],[556,1516],[557,1519],[558,1520],[559,1520],[560,1522],[561,1525],[562,1527],[563,1530],[564,1531],[565,1533],[566,1533],[567,1534],[568,1538],[569,1539],[570,1538],[571,1540],[572,1541],[573,1543],[574,1545],[575,1545],[576,1547],[577,1549],[578,1550],[579,1554],[580,1554],[581,1557],[582,1559],[583,1564],[584,1565],[585,1567],[586,1567],[587,1568],[588,1570],[589,1571],[590,1569],[591,1572],[592,1572],[593,1574],[594,1574],[595,1574],[596,1579],[597,1578],[598,1579],[599,1582],[600,1583],[601,1583],[602,1586],[603,1585],[604,1588],[605,1590],[606,1592],[607,1596],[608,1595],[609,1596],[610,1598],[611,1601],[612,1603],[613,1604],[614,1605],[615,1605],[616,1608],[617,1611],[618,1613],[619,1615],[620,1613],[621,1616],[622,1617],[623,1617],[624,1619],[625,1617],[626,1618],[627,1619],[628,1618],[629,1621],[630,1624],[631,1626],[632,1626],[633,1631]
])

And plotting them show a curve that seems to have a square root and linear behaviors respectively at the beginning and the end of the dataset. So let's try this first simple model:

def model(x, a, b, c):
    return a*np.sqrt(x) + b*x + c

Notice than formulated as this, it is a LLS problem which is a good point for solving it. The optimization with curve_fit works as expected:

popt, pcov = optimize.curve_fit(model, data[:,0],  data[:,1])

# [61.20233162  0.08897784 27.76102519]

# [[ 1.51146696e-02 -4.81216428e-04 -1.01108383e-01]
#  [-4.81216428e-04  1.59734405e-05  3.01250722e-03]
#  [-1.01108383e-01  3.01250722e-03  7.63590271e-01]]

And returns a pretty decent adjustment. Graphically it looks like:

enter image description here

Of course this is just an adjustment to an arbitrary model chosen based on a personal experience (to me it looks like a specific heterogeneous reaction kinetics).

If you have a theoretical reason to accept or reject this model then you should use it to discriminate. I would also investigate units of parameters to check if they have significant meanings.

But in any case this is out of scope of the Stack Overflow community which is oriented to solve programming issues not scientific validation of models (see Cross Validated or Math Overflow if you want to dig deeper). If you do so, draw my attention on it, I would be glad to follow this question in details.



Answered By - jlandercy
Answer Checked By - Dawn Plyler (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] How do I get a lognormal distribution in Python with Mu and Sigma?

 October 06, 2022     python, scipy, statistics     No comments   

Issue

I have been trying to get the result of a lognormal distribution using Scipy. I already have the Mu and Sigma, so I don't need to do any other prep work. If I need to be more specific (and I am trying to be with my limited knowledge of stats), I would say that I am looking for the cumulative function (cdf under Scipy). The problem is that I can't figure out how to do this with just the mean and standard deviation on a scale of 0-1 (ie the answer returned should be something from 0-1). I'm also not sure which method from dist, I should be using to get the answer. I've tried reading the documentation and looking through SO, but the relevant questions (like this and this) didn't seem to provide the answers I was looking for.

Here is a code sample of what I am working with. Thanks.

from scipy.stats import lognorm
stddev = 0.859455801705594
mean = 0.418749176686875
total = 37
dist = lognorm.cdf(total,mean,stddev)

UPDATE:

So after a bit of work and a little research, I got a little further. But I still am getting the wrong answer. The new code is below. According to R and Excel, the result should be .7434, but that's clearly not what is happening. Is there a logic flaw I am missing?

dist = lognorm([1.744],loc=2.0785)
dist.cdf(25)  # yields=0.96374596, expected=0.7434

UPDATE 2: Working lognorm implementation which yields the correct 0.7434 result.

def lognorm(self,x,mu=0,sigma=1):
   a = (math.log(x) - mu)/math.sqrt(2*sigma**2)
   p = 0.5 + 0.5*math.erf(a)
   return p
lognorm(25,1.744,2.0785)
> 0.7434

Solution

It sounds like you want to instantiate a "frozen" distribution from known parameters. In your example, you could do something like:

from scipy.stats import lognorm
stddev = 0.859455801705594
mean = 0.418749176686875
dist=lognorm([stddev],loc=mean)

which will give you a lognorm distribution object with the mean and standard deviation you specify. You can then get the pdf or cdf like this:

import numpy as np
import pylab as pl
x=np.linspace(0,6,200)
pl.plot(x,dist.pdf(x))
pl.plot(x,dist.cdf(x))

lognorm cdf and pdf

Is this what you had in mind?



Answered By - talonmies
Answer Checked By - Marie Seifert (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Older Posts Home
View mobile version

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
All Comments
Atom
All Comments

Copyright © PHPFixing