PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Saturday, October 8, 2022

[FIXED] How to properly use Kolmogorov Smirnoff test in SciPy?

 October 08, 2022     python, scipy, statistics     No comments   

Issue

I have a distribution

enter image description here

This one looks pretty gaussian, and we also can't reject the idea with such a high p-value from the KS test.

BUT, the test distribution is actually also a generated one with a finite sample size and not the CDF itself, as you'll notice in the code. So that's kind of cheating, compared to using the CDF for a smooth gaussian function.

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)

d1 = np.random.normal(loc = 3, scale = 2, size = 1000)
d2 = np.random.normal(loc = 3, scale = 0.5, size = 250) # Vary this to test

data = np.concatenate((d1,d2))

xmin, xmax = min(data), max(data)
lnspc = np.linspace(xmin, xmax, len(data))

# lets try the normal distribution first
m, s = stats.norm.fit(data)         # get mean and standard deviation from fit
pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval
plt.hist(data, color = "lightgrey", normed = True, bins = 50)
plt.plot(lnspc, pdf_g, color = "black", label="Gaussian") # plot it


# Test how not-gaussian our distribution is by generating a distribution from the fit
test_dist = np.random.normal(m, s, len(data))
KS_D, KS_p = stats.ks_2samp(data, test_dist)
plt.title("D = {0:.2f}, p = {1:.2f}".format(KS_D, KS_p))

plt.show()

But I can't figure out how to use the default KS test for, that is

KS_D, KS_p = stats.kstest(data, "norm"),

as it always returns a p-value of 0, i.e. my gaussian data must be in the wrong format.

How should I normalize my data to properly use the KS test? And is simulating the comparison distribution a valid usage, or more incorrect than testing against the continuous CDF for the distribution?


Solution

"norm" uses a normal distribution that defaults to be zero-mean, with standard deviation 1 [ref]. Your data have values m and s for that, which are quite different. It is telling you they are very different from this standard reference distribution.

You could still use this test to check if the data look Gaussian if you first normalize (haha) your data appropriately:

data_n = (data - m) / s
KS_D, KS_p = stats.kstest(data_n, "norm")


Answered By - surtur
Answer Checked By - Dawn Plyler (PHPFixing Volunteer)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing