[FIXED] How to specify a hierarchical GAM (HGAM) model with two categorical & a continuous predictor using s(y1, by=y2:y3)) in mgcv?

October 09, 2022 gam, mgcv, r, statistics No comments

Issue

Starting with dummy data for the question:

zone <- c(rep("Z1",1000),rep("Z2",1000),rep("Z3",1000),rep("Z4",1000))
scenario <- rep(c(rep("S1",250),rep("S2",250),rep("S3",250),rep("S4",250)),4)
model <- rep(rep(c(rep("m1",50),rep("m2",50),rep("m3",50),rep("m4",50),rep("m5",50)),4),4)
time <- rep(seq(2021,2070), 80)
response <- rep(c(rep(rnbinom(50, mu =exp(4), size = 50), 5), 
        rep(rnbinom(50, mu =exp(4.5), size = 50), 5), 
        rep(rnbinom(50, mu =exp(5), size = 50), 5), 
        rep(rnbinom(50, mu =exp(6), size = 50), 5)),4)

df <- cbind(zone, scenario, model, time, response)
df <- as.data.frame(df)
df$zone <- as.factor(df$zone)
df$scenario <- as.factor(df$scenario)
df$model <- as.factor(df$model)
df$time <- as.numeric(df$time)
df$response <- as.numeric(df$response)

I'm trying to fit a fairly simple GAM model in mgcv in R. I'm interested in how a response variable might vary through time (predictions between 2021 - 2070). I expect that the response will be non-linear, so I first fit a smoother over time:

library(mgcv)

m1 <- response ~ s(time)

In this case, there are four models (m1, m2, m3, m4) that each predict how the response may vary into the future under four different "scenarios" of increasing intensity (S1, S2, S3, S4). I'm not explicitly interested in how the models vary (I expect them to), but instead I'm interested in how the scenarios (from all four model predictions) may vary into the future. To account for this structure of the data (i.e. that each scenario contains four model predictions), I've included "model" as a random effect:

m2 <- gam(response~s(time, k=-1) + scenario + s(model, bs='re'), data=df)

Given that the non-linear response between "response" and "time" is likely to vary among scenarios, I've fitted a model with separate smoothers for each level of scenario using the "by" function:

m3 <- gam(response~s(time, by=scenario, k=-1) + scenario + s(model, bs='re'), data=df)

To add to the complexity, the area that the predictions are modeled from is split into four separate zones, so an additive term of "zone" is included in the model:

m4 <- gam(response~s(time, by=scenario, k=-1) + scenario + zone + s(model, bs='re'), data=df)

My question is: I've accounted for the expectation that the "wigglyness" of the response through time may vary across scenarios in m3 by including s(time, by=scenario, k=-1), but... how do I expand that model to account for the potential for the response to vary not only by scenario but also by zone? I envisioned this as something like:

m5 <- gam(response~s(time, by=scenario:zone, k=-1) + scenario + zone + s(model, bs='re'), data=df)

But this isn't the correct syntax. After reading the honestly excellent paper "Hierarchical generalized additive models in ecology: an introduction with mgcv" in PeerJ several times now, but I'm struggling to understand the leap between the nonlinear functional relationship between a simple x~s(y) gam model and how the shape of the function varies between not one, but two different grouping levels.

As far as I can tell I'd need "A single common smoother plus group-level smoothers with differing wiggliness (Model GI)" approach here to allow each each group-specific smoother (here "scenario" and "zone") to have its own smoothing parameter and hence its own level of wiggliness, but i'm struggling to see how this model would fit into the HGAM approach. How can I adapt M5 to fit within the HGAM framework?

Solution

The factor by variable really just codes for the rows in the input data that should be kept together and get their own smooth.

So your intuition is right, that you need the interaction of scenario and zone, but in practice you can't do it via a formula operation inside s() or te() etc.

What you need to do is create the interaction first in your data and then pass that to the by argument:

df <- transform(df, zone_in_scenario = interaction(zone, scenario, drop = TRUE))

The you can use by = zone_in_scenario in your smooth.

You also need to adjust the factor fixed effects to be zone * scenario as you need group means for each combination of these two factors, and this will get you a group mean for each of the smooths that will now be fitted.

Answered By - Gavin Simpson

Answer Checked By - Cary Denson (PHPFixing Admin)

[FIXED] How to implement Poisson Regression?

October 09, 2022 python, regression, statistics No comments

Issue

There are 2 types of Generalized Linear Models:
1. Log-Linear Regression, also known as Poisson Regression
2. Logistic Regression

How to implement the Poisson Regression in Python for Price Elasticity prediction?

Solution

Have a look at the statmodels package in python.

Here is an example

A bit more of input to avoid the link only answer

Assumming you know python here is an extract of the example I mentioned earlier.

import numpy as np
import pandas as pd
from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.cov_struct import (Exchangeable,
    Independence,Autoregressive)
from statsmodels.genmod.families import Poisson

pandas will hold the data frame with the data you want to use to feed your poisson model. statsmodels package contains large family of statistical models such as Linear, probit, poisson etc. from here you will import the Poisson family model (hint: see last import)

The way you fit your model is as follow (assuming your dependent variable is called y and your IV are age, trt and base):

fam = Poisson()
ind = Independence()
model1 = GEE.from_formula("y ~ age + trt + base", "subject", data, cov_struct=ind, family=fam)
result1 = model1.fit()
print(result1.summary())

As I am not familiar with the nature of your problem I would suggest to have a look at negative binomial regression if you need to count data is well overdispersed. with High overdispersion your poisson assumptions may not hold.

Plethora of info for poisson regression in R - just google it.

Hope now this answer helps.

Answered By - Altons

Answer Checked By - Terry (PHPFixing Volunteer)

[FIXED] How can I reproduce the statsmodels ARIMA filter?

October 09, 2022 arima, python, statistics, statsmodels, time-series No comments

Issue

I am attempting to reproduce the filters used in an ARIMA model using stastmodels recursive_filter and convolution_filter. (My end objective is to use these filters for prewhitening an exogenous series.)

I begin by working with just an AR model and recursive filter. Here is the simplified experimental setup:

import numpy as np
import statsmodels as sm

np.random.seed(42)

# sample data
series = sm.tsa.arima_process.arma_generate_sample(ar=(1,-0.2,-0.5), ma=(1,), nsample=100)

model = sm.tsa.arima.model.ARIMA(series, order=(2,0,0)).fit()
print(model.summary())

Which gracefully produces the following, which seems fair enough:

                               SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                 ARIMA(2, 0, 0)   Log Likelihood                -131.991
Date:                Wed, 07 Apr 2021   AIC                            271.982
Time:                        12:58:39   BIC                            282.403
Sample:                             0   HQIC                           276.200
                                - 100                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.3136      0.266     -1.179      0.238      -0.835       0.208
ar.L1          0.2135      0.084      2.550      0.011       0.049       0.378
ar.L2          0.4467      0.101      4.427      0.000       0.249       0.645
sigma2         0.8154      0.126      6.482      0.000       0.569       1.062
===================================================================================
Ljung-Box (L1) (Q):                   0.10   Jarque-Bera (JB):                 0.53
Prob(Q):                              0.75   Prob(JB):                         0.77
Heteroskedasticity (H):               0.98   Skew:                            -0.16
Prob(H) (two-sided):                  0.96   Kurtosis:                         2.85
===================================================================================

I fit an AR(2) and obtain coefficients for lag 1 and 2 based on the SARIMAX results. My intuition for reproducing this model using statsmodels.tsa.filters.filtertools.recursive_filter is like so:

filtered = sm.tsa.filters.filtertools.recursive_filter(series, ar_coeff=(-0.2135, -0.4467))

(And maybe adding in the constant from the regression results as well). However, a direct comparison of the results shows the recursive filter doesn't replicate the AR model:

import matploylib.pyplot as plt

# ARIMA residuals
plt.plot(model.resid)

# Calculated residuals using recursive filter outcome
plt.plot(filtered)

Am I approaching this incorrectly? Should I be using a different filter function? The next step for me is to perform the same task but on an MA model so that I can add (?) the results together to obtain a full ARMA filter for prewhitening.

Note: this question may be valuable to somebody searching for "how can I prewhiten timeseries data?" particularly in Python using statsmodels.

Solution

I guess you should use convolution_filter for the AR part and recursive_filter for the MA part. Combining these sequentially will work for ARMA models. Alternatively, you can use arma_innovations for an exact approach that works with both AR and MA parts simultaneously. Here are some examples:

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.innovations import arma_innovations

AR(2)

np.random.seed(42)
series = sm.tsa.arma_generate_sample(ar=(1, -0.2, -0.5), ma=(1,), nsample=100)

res = sm.tsa.arima.ARIMA(series, order=(2, 0, 0), trend='n').fit()
print(pd.DataFrame({
    'ARIMA resid': res.resid,
    'arma_innovations': arma_innovations.arma_innovations(
        series, ar_params=res.params[:-1])[0],
    'convolution filter': sm.tsa.filters.convolution_filter(
        series, np.r_[1, -res.params[:-1]], nsides=1)}))

gives:

    ARIMA resid  arma_innovations  convolution filter
0      0.496714          0.496714                 NaN
1     -0.254235         -0.254235                 NaN
2      0.666326          0.666326            0.666326
3      1.493315          1.493315            1.493315
4     -0.256708         -0.256708           -0.256708
..          ...               ...                 ...
95    -1.438670         -1.438670           -1.438670
96     0.323470          0.323470            0.323470
97     0.218243          0.218243            0.218243
98     0.012264          0.012264            0.012264
99    -0.245584         -0.245584           -0.245584

MA(1)

np.random.seed(42)
series = sm.tsa.arma_generate_sample(ar=(1,), ma=(1, 0.2), nsample=100)

res = sm.tsa.arima.ARIMA(series, order=(0, 0, 1), trend='n').fit()
print(pd.DataFrame({
    'ARIMA resid': res.resid,
    'arma_innovations': arma_innovations.arma_innovations(
        series, ma_params=res.params[:-1])[0],
    'convolution filter': sm.tsa.filters.recursive_filter(series, -res.params[:-1])}))

gives:

    ARIMA resid  arma_innovations    recursive filter
0      0.496714          0.496714            0.496714
1     -0.132893         -0.132893           -0.136521
2      0.646110          0.646110            0.646861
3      1.525620          1.525620            1.525466
4     -0.229316         -0.229316           -0.229286
..          ...               ...                 ...
95    -1.464786         -1.464786           -1.464786
96     0.291233          0.291233            0.291233
97     0.263055          0.263055            0.263055
98     0.005637          0.005637            0.005637
99    -0.234672         -0.234672           -0.234672

ARMA(1, 1)

np.random.seed(42)
series = sm.tsa.arma_generate_sample(ar=(1, 0.5), ma=(1, 0.2), nsample=100)

res = sm.tsa.arima.ARIMA(series, order=(1, 0, 1), trend='n').fit()
a = res.resid

# Apply the recursive then convolution filter
tmp = sm.tsa.filters.recursive_filter(series, -res.params[1:2])
filtered = sm.tsa.filters.convolution_filter(tmp, np.r_[1, -res.params[:1]], nsides=1)

print(pd.DataFrame({
    'ARIMA resid': res.resid,
    'arma_innovations': arma_innovations.arma_innovations(
        series, ar_params=res.params[:1], ma_params=res.params[1:2])[0],
    'combined filters': filtered}))

gives:

    ARIMA resid  arma_innovations    combined filters
0      0.496714          0.496714                 NaN
1     -0.134253         -0.134253           -0.136915
2      0.668094          0.668094            0.668246
3      1.507288          1.507288            1.507279
4     -0.193560         -0.193560           -0.193559
..          ...               ...                 ...
95    -1.448784         -1.448784           -1.448784
96     0.268421          0.268421            0.268421
97     0.212966          0.212966            0.212966
98     0.046281          0.046281            0.046281
99    -0.244725         -0.244725           -0.244725

SARIMA(1, 0, 1)x(1, 0, 0, 3)

The seasonal model is a little more complicated, because it requires multiplying lag polynomials. For additional details, see this example notebook from the Statsmodels documentation.

np.random.seed(42)
ar_poly = [1, -0.5]
sar_poly = [1, 0, 0, -0.1]
ar = np.polymul(ar_poly, sar_poly)
series = sm.tsa.arma_generate_sample(ar=ar, ma=(1, 0.2), nsample=100)

res = sm.tsa.arima.ARIMA(series, order=(1, 0, 1), seasonal_order=(1, 0, 0, 3), trend='n').fit()
a = res.resid

# Apply the recursive then convolution filter
tmp = sm.tsa.filters.recursive_filter(series, -res.polynomial_reduced_ma[1:])
filtered = sm.tsa.filters.convolution_filter(tmp, res.polynomial_reduced_ar, nsides=1)

print(pd.DataFrame({
    'ARIMA resid': res.resid,
    'arma_innovations': arma_innovations.arma_innovations(
        series, ar_params=-res.polynomial_reduced_ar[1:],
        ma_params=res.polynomial_reduced_ma[1:])[0],
    'combined filters': filtered}))

gives:

    ARIMA resid  arma_innovations combined filters
0      0.496714          0.496714              NaN
1     -0.100303         -0.100303              NaN
2      0.625066          0.625066              NaN
3      1.557418          1.557418              NaN
4     -0.209256         -0.209256        -0.205201
..          ...               ...              ...
95    -1.476702         -1.476702        -1.476702
96     0.269118          0.269118         0.269118
97     0.230697          0.230697         0.230697
98    -0.004561         -0.004561        -0.004561
99    -0.233466         -0.233466        -0.233466

Answered By - cfulton

Answer Checked By - Clifford M. (PHPFixing Volunteer)

[FIXED] How to get a list of probabilities after creating a histogram for continuous data (Python)?

October 09, 2022 histogram, probability, python, statistics No comments

Issue

I have the data set below (Data) and I create a histogram using the code below to extract n (number of points in each bin or frequency). Then I calculate the probability of each of the bins by dividing frequency by total number of points to get the respective probability of each bin (bin_probability).

Now I want to get the probability for each point in a list. For example say point 1 is in bin 1 therefore, probability is the first value in the array of 0.65; point 2 is in bin 5 so probability is 0.05, etc. How do I map each point to its respective bin_probability so that I have a list of probabilities for each point (in this case 20 probabilities)?

Data = [4.33, 4.11, 6.33, 5.67, 3.24, 6.74, 24.6, 6.43, 4.122, 9.67, 9.99, 3.44, 5.66, 3.54, 5.34, 6.55, 5.78, 3.56, 1.55, 5.45]

n, bin_edges = np.histogram(Data, bins = 10)
totalcount = np.sum(n)
bin_probability = n / totalcount
print(bin_probability)
>> array([0.65, 0.3 , 0.  , 0.  , 0.05])

Many thanks for your help!

Solution

Based on @kcsquared's link above, a list can be made with the respective bin locations for each point. The variable 'bins_per_point' includes 20 elements in an array. Each element corresponds to bin the data point is part of. Next the 'probability_perpoint variable divides each frequency by the total count to get the respective probabilities.

bins_per_point = np.fmin(np.digitize(Data, bin_edges), len(bin_edges)-1)
probability_perpoint = [bin_probability[bins_per_point[i]-1] for i in range(len(Data))] 
>> array([0.1 , 0.1 , 0.15, 0.1 , 0.05, 0.15, 0.55, 0.15, 0.1 , 0.2 , 0.2 ,
       0.05, 0.1 , 0.05, 0.1 , 0.15, 0.1 , 0.05, 0.05, 0.1 ])

To verify, the sum of unique probabilities is 1.

 np.sum(bin_probability) 
>> 1

Answered By - sarah

Answer Checked By - Marie Seifert (PHPFixing Admin)

[FIXED] What the solution for this ggplot2 error on gigINEXT function:" Error in levels<-..."?

October 09, 2022 ggplot2, github, graphics, r, statistics No comments

Issue

I'm running some examples from iNEXT package on Rstudio and having trouble with the graphical example. To help possible suggestions, below are some initial steps that run without any major problems, followed by the command line that generates the error:

#packages and dependencies

>install.packages("iNEXT")
>install.packages('devtools')
>library(devtools)
>install_github('JohnsonHsieh/iNEXT')
>library(iNEXT)
>library(ggplot2)

#example dataset

>data(spider)

#last line that runs properly before the error

>out <- iNEXT(spider, q=c(0, 1, 2), datatype="abundance", endpoint=500)

#here comes to the problematic line and its error

>ggiNEXT(out, type=1, facet.var="site")

Error in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  : factor level [2] is duplicated

Any idea how to solve this?

Solution

I solved it, adding the unique function, inside the ggiNEXT.iNEXT function: Below are the original command line and the same line altered:

Original command line:

z$lty <- factor(z$method, c("interpolated", "observed", "extrapolated"),
c("interpolation", "interpolation", "extrapolation"))

Changed command line:

z$lty <- factor(z$method, levels=unique(c("interpolated", "observed", "extrapolated"),
c("interpolation", "interpolation", "extrapolation")))

After the change, I could make the graph exactly as the example from https://cran.r-project.org/web/packages/iNEXT/vignettes/Introduction.html

Just for note, I used traceback() to find that the problem was inside ggiNEXT.iNEXT. Then I found the command indicated by the error message, and use the unique function based on a similar problem from this post: ggplot: order of factors with duplicate levels

To run properly and construct the graphs, past all the altered function on R console:

ggiNEXT.iNEXT<-function (x, type = 1, se = TRUE, facet.var = "none", color.var = "site", 
    grey = FALSE) 
{
    TYPE <- c(1, 2, 3)
    SPLIT <- c("none", "order", "site", "both")
    if (is.na(pmatch(type, TYPE)) | pmatch(type, TYPE) == -1) 
        stop("invalid plot type")
    if (is.na(pmatch(facet.var, SPLIT)) | pmatch(facet.var, SPLIT) == 
        -1) 
        stop("invalid facet variable")
    if (is.na(pmatch(color.var, SPLIT)) | pmatch(color.var, SPLIT) == 
        -1) 
        stop("invalid color variable")
    type <- pmatch(type, 1:3)
    facet.var <- match.arg(facet.var, SPLIT)
    color.var <- match.arg(color.var, SPLIT)
    if (facet.var == "order") 
        color.var <- "site"
    if (facet.var == "site") 
        color.var <- "order"
    options(warn = -1)
    z <- fortify(x, type = type)
    options(warn = 0)
    if (ncol(z) == 7) {
        se <- FALSE
    }
    datatype <- unique(z$datatype)
    if (color.var == "none") {
        if (levels(factor(z$order)) > 1 & "site" %in% names(z)) {
            warning("invalid color.var setting, the iNEXT object consists multiple sites and orders, change setting as both")
            color.var <- "both"
            z$col <- z$shape <- paste(z$site, z$order, sep = "-")
        }
        else if ("site" %in% names(z)) {
            warning("invalid color.var setting, the iNEXT object consists multiple orders, change setting as order")
            color.var <- "site"
            z$col <- z$shape <- z$site
        }
        else if (levels(factor(z$order)) > 1) {
            warning("invalid color.var setting, the iNEXT object consists multiple sites, change setting as site")
            color.var <- "order"
            z$col <- z$shape <- factor(z$order)
        }
        else {
            z$col <- z$shape <- rep(1, nrow(z))
        }
    }
    else if (color.var == "order") {
        z$col <- z$shape <- factor(z$order)
    }
    else if (color.var == "site") {
        if (!"site" %in% names(z)) {
            warning("invalid color.var setting, the iNEXT object do not consist multiple sites, change setting as order")
            z$col <- z$shape <- factor(z$order)
        }
        z$col <- z$shape <- z$site
    }
    else if (color.var == "both") {
        if (!"site" %in% names(z)) {
            warning("invalid color.var setting, the iNEXT object do not consist multiple sites, change setting as order")
            z$col <- z$shape <- factor(z$order)
        }
        z$col <- z$shape <- paste(z$site, z$order, sep = "-")
    }
    z$lty <- factor(z$method, levels=unique(c("interpolated", "observed", "extrapolated"), 
        c("interpolation", "interpolation", "extrapolation")))
    z$col <- factor(z$col)
    data.sub <- z[which(z$method == "observed"), ]
    g <- ggplot(z, aes_string(x = "x", y = "y", colour = "col")) + 
        geom_point(aes_string(shape = "shape"), size = 5, data = data.sub)
    g <- g + geom_line(aes_string(linetype = "lty"), lwd = 1.5) + 
        guides(linetype = guide_legend(title = "Method"), colour = guide_legend(title = "Guides"), 
            fill = guide_legend(title = "Guides"), shape = guide_legend(title = "Guides")) + 
        theme(legend.position = "bottom", legend.title = element_blank(), 
            text = element_text(size = 18))
    if (type == 2L) {
        g <- g + labs(x = "Number of sampling units", y = "Sample coverage")
        if (datatype == "abundance") 
            g <- g + labs(x = "Number of individuals", y = "Sample coverage")
    }
    else if (type == 3L) {
        g <- g + labs(x = "Sample coverage", y = "Species diversity")
    }
    else {
        g <- g + labs(x = "Number of sampling units", y = "Species diversity")
        if (datatype == "abundance") 
            g <- g + labs(x = "Number of individuals", y = "Species diversity")
    }
    if (se) 
        g <- g + geom_ribbon(aes_string(ymin = "y.lwr", ymax = "y.upr", 
            fill = "factor(col)", colour = "NULL"), alpha = 0.2)
    if (facet.var == "order") {
        if (length(levels(factor(z$order))) == 1 & type != 2) {
            warning("invalid facet.var setting, the iNEXT object do not consist multiple orders.")
        }
        else {
            g <- g + facet_wrap(~order, nrow = 1)
            if (color.var == "both") {
                g <- g + guides(colour = guide_legend(title = "Guides", 
                  ncol = length(levels(factor(z$order))), byrow = TRUE), 
                  fill = guide_legend(title = "Guides"))
            }
        }
    }
    if (facet.var == "site") {
        if (!"site" %in% names(z)) {
            warning("invalid facet.var setting, the iNEXT object do not consist multiple sites.")
        }
        else {
            g <- g + facet_wrap(~site, nrow = 1)
            if (color.var == "both") {
                g <- g + guides(colour = guide_legend(title = "Guides", 
                  nrow = length(levels(factor(z$order)))), fill = guide_legend(title = "Guides"))
            }
        }
    }
    if (facet.var == "both") {
        if (length(levels(factor(z$order))) == 1 | !"site" %in% 
            names(z)) {
            warning("invalid facet.var setting, the iNEXT object do not consist multiple sites or orders.")
        }
        else {
            g <- g + facet_wrap(site ~ order)
            if (color.var == "both") {
                g <- g + guides(colour = guide_legend(title = "Guides", 
                  nrow = length(levels(factor(z$site))), byrow = TRUE), 
                  fill = guide_legend(title = "Guides"))
            }
        }
    }
    if (grey) {
        g <- g + theme_bw(base_size = 18) + scale_fill_grey(start = 0, 
            end = 0.4) + scale_colour_grey(start = 0.2, end = 0.2) + 
            guides(linetype = guide_legend(title = "Method"), 
                colour = guide_legend(title = "Guides"), fill = guide_legend(title = "Guides"), 
                shape = guide_legend(title = "Guides")) + theme(legend.position = "bottom", 
            legend.title = element_blank())
    }
    g <- g + theme(legend.box = "vertical")
    return(g)
}

Answered By - Emanuel

Answer Checked By - Marie Seifert (PHPFixing Admin)

[FIXED] How could I optimize this following program in R to boost performance? (Monte Carlo simulation involving compute-intensive permutation test)

October 09, 2022 for-loop, optimization, performance, r, statistics No comments

Issue

Hi I'm trying to optimize the following program for a Monte Carlo Power study. I first use an algorithm taken from Efron and Tibshirani (1993) to find the p-value of a permutation test (upper-tailed) for equality of means. I wrote a function as follows called perm_test(), of which the output is a single p-value. Then, I call this function in another program, power_p(), that simulates 1000 permutation tests (returning 1000 p-values). My power estimate is the proportion of these 1000 p-values that are statistically significant, i.e., < 0.05. The entire process takes me about 8 minutes to run (2020 macBook pro). I'm wondering if anyone has any suggestions in terms of optimizing this process to make it run quicker. Many thnx.

perm_test <- function(delta) {
  
  # Permutation test as described in Efron and Tibshirani (1993), (Algorithm 15.1, p208)
  
  # Draw random samples from a normal distribution
  x <- rnorm(10, mean = delta, sd = 1)
  y <- rnorm(10, mean = 0, sd = 1)
  # observed diff in means, denoted as D_obs 
  D_obs <- mean(x) - mean(y)
  
  # Create a data frame "N" with  n_x + n_y obs (20 rows in our case)
  N <- data.frame("v" = c(x, y))
  # create a group variable "g" indicating which group each observation belongs to
  N$g <- as.factor(c(rep("x", 10), rep("y", 10)))
  # arrange column "v" in ascending order 
  # notice that column "g" is also re-ordered
  N <- arrange(N, v)
  
  ###############################################################################################
  # There are 20 choose 10 (184756) possibilities for the ordering of "g"                       #       
  # corresponding to all possible ways of partitioning 20 elements into two subsets of size 10  #
  # we take only a random sample of 5000 from those 184756 possibilities                        #
  ###############################################################################################
  
  # Initialize variables
  B <- 5000
  x_mean <- 0
  y_mean <- 0
  D_perm <- rep(0, 5000)
  
  # Loop to randomly generate 5000 different orderings of "g"
  for (i in 1:B) {
    
    # Shuffle the ordering of "g"
    N$g <- sample(N$g)
    # Permuted means of x and y
    x_mean <- tapply(N$v, N$g, mean)[1]
    y_mean <- tapply(N$v, N$g, mean)[2]
    # Find permuted diff in means, denoted as D_perm
    D_perm[i] <- x_mean - y_mean 
    }
  
  # Find p-value 
  P_perm <- sum(D_perm >= D_obs)/ B
  
  # Output
  return(round(P_perm, digits = 5))
  
}

Here's the program that simulates 1000 permutation tests:

power_p <- function(numTrial, delta) {
  
  # Initilize variables
  P_p <- rep(0, numTrial) 
  pwr_p <- 0
  
  # Simulation
  P_p <- replicate(n = numTrial, expr = perm_test(delta))
  
  # Power estimates are the proportions of p-values that are significant (i.e. less than 0.05)
  pwr_p <- sum(P_p < 0.05) / numTrial
  
  # Output 
  return(round(pwr_p, digits = 5))

}

Solution

perm_test2 <- function(delta) {
  x <- rnorm(10, mean = delta, sd = 1)
  y <- rnorm(10, mean = 0, sd = 1)
  D_obs <- mean(x) - mean(y)
  v <- c(x, y)
  g <- rep(1:2, each = 10)
  B <- 5000
  y_mean <- x_mean <- 0
  D_perm <- rep(0, B)
  for (i in 1:B) {
    ii <- sample(g) == 1L
    D_perm[i] <- (sum(v[ii]) - sum(v[!ii]))/10 
  }
  P_perm <- sum(D_perm >= D_obs)/ B
  return(round(P_perm, digits = 5))
}

In comparison to more complicated approaches I propose the above one, where I have made some simple improvements to your existing code.

we do not need to use data.frame, that only takes unnecessary time to subset vector each time
instead of factor group vector we can use simple integer vector
the ordering of data is not needed
we can reduce mean calculation. In your code 2 calls of tapply(N$v, N$g, mean) was the slowest part.
mean(x) is slower than sum(x)/n, it does additional checks etc., so in this situation we can use the faster approach. As the inner loop will be executed 1000x5000 times(sims x B).

bench::mark(perm_test_org(0.5), perm_test2(0.5), iterations = 5, check = F,
            relative = T)[, 1:8]
# A tibble: 2 x 8
#   expression           min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#   <bch:expr>         <dbl>  <dbl>     <dbl>     <dbl>    <dbl> <int> <dbl>
# 1 perm_test_org(0.5)  15.5   15.3       1        1.00     1        5    16
# 2 perm_test2(0.5)      1      1        13.5      1        1.69     5     2

Approximately 15x faster on my system. 1000 iterations took 33.89 seconds.

Update 2.

We can improve speed even more by:

replacing sample with sample.int
then we see that the g vector isn't needed at all to select random two groups
in loop we do not need to sum both parts of vector v as we can sum(v) before loop, so we can do one sum less inside the loop and calculate the result values at end.

perm_test3 <- function(delta) {
  x <- rnorm(10, mean = delta, sd = 1)
  y <- rnorm(10, mean = 0, sd = 1)
  D_obs <- mean(x) - mean(y)
  v <- c(x, y)
  B <- 5000
  s <- sum(v)
  D_perm2 <- rep(0, B)
  for (i in 1:B) {
    D_perm2[i] <- sum(v[sample.int(10) < 6])
  }
  D_perm <- D_perm2 - (s - D_perm2)
  P_perm <- sum(D_perm/10 >= D_obs) / B
  return(round(P_perm, digits = 5))
}

Runs in +/- 20 seconds for 1000 iterations. Now the slowest part is repeated call of sample.int. You can look into faster functions: https://www.r-bloggers.com/2019/04/fast-sampling-support-in-dqrng/

Answered By - minem

Answer Checked By - Cary Denson (PHPFixing Admin)

[FIXED] How do i put multiple separate charts onto one pdf in r

October 09, 2022 bar-chart, gplots, r, statistics No comments

Issue

I am having trouble figuring how to put 6 separate bar plots onto one pdf. This is one of the plots for example (they are all very similar):

    barplot(druhy, ylim = c(0,50), main = "Míváš často chuť na svíčkovou?", col = c("red", "blue", "green", "black", "yellow"))

   legend("topleft", legend = c("ano", "spíše ano", "nevím", "spíše ne", "ne"), fill = c("red", "black", "green", "yellow", "blue"))

And here are the values the chart uses:

  print(druhy)

  ano\r\n        ne\r\n     nevím\r\n spíše ano\r\n  spíše ne\r\n 
        4             1             2             5             3

I tried this code to put them onto one pdf together, but it doesn't work. I am sure there is something wrong in this code, but i just can not seem to find what excactly.

    pdf('plot.pdf')

  m <- rbind(c(1,2), c(3,4), c(5,6))
  layout(m)
  sapply(seq_along(sp),
   function(x) {
     dd <- sp[[x]]
     m <- t(`rownames<-`(as.matrix(dd[, -(1:2)]), dd[, 1]))
     bp <- barplot(m,ylim=c(0, 0.4),beside=TRUE,col=barcols)
     title(main=names(sp[x]))
     # abline(h=0)
   }
plot(NA,xlim=c(0,1),ylim=c(0,1),ann=FALSE,axes=FALSE)

Is there any better way or how do i fix this code, so the outcome is a pdf with six side by side barplots? I am a beginner, so I apologize if my question is downright stupid.

Solution

The end of plotting must be indicated using dev.off(). Here is a simple working example:

pdf("myOut.pdf")
for (i in 1:10){
  barplot(iris$Sepal.Length)
}
dev.off()

Answered By - danlooo

Answer Checked By - Willingham (PHPFixing Volunteer)

[FIXED] How to combine/union two separate numpy Gaussian sets?

October 09, 2022 numpy, python, statistics No comments

Issue

I want to combine two separate random Gaussian data sets, one with its own mean and std and the other with an outlier mean and std. My code that I have is this:

import random
import numpy as np
import numpy.random as ra
from numpy.random import seed 

#This makes the random numbers generated not change when rerunning the code
np.random.seed(0)

#Creating two Gaussian sets, one with mean 0 and std 1, the second is outlier with mean 3 and std 1
#Each set contains 1,000 trials, first set contains 99 points while outlier set contains 1 point for each trial (for 1% outlier)

data = np.random.normal(loc=0, scale=1, size=(1000, 99))
dataoutlier = np.random.normal(loc=3, scale=1, size=(1000, 1))

Now how can I combine this so the outlier numbers are with the first set for each trial? I thought using np.union1d would work, but that combines all the trials into one giant array. Any help would be very appreciated!

Solution

In order to combine two numpy arrays by column you might use the append method.

np.append(data, dataoutlier, axis=1)

Answered By - Grzegorz

Answer Checked By - Gilberto Lyons (PHPFixing Admin)

[FIXED] How to average two (or multiple histograms) with R

October 09, 2022 histogram, mean, plot, r, statistics No comments

Issue

Could someone tell me how to average two histograms with R?

I came across the HistogramTools package and the AddHistograms function:

h1<-hist(na.omit(a[,1]),breaks = 100,plot=F)
h2<-hist(na.omit(a[,2]),breaks = 100,plot=F)
> AddHistograms(h1,h2)
Error in .AddManyHistograms(x, main = main) : 
  Histograms must have identical breakpoints for this operation.

but I always have the same error Histograms must have identical breakpoints for this operation? can someone explain why? I am guessing is that a[,1] and a[,2] are not the same length, same with the outputs of h1 and h2 (i.e. I don't have the same length for "breaks","mids","counts" between h1 and h2).

Could you tell me how to average my two histograms using this function or anything else with R?

Solution

Follow the steps below:

Create h1 and h2,
combine and sort the breaks vectors,
keep the unique values
and add the histograms.

With the (not reproducible) example in the question,

h1 <- hist(na.omit(a[,1]), plot = FALSE)
h2 <- hist(na.omit(a[,2]), plot = FALSE)

brks <- sort(c(h1$breaks, h2$breaks))
brks <- unique(brks)

h1 <- hist(na.omit(a[,1]), breaks = brks, plot = FALSE)
h2 <- hist(na.omit(a[,2]), breaks = brks, plot = FALSE)

h12 <- AddHistograms(h1, h2)
plot(h12)

Note also that na.omit is not really needed, hist will discard them anyhow.

Answered By - Rui Barradas

Answer Checked By - Willingham (PHPFixing Volunteer)

[FIXED] How to count percentile of distribution in python

October 09, 2022 math, numpy, percentile, python, statistics No comments

Issue

Is there any python/numpy function that calculates n-th percentile of given probability distribution?

# Like This
distr = [.2, .6, .2]
do_some_magic(distr, 50)  # 1
distr = [.1, .1, .6, .2]
do_some_magic(distr, 50)  # 2

Solution

Yes, you can use scipy's percentileofscore.

from scipy.stats import percentileofscore

distr = [.2, .6, .2]

print(percentileofscore(distr,50)/100)

1.0

Answered By - Machetes0602

Answer Checked By - Willingham (PHPFixing Volunteer)

[FIXED] What does "inf" mean for F value in scipy.stats.f_oneway?

October 09, 2022 anova, pandas, python, scipy, statistics No comments

Issue

I ran a one-way anova, and some groups had "inf" for the F value and "0.000000e+00" for the p value. Does this mean that the difference is significant?

I separated the dataframe using groupby and looped through, example code:

from scipy import stats

c_jobs_anova = []

for name_group in c_jobs.groupby(['Name']):
    samples = [condition[1] for condition in name_group[1].groupby('Condition')['Value']]
    f_value, p_value = stats.f_oneway(*samples)
    print('Group: {}, F value: {:.3f}, p value: {:.3f}'.format(name_group[0], f_value, p_value))
    c_jobs_anova.append((name_group[0], f_value, p_value))

The result:

Solution

Yes, very large values of F-statistic indicate high significance, as witnessed by p being reported as 0. Mathematically, F comes up as infinity if there is no in-group variability, e.g.,

>>> stats.f_oneway([2, 2, 2], [1, 1, 1])
F_onewayResult(statistic=inf, pvalue=0.0)

This result is also possible if there is very little in-group variability compared to between-group variability, resulting in numerical overflow.

>>> stats.f_oneway([2, 2, 2], [1, 1, 1.00000001])
F_onewayResult(statistic=inf, pvalue=0.0)

Answered By - user6655984

Answer Checked By - Mildred Charles (PHPFixing Admin)

[FIXED] How to get sum of probabilities of rolling w dice where one is fair and the other one is unfair?

October 09, 2022 math, python-3.x, statistics No comments

Issue

I am writing a little program and wanted to ask how I can add the logic of having an unfair dice in the game? Right now, my code produces the sum of probabilities of rolling 2 dices with 6 faces for i times. However, it is treating the dices with a 1/6 probability of rolling a given number. How do I tweak it, so that the unfair dice ONLY shows up in the range of 2-5 but never as 1 or 6? The output should the sum of probs for all numbers in range 2-12 given the fair and unfair dice.

import random
from collections import defaultdict

def main():
    dice = 2
    sides = 6
    rolls = int(input("Enter the number of rolls to simulate: "))
    result = roll(dice, sides, rolls)
    maxH = 0
    for i in range(dice, dice * sides + 1):
        if result[i] / rolls > maxH: maxH = result[i] / rolls
    for i in range(dice, dice * sides + 1):
        print('{:2d}{:10d}{:8.2%} {}'.format(i, result[i], result[i] / rolls, '#' * int(result[i] / rolls / maxH * 40)))


def roll(dice, sides, rolls):
    d = defaultdict(int)
    for _ in range(rolls):
        d[sum(random.randint(1, sides) for _ in range(dice))] += 1
    return d

main()

Output

Enter the number of rolls to simulate: 10000
 2       265   2.65% ######
 3       567   5.67% #############
 4       846   8.46% ####################
 5      1166  11.66% ############################
 6      1346  13.46% ################################
 7      1635  16.35% ########################################
 8      1397  13.97% ##################################
 9      1130  11.30% ###########################
10       849   8.49% ####################
11       520   5.20% ############
12       279   2.79% ######

Solution

Given that the logic of which results are possible is currently being controlled by the line

random.randint(1, sides)

that's the line to change if you want to roll with different bounds. For example, to get 2-5, you could generalize the function:

def main():
    dice = 2
    sides = 6
    unfair_min = 2
    unfair_max = 5
    rolls = int(input("Enter the number of rolls to simulate: "))
    result_unfair = roll_biased(dice, sides, rolls, min_roll=unfair_min, max_roll=unfair_max)
    maxH = max(result_unfair.values()) / rolls

    for i in range(dice, dice * sides + 1):
        print('{:2d}{:10d}{:8.2%} {}'.format(i, result_unfair[i], result_unfair[i] / rolls,
                                             '#' * int(result_unfair[i] / rolls / maxH * 40)))


def roll_biased(dice, sides, rolls, min_roll=1, max_roll=None):
    if max_roll is None:
        max_roll = sides
    d = defaultdict(int)
    for _ in range(rolls):
        d[sum(random.randint(min_roll, max_roll) for _ in range(dice))] += 1
    return d

Which could print:

Enter the number of rolls to simulate: 10000
 2         0   0.00% 
 3         0   0.00% 
 4       632   6.32% ##########
 5      1231  12.31% ###################
 6      1851  18.51% #############################
 7      2480  24.80% ########################################
 8      1873  18.73% ##############################
 9      1296  12.96% ####################
10       637   6.37% ##########
11         0   0.00% 
12         0   0.00%

You could also generalize this to arbitrary choices (or arbitrary weights) using random.choices() as such:

def roll_from_choices(dice, sides, rolls, allowed_rolls=None):
    if allowed_rolls is None:
        allowed_rolls = list(range(1, sides+1))
    d = defaultdict(int)

    for _ in range(rolls):
        d[sum(random.choices(allowed_rolls, k=dice))] += 1
    return d

which you can call as:

result_unfair = roll_from_choices(dice, sides, rolls, allowed_rolls=[2, 3, 4, 5])

Answered By - kcsquared

Answer Checked By - Gilberto Lyons (PHPFixing Admin)

[FIXED] What is the best method to check for what value the elements of a list are approaching?

October 09, 2022 calculus, math, python, statistics No comments

Issue

I'm working on a method for calculating limits of a function in Python, but certain functions give an incorrect output. My main goal is to calculate derivatives, but for that, I must calculate the limit of a function.

This is for a personal project, I would not like a module to solve the limit, or search an algorithm on Wikipedia. I tried using the following method:

I generate a list with the values of f(x) very close to the value which x tends
The code analyzes which value is most repeated in this list, and it's concluded that this value is the limit of the function f(x) tending to x.

This method is obviously antimathematical, but I could not think of a better solution. I tried to apply centralization measures, but I don't think it works for a periodic continued fraction, like 2/7.

Here is my code:

# this function resolves a numerical expression
# using eval function from Python

def Solve(Str):
    if '^' in Str:
        Str = Str.replace('^', '**')
    return eval(Str)


# this function solves a mathematical function by substituting x
# for a value passed by parameter and returning its result

def SolveF(f, x, var = 'x'):
    f = f.replace(' ', '')

    # inserts a multiplication sign between numbers
    # example: 5x --> 5*x

    f = list(f)
    i = 0
    while i < len(f)-1:
        L = f[i:i+2]
        if L[0] in '0123456789' and L[1] == var:
            f.insert(i+1, '*')
        i += 1
    f = ''.join(f)

    f = f.replace(var, '(' + var + ')')
    f = f.replace(var, str(x))

    return Solve(f)


# this function returns f(x) for a value very close
# to the value at which x tends. for example, if x
# tends to 5, it returns f(5.0000000000001). the tiny
# amount that is added to x is 10^(-13) (arbitrary value)

def Lim(f, x, c = 13):
    return SolveF(f, x + (10**(-c)))


# this function returns several f(x) in a list to values
# very close to the value at which x tends. for example,
# if x tends to 0, it will add the list f(0.001), f(0.000001),
# f(0.0000001), ..., f(0.0000000001). then returns the value
# that most repeats in that list, which is supposed to be the
# value whose function is approaching.

def LimM(f, x):
    i = 0
    L = []
    for i in range(5, 20):
        try:
            L.append("{:.10f}".format(Lim(f, x, i)))
        except ZeroDivisionError:
            i += 1
            continue

    print(L)
    List2 = [L.count(i) for i in set(L)]
    if List2 == [1]*len(List2):
        return 'inf'
    else:
        return list(set(L))[List2.index(max(List2))]

from fractions import Fraction
while True:
    F = input('Function: ')
    X = float(input('x --> '))
    Res = LimM(F, X)
    if Res != 'inf':
            print(Fraction(Res).limit_denominator())
    else:
        print(Res)

Example 1: the function (x^2 - 4)/(x - 2) approaching x = 2.

The list generated by the LimM function is equal to ['4.0000100000', '4.0000010001', '4.0000000977', '4.0000000000', '4.0000000000', '4.0000000000', '4.0000000000', '4.0000000000', '4.0000000000', '4.0000000000', '4.0000000000'].

Notice that the value that repeats the most in the list is '4.0000000000', so the limit is equal to 4.

Example 2: the function ((x + 1)/(x - 1) approaching x = 2.

The list generated by the LimM function is equal to ['2.9999800002', '2.9999980000', '2.9999998000', '2.9999999800', '2.9999999980', '2.9999999998', '3.0000000000', '3.0000000000', '3.0000000000', '3.0000000000', '3.0000000000', '3.0000000000', '3.0000000000', '3.0000000000', '3.0000000000'].

Notice that the value that repeats the most in the list is '3.0000000000', so the limit is equal to 3.

I tested 28 different limits (you can check the questions here), and only 6 were incorrect. Among them are:

Input 1: Exercise 1, item m)

Function: (1/(1-x)) - (3/(1-x^3))
x --> 1

    Right answer: -1
    Code output: 0

List generated: ['-0.9999930434', '-1.0000138859', '-0.9992006216', '-0.7401486933', '0.0000000000', '0.0000000000', '0.0000000000', '0.0000000000', '0.0000000000', '0.0000000000', '0.0000000000']

Input 2: Exercise 1, item p)

Function: (3x^4 - 4x^3 + 1)/(x - 1)^2
x --> 1

    Right answer: 6
    Code output: 0

List generated: ['6.0000848733', '6.0000893153', '5.9952043260', '8.8817843050', '0.0000000000', '0.0000000000', '0.0000000000', '0.0000000000', '0.0000000000', '0.0000000000', '0.0000000000']

Input 3: Exercise 1, item u)

Function: (x^2 + 7x - 44)/(x^2 - 6x + 8)
x --> 4

    Right answer: 15/2
    Code output: 4222/563

List generated: ['7.4999675007', '7.4999967484', '7.4999995648', '7.4999992895', '7.4999982236', '7.4999911182', '7.4991119005', '7.4991119005', '7.5714285714', '6.6666666667']

Input 4: Exercise 1, item z)

Function: (1/(x^2 - 1)) - (2/(x^4 - 1))
x --> 1

    Right answer: 1/2
    Code output: 0

List generated: ['0.4999950374', '0.4999879392', '0.4996002605', '0.8326672688', '0.0000000000', '0.0000000000', '0.0000000000', '0.0000000000', '0.0000000000', '0.0000000000', '0.0000000000']

Input 5: Exercise 3, item a)

Function: ((1 + 2x)^(0.5) - 1)/(3x)
x --> 0

    Right answer: 1/3
    Code output: 0

List generated: ['0.3333316667', '0.3333331667', '0.3333333165', '0.3333333313', '0.3333332869', '0.3333333609', '0.3333333609', '0.3332889520', '0.3330669074', '0.3330669074', '0.2960594732', '0.0000000000', '0.0000000000', '0.0000000000', '0.0000000000']

Therefore, what is the best method to check for what value the elements of a list are approaching?

Solution

Just take the last value. It's the best approximation you're going to get.

Higher-level algorithmic improvement: don't even bother computing the other elements of the list. Just take a difference quotient with a small difference and use that as your computed derivative.

This is of course not going to be fully reliable, but that's an unavoidable problem of the approach you've chosen. It is simply not possible to compute limits by looking at a finite number of points near where you want to take the limit.

Answered By - user2357112

Answer Checked By - Marie Seifert (PHPFixing Admin)

[FIXED] How to interpret scipy.stats.probplot results?

October 09, 2022 matplotlib, numpy, plot, python, statistics No comments

Issue

I wanted to use scipy.stats.probplot() to perform some gaussianity test on mydata.

from scipy import stats
_,fit=stats.probplot(mydata, dist=stats.norm,plot=ax)
goodness_fit="%.2f" %fit[2]

The documentation says:

Generates a probability plot of sample data against the quantiles of a specified theoretical distribution (the normal distribution by default). probplot optionally calculates a best-fit line for the data and plots the results using Matplotlib or a given plot function. probplot generates a probability plot, which should not be confused with a Q-Q or a P-P plot. Statsmodels has more extensive functionality of this type, see statsmodels.api.ProbPlot.

But if google probability plot, it is a common name for P-P plot, while the documentation says not to confuse the two things.

Now I am confused, what is this function doing?

Solution

I looked since hours for an answer to this question, and this can be found in the Scipy/Statsmodel code comments.

In Scipy, comment at https://github.com/scipy/scipy/blob/abdab61d65dda1591f9d742230f0d1459fd7c0fa/scipy/stats/morestats.py#L523 says:

probplot generates a probability plot, which should not be confused with a Q-Q or a P-P plot. Statsmodels has more extensive functionality of this type, see statsmodels.api.ProbPlot.

So, now, let's look at Statsmodels, where comment at https://github.com/statsmodels/statsmodels/blob/66fc298c51dc323ce8ab8564b07b1b3797108dad/statsmodels/graphics/gofplots.py#L58 says:

ppplot : Probability-Probability plot Compares the sample and theoretical probabilities (percentiles).

qqplot : Quantile-Quantile plot Compares the sample and theoretical quantiles

probplot : Probability plot Same as a Q-Q plot, however probabilities are shown in the scale of the theoretical distribution (x-axis) and the y-axis contains unscaled quantiles of the sample data.

So, difference between QQ plot and Probability plot, in these modules, is related to the scales.

Answered By - mike123

Answer Checked By - Robin (PHPFixing Admin)

[FIXED] what will be the mean of a column if there are missing values in the column?

October 09, 2022 data-science, dataframe, pandas, python, statistics No comments

Issue

Example Data

ID	Name	Phone
1	x	+212
2	y	NaN
3	xy	NaN

df is the name of the dataset The code below gave the names of the columns with no missing values.

no_nulls = set(df.columns[df.isnull().mean()==0])

isnull() will convert the dataset into something like this

ID	Name	Phone
False	False	False
False	False	True
False	False	True

Can some one explain how mean will work on non-integers?

I used this and it worked but i am curious about mean

no_nulls = set(df.columns[df.notnull().all()])

Solution

Your case, .mean() is processing a dataframe of boolean values with True and False values only. In this case, .mean() treat False as 0 and True as 1. Hence, if you look at the result of df.isnull().mean(), you will see:

df.isnull().mean()

ID       0.000000
Name     0.000000
Phone    0.666667
dtype: float64

Here, as columns ID and Name have all False values, .mean() will treat all as zeros and get a mean of zero. For column Phone, you have one False and 2 True, hence, the mean is equivalent to taking mean of 0, 1, 1, i.e. 0.666667.

As a result, when you check for df.isnull().mean()==0, only the first 2 columns will be True and hence, you get {'ID', 'Name'} for the result of no_nulls.

Referring to the official document of DataFrame.mean, you will get some hint from the parameter numeric_only= and notice its default behavior with default setting:

Parameters

numeric_only bool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.

Answered By - SeaBean

Answer Checked By - Katrina (PHPFixing Volunteer)

[FIXED] How to calculate moving / running / rolling arbitrary function (e.g. kurtosis & skewness) using NumPy / SciPy

October 09, 2022 numpy, python, scipy, statistics No comments

Issue

I am working on the time-series data. To get features from data I have to calculate moving mean, median, mode, slop, kurtosis, skewness etc. I am familiar with scipy.stat which provides an easy way to calculate these quantities for straight calculation. But for the moving/running part, I have explored the whole internet and got nothing.

Surprisingly moving mean, median and mode are very easy to calculate with numpy. Unfortunately, there is no built-in function for calculating kurtosis and skewness. If someone can help, how to calculate moving kurtosis and skewness with scipy? Many thanks

Solution

Pandas offers a DataFrame.rolling() method which can be used, in combination with its Rolling.apply() method (i.e. df.rolling().apply()) to apply an arbitrary function to the specified rolling window.

If you are looking for NumPy-based solution, you could use FlyingCircus Numeric (disclaimer: I am the main author of it).

There, you could find the following:

flyingcircus_numeric.running_apply(): can apply any function to a 1D array and supports weights, but it is slow;
flyingcircus_numeric.moving_apply(): can apply any function supporting a axis: int parameter to a 1D array and supports weights, and it is fast (but memory-hungry);
flyingcircus_numeric.rolling_apply_nd(): can apply any function supporting a axis: int|Sequence[int] parameter to any ND array and it is fast (and memory-efficient), but it does not support weights.

Based on your requirements, I would suggest to use rolling_apply_nd(), e.g.:

import numpy as np
import scipy as sp
import flyingcircus_numeric as fcn

import scipy.stats


NUM = 30
arr = np.arange(NUM)

window = 4
new_arr = fcn.rolling_apply_nd(arr, window, func=sp.stats.kurtosis)
print(new_arr)
# [-1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36
#  -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36 -1.36
#  -1.36 -1.36 -1.36]

Of course, feel free to inspect the source code, it is open source (GPL).

EDIT

Just to get a feeling of the kind of speed we are talking about, these are the benchmarks for the solutions implemented in FlyingCircus:

The general approach flyingcircus_numeric.running_apply() is a couple of orders of magnitude slower than either flyingcircus_numeric.rolling_apply_nd() or flyingcircus_numeric.moving_apply(), with the first being approx. one order of magnitude faster than the second. This shows the speed price for generality or support for weighting.

The above plots were obtained using the scripts from here and the following code:

import scipy as sp
import flyingcircus_numeric as fcn 
import scipy.stats


WINDOW = 4
FUNC = sp.stats.kurtosis


def my_rolling_apply_nd(arr, window=WINDOW, func=FUNC):
    return fcn.rolling_apply_nd(arr, window, func=FUNC)


def my_moving_apply(arr, window=WINDOW, func=FUNC):
    return fcn.moving_apply(arr, window, func)


def my_running_apply(arr, window=WINDOW, func=FUNC):
    return fcn.running_apply(arr, window, func)


def equal_output(a, b):
    return np.all(np.isclose(a, b))


input_sizes = (5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000)
funcs = my_rolling_apply_nd, my_moving_apply, my_running_apply

runtimes, input_sizes, labels, results = benchmark(
    funcs, gen_input=np.random.random, equal_output=equal_output,
    input_sizes=input_sizes)

plot_benchmarks(runtimes, input_sizes, labels, units='s')
plot_benchmarks(runtimes, input_sizes, labels, units='ms', zoom_fastest=8)

(EDITED to reflect some refactoring of FlyingCircus)

Answered By - norok2

Answer Checked By - Katrina (PHPFixing Volunteer)

[FIXED] How to use C libraries in Vlang for basic statistics

October 09, 2022 c, libraries, statistics, vlang No comments

Issue

I want to do basic statistics with Vlang.

Can I use C libraries? For example, Apophenia: http://apophenia.info/

Or IMSL C stat library: https://docs.roguewave.com/en/imsl/c/8.6/pdf/C_Stat_library.pdf

Thanks for your help.

Solution

Yes, you can call C libraries from V.

You need to make the C library's structs, typedefs and functions known to V by defining them in V first - before calling/using them.

For structs you conveniently only need to define the fields you need to use.

Here's some examples:

via 2D game framework wrapping several C libs
sokol in vlib
v-miniaudio wrapper (disclaimer: my own module)

Generally you can find a lot of C wrapper code in vlib itself. (We're working on replacing C with pure V)

Answered By - Larpon

Answer Checked By - Timothy Miller (PHPFixing Admin)

[FIXED] How to re-analyze complete history of a project using Sonar?

October 09, 2022 history, sonarqube, statistics, svn, version-control No comments

Issue

I would like to load the entire project history since its inception into Sonar.

I would basically want to execute code like this:

0) checkout version 1 from Subversion
1) checkout next version from Subversion
2) if the commit date is from the same day as the previous one - goto 1
3) run mvn sonar:sonar, overriding the build time with the time of the commit
4) if not on last commit - goto 1

Is there a tool that does this already? Is there a way of convincing Sonar to use a different date than the current one?

Solution

This is from the Mailing Lists:

Indeed, to import historical data you must use the "sonar.projectDate" property (Format is yyyy-MM-dd, for example 2010-12-25) [1] and launch a Sonar analysis on each tag/branch that you'd like to see in your project history.

http://sonarqube.15.x6.nabble.com/re-ordering-historical-data-td3191565.html

There is an additional Blogpost that explains this further.

Answered By - oers

Answer Checked By - Gilberto Lyons (PHPFixing Admin)

[FIXED] How to perform column-wise wilcox.test and fisher exact on groups in R

October 09, 2022 dataframe, pairwise.wilcox.test, r, statistics No comments

Issue

I have data.frame df1:

df1 <- data.frame(
  En_ID = c("KNT00000000003", "KNT00000000005", "KNT00000000419", 
                 "KNT00000000457", "KNT00000000460", "KNT00000000938", 
                 "KNT00000000971", "KNT00000001036", "KNT00000001084", 
                 "KNT00000001167" ), 
  `Nor1` = c(-0.834165161710272, 1.02199443531549, 
                -0.558658947885705, -0.390114219973209, -1.23551839713296, 
                3.11429434221998, 0.283932163407262, -1.16908518620064, 
                -0.597054772455507, -0.593624543273255), 
  `Nor2` = c(-1.18531035488942, 0.423719727339646, -1.23261719368372, 
                0.0855281133529292, -1.52366830232278, 3.36692586561211, 
                1.00323690950956, -0.000211248816114964, -4.74738483548391, 
                -0.318176231083024), 
  `Nor3` = c(-0.262659255267546, 1.3962481061442, -0.548673555705647, 
                -0.0149651083306594, -1.45458689193089, 2.54126941463459, 
                1.17711308509307, -1.19425284921181, 1.17788731755683, 
                -0.367897054652365 ), 
  `Nor4` = c(-0.840752912305256, 0.536548846040064, -0.277409459604357, 
                -0.241073614962264, -0.875313153342293, 1.61789645804321, 
                0.412287101096504, -1.11846661523232, -2.6274528854429, 
                -0.760452698231182), 
  `Tor1` = c(-0.968784779247286, -0.502809694119192, -0.231526399163731, 
                -0.530038395734114, -0.706006018337411, 3.58264357077653, 
                -0.127521010699219, 0.270523387217103, 1.68335644352003, 
                -0.314902131571829), 
  `Tor2` = c(-0.481754175843152, -0.440784040523259, -0.532975340622715, 
                -0.182089795101371, -0.564807490336052, 1.74119896504534, 
                -0.96169805631325, -0.721782763145306, -0.433459827401695, 
                -0.727495835245995 ), 
  `Tor3` = c(-0.889343429110847, 1.07937149728343, -0.215144871523998, 
                -0.92234350748557, -0.832108253417702, 2.02456082994848, 
                -0.0434322861759954, -0.523126561938426, -0.556984056084809, 
                -0.740331742513503), 
  `Tor4` = c(-0.858141567384178, 1.87728717064375, -0.381047638414538, 
                -0.613568289061259, -1.92838339196505, 2.23393705735665, 
                0.635389543483408, -0.466053620529111, -1.50483745357134, 
                -1.33400859143521), 
  `Tor5` = c(-0.486388736112514, 0.789390852922639, -0.869434195504952, 
                -0.70405854858187, -1.16488184095428, 2.91497178849082, 
                -2.10331904053714, -0.571130459068143, -0.219526004620518, 
                -0.301435496557957)
)

How can I get the column-wise Wilcox.test and fisher extract text, comparing Nor1, Nor2, Nor3, and Nor4 columns with Tor1, Tor2, Tor3, Tor4, and Tor5 columns of each row. Then, I would like to add that p-value output of both tests at the end column, resulting in df2:

df2 <- data.frame( En_ID = c("KNT00000000003", "KNT00000000005", "KNT00000000419", "KNT00000000457", "KNT00000000460", "KNT00000000938", "KNT00000000971", "KNT00000001036", "KNT00000001084", "KNT00000001167" ), `Nor1` = c(-0.834165161710272, 1.02199443531549, -0.558658947885705, -0.390114219973209, -1.23551839713296, 3.11429434221998, 0.283932163407262, -1.16908518620064, -0.597054772455507, -0.593624543273255), `Nor2` = c(-1.18531035488942, 0.423719727339646, -1.23261719368372, 0.0855281133529292, -1.52366830232278, 3.36692586561211, 1.00323690950956, -0.000211248816114964, -4.74738483548391, -0.318176231083024), `Nor3` = c(-0.262659255267546, 1.3962481061442, -0.548673555705647, -0.0149651083306594, -1.45458689193089, 2.54126941463459, 1.17711308509307, -1.19425284921181, 1.17788731755683, -0.367897054652365 ), `Nor4` = c(-0.840752912305256, 0.536548846040064, -0.277409459604357, -0.241073614962264, -0.875313153342293, 1.61789645804321, 0.412287101096504, -1.11846661523232, -2.6274528854429, -0.760452698231182), `Tor1` = c(-0.968784779247286, -0.502809694119192, -0.231526399163731, -0.530038395734114, -0.706006018337411, 3.58264357077653, -0.127521010699219, 0.270523387217103, 1.68335644352003, -0.314902131571829), `Tor2` = c(-0.481754175843152, -0.440784040523259, -0.532975340622715, -0.182089795101371, -0.564807490336052, 1.74119896504534, -0.96169805631325, -0.721782763145306, -0.433459827401695, -0.727495835245995 ), `Tor3` = c(-0.889343429110847, 1.07937149728343, -0.215144871523998, -0.92234350748557, -0.832108253417702, 2.02456082994848, -0.0434322861759954, -0.523126561938426, -0.556984056084809, -0.740331742513503), `Tor4` = c(-0.858141567384178, 1.87728717064375, -0.381047638414538, -0.613568289061259, -1.92838339196505, 2.23393705735665, 0.635389543483408, -0.466053620529111, -1.50483745357134, -1.33400859143521), `Tor5` = c(-0.486388736112514, 0.789390852922639, -0.869434195504952, -0.70405854858187, -1.16488184095428, 2.91497178849082, -2.10331904053714, -0.571130459068143, -0.219526004620518, -0.301435496557957),`Tor4` = c(-0.858141567384178, 1.87728717064375, -0.381047638414538, -0.613568289061259, -1.92838339196505, 2.23393705735665, 0.635389543483408, -0.466053620529111, -1.50483745357134, -1.33400859143521), `p-value-wilcox` = c(0.8, 0.3, 0.7, 0.8, 0.9, 0.8, 0.7, -0.5, -0.7, -0.9), `p-value-fisher` = c(0.1, 0.7, 0.3, 0.1, 0.5, 0.3, 0.9, -0.2, -0.9, -0.4) )

Here I putting dummy p-value to provide an outline of the desired output. The real data have >200 columns, but both groups (Nor and Tor) have unequal sample number.

I found some examples from the stack as mentioned below and tried to replicate them but miserably failed.

wil-cox text

fisher extract text

Please help me.

Solution

We may use dapply from collapse which should be faster

library(collapse)
 cbind(df1, dapply(slt(df1, -c(1, 10)), MARGIN = 1,
    FUN = function(x) c(wilcox = wilcox.test(x[1:4], x[5:8])$p.value,
    fisher = fisher.test(x[1:4], x[5:8])$p.value)))
            En_ID       Nor1          Nor2        Nor3       Nor4       Tor1       Tor2        Tor3       Tor4       Tor5    wilcox fisher
1  KNT00000000003 -0.8341652 -1.1853103549 -0.26265926 -0.8407529 -0.9687848 -0.4817542 -0.88934343 -0.8581416 -0.4863887 0.6857143      1
2  KNT00000000005  1.0219944  0.4237197273  1.39624811  0.5365488 -0.5028097 -0.4407840  1.07937150  1.8772872  0.7893909 0.8857143      1
3  KNT00000000419 -0.5586589 -1.2326171937 -0.54867356 -0.2774095 -0.2315264 -0.5329753 -0.21514487 -0.3810476 -0.8694342 0.1142857      1
4  KNT00000000457 -0.3901142  0.0855281134 -0.01496511 -0.2410736 -0.5300384 -0.1820898 -0.92234351 -0.6135683 -0.7040585 0.1142857      1
5  KNT00000000460 -1.2355184 -1.5236683023 -1.45458689 -0.8753132 -0.7060060 -0.5648075 -0.83210825 -1.9283834 -1.1648818 0.3428571      1
6  KNT00000000938  3.1142943  3.3669258656  2.54126941  1.6178965  3.5826436  1.7411990  2.02456083  2.2339371  2.9149718 0.8857143      1
7  KNT00000000971  0.2839322  1.0032369095  1.17711309  0.4122871 -0.1275210 -0.9616981 -0.04343229  0.6353895 -2.1033190 0.1142857      1
8  KNT00000001036 -1.1690852 -0.0002112488 -1.19425285 -1.1184666  0.2705234 -0.7217828 -0.52312656 -0.4660536 -0.5711305 0.2000000      1
9  KNT00000001084 -0.5970548 -4.7473848355  1.17788732 -2.6274529  1.6833564 -0.4334598 -0.55698406 -1.5048375 -0.2195260 0.3428571      1
10 KNT00000001167 -0.5936245 -0.3181762311 -0.36789705 -0.7604527 -0.3149021 -0.7274958 -0.74033174 -1.3340086 -0.3014355 0.6857143      1

Answered By - akrun

Answer Checked By - Marie Seifert (PHPFixing Admin)

[FIXED] How to replace the previous value with the current value based on a condition?

October 09, 2022 conditional-statements, list, python, statistics No comments

Issue

My task is to detect outliers using the Z score and replace their value with the previous valid value.

signal = ['229.84', '227.8', '221.16', '220.6', '217.52', '225.2', '221.68', '221.68', '225.24', '218.6', '218.6', '222.08', '219.96', '219.52', '223.8', '223.72', '222.6', '222.68', '228.2', '221.84', '229.36', '227.48', '227.48', '226.56', '226.24', '215.32', '220.76', '222.44', '234.12', '226.56', '228.04', '236.64', '228.32', '236.72', '236.84', '237.64', '213.92', '235.52', '238.0', '239.12', '237.12', '217.24', '229.4', '229.4', '239.56', '236.2', '236.2', '220.04', '232.24', '223.92', '220.6', '242.96', '220.4', '242.2', '243.28', '241.72', '241.12', '241.8', '236.6', '234.24', '233.84', '234.8', '236.88', '244.8', '236.0', '230.84', '229.6', '229.84', '214.8', '231.48', '239.6', '239.56', '222.88', '238.24', '238.92', '235.36', '217.48', '217.2', '217.12', '218.08', '222.04', '89.48', '88.8', '223.2', '213.6', '239.6', '214.52', '95.8', '210.8', '209.92', '210.4', '215.76', '210.28', '211.76', '210.64', '211.36', '210.84', '201.84', '211.16', '242.16', '233.28', '212.8', '207.44', '209.0', '208.52', '207.44', '212.08', '210.96', '203.12', '207.76', '202.8', '203.16', '208.36', '209.76', '211.24', '211.24', '211.24', '206.04', '209.76', '210.2', '195.96', '195.84', '207.2', '201.92', '203.8', '199.96', '206.24', '204.12', '233.92', '230.68', '226.4', '221.6', '226.68', '226.56', '225.6', '223.72', '220.44', '223.64', '225.52', '223.96', '228.0', '227.44', '224.4', '223.32', '220.08', '220.2', '221.8', '218.08', '218.08', '216.96']

import numpy as np
results = [ float(s) for s in signal]
mean = np.mean(results)
std = np.std(results)


threshold = -1.5
outlier = []
new_list = []

for i in results:
            z = (i-mean)/std
            if z < threshold:
                   outlier.append(i)

outlier in the dataset is [89.48, 88.8, 95.8]

The final list should have these values replaced with the previous one(only if the prev value's z score disqualifies the condition z < threshold.

Edit:

Now when I try to expand it to the whole file, with similar elements, it gives an error. File

 with open(f "File.txt") as f:
 img_intensity_list = f.readlines()
        for count,value in enumerate(img_intensity_list):
            img_intensity_list[count] = value.split("[")[1].split("]")[0].split(", ")
    #                 print(img_intensity_list)
            for elem,val in enumerate(img_intensity_list):

                    results = [ float(elem) for elem in img_intensity_list]
                    mean = np.mean(results)
                    std = np.std(results)


                    threshold = -1.5 
                    outlier = []
                            # new_list = [0 for k in range(len(results))] 

                    for i, value in enumerate(results):
                            z = (value-mean)/std

                            if float(z) < threshold: 
                                outlier.append(value)
                                results[i] = results[i-1]
                            else:
                                results[i] = value

Error: float() argument must be a string or a number, not 'list'

Solution

i in your code is the value in the list. You are using it both as a value when computing your z value and an index when assigning the value of the previous result.

Use enumerate to get both the index, value of each element of your list like this:

for i, value in enumerate( results):
            z = (value-mean)/std
            if z - threshold:
                   outlier.append(value)
                   results[i] = results[i-1]

If I understood well your code this version should give you the expected result.

import numpy as np


signal = ['229.84', '227.8', '221.16', '220.6', '217.52', '225.2', '221.68', '221.68', '225.24', '218.6', '218.6', '222.08', '219.96', '219.52', '223.8', '223.72', '222.6', '222.68', '228.2', '221.84', '229.36', '227.48', '227.48', '226.56', '226.24', '215.32', '220.76', '222.44', '234.12', '226.56', '228.04', '236.64', '228.32', '236.72', '236.84', '237.64', '213.92', '235.52', '238.0', '239.12', '237.12', '217.24', '229.4', '229.4', '239.56', '236.2', '236.2', '220.04', '232.24', '223.92', '220.6', '242.96', '220.4', '242.2', '243.28', '241.72', '241.12', '241.8', '236.6', '234.24', '233.84', '234.8', '236.88', '244.8', '236.0', '230.84', '229.6', '229.84', '214.8', '231.48', '239.6', '239.56', '222.88', '238.24', '238.92', '235.36', '217.48', '217.2', '217.12', '218.08', '222.04', '89.48', '88.8', '223.2', '213.6', '239.6', '214.52', '95.8', '210.8', '209.92', '210.4', '215.76', '210.28', '211.76', '210.64', '211.36', '210.84', '201.84', '211.16', '242.16', '233.28', '212.8', '207.44', '209.0', '208.52', '207.44', '212.08', '210.96', '203.12', '207.76', '202.8', '203.16', '208.36', '209.76', '211.24', '211.24', '211.24', '206.04', '209.76', '210.2', '195.96', '195.84', '207.2', '201.92', '203.8', '199.96', '206.24', '204.12', '233.92', '230.68', '226.4', '221.6', '226.68', '226.56', '225.6', '223.72', '220.44', '223.64', '225.52', '223.96', '228.0', '227.44', '224.4', '223.32', '220.08', '220.2', '221.8', '218.08', '218.08', '216.96']

# Converting the strings to floats
results = [ float(s) for s in signal]
mean = np.mean(results)
std = np.std(results)


threshold = -1.5 
outlier = []
new_list = [0 for k in range(len(results))] 

for i, value in enumerate(results):
            z = (value-mean)/std

            if float(z) < threshold: 
                outlier.append(value)
                new_list[i] = new_list[i-1]
            else:
                new_list[i] = value

Answered By - Gaspard Blanchet

Answer Checked By - Willingham (PHPFixing Volunteer)

Sunday, October 9, 2022

Issue

Solution

Issue

Solution

Issue

Solution

Issue

Solution

Issue

Solution

Issue

Solution

Update 2.

Issue

Solution

Issue

Solution

Issue

Solution

Issue

Solution

Issue

Solution

Issue

Solution

Issue

Solution

Issue

Solution

Issue

Solution

Issue

Solution

EDIT

Issue

Solution

Issue

Solution

Issue

Solution

Issue

Solution

Total Pageviews

Featured Post

Subscribe To