PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0
Showing posts with label prediction. Show all posts
Showing posts with label prediction. Show all posts

Friday, October 7, 2022

[FIXED] How to check for the increasing and decreasing trend in a non-time series data

 October 07, 2022     prediction, r, statistics, tidyverse     No comments   

Issue

I have the following data frame:

dat <- structure(list(peptide_name = c("P2", "P2", "P2", "P2", "P2", 
"P2", "P4", "P4", "P4", "P4", "P4", "P4", "P1", "P1", "P1", "P1", 
"P1", "P1", "P3", "P3", "P3", "P3", "P3", "P3"), dose = c("0mM", 
"0.3mM", "1mM", "3mM", "10mM", "20mM", "0mM", "0.3mM", "1mM", 
"3mM", "10mM", "20mM", "0mM", "0.3mM", "1mM", "3mM", "10mM", 
"20mM", "0mM", "0.3mM", "1mM", "3mM", "10mM", "20mM"), prolif_score = c(1, 
1.174927114, 1.279883382, 1.752186589, 1.994169096, 2.358600583, 
1, 1.046454768, 1.339853301, 1.293398533, 1.026894866, 1.17264410515205, 
1, 0.928020566, 0.920308483, 1.071979434, 1.195372751, 1.524421594, 
1, 1.293233083, 1.483709273, 1.468671679, 1.192982456, 0.463659148
)), row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"
))

The plot looks like this:

enter image description here

What I want to have is an indicator that can differentiate between upward (P2, P1) and non-upward (P4 and P3) trend. As you can see the R value of linear model used for that is not useful. For example R score in P3 is positive just like P2 and P1.

How can I do that in R?

This is the code I have to create that plot:

library(tidyverse)
library(broom)
library(ggpubr)

lm_rsq_dat <- dat %>% 
    mutate(dose = as.numeric(gsub("mM", "", dose))) %>% 
    group_by(peptide_name) %>% 
    do(model = glance(lm(dose ~ prolif_score, data = .))) %>% 
    unnest(model) %>% 
    arrange(desc(adj.r.squared)) %>%
    dplyr::select(peptide_name, adj.r.squared) %>% 
    print(n = 100)  
  

# Plot --------------------------------------------------------------------
plot_dat <- dat %>% 
  left_join(lm_rsq_dat, by = "peptide_name") %>%
  mutate(r.squared = formatC(adj.r.squared, format = "e", digits = 2)) %>% 
  mutate(npeptide_name = paste0( peptide_name, " (R=", r.squared, ")")) #%>% 

nspn <- plot_dat %>% 
  dplyr::select(peptide_name, npeptide_name, adj.r.squared) %>% 
  arrange(match(peptide_name, rsq_dat$peptide_name)) %>% 
  unique() %>% 
  pull(npeptide_name)


plot_dat <- plot_dat %>% 
  mutate(npeptide_name = factor(npeptide_name, levels =   nspn))


end_dat <- plot_dat %>% 
  filter(dose == "20mM")

  ggline(plot_dat,
         y = "prolif_score", x = "dose",
         color = "npeptide_name", 
         size = 1, 
         facet.by = "npeptide_name", scales = "free_y",
         palette = get_palette("npg",  length( dat$peptide_name))
  ) +
  xlab("Dose") +
  ylab("Prolif. Score") +
  grids(linetype = "dashed")  +
  rremove("legend") + 
theme(axis.text.x=element_text(angle = 60, hjust = 0.5, vjust = 0.5, size = 12)) 

Solution

Both the Kendall and Spearman correlation tests assess rank-based correlation and therefore read out on the degree to which a value is is changing monotonically with another. This can be obtained by simply running cor(x, y, method = "kendall").

library(tidyverse)
  
dat <- structure(list(peptide_name = c("P2", "P2", "P2", "P2", "P2", "P2", "P4", "P4", "P4", "P4", "P4", "P4", "P1", "P1", "P1", "P1", "P1", "P1", "P3", "P3", "P3", "P3", "P3", "P3"), dose = c("0mM", "0.3mM", "1mM", "3mM", "10mM", "20mM", "0mM", "0.3mM", "1mM", "3mM", "10mM", "20mM", "0mM", "0.3mM", "1mM", "3mM", "10mM", "20mM", "0mM", "0.3mM", "1mM", "3mM", "10mM", "20mM"), prolif_score = c(1, 1.174927114, 1.279883382, 1.752186589, 1.994169096, 2.358600583, 1, 1.046454768, 1.339853301, 1.293398533, 1.026894866, 1.17264410515205, 1, 0.928020566, 0.920308483, 1.071979434, 1.195372751, 1.524421594, 1, 1.293233083, 1.483709273, 1.468671679, 1.192982456, 0.463659148)), row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"))

dat_proc <- dat %>% 
  mutate(dose = parse_number(dose),
         unit = "nM",
         peptide = parse_number(peptide_name)) 

dat_proc %>% 
  group_split(peptide_name) %>% 
  map(~cor(.x$prolif_score, .x$dose, method = "kendall")) %>%
  map(data.frame) %>% 
  map(rename, kendall = 1) %>% 
  bind_rows(.id = "peptide") %>% 
  mutate(peptide = as.numeric(peptide)) %>% 
  left_join(dat_proc, .) %>% 
  ggplot(aes(factor(dose), prolif_score, color = kendall)) +
  geom_point() +
  geom_text(aes(x = 0, y = 2, label = paste0("kendall correlation \n coefficent = ", kendall)), hjust = -0.2) +
  geom_line(aes(group = peptide)) +
  facet_wrap(~peptide, ncol = 2)
#> Joining, by = "peptide"

Created on 2022-03-10 by the reprex package (v2.0.1)



Answered By - Dan Adams
Answer Checked By - Dawn Plyler (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Thursday, October 6, 2022

[FIXED] What is the best way to perform value estimation on a dataset with discrete, continuous, and categorical variables?

 October 06, 2022     feature-engineering, prediction, python, regression, statistics     No comments   

Issue

What is the best approach to this regression problem, in terms of performance as well as accuracy? Would feature importance be helpful in this scenario? And how do I process this large range of data?

Please note that I am not an expert on any of this, so I may have bad information or theories about why things/methods don't work.


The Data: Each item has an id and various attributes. Most items share the same attributes, however there are a few special items with items specific attributes. An example would look something like this:

item = {
  "item_id": "AMETHYST_SWORD",
  "tier_upgrades": 1,  # (0-1)
  "damage_upgrades": 15,  # (0-15)
     ...
  "stat_upgrades": 5  # (0-5)
}

The relationship between any attribute and the value of the item is linear; if the level of an attribute is increased, so is the value, and vise versa. However, an upgrade at level 1 is not necessarily 1/2 of the value of an upgrade at level 2; the value added for each level increase is different. The value of each upgrade is not constant between items, nor is the price of the item without upgrades. All attributes are capped at a certain integer, however it is not constant for all attributes.

As an item gets higher levels of upgrades, they are also more likely to have other high level upgrades, which is why the price starts to have a steeper slope at upgrade level 10+.

linear relationship

Collected Data: I've collected a bunch of data on the prices of these items with various different combinations of these upgrades. Note that, there is never going to be every single combination of each upgrade, which is why I must implement some sort of prediction into this problem.

As far as the economy & pricing goes, high tier, low drop chance items that cannot be outright bought from a shop are going to be priced based on pure demand/supply. However, middle tier items that have a certain cost to unlock/buy will usually settle for a bit over the cost to acquire.

Some upgrades are binary (ranges from 0 to 1). As shown below, almost all points where tier_upgrades == 0 overlap with the bottom half of tier_upgrades == 1, which I think may cause problems for any type of regression.

tier_upgrades vs price, showing overlapping data


Attempts made so far: I've tried linear regression, K-Nearest Neighbor search, and attemted to make a custom algorithm (more on that below).


Regression: It works, but with a high amount of error. Due to the nature of the data I'm working with, many of the features are either a 1 or 0 and/or overlap a lot. From my understanding, this creates a lot of noise in the model and degrades the accuracy of it. I'm also unsure of how well it would scale to multiple items, since each is valued independent of each other. Aside from that, in theory, regression should work because different attributes affect the value of an item linearly.

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import linear_model

x = df.drop("id", axis=1).drop("adj_price", axis=1)
y = df.drop("id", axis=1)["adj_price"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=69)

regr = linear_model.LinearRegression()
regr.fit(x, y)

y_pred = regr.predict(x_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = np.mean(np.absolute(y_pred - y_test))
print(f"RMSE: {rmse} MAE: {mae}")

K-Nearest Neighbors: This has also worked, but not all the time. Sometimes I run into issues where I don't have enough data for one item, which then forces it to choose a very different item, throwing off the value completely. In addition, there are some performance concerns here, as it is quite slow to generate an outcome. This example is written in JS, using the nearest-neighbor package. Note: The price is not included in the item object, however I add it when I collect data, as it is the price that gets paid for the item. The price is only used to find the value after the fact, it is not accounted for in the KNN search, which is why it is not in fields.

const nn = require("nearest-neighbor");

var items = [
  {
    item_id: "AMETHYST_SWORD",
    tier_upgrades: 1,
    damage_upgrades: 15,
    stat_upgrades: 5,
    price: 1800000
  },
  {
    item_id: "AMETHYST_SWORD",
    tier_upgrades: 0,
    damage_upgrades: 0,
    stat_upgrades: 0,
    price: 1000000
  },
  {
    item_id: "AMETHYST_SWORD",
    tier_upgrades: 0,
    damage_upgrades: 8,
    stat_upgrades: 2,
    price: 1400000
  },
];
 
var query = {
  item_id: "AMETHYST_SWORD",
  tier_upgrades: 1,
  damage_upgrades: 10,
  stat_upgrades: 3
};

var fields = [
  { name: "item_id", measure: nn.comparisonMethods.word },
  { name: "tier_upgrades", measure: nn.comparisonMethods.number },
  { name: "damage_upgrades", measure: nn.comparisonMethods.number },
  { name: "stat_upgrades", measure: nn.comparisonMethods.number },
];
 
nn.findMostSimilar(query, items, fields, function(nearestNeighbor, probability) {
  console.log(query);
  console.log(nearestNeighbor);
  console.log(probability);
});

Averaged distributions: Below is a box chart showing the distribution of prices for each level of damage_upgrades. This algorithm will find the average price where the attribute == item[attribute] for each attribute, and then find the mean. This is a relatively fast way to calculate the value, much faster than using a KNN. However, there is often too big of a spread in a given distribution, which increases the error. Another problem with this is if there is not an equal(ish) distribution of items in each set, it also increases the error. However, the main problem is that items with max upgrades except for a few will be placed in the same set, further disrupting the average, because there is a spread in the value of items. An example:

low_value = {
  item_id: "AMETHYST_SWORD",
  tier_upgrades: 0,
  damage_upgrades: 1,
  stat_upgrades: 0,
  price: 1_100_000
}
# May be placed in the same set as a high value item:
high_value = {
  item_id: "AMETHYST_SWORD",
  tier_upgrades: 0,
  damage_upgrades: 15,
  stat_upgrades: 5,
  price: 1_700_000
}
# This spread in each set is responsible for any inaccuracies in the prediction, because the algorithm does not take into account any other attributes/upgrades.

damage_upgrades price distribution

Here is the Python code for this algorithm. df is a regular dataframe with the item_id, price, and the attributes.

total = 0
features = {
 'tier_upgrades': 1,
 'damage_upgrades': 15,
 'stat_upgrades': 5,
}
for f in features:
  a = df[df[f] == features[f]]
  avg_price = np.mean(a["adj_price"])
  total += avg_price

print("Estimated value:", total / len(features))

If anyone has any ideas, please, let me know!


Solution

  1. For modeling right-skewed targets such as prices I'd try other distributions than Gaussian, like gamma or log-normal.

  2. The algo can be made less restrictive. GBDTs offer best trade-off in terms of accuracy for such tabular data, and should be able to capture some non-linearities. They even accept categorical variables as numerical vectors (label encoder). XGBoost has more APIs, but LightGBM is more accurate and faster.

  3. You may use a submodel to try to predict the binary feature ("probability of tier upgrades") - predictions from a classifier can improve the main model compared to using the binary feature as it is (smooth predictor with no missings vs. discrete with missings).

  4. You can improve model accuracy on small datasets by using cross-validation with a relatively large number of folds (20 or more), which saves more data for training.

  5. Try to stay within python for all ML tasks, this is by far the most appropriate language (and yes, you can later easily host python models in production).



Answered By - mirekphd
Answer Checked By - Dawn Plyler (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Wednesday, July 13, 2022

[FIXED] How do I predict the near future value correctly in python?

 July 13, 2022     flask, lstm, prediction, python, web-deployment     No comments   

Issue

I need help, I m currently deploying my LSTM model in flask python, I m trying to load my result to new csv file, but eventually, it loaded with the repeated result, so I have no idea which line of code was doing wrong, Please adjust me and give me some tips Thanks a lots!

model.py

import numpy as np
from math import sqrt
from numpy import concatenate
from matplotlib import pyplot
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from pickle import dump




def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back), 0]
        dataX.append(a)
        dataY.append(dataset[i + look_back, 0])
    return np.array(dataX), np.array(dataY)
    
# load dataset
np.random.seed(7)
# load the dataset
dataframe = read_csv('Sales.csv', usecols=[1], engine='python', skipfooter=3)
dataset = dataframe.values
dataset = dataset.astype('float32')
# normalize the dataset
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)
# split into train and test sets
train_size = int(len(dataset) * 0.67)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]
# reshape into X=t and Y=t+1
look_back = 1
train_X, train_Y = create_dataset(train, look_back)
test_X, test_Y = create_dataset(test, look_back)
# reshape input to be [samples, time steps, features]
train_X = np.reshape(train_X, (train_X.shape[0], 1, train_X.shape[1]))
test_X = np.reshape(test_X, (test_X.shape[0], 1, test_X.shape[1]))



model = Sequential()
model.add(LSTM(128, return_sequences=True ,input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dropout(0.2))
model.add(LSTM(64))


model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
history = model.fit(train_X, train_Y, epochs=100, batch_size=128, validation_data=(test_X, test_Y), verbose=2, shuffle=False)


#save the model
model.save('model.h5')

app.py

from flask import Flask, make_response, request, render_template
from pandas import DataFrame
import io
from pandas import datetime
from io import StringIO
import csv
import pandas as pd
import numpy as np
import pickle
import os
from keras.models import load_model
from sklearn.preprocessing import MinMaxScaler
import datetime
from datetime import timedelta, datetime
from dateutil.relativedelta import relativedelta

app = Flask(__name__)

@app.route('/')
def form():
    return """
        <html>
            <body>
                <h1>Let's TRY to Predict..</h1>
                </br>
                </br>
                <p> Insert your CSV file and then download the Result
                <form action="/transform" method="post" enctype="multipart/form-data">
                    <input type="file" name="data_file" class="btn btn-block"/>
                    </br>
                    </br>
                    <button type="submit" class="btn btn-primary btn-block btn-large">Predict</button>
                </form>

                 <div class="ct-chart ct-perfect-fourth"></div>

            </body>
        </html>
    """

@app.route('/transform', methods=["POST"])
def transform_view():
 if request.method == 'POST':
    f = request.files['data_file']
    if not f:
        return "No file"

    
    stream = io.StringIO(f.stream.read().decode("UTF8"), newline=None)
    csv_input = csv.reader(stream)
    stream.seek(0)
    result = stream.read()
    df = pd.read_csv(StringIO(result), usecols=[1])
    
    #extract month value
    df2 = pd.read_csv(StringIO(result))
    matrix2 = df2[df2.columns[0]].to_numpy()
    list1 = matrix2.tolist()
     
    # load the model from disk
    model = load_model('model.h5')
    dataset = df.values
    dataset = dataset.astype('float32')
    scaler = MinMaxScaler(feature_range=(0, 1))
    dataset = scaler.fit_transform(dataset)
    dataset = np.reshape(dataset, (dataset.shape[0], 1, dataset.shape[1]))
    predict = model.predict(dataset)
    transform = scaler.inverse_transform(predict)

    X_FUTURE = 100
    transform = np.array([])
    last = dataset[-1]
    for i in range(X_FUTURE):
        curr_prediction = model.predict(np.array([last]))
        last = np.concatenate([last[1:], curr_prediction])
        transform = np.concatenate([transform, curr_prediction[0]])
        
    transform = scaler.inverse_transform([transform])[0]

    dicts = []
    curr_date = pd.to_datetime(list1[-1])
    for i in range(X_FUTURE):
        curr_date = curr_date +  relativedelta(month=1)
        dicts.append({'Predictions':transform[i], "Month": curr_date})


    new_data = pd.DataFrame(dicts).set_index("Month")
    ##df_predict = pd.DataFrame(transform, columns=["predicted value"])
          

    response = make_response(new_data.to_csv(index = True, encoding='utf8'))
    response.headers["Content-Disposition"] = "attachment; filename=result.csv"
    return response

if __name__ == "__main__":
    app.run(debug=True, port = 9000, host = "localhost")

This is the result that loaded to the new csv file

enter image description here


Solution

I think it is the case that you have correct results (meaning duplicates), your LSTM is trained correctly (but maybe with low accuracy), and duplicates are not a mistake but correct answer.

Regarding duplicate Month column values - the reason is that Pandas can't recognize relativedelta from dateutil package hence adding it to date gives wrong result. Instead try doing this curr_date = curr_date + pd.DateOffset(months = 1), this will produces correct different dates in your Month column.



Answered By - Arty
Answer Checked By - Dawn Plyler (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Friday, May 13, 2022

[FIXED] How to add predicted values in a dataframe?

 May 13, 2022     append, jupyter-notebook, prediction, regression     No comments   

Issue

I extended the predictions to five values from this link. Now, I want to add the new five predicted values (New_Interest_Rate and New_Unemployment_Rate) so I can plot them together in a new figure together with the original timeseries.

import pandas as pd
from sklearn import linear_model
import statsmodels.api as sm

Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],
                'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
                'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
                'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
                'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]        
                }

df = pd.DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price'])

X = df[['Interest_Rate','Unemployment_Rate']] # here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets
Y = df['Stock_Index_Price']
 
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

# prediction with sklearn
New_Interest_Rate = [2.75, 3, 4, 1, 2]
New_Unemployment_Rate = [5.3, 4, 3, 2, 1]
for i in range(len(New_Interest_Rate)):
    print (str(i+1) + ' - Predicted Stock Index Price: \n', 
           regr.predict([[New_Interest_Rate[i] ,New_Unemployment_Rate[i]]]))

# with statsmodels
X = sm.add_constant(X) # adding a constant

model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
 
print_model = model.summary()
print(print_model)

I cannot figure out how to append that because when I try, an error comes out.

Interest_Rate=Interest_Rate.append(New_Interest_Rate)

TypeError: cannot concatenate object of type "<class 'float'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid

My goal is to plot the extended predicted values. I use jupyter notebook. The original code comes from thislink. Thank you!


Solution

Running the code you provided seems to work on my computer, but with some warning messages. The versions I'm using are python 3.9.7, pandas 1.3.3-1, sklearn-pandas 2.2.0-1, and statsmodels 0.13.0 . I just saved it to a file and ran it in a terminal with "python copypastedcode.py". I got this output:

Intercept:
 1798.4039776258544
Coefficients:
 [ 345.54008701 -250.14657137]
/usr/lib/python3.9/site-packages/sklearn/base.py:441: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
  warnings.warn(
1 - Predicted Stock Index Price:
 [1422.86238865]
/usr/lib/python3.9/site-packages/sklearn/base.py:441: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
  warnings.warn(
2 - Predicted Stock Index Price:
 [1834.43795318]
/usr/lib/python3.9/site-packages/sklearn/base.py:441: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
  warnings.warn(
3 - Predicted Stock Index Price:
 [2430.12461156]
/usr/lib/python3.9/site-packages/sklearn/base.py:441: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
  warnings.warn(
4 - Predicted Stock Index Price:
 [1643.6509219]
/usr/lib/python3.9/site-packages/sklearn/base.py:441: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
  warnings.warn(
5 - Predicted Stock Index Price:
 [2239.33758028]
                            OLS Regression Results
==============================================================================
Dep. Variable:      Stock_Index_Price   R-squared:                       0.898
Model:                            OLS   Adj. R-squared:                  0.888
Method:                 Least Squares   F-statistic:                     92.07
Date:                Wed, 20 Oct 2021   Prob (F-statistic):           4.04e-11
Time:                        09:07:19   Log-Likelihood:                -134.61
No. Observations:                  24   AIC:                             275.2
Df Residuals:                      21   BIC:                             278.8
Df Model:                           2
Covariance Type:            nonrobust
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const              1798.4040    899.248      2.000      0.059     -71.685    3668.493
Interest_Rate       345.5401    111.367      3.103      0.005     113.940     577.140
Unemployment_Rate  -250.1466    117.950     -2.121      0.046    -495.437      -4.856
==============================================================================
Omnibus:                        2.691   Durbin-Watson:                   0.530
Prob(Omnibus):                  0.260   Jarque-Bera (JB):                1.551
Skew:                          -0.612   Prob(JB):                        0.461
Kurtosis:                       3.226   Cond. No.                         394.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

the "X does not have valid feature names..." warnings can be fixed by changing

regr.fit(X,Y)

to

regr.fit(X.values, Y.values) 

If you want to use New_Interest_rate and New_Unemployment_Rate to create the regression, then you would need Y to have 5 more corresponding stock prices. I don't think that's what you want to do if you're trying to predict stock prices from interest and unemployment rates. Here's how you would do that though:

New_Interest_Rate = [2.75, 3, 4, 1, 2]
New_Unemployment_Rate = [5.3, 4, 3, 2, 1]
New_Stock_Prices = [1,2,3,4,5]
X_new = pd.DataFrame(data={'Interest_Rate': New_Interest_Rate,'Unemployment_Rate': New_Unemployment_Rate})
Y_new = pd.DataFrame(data={'Stock_Index_Price': New_Stock_Prices})
regr = linear_model.LinearRegression()
X = X.append(X_df)
Y = Y.append(Y_df)
regr.fit(X.values, Y.values)

And if you want to make plots, you can make a small function to get stock predictions from input arrays with something like this:

def predict_stock_price(future_interest_rate, future_unemployment_rate):
    return [regr.predict([[i ,j]])[0,0] for i,j in zip(future_interest_rate,future_unemployment_rate)]

prices = predict_stock_price(New_Interest_Rate,New_Unemployment_Rate)
print("list of predicted stock prices:",prices)

predicted_stock_market = {'Month': range(13,13+len(prices)), #just to have a time axis to plot with
                         'Interest_Rate': New_Interest_Rate,
                         'Unemployment_Rate': New_Unemployment_Rate,
                         'Stock_Index_Price': prices}
predicted_df = pd.DataFrame(predicted_stock_market)
predicted_df.plot( x="Month",y="Stock_Index_Price",kind='scatter')
plt.show()


Answered By - Max Behling
Answer Checked By - Cary Denson (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Older Posts Home

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
All Comments
Atom
All Comments

Copyright © PHPFixing