This is a brief introduction to how R can be useful in the manufacturing sector to calculate the frequency of faults and then developing the model so that preventive maintenance can be done
This is a brief introduction to how R can be useful in the manufacturing sector to calculate the frequency of faults and then developing the model so that preventive maintenance can be done
Introduction to Pandas and Time Series Analysis [PyCon DE]Alexander Hendorf
Most data is allocated to a period or to some point in time. We can gain a lot of insight by analyzing what happened when. The better the quality and accuracy of our data, the better our predictions can become.
Unfortunately the data we have to deal with is often aggregated for example on a monthly basis, but not all months are the same, they may have 28 days, 31 days, have four or five weekends,…. It’s made fit to our calendar that was made fit to deal with the earth surrounding the sun, not to please Data Scientists.
Dealing with periodical data can be a challenge. This talk will show to how you can deal with it with Pandas.
Inspired by Runar Bjarnason's scala exchange 2017 keynote https://skillsmatter.com/skillscasts/10746-keynote-composing-programs, I made a slide deck on Functor laws. Download for better quality.
[open source] hamilton, a micro framework for creating dataframes, and its ap...Stefan Krawczyk
At Stitch Fix, we have 130+ “Full Stack Data Scientists” who, in addition to doing data science work, are also expected to engineer and own data pipelines for their production models. One data science team, the Forecasting, Estimation, and Demand team, was in a bind. Their data generation process was causing them iteration & operational frustrations in delivering time-series forecasts for the business. In this talk I’ll present Hamilton, a novel open source Python micro framework, that solved their pain points by changing their working paradigm.
Specifically, Hamilton enables a simpler paradigm for Data Science & Data Engineering teams to create, maintain, and execute code for generating dataframes, especially when there are lots of inter-column dependencies. Hamilton does this by building a DAG of dependencies directly from Python functions defined in a special manner, which also makes unit testing and documentation easy; tune into the talk to find out how. I’ll also cover our experience migrating to it, our best practices in using it in production for over two years, along with planned extensions to make it a general purpose framework.
Company segmentation - an approach with RCasper Crause
We classify companies based on how their stocks trade using their daily stock returns (percentage movement from one day to the next). This analysis will help your organization determine which companies are related to each other (competitors and have similar attributes).
Best Data Science Ppt using Python
Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data.
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
Data Exploration with Apache Drill: Day 2Charles Givre
Study after study shows that data scientists and analysts spend between 50% and 90% of their time preparing their data for analysis. Using Drill, you can dramatically reduce the time it takes to go from raw data to insight. This course will show you how.
The course material for this presentation are available at https://github.com/cgivre/data-exploration-with-apache-drill
Introduction to Pandas and Time Series Analysis [PyCon DE]Alexander Hendorf
Most data is allocated to a period or to some point in time. We can gain a lot of insight by analyzing what happened when. The better the quality and accuracy of our data, the better our predictions can become.
Unfortunately the data we have to deal with is often aggregated for example on a monthly basis, but not all months are the same, they may have 28 days, 31 days, have four or five weekends,…. It’s made fit to our calendar that was made fit to deal with the earth surrounding the sun, not to please Data Scientists.
Dealing with periodical data can be a challenge. This talk will show to how you can deal with it with Pandas.
Inspired by Runar Bjarnason's scala exchange 2017 keynote https://skillsmatter.com/skillscasts/10746-keynote-composing-programs, I made a slide deck on Functor laws. Download for better quality.
[open source] hamilton, a micro framework for creating dataframes, and its ap...Stefan Krawczyk
At Stitch Fix, we have 130+ “Full Stack Data Scientists” who, in addition to doing data science work, are also expected to engineer and own data pipelines for their production models. One data science team, the Forecasting, Estimation, and Demand team, was in a bind. Their data generation process was causing them iteration & operational frustrations in delivering time-series forecasts for the business. In this talk I’ll present Hamilton, a novel open source Python micro framework, that solved their pain points by changing their working paradigm.
Specifically, Hamilton enables a simpler paradigm for Data Science & Data Engineering teams to create, maintain, and execute code for generating dataframes, especially when there are lots of inter-column dependencies. Hamilton does this by building a DAG of dependencies directly from Python functions defined in a special manner, which also makes unit testing and documentation easy; tune into the talk to find out how. I’ll also cover our experience migrating to it, our best practices in using it in production for over two years, along with planned extensions to make it a general purpose framework.
Company segmentation - an approach with RCasper Crause
We classify companies based on how their stocks trade using their daily stock returns (percentage movement from one day to the next). This analysis will help your organization determine which companies are related to each other (competitors and have similar attributes).
Best Data Science Ppt using Python
Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data.
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
Data Exploration with Apache Drill: Day 2Charles Givre
Study after study shows that data scientists and analysts spend between 50% and 90% of their time preparing their data for analysis. Using Drill, you can dramatically reduce the time it takes to go from raw data to insight. This course will show you how.
The course material for this presentation are available at https://github.com/cgivre/data-exploration-with-apache-drill
Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava M...GapData Institute
Event description:
Exponea is full-stack Omni-channel real-time marketing cloud. In Exponea, we are extensively building practical AI applications varying from predictions or recommendations to simple simulated annealing. Regardless of application we are building, each one needs data. A lot of data that Exponea can efficiently provide.
Major issue, when building any AI application or ML model, is data preprocessing. This problem arises when you need to process vast volume datasets or high velocity data streams. We build such data pipelines mostly using Spark respectively PySpark and Python, but also many other tools are adopted.
In this talk we will go through the steps we implemented to build such pipelines. We will show you how to get Spark running easily, basic data wrangling with PySpark and Spark Streaming. In the end, we will use our data pipeline for real application and finish talk about resource managing joys and sorrows.
About speaker:
Matus Cimerman
1+y Data science @Exponea, before BI intern and other stuff @Orange.
Finishing masters FIIT STU, thesis: Data stream analysis
https://github.com/cimox
https://www.linkedin.com/in/mat%C3%BA%C5%A1-cimerman-4b08b352/
https://twitter.com/MatusCimerman
https://www.facebook.com/matus.cimerman
Registration:
@Eventbrite registration here & @Meetup.com group's event here (if you use both your seat is guarateed). +our event you can find also @Facebook here.
[Disclaimer: If you just mark "going" @Facebook event we can't guarantee your seat]
Language of the event: Python & English
------------------------------------
PyData Bratislava [Python Data Enthusiasts and Users, Data Scientists & Statisticians of all levels from Slovakia]
------------------------------------
This meetup group is for Data Scientists, Statisticians, Economists and Data Enthusiasts using Python for data analysis and data visualization. The goals are to provide Python enthusiasts a place to share ideas and learn from each other about how best to apply the language and tools to ever-evolving challenges in the vast realm of data management, processing, analytics, and visualization.
--
PyData is a group for users and developers of data analysis tools to share ideas and learn from each other. We gather to discuss how best to apply Python tools, as well as those using R and Julia, to meet the evolving challenges in data management, processing, analytics, and visualization. PyData groups, events, and conferences aim to provide a venue for users acrossall the various domains of data analysis to share their experiences and their techniques. PyData is organized by NumFOCUS.org, a 501(c)3 non-profit in the United States.
As the amount of metrics, software that produce and process them, and people involved in them continue to increase, we need better ways to organize them, to make them self-describing, and do so in a way that is consistent. Leveraging this, we can then automatically build graphs and dashboards, given a query that represents an information need, even for complicated cases. We can build richer visualizations, alerting and fault detection. This talk will introduce the concepts and related tools, demonstrate possibilities using the Graph-Explorer interface, and lay the groundwork for future work.
[DSC Europe 22] Smart approach in development and deployment process for vari...DataScienceConferenc1
During development of machine learning model about 80% of time is used for data preparation and due to data quality issues, especially when there is a need to combine data from structured and unstructured data sources. Development of smart generic data mart can reduce go to production time for new ML models. We will share creative solutions for challenges we encountered during data transfer between DWH and Data Lake, furthermore data preprocessing, development, deployment/orchestration of ML models, using python/pyspark scripts.
how to swap pi coins to foreign currency withdrawable.DOT TECH
As of my last update, Pi is still in the testing phase and is not tradable on any exchanges.
However, Pi Network has announced plans to launch its Testnet and Mainnet in the future, which may include listing Pi on exchanges.
The current method for selling pi coins involves exchanging them with a pi vendor who purchases pi coins for investment reasons.
If you want to sell your pi coins, reach out to a pi vendor and sell them to anyone looking to sell pi coins from any country around the globe.
Below is the contact information for my personal pi vendor.
Telegram: @Pi_vendor_247
What price will pi network be listed on exchangesDOT TECH
The rate at which pi will be listed is practically unknown. But due to speculations surrounding it the predicted rate is tends to be from 30$ — 50$.
So if you are interested in selling your pi network coins at a high rate tho. Or you can't wait till the mainnet launch in 2026. You can easily trade your pi coins with a merchant.
A merchant is someone who buys pi coins from miners and resell them to Investors looking forward to hold massive quantities till mainnet launch.
I will leave the telegram contact of my personal pi vendor to trade with.
@Pi_vendor_247
The secret way to sell pi coins effortlessly.DOT TECH
Well as we all know pi isn't launched yet. But you can still sell your pi coins effortlessly because some whales in China are interested in holding massive pi coins. And they are willing to pay good money for it. If you are interested in selling I will leave a contact for you. Just telegram this number below. I sold about 3000 pi coins to him and he paid me immediately.
Telegram: @Pi_vendor_247
The European Unemployment Puzzle: implications from population agingGRAPE
We study the link between the evolving age structure of the working population and unemployment. We build a large new Keynesian OLG model with a realistic age structure, labor market frictions, sticky prices, and aggregate shocks. Once calibrated to the European economy, we quantify the extent to which demographic changes over the last three decades have contributed to the decline of the unemployment rate. Our findings yield important implications for the future evolution of unemployment given the anticipated further aging of the working population in Europe. We also quantify the implications for optimal monetary policy: lowering inflation volatility becomes less costly in terms of GDP and unemployment volatility, which hints that optimal monetary policy may be more hawkish in an aging society. Finally, our results also propose a partial reversal of the European-US unemployment puzzle due to the fact that the share of young workers is expected to remain robust in the US.
how can I sell my pi coins for cash in a pi APPDOT TECH
You can't sell your pi coins in the pi network app. because it is not listed yet on any exchange.
The only way you can sell is by trading your pi coins with an investor (a person looking forward to hold massive amounts of pi coins before mainnet launch) .
You don't need to meet the investor directly all the trades are done with a pi vendor/merchant (a person that buys the pi coins from miners and resell it to investors)
I Will leave The telegram contact of my personal pi vendor, if you are finding a legitimate one.
@Pi_vendor_247
#pi network
#pi coins
#money
US Economic Outlook - Being Decided - M Capital Group August 2021.pdfpchutichetpong
The U.S. economy is continuing its impressive recovery from the COVID-19 pandemic and not slowing down despite re-occurring bumps. The U.S. savings rate reached its highest ever recorded level at 34% in April 2020 and Americans seem ready to spend. The sectors that had been hurt the most by the pandemic specifically reduced consumer spending, like retail, leisure, hospitality, and travel, are now experiencing massive growth in revenue and job openings.
Could this growth lead to a “Roaring Twenties”? As quickly as the U.S. economy contracted, experiencing a 9.1% drop in economic output relative to the business cycle in Q2 2020, the largest in recorded history, it has rebounded beyond expectations. This surprising growth seems to be fueled by the U.S. government’s aggressive fiscal and monetary policies, and an increase in consumer spending as mobility restrictions are lifted. Unemployment rates between June 2020 and June 2021 decreased by 5.2%, while the demand for labor is increasing, coupled with increasing wages to incentivize Americans to rejoin the labor force. Schools and businesses are expected to fully reopen soon. In parallel, vaccination rates across the country and the world continue to rise, with full vaccination rates of 50% and 14.8% respectively.
However, it is not completely smooth sailing from here. According to M Capital Group, the main risks that threaten the continued growth of the U.S. economy are inflation, unsettled trade relations, and another wave of Covid-19 mutations that could shut down the world again. Have we learned from the past year of COVID-19 and adapted our economy accordingly?
“In order for the U.S. economy to continue growing, whether there is another wave or not, the U.S. needs to focus on diversifying supply chains, supporting business investment, and maintaining consumer spending,” says Grace Feeley, a research analyst at M Capital Group.
While the economic indicators are positive, the risks are coming closer to manifesting and threatening such growth. The new variants spreading throughout the world, Delta, Lambda, and Gamma, are vaccine-resistant and muddy the predictions made about the economy and health of the country. These variants bring back the feeling of uncertainty that has wreaked havoc not only on the stock market but the mindset of people around the world. MCG provides unique insight on how to mitigate these risks to possibly ensure a bright economic future.
Poonawalla Fincorp and IndusInd Bank Introduce New Co-Branded Credit Cardnickysharmasucks
The unveiling of the IndusInd Bank Poonawalla Fincorp eLITE RuPay Platinum Credit Card marks a notable milestone in the Indian financial landscape, showcasing a successful partnership between two leading institutions, Poonawalla Fincorp and IndusInd Bank. This co-branded credit card not only offers users a plethora of benefits but also reflects a commitment to innovation and adaptation. With a focus on providing value-driven and customer-centric solutions, this launch represents more than just a new product—it signifies a step towards redefining the banking experience for millions. Promising convenience, rewards, and a touch of luxury in everyday financial transactions, this collaboration aims to cater to the evolving needs of customers and set new standards in the industry.
What website can I sell pi coins securely.DOT TECH
Currently there are no website or exchange that allow buying or selling of pi coins..
But you can still easily sell pi coins, by reselling it to exchanges/crypto whales interested in holding thousands of pi coins before the mainnet launch.
Who is a pi merchant?
A pi merchant is someone who buys pi coins from miners and resell to these crypto whales and holders of pi..
This is because pi network is not doing any pre-sale. The only way exchanges can get pi is by buying from miners and pi merchants stands in between the miners and the exchanges.
How can I sell my pi coins?
Selling pi coins is really easy, but first you need to migrate to mainnet wallet before you can do that. I will leave the telegram contact of my personal pi merchant to trade with.
Tele-gram.
@Pi_vendor_247
how can I sell pi coins after successfully completing KYCDOT TECH
Pi coins is not launched yet in any exchange 💱 this means it's not swappable, the current pi displaying on coin market cap is the iou version of pi. And you can learn all about that on my previous post.
RIGHT NOW THE ONLY WAY you can sell pi coins is through verified pi merchants. A pi merchant is someone who buys pi coins and resell them to exchanges and crypto whales. Looking forward to hold massive quantities of pi coins before the mainnet launch.
This is because pi network is not doing any pre-sale or ico offerings, the only way to get my coins is from buying from miners. So a merchant facilitates the transactions between the miners and these exchanges holding pi.
I and my friends has sold more than 6000 pi coins successfully with this method. I will be happy to share the contact of my personal pi merchant. The one i trade with, if you have your own merchant you can trade with them. For those who are new.
Message: @Pi_vendor_247 on telegram.
I wouldn't advise you selling all percentage of the pi coins. Leave at least a before so its a win win during open mainnet. Have a nice day pioneers ♥️
#kyc #mainnet #picoins #pi #sellpi #piwallet
#pinetwork
what is the best method to sell pi coins in 2024DOT TECH
The best way to sell your pi coins safely is trading with an exchange..but since pi is not launched in any exchange, and second option is through a VERIFIED pi merchant.
Who is a pi merchant?
A pi merchant is someone who buys pi coins from miners and pioneers and resell them to Investors looking forward to hold massive amounts before mainnet launch in 2026.
I will leave the telegram contact of my personal pi merchant to trade pi coins with.
@Pi_vendor_247
Turin Startup Ecosystem 2024 - Ricerca sulle Startup e il Sistema dell'Innov...Quotidiano Piemontese
Turin Startup Ecosystem 2024
Una ricerca de il Club degli Investitori, in collaborazione con ToTeM Torino Tech Map e con il supporto della ESCP Business School e di Growth Capital
Falcon stands out as a top-tier P2P Invoice Discounting platform in India, bridging esteemed blue-chip companies and eager investors. Our goal is to transform the investment landscape in India by establishing a comprehensive destination for borrowers and investors with diverse profiles and needs, all while minimizing risk. What sets Falcon apart is the elimination of intermediaries such as commercial banks and depository institutions, allowing investors to enjoy higher yields.
1. Final Project - %Change in Stock Price
(Technology Service Industry) Analysis
Name: Natsarankorn Kijtorntham
Packages
In [1]:
%matplotlib inline
from datetime import datetime
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from bs4 import BeautifulSoup as bs
from scipy.stats import pearsonr
from patsy import dmatrix
from patsy import dmatrices
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflat
ion_factor
import itertools
import random
import warnings
warnings.filterwarnings('ignore')
2. 1. Introduction
What is the importance of your data set?
This project scraped the data from finance.yahoo.com. The main data set was the
companies in the Technology Services sector. This analysis aims to predict the
%Change (dependent variable) of the stock price, with the OLS model. The features
are the statistical key of each stock. For example, Market Cap, P/E Ratio, Price/Sales,
Enterprise Value, ROE, ROA, etc. Since there're a lot of indices to the stock price (high
dimensionality), the significant level of independent variables as well as the subset
selection method would preliminarily filter out unnecessary variables.
Which question(s) can it help us understand?
Which indices (variables) are statistically important to the Change in percentage
of stock price in the Technology industry?
What is the magnitude of each variable against Change of stock price in the
Technology Services industry?
====================================================================
2. Data Scraping
Where and how are you getting the data?
The data set is tocks from the Technology Services Sector of Yahoo Finance
(https://finance.yahoo.com/screener/predefined/ms_technology). There're
approximately 390 companies in this data set.
Scraping Steps
Part 1:
Scrape the main dataframe df_1 of all companies containing 'Symbol',
'Name of the company', 'Price', 'Change', '% Change', 'Volume', 'Avg
Vol', 'Market Cap', 'PE Ratio'.
Part 2:
3. Get the 'href' link for each company.
Get the tables from the 'Statistical Keys' page using the modified 'href'
links, e.g. AAPL (https://finance.yahoo.com/quote/AAPL/key-statistics?
p=AAPL), MSFT (https://finance.yahoo.com/quote/MSFT/key-statistics?
p=MSFT), TSM (https://finance.yahoo.com/quote/TSM/key-statistics?
p=TSM).
Run comprehensive for loops to build another data frame, df_2 , from
the second part.
Joining two data frames ( df_1 , and df_2 ) using an index, 'Symbol'.
What data are available?
The whole data contains approximately 190 observations after omitting NAs, and
20 variables to run the full OLS model.
The dependent variable is Change which is the percentage of the stock price of
that day (Change (USD)/Price (Intraday)).
The independent variables are Price, Mkt_Cap, PEG_Ratio, PE_Ratio, PriceSales,
PriceBook, EV, EVRevenue, Payout_Ratio, Profit_Margin, Operating_Margin, ROA,
ROE, Revenue, RevenueShare, Gross_Profit, EBITDA, NItoCommon, Diluted_EPS,
Earnings_Growth_Q, Health.
EV/EBITDA is transformed into Health , a binary variable
For a company who has EV/EBITDA > industrial average, it will be indicated
as 1 = 'healthy' company
For a company who has EV/EBITDA < industrial average, it will be indicated
as 0 = 'unhealthy' company
What relationships do you expect to see in the data?
The expected relationships are both positives and negatives as shown below:
Independent Variables E(Relationships)
Market Cap +
PEG Ratio -
P/E Ratio +
Price/Sales -
Price/Book +
EV +
4. EV/Revenue -
Payout Ratio(%) +
Profit Margin(%) +
Operating Margin(%) +
ROA +
ROE +
Revenue +
Revenue/Share +
Gross Profit(%) +
EBITDA +
Net Income to Common +
Diluted EPS +
Quarterly Earnings Growth (yoy) +
Health (healthy) +
Steps:
In [2]:
# Time of data scraping
now = datetime.now()
dt = now.strftime("%d/%m/%Y %H:%M:%S")
print("This data was scraped on", dt)
# This data was scraped on 23/11/2019 22:25:07
PART 1
Getting a main dataframe (df_1)
This data was scraped on 23/11/2019 22:25:07
5. In [3]:
# Scraping the main dataframe (df_1) by setting the parameters f
or 100 counts each page with approximately 390 rows.
url = 'https://finance.yahoo.com/screener/predefined/ms_technolo
gy'
rows = np.arange(0,301,100).tolist()
# rows = [100,200,300]
url_list = []
tech_df = []
for i in rows:
r = requests.get(url, params = {'count' : '100', 'offset' :
i})
link = r.url
url_list.append(link)
for link in url_list:
df = pd.read_html(link)
tb = df[0]
tech_df.append(tb)
df_1 = pd.concat(tech_df)
In [4]:
# Setting 'Symbol' as an index
df = df_1.set_index('Symbol')
df.to_csv('df_1.csv')
In [5]:
url_list
Out[5]:
['https://finance.yahoo.com/screener/predefined/ms_t
echnology?count=100&offset=0',
'https://finance.yahoo.com/screener/predefined/ms_t
echnology?count=100&offset=100',
'https://finance.yahoo.com/screener/predefined/ms_t
echnology?count=100&offset=200',
'https://finance.yahoo.com/screener/predefined/ms_t
echnology?count=100&offset=300']
6. In [6]:
df.head()
In [7]:
# The dimension for the main dataframe
df.shape
PART 2
Scraping from Key-Statistics page
Out[6]:
Name
Price
(Intraday)
Change
%
Change
Volume
Avg Vol
(3
month)
Symbol
AAPL Apple Inc. 261.78 -0.23 -0.09% 16.331M 25.857M
MSFT
Microsoft
Corporation
149.59 0.11 +0.07% 15.842M 22.825M
TSM
Taiwan
Semiconductor
Manufacturing
Company
Lim...
52.79 -0.19 -0.36% 4.103M 6.848M
INTC
Intel
Corporation
57.61 -0.61 -1.05% 15.69M 18.498M
CSCO
Cisco
Systems, Inc.
44.85 0.01 +0.02% 16.516M 19.124M
Out[7]:
(393, 9)
7. In [8]:
# get thae 'a' tag from web element
table = []
tag = []
# get text from the main page
for url in url_list:
txt = requests.get(url).text
soup = bs(txt)
t = soup.find('div', {'id':'scr-res-table'})
table.append(t)
for i in range(0,4):
t = table[i].find_all('a')
tag.append(t)
In [9]:
# Get the href link for key statistics page of each ticker to ex
tract table
link = []
for e in range(0,4):
for i in tag[e]:
l = 'https://finance.yahoo.com'+i.get('href')
l_kstat = l.split('?')[0]+'/key-statistics?'+l.split('?'
)[1]
link.append(l_kstat)
Note
Some HTML links don't work (404) when running the code for some specific time
(around closing time of the stock market). So, the code chunk below will prevent the
error when scraping.
8. In [10]:
connection = []
for l in link:
if requests.get(l).status_code == 200:
status = ['good', l]
else: status = ['404', l]
connection.append(status)
# Get the responded links, '200', and company tickers
links = []
tickers = []
for status in range(0,len(connection)):
if connection[status][0] == 'good':
good_link = connection[status][1]
else: bad_link = connection[status][1]
links.append(good_link)
tickers.append(good_link.split('=')[1])
In [11]:
print('There are',len(links),'links that responded (200)')
There are 393 links that responded (200)
9. In [12]:
tic = time.time()
data = []
tables = (0,3,5,6,7) # This indicates the specific table in stat
istical key page used in the analysis
for url in links[:len(links)]:
for table in tables:
d = pd.read_html(url)[table]
data.append(d)
matrix = pd.concat(data)
matrix.shape
m = matrix.set_index(0)
toc = time.time()
print("Total scraping time:", (toc-tic)/60, "minutes.")
# Total scraping time: 35.854390549659726 minutes.
In [13]:
# Build a second dataframe from concatenated matrix
df_2 = pd.DataFrame()
for i in range(0,len(m),31):
m_m = m.iloc[i:i+31]
n = (i+31)/31-1
m_m.columns = [tickers[int(n)]]
df_2[tickers[int(n)]] = m_m[tickers[int(n)]]
df_2 = df_2.transpose()
In [14]:
df_2.to_csv('df_2.csv')
Total scraping time: 35.854390549659726 minutes.
10. Joining Data frames
In [15]:
df = df.join(df_2)
In [16]:
df = df.iloc[:len(df_2)]
df.shape
In [17]:
df.head()
Out[16]:
(393, 40)
11. In [18]:
df.to_csv('tech_390.csv')
====================================================================
3. Data Cleaning
Rename Variables in The Data Frame, df
Out[17]:
Name
Price
(Intraday)
Change
%
Change
Volume
Avg Vol
(3
month)
Symbol
AAPL Apple Inc. 261.78 -0.23 -0.09% 16.331M 25.857M
MSFT
Microsoft
Corporation
149.59 0.11 +0.07% 15.842M 22.825M
TSM
Taiwan
Semiconductor
Manufacturing
Company
Lim...
52.79 -0.19 -0.36% 4.103M 6.848M
INTC
Intel
Corporation
57.61 -0.61 -1.05% 15.69M 18.498M
CSCO
Cisco
Systems, Inc.
44.85 0.01 +0.02% 16.516M 19.124M
5 rows × 40 columns
12. In [19]:
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 400)
In [20]:
df = pd.read_csv('tech_390.csv')
df = df.set_index('Symbol')
# df.head()
Out[20]:
Name
Price
(Intraday)
Change
%
Change
Volume
Avg Vol
(3
month)
Symbol
AAPL Apple Inc. 261.78 -0.23 -0.09% 16.331M 25.857M
MSFT
Microsoft
Corporation
149.59 0.11 +0.07% 15.842M 22.825M
TSM
Taiwan
Semiconductor
Manufacturing
Company
Lim...
52.79 -0.19 -0.36% 4.103M 6.848M
INTC
Intel
Corporation
57.61 -0.61 -1.05% 15.69M 18.498M
CSCO
Cisco
Systems, Inc.
44.85 0.01 +0.02% 16.516M 19.124M
13. In [21]:
data = df[['% Change', 'Price (Intraday)', 'Market Cap', 'PE Rat
io (TTM)',
'PEG Ratio (5 yr expected) 1', 'Price/Sales (ttm)', '
Price/Book (mrq)',
'Enterprise Value 3', 'Enterprise Value/Revenue 3',
'Enterprise Value/EBITDA 6', 'Payout Ratio 4', 'Profi
t Margin',
'Operating Margin (ttm)', 'Return on Assets (ttm)', '
Return on Equity (ttm)',
'Revenue (ttm)', 'Revenue Per Share (ttm)', 'Gross Pr
ofit (ttm)', 'EBITDA',
'Net Income Avi to Common (ttm)', 'Diluted EPS (ttm)'
,
'Quarterly Earnings Growth (yoy)']]
data.columns = ['Change', 'Price', 'Mkt_Cap', 'PE_Ratio',
'PEG_Ratio', 'PriceSales', 'PriceBook',
'EV', 'EVRevenue',
'EV/EBITDA', 'Payout_Ratio', 'Profit_Margin',
'Operating_Margin', 'ROA', 'ROE',
'Revenue', 'RevenueShare', 'Gross_Profit', 'EBIT
DA',
'NItoCommon', 'Diluted_EPS' ,
'Earnings_Growth_Q']
data.head()
Out[21]:
Change Price Mkt_Cap PE_Ratio PEG_Ratio PriceSales PriceBook
Symbol
AAPL -0.09% 261.78 1.183T 22.02 2.04 4.55
MSFT +0.07% 149.59 1.141T 28.22 1.91 8.79
TSM -0.36% 52.79 263.261B 23.67 2.39 NaN
INTC -1.05% 57.61 250.604B 13.49 1.79 3.56
CSCO +0.02% 44.85 190.265B 17.85 1.97 3.66
14. Data Cleaning & Transformation
Converting str to float by using
replace() to replace abbreviations (T, B, M ,and K) to Scientific Notation
(e) for Mkt_Cap, EV, Revenue, Gross_Profit, EBITDA, and NItoCommon.
strip() to strip the unnecessary symbols which are ',', and '%'.
astype() to change string to float
Creating categorical variable (Binary) for company health based on industrial
average of EV/EBITDA .
In [22]:
# Check the type of variables
columns = ['Change', 'Price', 'Mkt_Cap', 'PE_Ratio',
'PEG_Ratio', 'PriceSales', 'PriceBook',
'EV', 'EVRevenue',
'EV/EBITDA', 'Payout_Ratio', 'Profit_Margin',
'Operating_Margin', 'ROA', 'ROE',
'Revenue', 'RevenueShare', 'Gross_Profit', 'EBITDA',
'NItoCommon', 'Diluted_EPS',
'Earnings_Growth_Q']
for i in columns:
print(type(data[i].values[0]), i)
17. In [25]:
# Creating binary variable based on 'data['EV/EBITDA'].mean()'
health = []
print('The industrial average of EV/EBITDA is' ,data['EV/EBITDA'
].mean())
for i in data['EV/EBITDA']:
if i > data['EV/EBITDA'].mean():
h = 1
else:
h = 0
health.append(h)
data['Health'] = health
# Since the new categorical data was created from 'EV/EBITDA', i
t will be taken out from the dataframe.
del data['EV/EBITDA']
In [26]:
# Drop NAs
data = data.dropna()
print('After steps of data cleaning, and manipulating, the dataf
rame used in the model has', data.shape[0],
'observations (companies) with', data.shape[1]-1, 'feature
s.')
In [27]:
a = data['Health'] == 1
print('The number of observations defined as healthy are',a.sum(
))
The industrial average of EV/EBITDA is 7.90497282608
6954
After steps of data cleaning, and manipulating, the
dataframe used in the model has 186 observations (co
mpanies) with 21 features.
The number of observations defined as healthy are 17
0
20. Table Summary
The correlation table for numeric variables indicates that positive relationships are
PE_Ratio, Earnings_Growth_Q, PriceSales, EVRevenue, and Profit_Margin,
respectively. For negative relationships variables are, Operating_Margin, Gross_Profit,
EBITDA, EV, Mkt_Cap, NItoCommon, PriceBook, PEG_Ratio, Revenue, Price,
Diluted_EPS, ROE, RevenueShare, Payout_Ratio, and ROA, respectively.
In [5]:
plt.figure(figsize=(15,15))
plt.imshow(correl) # show as image
plt.colorbar()
# To set the label on the axes
plt.xticks(range(21), variables, rotation='vertical') # Require
list of variables list creation
plt.yticks(range(21), variables)
Out[5]:
([<matplotlib.axis.YTick at 0x1c227f7c18>,
<matplotlib.axis.YTick at 0x1c227f7550>,
<matplotlib.axis.YTick at 0x1c22721550>,
<matplotlib.axis.YTick at 0x1c2275c588>,
<matplotlib.axis.YTick at 0x1c22755be0>,
<matplotlib.axis.YTick at 0x1c22765c88>,
<matplotlib.axis.YTick at 0x1c227654e0>,
<matplotlib.axis.YTick at 0x1c22aeb550>,
<matplotlib.axis.YTick at 0x1c22aebac8>,
<matplotlib.axis.YTick at 0x1c22af00f0>,
<matplotlib.axis.YTick at 0x1c22af05f8>,
<matplotlib.axis.YTick at 0x1c22af0b70>,
<matplotlib.axis.YTick at 0x1c22af8160>,
<matplotlib.axis.YTick at 0x1c22af09e8>,
<matplotlib.axis.YTick at 0x1c227650b8>,
<matplotlib.axis.YTick at 0x1c22af85c0>,
<matplotlib.axis.YTick at 0x1c22af8b38>,
<matplotlib.axis.YTick at 0x1c22afe160>,
<matplotlib.axis.YTick at 0x1c22afe668>,
<matplotlib.axis.YTick at 0x1c22afebe0>,
<matplotlib.axis.YTick at 0x1c22b06198>],
<a list of 21 Text yticklabel objects>)
22. In [6]:
plt.hist(data['Change'], bins=20)
plt.title('Histogram of % Change in Stock Price')
plt.xlabel('% Change in Stock Price')
The histogram of Change indicates that the independent variable (Y) is approximately
normally distributed.
Example for Plots of independent variables against Y
Out[6]:
Text(0.5, 0, '% Change in Stock Price')
23. In [7]:
# Enterprise Value/Revenue
plt.scatter(data['EVRevenue'], data['Change'])
plt.xlabel('Enterprise Value/Revenue')
plt.ylabel('% Change in Stock Price')
plt.title('Enterprise Value/Revenue vs Change')
From the plot above, there's no obvious direction whether it's an upward or downward
slope. However, there's a slightly non-linear relationship between this feature and
response. Hence, in further analysis, if this variable is statistically significant in the
model, the polynomial term of EV/Revenue will be generated in order to improve the
model.
Out[7]:
Text(0.5, 1.0, 'Enterprise Value/Revenue vs Change')
24. In [8]:
# Payout Ratio (%)
plt.figure(figsize=(15,5))
plt.subplot(121)
plt.scatter(data['Payout_Ratio'], data['Change'])
plt.xlabel('Payout Ratio (%)')
plt.ylabel('% Change in Stock Price')
plt.title('Payout_Ratio vs Change')
plt.subplot(122)
plt.scatter(np.log(data['Payout_Ratio']), data['Change'])
plt.xlabel('Log of Payout Ratio (%)')
plt.ylabel('% Change in Stock Price')
plt.title('log(Payout_Ratio) vs Change')
From the left panel, the distribution is dense when the payout ratio (%) is less than
200. After taking logarithm function to the observation, the scatter plot (right panel)
shows a slightly negative relationship but no non-linear relationship is detected. Taking
a log() function into the model might improve the model. Unfortunately, there's some
infinite value after taking log() in the observations. Dropping more observations (infinite
values) from the approximately existing 190 observations would potentially reduce the
model accuracy. Hence, this variable will be kept as it is.
Out[8]:
Text(0.5, 1.0, 'log(Payout_Ratio) vs Change')
25. In [9]:
# Operating Margin (%)
plt.scatter(data['Operating_Margin'], data['Change'])
plt.xlabel('Operating Margin (%)')
plt.ylabel('% Change in Stock Price')
plt.title('Operating_Margin vs Change')
The plot above shows a slight positive relationship between Operating Margin (%) and
% change in stock price.
Out[9]:
Text(0.5, 1.0, 'Operating_Margin vs Change')
26. In [10]:
# ROA (%)
plt.scatter(data['ROA'], data['Change'])
plt.xlabel('Return on Asset (%)')
plt.ylabel('% Change in Stock Price')
plt.title('ROA (%) vs Change')
The scatter plot of ROA (%) and % Change in stock price shows a negative
relationship with no evidence of non-linearity.
The Boxplot (for Binary Variable)
Out[10]:
Text(0.5, 1.0, 'ROA (%) vs Change')
27. In [11]:
plt.figure(figsize=(5,5))
sns.boxplot(data['Health'],data['Change'])
The boxplot of this binary variable Health shows a slight difference between the
mean and 50% of observations for each group where the 'healthy' companies (Health
=1) are slightly higher and have a wider range of distribution (whiskers). Hence, this
variable is potentially statistical significant in the model.
====================================================================
4. Predictive Modeling
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c2410a4
38>
28. The Multicollinearity Among Variables (VIF)
Calculating the variance inflation factor
*source
(https://etav.github.io/python/vif_factor_python.html)
In [37]:
y, X_vif = dmatrices('Change ~' + 'Price+Mkt_Cap+PEG_Ratio+PE_Ra
tio+PriceSales+PriceBook+EV+EVRevenue+Payout_Ratio+Profit_Margin
+Operating_Margin+ROA+ROE+Revenue+RevenueShare+Gross_Profit+EBIT
DA+NItoCommon+Earnings_Growth_Q+Health', data = data, return_typ
e='dataframe')
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_vif.values, i)
for i in range(X_vif.shape[1])]
vif["features"] = X_vif.columns
vif.round(2).set_index('features')
30. From the result above, there're high multicollinearities (VIF > 10) in the features
among Mkt_Cap , Price/Sales , EV , EV/Revenue , Revenue ,
Gross_Profit , EBITDA , and NItoCommon . In which, these varaibles are not
independent. In the next step, the subset selection method will help filter out
highly correlated and unnecessary variables from the model.
Best Subset Selection Method
Since there're many variables (high dimension) with multicollinearity in the data
set. Including all the variables may lead to high varaince in the model. To reduce
the model variance, this method could help select the best set of variables that
yields a higher Adjusted R squared by minimizing RSS.
* source
(http://www.science.smith.edu/~jcrouser/SDS293/labs/lab8-
py.html)
31. In [13]:
y = data.Change
X = data[['Price','Mkt_Cap','PEG_Ratio', 'PE_Ratio', 'PriceSales
', 'PriceBook', 'EV', 'EVRevenue',
'Payout_Ratio', 'Profit_Margin', 'Operating_Margin', 'R
OA', 'ROE',
'Revenue', 'RevenueShare', 'Gross_Profit', 'EBITDA', 'N
ItoCommon',
'Earnings_Growth_Q', 'Health']]
X = pd.concat([X], axis=1)
X.head()
In [14]:
def processSubset(feature_set):
# Fit model on feature_set and calculate RSS
model = sm.OLS(y,X[list(feature_set)])
regr = model.fit()
RSS = ((regr.predict(X[list(feature_set)]) - y) ** 2).sum()
return {"model":regr, "RSS":RSS}
Out[13]:
Price Mkt_Cap PEG_Ratio PE_Ratio PriceSales PriceBook
Symbol
AAPL 261.78 1.183000e+12 2.04 22.02 4.55 12.85
MSFT 149.59 1.141000e+12 1.91 28.22 8.79 10.77
INTC 57.61 2.506040e+11 1.79 13.49 3.56 3.38
CSCO 44.85 1.902650e+11 1.97 17.85 3.66 5.53
ORCL 56.39 1.851010e+11 1.45 18.46 4.68 10.08
32. In [15]:
def getBest(k):
tic = time.time()
results = []
for combo in itertools.combinations(X.columns, k):
results.append(processSubset(combo))
models = pd.DataFrame(results)
# Choose model with the lowest RSS
best_model = models.loc[models['RSS'].argmin()]
toc = time.time()
print("Processed", models.shape[0], "models on", k, "predict
ors in", (toc-tic), "seconds.")
return best_model
In [16]:
def getBest(k):
tic = time.time()
results = []
for combo in itertools.combinations(X.columns, k):
results.append(processSubset(combo))
models = pd.DataFrame(results)
best_model = models.loc[models['RSS'].argmin()]
toc = time.time()
print("Processed", models.shape[0], "models on", k, "predict
ors in", (toc-tic), "seconds.")
return best_model
33. In [17]:
models_best = pd.DataFrame(columns=['RSS', 'model'])
tic = time.time()
for i in range(1,10):
models_best.loc[i] = getBest(i)
toc = time.time()
print("Total elapsed time:", (toc-tic)/60, "minutes.")
Processed 20 models on 1 predictors in 0.06646728515
625 seconds.
Processed 190 models on 2 predictors in 0.4089689254
760742 seconds.
Processed 1140 models on 3 predictors in 2.546672344
2077637 seconds.
Processed 4845 models on 4 predictors in 11.26060104
3701172 seconds.
Processed 15504 models on 5 predictors in 34.6961269
3786621 seconds.
Processed 38760 models on 6 predictors in 88.0667259
6931458 seconds.
Processed 77520 models on 7 predictors in 613.451443
9105988 seconds.
Processed 125970 models on 8 predictors in 403.86032
41443634 seconds.
Processed 167960 models on 9 predictors in 419.71901
79824829 seconds.
Total elapsed time: 26.632335432370503 minutes.
36. In [20]:
print('The model has the highest Adjusted R squared at', '{0:.4f
}'.format(models_best.loc[7, "model"].rsquared_adj), 'when it ha
s', rsquared_adj.argmax() , 'variables' )
print('The model has the lowest AIC at', '{0:.4f}'.format(models
_best.loc[7, "model"].aic), 'when it has', aic.argmin() , 'varia
bles' )
print('The model has the lowest BIC at', '{0:.4f}'.format(models
_best.loc[7, "model"].bic), 'when it has', bic.argmin() , 'varia
bles' )
Out[19]:
Text(0, 0.5, 'BIC')
The model has the highest Adjusted R squared at 0.06
37 when it has 7 variables
The model has the lowest AIC at 602.5338 when it has
4 variables
The model has the lowest BIC at 625.1140 when it has
2 variables
38. Criteria
The subset selection computes or models were computed by minimizing
the RSS.
Adjusted R squared:
According to the formula above the smaller RSS would yield a higher
Adjusted R squared. As the result of the subset selection method, the 7
variable-model yields the highest adjusted R squared at 0.067, where the
variables are Price/Sales, Price/Book, EV/Revenue, Payout Ratio, Operating
Margin, ROA, and Health. There're 5 variables that are statistically significant
at p-value < 0.05, where R squared is 0.099. (The result shown below)
AIC
From the result, AIC criterion yields the model with 4 variables which are
Payout Ratio, Operating Margin, ROA, and Health. However, there's only one
variable (ROA) that is statistically significant at 95% confident interval, with R
squared equal to 0.071.
BIC
Since BIC criterion is more restricted (higher penalty term; ), it
yields a smaller model with two significant variables which are
Operating_Margin, and ROA. The R squared is 0.045.
Criteria # of Optimal Variables
7 0.099 0.064
AIC 4 0.071 0.051
BIC 2 0.045 0.035
To conclude, based on Adjusted R squared criterion, the optimal non-linear
model is OLS model with 7 variables, shown below.
( )
𝑝
𝑘
𝑝!
𝑘!(𝑝−𝑘)!
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 = 1 −𝑅2
𝑅𝑆𝑆
𝑛−𝑝−1
𝑇𝑆𝑆
𝑛−1
log(𝑛) ∗ 𝑝 ∗ σ̂2
𝑅2
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑𝑅2
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑𝑅2
40. The Optimal Linear Model (7 Variables)
In [3]:
y = data.Change
X_7 = dmatrix('1 + PriceSales + PriceBook + EVRevenue + Payout_R
atio + Operating_Margin + ROA + Health', data = data)
m = sm.OLS(y, X_7)
m.data.xnames = X_7.design_info.column_names
m = m.fit()
print(m.summary())
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
OLS Regression Results
====================================================
==========================
Dep. Variable: Change R-squared:
0.098
Model: OLS Adj. R-squar
ed: 0.063
Method: Least Squares F-statistic:
2.774
Date: Thu, 28 Nov 2019 Prob (F-stat
istic): 0.00924
Time: 21:50:30 Log-Likeliho
od: -294.25
No. Observations: 186 AIC:
604.5
Df Residuals: 178 BIC:
630.3
Df Model: 7
Covariance Type: nonrobust
====================================================
================================
coef std err t
41. VIF After The Subset Selection
P>|t| [0.025 0.975]
----------------------------------------------------
--------------------------------
Intercept 0.0664 0.326 0.204
0.839 -0.576 0.709
PriceSales 0.2852 0.153 1.863
0.064 -0.017 0.587
PriceBook 0.0310 0.021 1.481
0.140 -0.010 0.072
EVRevenue -0.3609 0.170 -2.124
0.035 -0.696 -0.026
Payout_Ratio -0.0020 0.001 -2.041
0.043 -0.004 -6.67e-05
Operating_Margin 0.0474 0.017 2.713
0.007 0.013 0.082
ROA -0.1509 0.039 -3.909
0.000 -0.227 -0.075
Health 0.4269 0.332 1.286
0.200 -0.228 1.082
====================================================
==========================
Omnibus: 8.080 Durbin-Watso
n: 1.953
Prob(Omnibus): 0.018 Jarque-Bera
(JB): 14.644
Skew: -0.096 Prob(JB):
0.000661
Kurtosis: 4.361 Cond. No.
493.
====================================================
==========================
Warnings:
[1] Standard Errors assume that the covariance matri
x of the errors is correctly specified.
42. In [36]:
y, X_vif = dmatrices('Change ~' + 'PriceSales + PriceBook + EVRe
venue + Payout_Ratio + Operating_Margin + ROA + Health', data =
data, return_type='dataframe')
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_vif.values, i)
for i in range(X_vif.shape[1])]
vif["features"] = X_vif.columns
vif.round(2).set_index('features')
As the result, the high VIF variables are eliminated. Even though the model still has
some multicollinearity beween variables, Price/Sales and EV/Revenue, it is
moderately acceptable.
Out[36]:
VIF Factor
features
Intercept 13.63
PriceSales 44.86
PriceBook 2.12
EVRevenue 51.57
Payout_Ratio 1.07
Operating_Margin 3.46
ROA 2.95
Health 1.11
43. The Non-Linear Model with Polynomial Term
Based on the data visualization above, there's an evidence that EV/Revenue
could have non-linearity.
Below, a model with polynomial term is performed, along with other selected
variables.
In [25]:
y = data.Change
X_new = dmatrix('1 + PriceSales + PriceBook + EVRevenue + I(EVR
evenue**2) + Payout_Ratio + Operating_Margin + ROA + Health', da
ta = data)
m_new = sm.OLS(y, X_new)
m_new.data.xnames = X_new.design_info.column_names
m_new = m_new.fit()
print(m_new.summary())
OLS Regression Results
====================================================
==========================
Dep. Variable: Change R-squared:
0.115
Model: OLS Adj. R-squar
ed: 0.075
Method: Least Squares F-statistic:
2.872
Date: Wed, 27 Nov 2019 Prob (F-stat
istic): 0.00499
Time: 00:10:37 Log-Likeliho
od: -292.52
No. Observations: 186 AIC:
603.0
Df Residuals: 177 BIC:
632.1
Df Model: 8
Covariance Type: nonrobust
====================================================
=================================
coef std err t
45. Compared with linear model, this non-linear model with the polynomial term has a
higher Adjusted R squared at 0.075 ( > 0.063 ), and R squared at 0.115. This means
the explanation of variation for the dependent variable (%Change in stock price) has
been improved by the polynomial term of EV/Revenue. However, the model will be
evaluated by the cross-validation to compare the predictive performance between the
linear and non-linear model.
Prediction Accuracy Between Models
Cross-Validation: Using the random sample cross-validition with 80:20 partitioning,
and random.seed(1) to validate the model predictive power.
In [26]:
# Create a training and testing set
random.seed(1)
train = random.sample(range(0,len(data)), round(len(data)*0.8))
test = []
for n in range(0,len(data)):
if n not in train:
test.append(n)
y_training = data['Change'].iloc[train]
x_training = data[['PriceSales', 'PriceBook', 'EVRevenue', 'Payo
ut_Ratio', 'Operating_Margin', 'ROA', 'Health']].iloc[train]
y_testing = data['Change'].iloc[test]
46. In [27]:
# Build a model with training set from the best subset model (7
Variables)
y = y_training
X_7 = dmatrix('1 + PriceSales + PriceBook + EVRevenue + Payout_R
atio + Operating_Margin + ROA + Health', data = x_training)
m_7_cv = sm.OLS(y, X_7)
m_7_cv.data.xnames = X_7.design_info.column_names
m_7_cv = m_7_cv.fit()
print(m_7_cv.summary())
OLS Regression Results
====================================================
==========================
Dep. Variable: Change R-squared:
0.122
Model: OLS Adj. R-squar
ed: 0.079
Method: Least Squares F-statistic:
2.802
Date: Wed, 27 Nov 2019 Prob (F-stat
istic): 0.00922
Time: 00:10:37 Log-Likeliho
od: -227.28
No. Observations: 149 AIC:
470.6
Df Residuals: 141 BIC:
494.6
Df Model: 7
Covariance Type: nonrobust
====================================================
================================
coef std err t
P>|t| [0.025 0.975]
----------------------------------------------------
--------------------------------
Intercept -0.0203 0.329 -0.061
0.951 -0.672 0.631
PriceSales 0.1883 0.155 1.213
0.227 -0.119 0.495
PriceBook 0.0267 0.023 1.167
47. In [28]:
# Build a model with training set from a model with polynomial t
erm (EV/Revenue^2)
y = y_training
X_new = dmatrix('1 + PriceSales + PriceBook + EVRevenue + I(EVR
evenue**2) + Payout_Ratio + Operating_Margin + ROA + Health', da
ta = x_training)
m_new_cv = sm.OLS(y, X_new)
m_new_cv.data.xnames = X_new.design_info.column_names
m_new_cv = m_new_cv.fit()
print(m_new_cv.summary())
0.245 -0.018 0.072
EVRevenue -0.2841 0.170 -1.672
0.097 -0.620 0.052
Payout_Ratio -0.0023 0.001 -2.375
0.019 -0.004 -0.000
Operating_Margin 0.0370 0.018 2.008
0.047 0.001 0.073
ROA -0.1325 0.040 -3.283
0.001 -0.212 -0.053
Health 0.7233 0.344 2.102
0.037 0.043 1.403
====================================================
==========================
Omnibus: 8.706 Durbin-Watso
n: 1.839
Prob(Omnibus): 0.013 Jarque-Bera
(JB): 15.574
Skew: 0.193 Prob(JB):
0.000415
Kurtosis: 4.536 Cond. No.
522.
====================================================
==========================
Warnings:
[1] Standard Errors assume that the covariance matri
x of the errors is correctly specified.
OLS Regression Results
====================================================
==========================
49. Mean Squared Error
𝑀𝑆𝐸 =
∑
𝑖=1
𝑛
( −𝑦̂ 𝑦𝑖)2
𝑛
In [29]:
# Calculate the test MSEs
x_testing = dmatrix('1+PriceSales+PriceBook+EVRevenue+Payout_Rat
io+Operating_Margin+ROA+Health',data = data.iloc[test])
predicted_7 = m_7_cv.predict(x_testing)
x_testing = dmatrix('1+PriceSales+PriceBook+EVRevenue+I(EVRevenu
e**2)+Payout_Ratio+Operating_Margin+ROA+Health',data = data.iloc
[test])
predicted_new = m_new_cv.predict(x_testing)
mse = pd.DataFrame()
mse['Actual Value'] = y_testing
mse['Predicted Value (m_7)'] = predicted_7
mse['Predicted Value (m_new)'] = predicted_new
mse['Squared Error (m_7)'] = (mse['Predicted Value (m_7)'] - mse
['Actual Value'])**2
mse['Squared Error (m_new)'] = (mse['Predicted Value (m_new)'] -
mse['Actual Value'])**2
MSE_7 = mse['Squared Error (m_7)'].sum()/len(mse)
MSE_new = mse['Squared Error (m_new)'].sum()/len(mse)
(JB): 18.892
Skew: 0.166 Prob(JB):
7.90e-05
Kurtosis: 4.713 Cond. No.
524.
====================================================
==========================
Warnings:
[1] Standard Errors assume that the covariance matri
x of the errors is correctly specified.
50. In [31]:
mse.T
In [32]:
print('The model test MSE for the linear model with 7 variable i
s', MSE_7)
print('The model test MSE for the model with polynomial term is'
, MSE_new)
Out[31]:
Symbol ACN AVGO IBM NOW MU AMD
Actual
Value
-0.060000 -0.100000 0.370000 0.310000 0.700000 -0.940000
Predicted
Value
(m_7)
-0.780854 -0.253737 0.194448 0.138732 -0.359416 0.184337
Predicted
Value
(m_new)
-0.790585 -0.126589 0.075529 -0.300076 -0.351424 0.357285
Squared
Error
(m_7)
0.519630 0.023635 0.030819 0.029333 1.122362 1.264133
Squared
Error
(m_new)
0.533755 0.000707 0.086713 0.372193 1.105493 1.682949
5 rows × 37 columns
The model test MSE for the linear model with 7 varia
ble is 2.1063194289258402
The model test MSE for the model with polynomial ter
m is 2.154634174218139
51. The Optimal Model Recall
According to the MSEs value above, the model with a lower error, a non-linear model
with 7 selected variables, has been recalled below.
In [4]:
print(m.summary())
OLS Regression Results
====================================================
==========================
Dep. Variable: Change R-squared:
0.098
Model: OLS Adj. R-squar
ed: 0.063
Method: Least Squares F-statistic:
2.774
Date: Thu, 28 Nov 2019 Prob (F-stat
istic): 0.00924
Time: 21:51:37 Log-Likeliho
od: -294.25
No. Observations: 186 AIC:
604.5
Df Residuals: 178 BIC:
630.3
Df Model: 7
Covariance Type: nonrobust
====================================================
================================
coef std err t
P>|t| [0.025 0.975]
----------------------------------------------------
--------------------------------
Intercept 0.0664 0.326 0.204
0.839 -0.576 0.709
PriceSales 0.2852 0.153 1.863
0.064 -0.017 0.587
PriceBook 0.0310 0.021 1.481
0.140 -0.010 0.072
EVRevenue -0.3609 0.170 -2.124
0.035 -0.696 -0.026
52. Regression Diagnosis
*Source (https://robert-alvarez.github.io/2018-06-04-diagnostic_plots/)
Payout_Ratio -0.0020 0.001 -2.041
0.043 -0.004 -6.67e-05
Operating_Margin 0.0474 0.017 2.713
0.007 0.013 0.082
ROA -0.1509 0.039 -3.909
0.000 -0.227 -0.075
Health 0.4269 0.332 1.286
0.200 -0.228 1.082
====================================================
==========================
Omnibus: 8.080 Durbin-Watso
n: 1.953
Prob(Omnibus): 0.018 Jarque-Bera
(JB): 14.644
Skew: -0.096 Prob(JB):
0.000661
Kurtosis: 4.361 Cond. No.
493.
====================================================
==========================
Warnings:
[1] Standard Errors assume that the covariance matri
x of the errors is correctly specified.
53. In [5]:
# Residual Plot
sns.residplot(m.fittedvalues, 'Change', data=data, lowess=True,
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw'
: 1, 'alpha': 0.8})
plt.title('Residuals vs Fitted')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
Residuals and fited value plot shows there's some nonlinearity that this linear model
couldn't capture.
Out[5]:
Text(0, 0.5, 'Residuals')
54. In [6]:
# Normal Q-Q plot
sm.qqplot(m.resid, line='45', color='cornflowerblue', alpha=0.6)
plt.title('Normal Q-Q')
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Standardized Residuals')
The Q-Q plot indicates that approimately more than 85% of the residuals align along
the line, which means the errors are being normally distributed.
Out[6]:
Text(0, 0.5, 'Standardized Residuals')
55. In [7]:
# Scale-Location Plot
norm_res_abs_sqrt = np.sqrt(np.abs(m.get_influence().resid_stude
ntized_internal))
plt.scatter(m.fittedvalues, norm_res_abs_sqrt, alpha=0.5);
sns.regplot(m.fittedvalues, norm_res_abs_sqrt, scatter=False, ci
=False, lowess=True,
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8});
plt.xlabel('Fitted values')
plt.ylabel('$sqrt{|Standardized Residuals|}$')
The residual plot shows a slightliness of uneven cloud of the residual. This model might
suffer from heteroscedasticiy.
Out[7]:
Text(0, 0.5, '$sqrt{|Standardized Residuals|}$')
56. In [8]:
# Residual and Leverage
leverage = m.get_influence().hat_matrix_diag
norm_res = m.get_influence().resid_studentized_internal
plt.scatter(leverage, norm_res, alpha=0.5);
sns.regplot(leverage, norm_res, scatter=False, ci=False, lowess=
True,
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plt.xlim(0, max(leverage)+0.01)
plt.ylim(-3, 5)
plt.title('Residuals vs Leverage')
plt.xlabel('Leverage')
plt.ylabel('Standardized Residuals');
The residual and leverage plot shows that there's no outlier.
57. Model Conclusion
From the training set, the model with polynomial term seems to perform better
than the linear model, due to a higher Adjusted R squared as well as R squared,
which means that the variation of %Change on stock price is better explained by
the additional polynomial term. However, the training error tends to underestimate
the testing error.
According to the test MSEs for both models, the model without polynomial term
yields a slightly lower MSE ( 2.1063 < 2.1546 ). This indicates that the model with
7 variables (non-linear) has a stronger predictive power.
The non-linear model:
𝐶ℎ𝑎𝑛𝑔𝑒 = 0.0664 + 0.2852(𝑃 𝑟𝑖𝑐𝑒𝑆𝑎𝑙𝑒𝑠) + 0.0310(𝑃 𝑟𝑖𝑐𝑒𝐵𝑜𝑜𝑘) − 0.3609(𝐸𝑉 𝑅𝑒
− 0.1509(𝑅𝑂𝐴) + 0.4269(
The Optimal Model Interpretation
Independent Variables Relationships Coefficient P-Value
Intercept + 0.0664 0.839
Price/Sales + 0.2852 0.064 (.)
Price/Book + 0.0310 0.140
EV/Revenue - 0.3609 0.035 (*)
Payout Ratio (%) - 0.0020 0.043 (*)
Operating Margin (%) + 0.0474 0.007 (**)
Return on Asset (ttm) - 0.1509 0.000 (***)
Health + 0.4269 0.200
R squared
The 9.8% variation of the percentage change in stock price is explained by
independent variables in this ordinary least squares model.
In order to improve the R sqaured value, the model might need other vairable
that is more correlated to the response. Because the stock data has high
variation as well as high randomness, besides, numeric data we might need
58. other data such as, daily news, financial report, 10-K, index companies
performance, and so on to improve the change in stock price evaluation.
Coefficients (Significant at 95% confident interval)
EVRevenue : The coefficient indicates that, on average, when the Enterprise
Value/Revenue increases by 1 unit, the stock price will decline by 0.3609%,
holding other variables constant, at p-value 0.035 < 0.05.
Payout_Ratio : On average, when the Payout Ratio increases by 1%, the
stock price will decrease by 0.002%, while holding others constant, at p-value
0.045 < 0.05.
Operating_Margin : The coefficient indicates that, on average, when the
Operating Margin (ttm) increases by 1%, the stock price will also increase by
0.0474%, holding others constant, at p-value 0.007 (< 0.05).
ROA : ROA is highly significant with p-value at 0.000. On average, when the
Return on Asset (ttm) increases by 1% while holding others constant, the
stock price will decreases by 0.1509%.
PriceSales has a p-value at 0.064 which is statistically significant at 99%
confidence level, which means it isn't highly correlated with the dependent
variable.
PriceBook , and Health are not statistically significant.
In [33]:
# # Use this code in order to predict a specific scenerio
# PriceSales =
# PriceBook =
# EVRevenue =
# Payout_Ratio =
# Operating_Margin =
# ROA =
# Health =
# data_new = [1, PriceSales, PriceBook, EVRevenue, Payout_Ratio,
Operating_Margin, ROA, Health]
# predicted = m_7_cv.predict(data_new)[0]
# predicted
====================================================================
59. 5. Conclusions
What have we seen based on the data?
From joining the two data frames (one from yahoo finance by the technology
services sector, and another from the key-statistic page), the data set had
approximately 35 numeric variables with 390 companies. After cleaning the data
the observations were reduced to approximately 190 companies.
Building a correlation table and plot, as well as taking those variables into the
scatter plot, the results showed that most of them had vague relationships (low
magnitude of correlation) between the response (% Change in stock price). Also,
there's a sign of non-linearity between the response and EV/Revenue . Hence
the polynomial term of this variable was performed in further progress.
Since there's a high dimensionality in the model, the best subset selection method
was performed. Then, some variables (21 variables including binary variable) are
selected to run the best subset selection model. According to the lowest RSS and
highest Adjusted R squared, 7 variables were selected which are Price/Sales,
Price/Book, EV/Revenue, Payout Ratio, Operating Margin, ROA, and Health
(Binary variable created based on the industrial average of EV/EBITDA)
Due to the non-linearity of EV/Revenue , the model with an additional polynomial
term was performed. The result turns out that the model's Adjusted R squared has
improved. As a result, the % Change of stock price variation is better explained by
dependent variables and the additional polynomial term, .
However, the predictive accuracy was further investigated
Model predictive accuracy:
To validate the model accuracy between these two models, the cross-validation is
performed. The data set was randomly divided into 80% of the training set and
20% of the test set. (Set the seed equal to 1). From the result, the test MSE of the
linear model is slightly lower than the model with a non-linear term. Even though
the non-linear model has a higher Adjusted R squared indicating a better
describing of the relationship between predictors and response, the linear model
has a slightly stronger predictive power.
(The model result comparision shown in the table below)
(𝐸𝑉 /𝑅𝑒𝑣𝑒𝑛𝑢𝑒)2
61. How has our understanding of the original question changed?
Recall the question(s):
Which indices (variables) are statistically important to the Change in percentage
of stock price in Technology Services industry?
The statistically significant indices are Price/Share (+), EV/Revenue (-), Payout
Ratio (%) (-), Operating Margin (%) (+), and Return on Asset (%) (-). Besides
these significant variables in the model, to determine the change in stock
price, some additional factors need to be considered. In the stock market,
there're many types of information the stock analyst could use for decision
making. For instance, reading an annual report like 10-K as well as news and
integrate with numeric data would help them gain more advantage over a
person who only relies on less source.
What is the magnitude of each variable against Change of stock price in the
Technology Services industry?
On the basis, I expected that the market capitalization would pay a significant
role with a positive magnitude as a predictor since most of the companies
that are mostly paid attention, like S&P500 have a high market capitalization.
However, this variable is not statistically significant in the model where the
dependent variable is percentage change in stock price. Also, the result of
ROA is not the same as expected. The more return on asset the more profit a
company can generate from its source. Surprisingly, this variable has negative
relationship in the model.
However, the actual relationship of EV/Revenue is the same as expected
(negative relationship). Since the EV/Revenue is used to compare the
company's revenue with the enterprise value. The lower of the multiple would
mean it's undervalued, the more attraction is drawing to the company. Also,
other variables such as Operating Margin(%), Payout Ratio(%), and
Price/Book are the same as expectation because these variables are the
indices that can draw the attention from investor (the higher the value, the
more attractive stock would gain).