SlideShare a Scribd company logo
SPARK MACHINE LEARNING
Certification Course Academic Year (2017-2018)
Done by:
K Teja Sreenivas
INTRODUCTION:
– Machine learning is a type of artificial intelligence (AI) that
allows software applications to become more accurate in
predicting outcomes without being explicitly programmed.
The basic premise of machine learning is to build
algorithm that can receive input data and use statistical
learning to predict an output value within an acceptable
range.
– Machine learning algorithms are often categorized as
being supervised or Unsupervised.
MACHINE LEARNING TYPES:
LIFE CYCLE IN DESIGNING A
MACHINE LEARNING MODEL
 1. Data collection
 2. Data processing
 3. Feature Engineering
 4. Model Building
 5. Model Evaluation
 6. Model evaluation
 7. Model Deployment
SPARK FOR MACHINE LEARNING:
• Spark is a distributed file system used in place of hadoop. Big Data is used over
network clusters and used as an essential application in several industries. The broad
use of Hadoop and MapReduce technologies shows how such technology is
constantly evolving. The increase in the use of Apache Spark, which is a data
processing engine, is testament to this fact.
• Superior abilities for Big Data applications are provided by Apache Spark when
compared to other Big Data Technologies like MapReduce or Hadoop. The Apache
Spark features are as follows:
1. Holistic framework
2. Speed
3. Easy to use
4. Enhanced support
PROBLEM STATMENT:
Prediction of Annual returns
using sets of weights which
are simulated using US stock
market historical data to
obtain their performances.
DATA SET ATTRIBUTE INFORMATION:
• The inputs are the weights of the stock-picking concepts as follows
X1=the weight of the Large B/P concept
X2=the weight of the Large ROE concept
X3=the weight of the Large S/P concept
X4=the weight of the Large Return Rate in the last quarter concept
X5=the weight of the Large Market Value concept
X6=the weight of the Small systematic Risk concept
The outputs are the investment performance indicators (normalized) as follows
Y1=Annual Return
Y2=Excess Return
Y3=Systematic Risk
Y4=Total Risk
Y5=Abs. Win Rate
Y6=Rel. Win Rate
TERMINOLOGY:
• P/B ratio : The price-to-book ratio, or P/B ratio, is a financial ratio used to compare a company's current market price to its
book value. It is also sometimes known as a Market-to-Book ratio.
• ROE: Return on equity (ROE) is the amount of net income returned as a percentage of shareholder equity. Return on
equity measures a corporation's profitability by revealing how much profit a company generates with the money
shareholders have invested.
• The S&P 500 measures the value of stocks of the 500 largest corporations by market capitalization listed on the New York
Stock Exchange or Nasdaq Composite. Standard & Poor's intention is to have a price that provides a quick look at the stock
market and economy.
• Return Rate: A rate of return is the gain or loss on an investment over a specified time period, expressed as a percentage
of the investment's cost. Gains on investments are defined as income received plus any capital gains realized on the sale of
the investment.
• market value: The amount for which something can be sold on a given market.
• Systematic Risk: Systematic risk is the risk inherent to the entire market or market segment. Systematic risk, also known
as “undiversifiable risk,” “volatility,” or “market risk,” affects the overall market, not just a particular stock or industry. This
type of risk is both unpredictable and impossible to completely avoid.
SOFTWARE TOOLS USED:
• SPARK
• SPYDER
• ANACONDA
• JUPYTER
• PYTHON
• VERTUAL MACHINE
• HDFS
from pyspark import SparkContext , SQLContext
sqlContext = SQLContext(sc)
#data collection:
data = sqlContext.read.csv('/home/tej/Documents/ML with spark/train.csv',header=True, sep=',')
data.show(n=5)
X_train =
data.select('Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_last_quarter','Large_Market_
Value','Small_systematic_Risk','systematic_risks','Annual_Return')
X_train=X_train.select(X_train.Large_ROE.cast('float'),X_train.Large_Return_Rate_last_quarter.cast
('float'),X_train.Large_Market_Value.cast('float'),X_train.Small_systematic_Risk.cast('float'),X_train.
systematic_risks.cast('float'),X_train.Large_BnP.cast('float'),X_train.Large_SnP.cast('float'),X_train.A
nnual_Return.cast('float'))
from pyspark.ml.feature import VectorAssembler,VectorIndexer,StringIndexer
assembler=VectorAssembler(inputCols=['Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_
last_quarter','Large_Market_Value','Small_systematic_Risk','systematic_risks'],outputCol='features')
X_train=assembler.transform(X_train)
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures",
maxCategories=7).fit(X_train)
X_train=featureIndexer.transform(X_train)
from pyspark.ml.regression import LinearRegression
linear_reg = LinearRegression(labelCol='Annual_Return',featuresCol =
'indexedFeatures')
linear_reg_model = linear_reg.fit(X_train)
test_data = sqlContext.read.csv('/home/tej/Documents/ML with spark/test.csv',header=True, sep=',')
X_test=test_data.select('Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_last_quarter','Large_Market_Value',
'Small_systematic_Risk','systematic_risks','Annual_Return')
X_test=
test_data.select(X_test.Large_ROE.cast('float'),X_test.Large_Return_Rate_last_quarter.cast('float'),X_test.Large_Mark
et_Value.cast('float'),X_test.Small_systematic_Risk.cast('float'),X_test.systematic_risks.cast('float'),X_test.Large_BnP.c
ast('float'),X_test.Large_SnP.cast('float'),X_test.Annual_Return.cast('float'))
assembler =
VectorAssembler(inputCols=['Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_last_quarter','Large_Market_
Value','Small_systematic_Risk','systematic_risks'],outputCol='features')
X_test=assembler.transform(X_test)
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=7).fit(X_test)
X_test=featureIndexer.transform(X_test)
linear_predictions = linear_reg_model.transform(X_test)
linear_predictions.show()
linear_predictions.select('Annual_Return','prediction').show()
CONCLUSION:
• From the final output it is clear that using linear model in training the data set we have
obtained predictions which show perdictions of annul returns with less than 0.1 unit
error on average.
key learning :
• we have learnt the basic uses of a machine learning and the uses of spark
in the implementation of the machine learning model.
• The various phases involved in the designing machine learning model in
understood and implemented using a machine learning Random forest model
•
THANKYOU !

More Related Content

Similar to Spark machine learning

Leveraging Data Analysis for Sales
Leveraging Data Analysis for SalesLeveraging Data Analysis for Sales
Leveraging Data Analysis for Sales
Aditya Ratnaparkhi
 
I Know First Presentation (May 2016)
I Know First Presentation (May 2016)I Know First Presentation (May 2016)
I Know First Presentation (May 2016)
I Know First: Daily Market Forecast
 
A Study on Empirical Testing of Capital Asset Pricing Model
A Study on Empirical Testing of Capital Asset Pricing ModelA Study on Empirical Testing of Capital Asset Pricing Model
A Study on Empirical Testing of Capital Asset Pricing Model
Projects Kart
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Pavel Hardak
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Databricks
 
Business analytics and it's tools and competitive advantage
Business analytics and it's tools and competitive advantage Business analytics and it's tools and competitive advantage
Business analytics and it's tools and competitive advantage
reshmamajji123
 
CAST HIGHLIGHT - Overview & Demos
CAST HIGHLIGHT - Overview & DemosCAST HIGHLIGHT - Overview & Demos
CAST HIGHLIGHT - Overview & Demos
Jean-Patrick Ascenci
 
2016-10 Using the Copy & Move webpart
2016-10 Using the Copy & Move webpart2016-10 Using the Copy & Move webpart
2016-10 Using the Copy & Move webpart
Ascendore Limited
 
Stock Market Prediction
Stock Market Prediction Stock Market Prediction
Stock Market Prediction
SalmanShezad
 
Fin 550 Massive Success / snaptutorial.com
Fin 550  Massive Success / snaptutorial.comFin 550  Massive Success / snaptutorial.com
Fin 550 Massive Success / snaptutorial.com
NorrisMistryzh
 
Know risk for mining industry 1
Know risk for mining industry 1Know risk for mining industry 1
Know risk for mining industry 1
Ozdocs
 
Project Evaluation and Estimation in Software Development
Project Evaluation and Estimation in Software DevelopmentProject Evaluation and Estimation in Software Development
Project Evaluation and Estimation in Software Development
Prof Ansari
 
Chapter 2: Information Systems in Organizations
Chapter 2: Information Systems in OrganizationsChapter 2: Information Systems in Organizations
Chapter 2: Information Systems in Organizations
phak_09
 
Dhaval Shah on "Strategic Alignment Of Projects For Higher Profits And Increa...
Dhaval Shah on "Strategic Alignment Of Projects For Higher Profits And Increa...Dhaval Shah on "Strategic Alignment Of Projects For Higher Profits And Increa...
Dhaval Shah on "Strategic Alignment Of Projects For Higher Profits And Increa...
PMI Pearl City Chapter
 
IRJET - Stock Recommendation System using Machine Learning Approache
IRJET - Stock Recommendation System using Machine Learning ApproacheIRJET - Stock Recommendation System using Machine Learning Approache
IRJET - Stock Recommendation System using Machine Learning Approache
IRJET Journal
 
Risk Insight v1.0 User Guide
Risk Insight v1.0 User GuideRisk Insight v1.0 User Guide
Risk Insight v1.0 User Guide
Protect724gopi
 
Gain Comparison between NIFTY and Selected Stocks identified by SOM using Tec...
Gain Comparison between NIFTY and Selected Stocks identified by SOM using Tec...Gain Comparison between NIFTY and Selected Stocks identified by SOM using Tec...
Gain Comparison between NIFTY and Selected Stocks identified by SOM using Tec...
IOSR Journals
 
WACC
WACCWACC
SDX EQ Presentation
SDX EQ PresentationSDX EQ Presentation
SDX EQ Presentation
nimrodio
 
Are indian life insurance companies cost efficient ppt
Are indian life insurance companies cost efficient pptAre indian life insurance companies cost efficient ppt
Are indian life insurance companies cost efficient ppt
Ram Pratap Sinha
 

Similar to Spark machine learning (20)

Leveraging Data Analysis for Sales
Leveraging Data Analysis for SalesLeveraging Data Analysis for Sales
Leveraging Data Analysis for Sales
 
I Know First Presentation (May 2016)
I Know First Presentation (May 2016)I Know First Presentation (May 2016)
I Know First Presentation (May 2016)
 
A Study on Empirical Testing of Capital Asset Pricing Model
A Study on Empirical Testing of Capital Asset Pricing ModelA Study on Empirical Testing of Capital Asset Pricing Model
A Study on Empirical Testing of Capital Asset Pricing Model
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
 
Business analytics and it's tools and competitive advantage
Business analytics and it's tools and competitive advantage Business analytics and it's tools and competitive advantage
Business analytics and it's tools and competitive advantage
 
CAST HIGHLIGHT - Overview & Demos
CAST HIGHLIGHT - Overview & DemosCAST HIGHLIGHT - Overview & Demos
CAST HIGHLIGHT - Overview & Demos
 
2016-10 Using the Copy & Move webpart
2016-10 Using the Copy & Move webpart2016-10 Using the Copy & Move webpart
2016-10 Using the Copy & Move webpart
 
Stock Market Prediction
Stock Market Prediction Stock Market Prediction
Stock Market Prediction
 
Fin 550 Massive Success / snaptutorial.com
Fin 550  Massive Success / snaptutorial.comFin 550  Massive Success / snaptutorial.com
Fin 550 Massive Success / snaptutorial.com
 
Know risk for mining industry 1
Know risk for mining industry 1Know risk for mining industry 1
Know risk for mining industry 1
 
Project Evaluation and Estimation in Software Development
Project Evaluation and Estimation in Software DevelopmentProject Evaluation and Estimation in Software Development
Project Evaluation and Estimation in Software Development
 
Chapter 2: Information Systems in Organizations
Chapter 2: Information Systems in OrganizationsChapter 2: Information Systems in Organizations
Chapter 2: Information Systems in Organizations
 
Dhaval Shah on "Strategic Alignment Of Projects For Higher Profits And Increa...
Dhaval Shah on "Strategic Alignment Of Projects For Higher Profits And Increa...Dhaval Shah on "Strategic Alignment Of Projects For Higher Profits And Increa...
Dhaval Shah on "Strategic Alignment Of Projects For Higher Profits And Increa...
 
IRJET - Stock Recommendation System using Machine Learning Approache
IRJET - Stock Recommendation System using Machine Learning ApproacheIRJET - Stock Recommendation System using Machine Learning Approache
IRJET - Stock Recommendation System using Machine Learning Approache
 
Risk Insight v1.0 User Guide
Risk Insight v1.0 User GuideRisk Insight v1.0 User Guide
Risk Insight v1.0 User Guide
 
Gain Comparison between NIFTY and Selected Stocks identified by SOM using Tec...
Gain Comparison between NIFTY and Selected Stocks identified by SOM using Tec...Gain Comparison between NIFTY and Selected Stocks identified by SOM using Tec...
Gain Comparison between NIFTY and Selected Stocks identified by SOM using Tec...
 
WACC
WACCWACC
WACC
 
SDX EQ Presentation
SDX EQ PresentationSDX EQ Presentation
SDX EQ Presentation
 
Are indian life insurance companies cost efficient ppt
Are indian life insurance companies cost efficient pptAre indian life insurance companies cost efficient ppt
Are indian life insurance companies cost efficient ppt
 

Recently uploaded

办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 

Recently uploaded (20)

办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 

Spark machine learning

  • 1. SPARK MACHINE LEARNING Certification Course Academic Year (2017-2018) Done by: K Teja Sreenivas
  • 2. INTRODUCTION: – Machine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate in predicting outcomes without being explicitly programmed. The basic premise of machine learning is to build algorithm that can receive input data and use statistical learning to predict an output value within an acceptable range. – Machine learning algorithms are often categorized as being supervised or Unsupervised.
  • 4.
  • 5. LIFE CYCLE IN DESIGNING A MACHINE LEARNING MODEL  1. Data collection  2. Data processing  3. Feature Engineering  4. Model Building  5. Model Evaluation  6. Model evaluation  7. Model Deployment
  • 6. SPARK FOR MACHINE LEARNING: • Spark is a distributed file system used in place of hadoop. Big Data is used over network clusters and used as an essential application in several industries. The broad use of Hadoop and MapReduce technologies shows how such technology is constantly evolving. The increase in the use of Apache Spark, which is a data processing engine, is testament to this fact. • Superior abilities for Big Data applications are provided by Apache Spark when compared to other Big Data Technologies like MapReduce or Hadoop. The Apache Spark features are as follows: 1. Holistic framework 2. Speed 3. Easy to use 4. Enhanced support
  • 7. PROBLEM STATMENT: Prediction of Annual returns using sets of weights which are simulated using US stock market historical data to obtain their performances.
  • 8. DATA SET ATTRIBUTE INFORMATION: • The inputs are the weights of the stock-picking concepts as follows X1=the weight of the Large B/P concept X2=the weight of the Large ROE concept X3=the weight of the Large S/P concept X4=the weight of the Large Return Rate in the last quarter concept X5=the weight of the Large Market Value concept X6=the weight of the Small systematic Risk concept The outputs are the investment performance indicators (normalized) as follows Y1=Annual Return Y2=Excess Return Y3=Systematic Risk Y4=Total Risk Y5=Abs. Win Rate Y6=Rel. Win Rate
  • 9. TERMINOLOGY: • P/B ratio : The price-to-book ratio, or P/B ratio, is a financial ratio used to compare a company's current market price to its book value. It is also sometimes known as a Market-to-Book ratio. • ROE: Return on equity (ROE) is the amount of net income returned as a percentage of shareholder equity. Return on equity measures a corporation's profitability by revealing how much profit a company generates with the money shareholders have invested. • The S&P 500 measures the value of stocks of the 500 largest corporations by market capitalization listed on the New York Stock Exchange or Nasdaq Composite. Standard & Poor's intention is to have a price that provides a quick look at the stock market and economy. • Return Rate: A rate of return is the gain or loss on an investment over a specified time period, expressed as a percentage of the investment's cost. Gains on investments are defined as income received plus any capital gains realized on the sale of the investment. • market value: The amount for which something can be sold on a given market. • Systematic Risk: Systematic risk is the risk inherent to the entire market or market segment. Systematic risk, also known as “undiversifiable risk,” “volatility,” or “market risk,” affects the overall market, not just a particular stock or industry. This type of risk is both unpredictable and impossible to completely avoid.
  • 10. SOFTWARE TOOLS USED: • SPARK • SPYDER • ANACONDA • JUPYTER • PYTHON • VERTUAL MACHINE • HDFS
  • 11. from pyspark import SparkContext , SQLContext sqlContext = SQLContext(sc) #data collection: data = sqlContext.read.csv('/home/tej/Documents/ML with spark/train.csv',header=True, sep=',') data.show(n=5) X_train = data.select('Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_last_quarter','Large_Market_ Value','Small_systematic_Risk','systematic_risks','Annual_Return')
  • 12. X_train=X_train.select(X_train.Large_ROE.cast('float'),X_train.Large_Return_Rate_last_quarter.cast ('float'),X_train.Large_Market_Value.cast('float'),X_train.Small_systematic_Risk.cast('float'),X_train. systematic_risks.cast('float'),X_train.Large_BnP.cast('float'),X_train.Large_SnP.cast('float'),X_train.A nnual_Return.cast('float')) from pyspark.ml.feature import VectorAssembler,VectorIndexer,StringIndexer assembler=VectorAssembler(inputCols=['Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_ last_quarter','Large_Market_Value','Small_systematic_Risk','systematic_risks'],outputCol='features') X_train=assembler.transform(X_train) featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=7).fit(X_train) X_train=featureIndexer.transform(X_train)
  • 13.
  • 14. from pyspark.ml.regression import LinearRegression linear_reg = LinearRegression(labelCol='Annual_Return',featuresCol = 'indexedFeatures') linear_reg_model = linear_reg.fit(X_train)
  • 15.
  • 16. test_data = sqlContext.read.csv('/home/tej/Documents/ML with spark/test.csv',header=True, sep=',') X_test=test_data.select('Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_last_quarter','Large_Market_Value', 'Small_systematic_Risk','systematic_risks','Annual_Return') X_test= test_data.select(X_test.Large_ROE.cast('float'),X_test.Large_Return_Rate_last_quarter.cast('float'),X_test.Large_Mark et_Value.cast('float'),X_test.Small_systematic_Risk.cast('float'),X_test.systematic_risks.cast('float'),X_test.Large_BnP.c ast('float'),X_test.Large_SnP.cast('float'),X_test.Annual_Return.cast('float')) assembler = VectorAssembler(inputCols=['Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_last_quarter','Large_Market_ Value','Small_systematic_Risk','systematic_risks'],outputCol='features') X_test=assembler.transform(X_test) featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=7).fit(X_test) X_test=featureIndexer.transform(X_test)
  • 18.
  • 19. CONCLUSION: • From the final output it is clear that using linear model in training the data set we have obtained predictions which show perdictions of annul returns with less than 0.1 unit error on average. key learning : • we have learnt the basic uses of a machine learning and the uses of spark in the implementation of the machine learning model. • The various phases involved in the designing machine learning model in understood and implemented using a machine learning Random forest model •