SlideShare a Scribd company logo
AlphaPy
A Data Science Pipeline in Python
1
The Framework
2
The Framework
• The Model Pipeline is the common code that will generate
a model for any classification or regression problem.
• The Domain Pipeline is the code required to generate the
training and test data; it transforms raw data from a feed
or database into canonical form.
• Both pipelines have configuration files using the YAML
format.
• The output of the whole process is a Model Object, which
is persistent; it can be saved and loaded for analysis.
3
Model Pipeline
4
Model Pipeline
5
Components
• scikit-learn: machine learning in Python
• pandas: Python data analysis library
• NumPy: numerical computing package for Python
• spark-sklearn: scikit-learn integration package for Spark
• skflow: scikit-learn wrapper for Google TensorFlow
• SciPy: scientific computing library for Python
• imbalanced-learn: resampling techniques for imbalanced data sets
• xgboost: scaleable and distributed gradient boosting
6
Predictive Models
• Binary Classification: Classify elements of a given
set into two groups.
• Multiclass Classification: Classify elements of a
given set into greater than two or more groups.
• Regression: Predict real values of a given set.
7
Classification
• AdaBoost
• Extra Trees
• Google TensorFlow
• Gradient Boosting
• K-Nearest Neighbors
• Logistic Regression
• Support Vector Machine (including Linear)
• Naive Bayes (including Multinomial)
• Radial Basis Functions
• Random Forests
• XGBoost Binary and Multiclass
8
Regression
• Extra Trees
• Gradient Boosting
• K-Nearest Neighbor
• Linear Regression
• Random Forests
• XGBoost
9
Data Sampling
• Different techniques to handle unbalanced classes
• Undersampling
• Oversampling
• Combined Sampling (SMOTE)
• Ensemble Sampling
10
Feature Engineering
• Imputation
• Row Statistics and Distributions
• Clustering and PCA
• Standard Scaling (e.g., mean-centering)
• Interactions (n-way)
• Treatments (see below)
11
Treatments
• Some features require special treatment, for example, a date column
that is split into separate columns for month, day, and year.
• Treatments are specified in the configuration file with the feature
name, the treatment function, and its parameters.
• In the following example, we apply a runs test to 6 features in the
YAML file:
12
Feature Transformation
13
Feature Processing
14
Encoders
• Factorization
• One-Hot
• Ordinal
• Binary
• Helmert Contrast
• Sum Contrast
• Polynomial Contrast
• Backward Difference Contrast
• Simple Hashing
15
Feature Selection
• Univariate selection based on the percentile of
highest feature scores
• Scoring functions for both classification and
regression, e.g., ANOVA F-value or chi-squared
statistic
• Recursive Feature Elimination (RFE) with Cross-
Validation (CV) with configurable scoring function
and step size
16
Grid Search
• Full or Randomized Distributed Grid Search with
subsampling (Spark if available)
17
Ensembles

Blending and Stacking
18
Best Model Selection
19
Model Evaluation
• Metrics
• Calibration Plot
• Confusion Matrix
• Learning Curve
• ROC Curve
20
Metrics
21
Calibration Plot
22
Confusion Matrix
23
Learning Curve
24
ROC Curve
25
Domains
• A domain encapsulates functionality that operates on specific
kinds of schema or subject matter.
• The purpose of the Domain Pipeline is to prepare data for the
Model Pipeline. For example, both market and sports data are
time series data but must be structured differently for prediction.
• Market data consist of standard primitives such as open, high,
low, and close; the latter three are postdictive and cause data
leakage. Leaders and laggards must be identified and possibly
column-shifted, which is handled by the Model Pipeline.
• Sports data are typically structured into a match or game format
after gathering team and player data.
26
Market Domain
1. Suppose we have five years of history for a group
of stocks, each stock represented by rows of time
series data on a daily basis.
2. We want to create a model that predicts whether or
not a stock will generate a given return over the
next n days, where n = the forecast period.
3. The goal is to generate canonical training and test
data for the model pipeline, so we need to apply a
series of transformations to the raw stock data.
27
Market Pipeline
28
Feature Definition Language
• Along with treatments, we defined a Feature
Definition Language (FDL) that would make it easy
for data scientists to define formulas and functions.
• Features are applied to groups, so feature sets are
uniformly applied across multiple frames.
• The features are represented by variables, and
these variables map to functions with parameters.
29
FDL Example
• Suppose we want to use the 50-day moving average (MA)
in our model, as we believe that it has predictive power for
a stock’s direction.
• The moving average function ma has two parameters: a
feature (column) name and a time period.
• To apply the 50-day MA, we can simply join the function
name with its parameters, separated by “_”, or
ma_close_50.
• If we want to use an alias, then we can define cma to be
the equivalent of ma_close and get cma_50.
30
Stock Pipeline Example
• Develop a model to predict days with ranges that
are greater than average.
• We will use both random forests and gradient
boosting.
• Get daily data from Yahoo over the past few years
to train our model.
• Define technical analysis features with FDL.
31
Stocks Configuration
32
Stocks Configuration
33
Model Configuration
• Data Section
34
Model Configuration
• Model Section
35
Model Configuration
• Features Section
36
Model Configuration
• Other Sections
37
Start Stocks Pipeline
38
Get Stock Prices
39
Apply Variables
40
Create Train/Test Files
41
Transform Features
42
Fit Model
43
Cross-Validate
44
Generate Metrics
45
Calibration [Train]
46
Calibration [Test]
47
Learning Curves
48
Feature Importances
49
Random Forest ROC
50
XGBoost ROC
51
Results
• We have identified some features that predict large-
range days. This is important for determining
whether or not to deploy an automated system on
any given day.
• Results are consistent across training and test data.
• The learning curves show that results may improve
with more data.
52
Use Cases
• AlphaPy was created based on experimentation
with many Kaggle competitions, so it can generate
models for any classification or regression problem.
• AlphaPy has been designed to be a toolkit for all
data scientists.
• The AlphaPy framework has specific pipelines for
creating market and sports models.
53

More Related Content

What's hot

Rules of data mining
Rules of data miningRules of data mining
Rules of data mining
Sulman Ahmed
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
Knoldus Inc.
 
Sequential Pattern Mining and GSP
Sequential Pattern Mining and GSPSequential Pattern Mining and GSP
Sequential Pattern Mining and GSP
Hamidreza Mahdavipanah
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
MaryamRehman6
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
Multiplier control unit
Multiplier control unitMultiplier control unit
Multiplier control unit
Satyamevjayte Haxor
 
HOPFIELD NETWORK
HOPFIELD NETWORKHOPFIELD NETWORK
HOPFIELD NETWORK
ankita pandey
 
Multicore processors and its advantages
Multicore processors and its advantagesMulticore processors and its advantages
Multicore processors and its advantages
Nitesh Tudu
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
Krish_ver2
 
Multiprocessor architecture
Multiprocessor architectureMultiprocessor architecture
Multiprocessor architecture
Arpan Baishya
 
02 asymptotic notations
02 asymptotic notations02 asymptotic notations
02 asymptotic notations
TarikuDabala1
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
Vinod Kumar Meghwar
 
Stack using Array
Stack using ArrayStack using Array
Stack using Array
Sayantan Sur
 
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInstruction pipeline: Computer Architecture
Instruction pipeline: Computer Architecture
InteX Research Lab
 
Ppt presentation of queues
Ppt presentation of queuesPpt presentation of queues
Ppt presentation of queues
Buxoo Abdullah
 
Markov Random Field (MRF)
Markov Random Field (MRF)Markov Random Field (MRF)
Loop optimization
Loop optimizationLoop optimization
Loop optimization
Vivek Gandhi
 
Swap-space Management
Swap-space ManagementSwap-space Management
Swap-space Management
Agnas Jasmine
 
Semophores and it's types
Semophores and it's typesSemophores and it's types
Semophores and it's types
Nishant Joshi
 
lec6
lec6lec6
lec6
alaa223
 

What's hot (20)

Rules of data mining
Rules of data miningRules of data mining
Rules of data mining
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Sequential Pattern Mining and GSP
Sequential Pattern Mining and GSPSequential Pattern Mining and GSP
Sequential Pattern Mining and GSP
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Multiplier control unit
Multiplier control unitMultiplier control unit
Multiplier control unit
 
HOPFIELD NETWORK
HOPFIELD NETWORKHOPFIELD NETWORK
HOPFIELD NETWORK
 
Multicore processors and its advantages
Multicore processors and its advantagesMulticore processors and its advantages
Multicore processors and its advantages
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
 
Multiprocessor architecture
Multiprocessor architectureMultiprocessor architecture
Multiprocessor architecture
 
02 asymptotic notations
02 asymptotic notations02 asymptotic notations
02 asymptotic notations
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Stack using Array
Stack using ArrayStack using Array
Stack using Array
 
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInstruction pipeline: Computer Architecture
Instruction pipeline: Computer Architecture
 
Ppt presentation of queues
Ppt presentation of queuesPpt presentation of queues
Ppt presentation of queues
 
Markov Random Field (MRF)
Markov Random Field (MRF)Markov Random Field (MRF)
Markov Random Field (MRF)
 
Loop optimization
Loop optimizationLoop optimization
Loop optimization
 
Swap-space Management
Swap-space ManagementSwap-space Management
Swap-space Management
 
Semophores and it's types
Semophores and it's typesSemophores and it's types
Semophores and it's types
 
lec6
lec6lec6
lec6
 

Similar to AlphaPy: A Data Science Pipeline in Python

Practical data science
Practical data sciencePractical data science
Practical data science
Ding Li
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
Michael Gerke
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
Praveen Devarao
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Philip Goddard
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
Ahmet Bulut
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
Zahra Eskandari
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Philip Goddard
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
NECST Lab @ Politecnico di Milano
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Commercial features spss modules 12 june 2020
Commercial features spss modules 12 june 2020Commercial features spss modules 12 june 2020
Commercial features spss modules 12 june 2020
Bianca Welgraaf
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning Challenge
Bernard Ong
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Ronald Francisco Vargas Quesada
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
LDBC council
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
Databricks
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
Mark Peng
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
Databricks
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1
arthi v
 

Similar to AlphaPy: A Data Science Pipeline in Python (20)

Practical data science
Practical data sciencePractical data science
Practical data science
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Commercial features spss modules 12 june 2020
Commercial features spss modules 12 june 2020Commercial features spss modules 12 june 2020
Commercial features spss modules 12 june 2020
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning Challenge
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12c
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1
 

Recently uploaded

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 

Recently uploaded (20)

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 

AlphaPy: A Data Science Pipeline in Python

  • 1. AlphaPy A Data Science Pipeline in Python 1
  • 3. The Framework • The Model Pipeline is the common code that will generate a model for any classification or regression problem. • The Domain Pipeline is the code required to generate the training and test data; it transforms raw data from a feed or database into canonical form. • Both pipelines have configuration files using the YAML format. • The output of the whole process is a Model Object, which is persistent; it can be saved and loaded for analysis. 3
  • 6. Components • scikit-learn: machine learning in Python • pandas: Python data analysis library • NumPy: numerical computing package for Python • spark-sklearn: scikit-learn integration package for Spark • skflow: scikit-learn wrapper for Google TensorFlow • SciPy: scientific computing library for Python • imbalanced-learn: resampling techniques for imbalanced data sets • xgboost: scaleable and distributed gradient boosting 6
  • 7. Predictive Models • Binary Classification: Classify elements of a given set into two groups. • Multiclass Classification: Classify elements of a given set into greater than two or more groups. • Regression: Predict real values of a given set. 7
  • 8. Classification • AdaBoost • Extra Trees • Google TensorFlow • Gradient Boosting • K-Nearest Neighbors • Logistic Regression • Support Vector Machine (including Linear) • Naive Bayes (including Multinomial) • Radial Basis Functions • Random Forests • XGBoost Binary and Multiclass 8
  • 9. Regression • Extra Trees • Gradient Boosting • K-Nearest Neighbor • Linear Regression • Random Forests • XGBoost 9
  • 10. Data Sampling • Different techniques to handle unbalanced classes • Undersampling • Oversampling • Combined Sampling (SMOTE) • Ensemble Sampling 10
  • 11. Feature Engineering • Imputation • Row Statistics and Distributions • Clustering and PCA • Standard Scaling (e.g., mean-centering) • Interactions (n-way) • Treatments (see below) 11
  • 12. Treatments • Some features require special treatment, for example, a date column that is split into separate columns for month, day, and year. • Treatments are specified in the configuration file with the feature name, the treatment function, and its parameters. • In the following example, we apply a runs test to 6 features in the YAML file: 12
  • 15. Encoders • Factorization • One-Hot • Ordinal • Binary • Helmert Contrast • Sum Contrast • Polynomial Contrast • Backward Difference Contrast • Simple Hashing 15
  • 16. Feature Selection • Univariate selection based on the percentile of highest feature scores • Scoring functions for both classification and regression, e.g., ANOVA F-value or chi-squared statistic • Recursive Feature Elimination (RFE) with Cross- Validation (CV) with configurable scoring function and step size 16
  • 17. Grid Search • Full or Randomized Distributed Grid Search with subsampling (Spark if available) 17
  • 20. Model Evaluation • Metrics • Calibration Plot • Confusion Matrix • Learning Curve • ROC Curve 20
  • 26. Domains • A domain encapsulates functionality that operates on specific kinds of schema or subject matter. • The purpose of the Domain Pipeline is to prepare data for the Model Pipeline. For example, both market and sports data are time series data but must be structured differently for prediction. • Market data consist of standard primitives such as open, high, low, and close; the latter three are postdictive and cause data leakage. Leaders and laggards must be identified and possibly column-shifted, which is handled by the Model Pipeline. • Sports data are typically structured into a match or game format after gathering team and player data. 26
  • 27. Market Domain 1. Suppose we have five years of history for a group of stocks, each stock represented by rows of time series data on a daily basis. 2. We want to create a model that predicts whether or not a stock will generate a given return over the next n days, where n = the forecast period. 3. The goal is to generate canonical training and test data for the model pipeline, so we need to apply a series of transformations to the raw stock data. 27
  • 29. Feature Definition Language • Along with treatments, we defined a Feature Definition Language (FDL) that would make it easy for data scientists to define formulas and functions. • Features are applied to groups, so feature sets are uniformly applied across multiple frames. • The features are represented by variables, and these variables map to functions with parameters. 29
  • 30. FDL Example • Suppose we want to use the 50-day moving average (MA) in our model, as we believe that it has predictive power for a stock’s direction. • The moving average function ma has two parameters: a feature (column) name and a time period. • To apply the 50-day MA, we can simply join the function name with its parameters, separated by “_”, or ma_close_50. • If we want to use an alias, then we can define cma to be the equivalent of ma_close and get cma_50. 30
  • 31. Stock Pipeline Example • Develop a model to predict days with ranges that are greater than average. • We will use both random forests and gradient boosting. • Get daily data from Yahoo over the past few years to train our model. • Define technical analysis features with FDL. 31
  • 52. Results • We have identified some features that predict large- range days. This is important for determining whether or not to deploy an automated system on any given day. • Results are consistent across training and test data. • The learning curves show that results may improve with more data. 52
  • 53. Use Cases • AlphaPy was created based on experimentation with many Kaggle competitions, so it can generate models for any classification or regression problem. • AlphaPy has been designed to be a toolkit for all data scientists. • The AlphaPy framework has specific pipelines for creating market and sports models. 53