SlideShare a Scribd company logo
1 of 16
Download to read offline
Airfare prediction using
Machine Learning with Apache Spark
on 1 billion observed airfares daily
AGIFORS RM 2016
Josef Habdank
20th of May, 2016
Lead Data Scientist & Data Platform Architect
jha@infare.com www.infare.com
In business since 2000
150 Airlines Customers
11 Airports and
several OTAs Customers
7 offices worldwide
5
5000-6000 revenue managers
login to our platform every week
Leading provider of Airfare
Intelligence Solutions to
the Aviation Industry
Delivers actionable information
based on huge amount of freshly
collected and historical data
https://www.youtube.com/watch?v=h9cQTooY92E
Pharos: life analytics
Airfare Collection and Analytics
Online
Airfare Data
Collection
Data Processing
and Modelling Altus: historical
analytics
Data Feeds
Collecting 1 billion a day airfares
Reached 1bn/day airfares
on 7th of April 2016
Conservative projected
growth based on leads
-
500,000,000.00
1,000,000,000.00
1,500,000,000.00
2,000,000,000.00
2,500,000,000.00
3,000,000,000.00
3,500,000,000.00
Airfare observations daily
Observations Daily Extrapolated Observations Daily
Data collection doubling time ~7-12 months
Reached 1bn/day airfares
on 7th of April 2016
Conservative projected
growth based on leads
100,000.00
1,000,000.00
10,000,000.00
100,000,000.00
1,000,000,000.00
10,000,000,000.00
Airfare observations daily
Observations Daily Extrapolated Observations Daily
Infare technology stack
2015
2016+
Infare technology stack
2016+
Data processing: Apache Spark
Message streaming: Kafka/Kinesis
BigData storage: Hadoop/S3
Microservices: C#.Net/Akka Spray
Real time analytics:
MsSql/Cassandra
Machine Learning:
PySpark + Scikit Learn
Tested on 6-8bn airfares a day
Reaching soon a full market coverage:
how to utilize it?
Infare DataCenter
Altus: historicalData Feeds Granular Data
Access API
(life + historical
queries to DB)
Prediction and
Analytics API
(all models
presented later)
Pharos: life data
+ prediction
Researched prediction since 2012, however accuracy requires larger market coverage.
Estimated that at 5bn airfares/day is the required coverage for launch of the final product.
Prediction: minimum future price
+ API access
Prediction: price evolution
+ API access
Developing Prediction at Scale
• Tens to hundreds of millions of unique
trips observed daily
• Tens to hundreds observed prices per
trip
• Clustering price vectors
• Training model per cluster
• 10000-50000 models
• Training should take 2-3h to enable
daily or real time update
Prediction of highly multivariate time series
Drawing depicts trivial case in 2 dim and 3 models.
In reality there are tens of thousands clusters in > 300 dim space
Each point is representing
n-dim vector time series
Cluster the time series
(after dimensionality
reduction reducing sparsity)
Train ML models on the
data within respective cluster
Remarks regarding modelling
+
• Requires careful feature selection
• Dimensionality reduction of time series space done using
polynomial fitting or inverse exponential series fitting
• Transforms the price vectors into a parameters space
𝑓: 𝑃 ↦ Θ
• Clustering of time series projection Θ using k-means or
Gaussian Mixture Model
• ARIMA formulated as Linear Regression trained on P space:
𝐴𝑅𝐼𝑀𝐴 0, 1, 𝑛 ≡ 𝒚 = 𝑿𝛽 + 𝛼, 𝑤ℎ𝑒𝑟𝑒 dim 𝑿 = ∙, 𝑛
• For some clusters Support Vector Regression
with Radial Basis Function Kernel
• Quantize the continuous co-domain to finite states drawn from data
• Requires in-memory parallel processing, using Scikit Learn on PySpark
could be solved as Blind Source Separation or Machine Learning problem
Future research:
estimating competitors’ demand curves
Looking for a partner Airline to pilot this research project
Airline’s own
historical and
current demand
curves
Estimate of
competitor’s current
and future demand
curves
Infare’s historical
and current
market prices
Question to audience
What do you think is the
most important product?
1) Granular life
and historical data
access API
3) Estimating
competitors’
booking curves
2) Price Prediction in
Pharos + API
THANK YOU!
Please contact to us if you would
like to collaborate in research
Josef Habdank
20th of May, 2016
Lead Data Scientist & Data Platform Architect
jha@infare.com www.infare.com

More Related Content

What's hot

Normalization PRESENTATION
Normalization PRESENTATIONNormalization PRESENTATION
Normalization PRESENTATIONbit allahabad
 
Introduction to DBMS(For College Seminars)
Introduction to DBMS(For College Seminars)Introduction to DBMS(For College Seminars)
Introduction to DBMS(For College Seminars)Naman Joshi
 
Minimum Spanning Tree (MST), Kruskal's algorithm and Prim's Algorithm, and th...
Minimum Spanning Tree (MST), Kruskal's algorithm and Prim's Algorithm, and th...Minimum Spanning Tree (MST), Kruskal's algorithm and Prim's Algorithm, and th...
Minimum Spanning Tree (MST), Kruskal's algorithm and Prim's Algorithm, and th...Animesh Chaturvedi
 
Nested Queries Lecture
Nested Queries LectureNested Queries Lecture
Nested Queries LectureFelipe Costa
 
BINARY TREE REPRESENTATION.ppt
BINARY TREE REPRESENTATION.pptBINARY TREE REPRESENTATION.ppt
BINARY TREE REPRESENTATION.pptSeethaDinesh
 
Inner join and outer join
Inner join and outer joinInner join and outer join
Inner join and outer joinNargis Ehsan
 
K- Nearest Neighbor Approach
K- Nearest Neighbor ApproachK- Nearest Neighbor Approach
K- Nearest Neighbor ApproachKumud Arora
 
Data structure lecture 1
Data structure lecture 1Data structure lecture 1
Data structure lecture 1Kumar
 
Aggregate functions
Aggregate functionsAggregate functions
Aggregate functionssinhacp
 
Data Structure - Elementary Data Organization
Data Structure - Elementary  Data Organization Data Structure - Elementary  Data Organization
Data Structure - Elementary Data Organization Uma mohan
 
Binary search algorithm
Binary search algorithmBinary search algorithm
Binary search algorithmmaamir farooq
 
Presentation on Elementary data structures
Presentation on Elementary data structuresPresentation on Elementary data structures
Presentation on Elementary data structuresKuber Chandra
 

What's hot (20)

SQL logical operators
SQL logical operatorsSQL logical operators
SQL logical operators
 
Normalization PRESENTATION
Normalization PRESENTATIONNormalization PRESENTATION
Normalization PRESENTATION
 
Introduction to DBMS(For College Seminars)
Introduction to DBMS(For College Seminars)Introduction to DBMS(For College Seminars)
Introduction to DBMS(For College Seminars)
 
Minimum Spanning Tree (MST), Kruskal's algorithm and Prim's Algorithm, and th...
Minimum Spanning Tree (MST), Kruskal's algorithm and Prim's Algorithm, and th...Minimum Spanning Tree (MST), Kruskal's algorithm and Prim's Algorithm, and th...
Minimum Spanning Tree (MST), Kruskal's algorithm and Prim's Algorithm, and th...
 
Nested Queries Lecture
Nested Queries LectureNested Queries Lecture
Nested Queries Lecture
 
BINARY TREE REPRESENTATION.ppt
BINARY TREE REPRESENTATION.pptBINARY TREE REPRESENTATION.ppt
BINARY TREE REPRESENTATION.ppt
 
Inner join and outer join
Inner join and outer joinInner join and outer join
Inner join and outer join
 
Pipelining in computer architecture
Pipelining in computer architecturePipelining in computer architecture
Pipelining in computer architecture
 
K- Nearest Neighbor Approach
K- Nearest Neighbor ApproachK- Nearest Neighbor Approach
K- Nearest Neighbor Approach
 
Data structure lecture 1
Data structure lecture 1Data structure lecture 1
Data structure lecture 1
 
Aggregate functions
Aggregate functionsAggregate functions
Aggregate functions
 
SQL Views
SQL ViewsSQL Views
SQL Views
 
Complexity of Algorithm
Complexity of AlgorithmComplexity of Algorithm
Complexity of Algorithm
 
Query processing
Query processingQuery processing
Query processing
 
Data Structure - Elementary Data Organization
Data Structure - Elementary  Data Organization Data Structure - Elementary  Data Organization
Data Structure - Elementary Data Organization
 
Machine Learning: Bias and Variance Trade-off
Machine Learning: Bias and Variance Trade-offMachine Learning: Bias and Variance Trade-off
Machine Learning: Bias and Variance Trade-off
 
Binary search algorithm
Binary search algorithmBinary search algorithm
Binary search algorithm
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler Design
 
Compiler Design Unit 3
Compiler Design Unit 3Compiler Design Unit 3
Compiler Design Unit 3
 
Presentation on Elementary data structures
Presentation on Elementary data structuresPresentation on Elementary data structures
Presentation on Elementary data structures
 

Viewers also liked

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Streaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapStreaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapWithTheBest
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Building Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom WhiteBuilding Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom WhiteThe Hive
 
Introduction to (Big) Data Science
Introduction to (Big) Data ScienceIntroduction to (Big) Data Science
Introduction to (Big) Data ScienceInfoFarm
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Spark Summit
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 

Viewers also liked (20)

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Geo Python16 keynote
Geo Python16 keynoteGeo Python16 keynote
Geo Python16 keynote
 
Streaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapStreaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara Prathap
 
Twitter sentiment analysis
Twitter sentiment analysisTwitter sentiment analysis
Twitter sentiment analysis
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
Big Data Usecases
Big Data UsecasesBig Data Usecases
Big Data Usecases
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Building Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom WhiteBuilding Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom White
 
Introduction to (Big) Data Science
Introduction to (Big) Data ScienceIntroduction to (Big) Data Science
Introduction to (Big) Data Science
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 

Similar to Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily, Josef Habdank, Infare Solutions

The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaAlluxio, Inc.
 
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...confluent
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkWenrui Meng
 
Development Infographic
Development InfographicDevelopment Infographic
Development InfographicRealMassive
 
Webinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computingWebinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computingCREATE-NET
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of dataconfluent
 
ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016Paolo Missier
 
Real-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil DahlkeReal-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil DahlkeSingleStore
 
Supply Chain Management
Supply Chain ManagementSupply Chain Management
Supply Chain ManagementKori Bori
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger AnalyticsItzhak Kameli
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman FarahatSpark Summit
 
THIERRYCV DATASCIENCE
THIERRYCV DATASCIENCETHIERRYCV DATASCIENCE
THIERRYCV DATASCIENCEthierry bema
 
THIERRYCV DATASCIENCE
THIERRYCV DATASCIENCETHIERRYCV DATASCIENCE
THIERRYCV DATASCIENCEthierry bema
 
BigDataFest_ Building Modern Data Streaming Apps
BigDataFest_  Building Modern Data Streaming AppsBigDataFest_  Building Modern Data Streaming Apps
BigDataFest_ Building Modern Data Streaming Appsssuser73434e
 
big data fest building modern data streaming apps
big data fest building modern data streaming appsbig data fest building modern data streaming apps
big data fest building modern data streaming appsTimothy Spann
 
The Wattminder Vision2009
The Wattminder Vision2009The Wattminder Vision2009
The Wattminder Vision2009solarMD
 
Exposing the Cost of Performance Hidden in the Cloud
Exposing the Cost of Performance Hidden in the CloudExposing the Cost of Performance Hidden in the Cloud
Exposing the Cost of Performance Hidden in the CloudNeil Gunther
 
Parallel Processing Technique for Time Efficient Matrix Multiplication
Parallel Processing Technique for Time Efficient Matrix MultiplicationParallel Processing Technique for Time Efficient Matrix Multiplication
Parallel Processing Technique for Time Efficient Matrix MultiplicationIJERA Editor
 

Similar to Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily, Josef Habdank, Infare Solutions (20)

The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
 
Development Infographic
Development InfographicDevelopment Infographic
Development Infographic
 
Webinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computingWebinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computing
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of data
 
ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016
 
Real-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil DahlkeReal-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil Dahlke
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Supply Chain Management
Supply Chain ManagementSupply Chain Management
Supply Chain Management
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
 
THIERRYCV DATASCIENCE
THIERRYCV DATASCIENCETHIERRYCV DATASCIENCE
THIERRYCV DATASCIENCE
 
THIERRYCV DATASCIENCE
THIERRYCV DATASCIENCETHIERRYCV DATASCIENCE
THIERRYCV DATASCIENCE
 
BigDataFest_ Building Modern Data Streaming Apps
BigDataFest_  Building Modern Data Streaming AppsBigDataFest_  Building Modern Data Streaming Apps
BigDataFest_ Building Modern Data Streaming Apps
 
big data fest building modern data streaming apps
big data fest building modern data streaming appsbig data fest building modern data streaming apps
big data fest building modern data streaming apps
 
The Wattminder Vision2009
The Wattminder Vision2009The Wattminder Vision2009
The Wattminder Vision2009
 
VINEYARD Overview - ARC 2016
VINEYARD Overview - ARC 2016VINEYARD Overview - ARC 2016
VINEYARD Overview - ARC 2016
 
Exposing the Cost of Performance Hidden in the Cloud
Exposing the Cost of Performance Hidden in the CloudExposing the Cost of Performance Hidden in the Cloud
Exposing the Cost of Performance Hidden in the Cloud
 
Parallel Processing Technique for Time Efficient Matrix Multiplication
Parallel Processing Technique for Time Efficient Matrix MultiplicationParallel Processing Technique for Time Efficient Matrix Multiplication
Parallel Processing Technique for Time Efficient Matrix Multiplication
 

Recently uploaded

GBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisGBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisAreesha Ahmad
 
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...yogeshlabana357357
 
Fun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdfFun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdfhoangquan21999
 
Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxGlendelCaroz
 
Heads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfHeads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfbyp19971001
 
MODERN PHYSICS_REPORTING_QUANTA_.....pdf
MODERN PHYSICS_REPORTING_QUANTA_.....pdfMODERN PHYSICS_REPORTING_QUANTA_.....pdf
MODERN PHYSICS_REPORTING_QUANTA_.....pdfRevenJadePalma
 
Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloChristian Robert
 
RACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxRACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxArunLakshmiMeenakshi
 
TEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfTEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfmarcuskenyatta275
 
Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Fabiano Dalpiaz
 
MSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdfMSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdfSuchita Rawat
 
Electricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 studentsElectricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 studentslevieagacer
 
Film Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfFilm Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfPharmatech-rx
 
Isolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptxIsolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptxGOWTHAMIM22
 
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday LifeGBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday LifeAreesha Ahmad
 
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 RpWASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 RpSérgio Sacani
 
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...Sérgio Sacani
 
GBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interactionGBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interactionAreesha Ahmad
 
NuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdfNuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdfpablovgd
 
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdfFORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdfSuchita Rawat
 

Recently uploaded (20)

GBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisGBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of Asepsis
 
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
 
Fun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdfFun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdf
 
Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptx
 
Heads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfHeads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdf
 
MODERN PHYSICS_REPORTING_QUANTA_.....pdf
MODERN PHYSICS_REPORTING_QUANTA_.....pdfMODERN PHYSICS_REPORTING_QUANTA_.....pdf
MODERN PHYSICS_REPORTING_QUANTA_.....pdf
 
Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte Carlo
 
RACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxRACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptx
 
TEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfTEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdf
 
Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Information science research with large language models: between science and ...
Information science research with large language models: between science and ...
 
MSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdfMSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdf
 
Electricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 studentsElectricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 students
 
Film Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfFilm Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdf
 
Isolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptxIsolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptx
 
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday LifeGBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
 
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 RpWASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
 
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
 
GBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interactionGBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interaction
 
NuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdfNuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdf
 
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdfFORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
 

Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily, Josef Habdank, Infare Solutions

  • 1. Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily AGIFORS RM 2016 Josef Habdank 20th of May, 2016 Lead Data Scientist & Data Platform Architect jha@infare.com www.infare.com
  • 2. In business since 2000 150 Airlines Customers 11 Airports and several OTAs Customers 7 offices worldwide 5 5000-6000 revenue managers login to our platform every week Leading provider of Airfare Intelligence Solutions to the Aviation Industry Delivers actionable information based on huge amount of freshly collected and historical data https://www.youtube.com/watch?v=h9cQTooY92E
  • 3. Pharos: life analytics Airfare Collection and Analytics Online Airfare Data Collection Data Processing and Modelling Altus: historical analytics Data Feeds
  • 4. Collecting 1 billion a day airfares Reached 1bn/day airfares on 7th of April 2016 Conservative projected growth based on leads - 500,000,000.00 1,000,000,000.00 1,500,000,000.00 2,000,000,000.00 2,500,000,000.00 3,000,000,000.00 3,500,000,000.00 Airfare observations daily Observations Daily Extrapolated Observations Daily
  • 5. Data collection doubling time ~7-12 months Reached 1bn/day airfares on 7th of April 2016 Conservative projected growth based on leads 100,000.00 1,000,000.00 10,000,000.00 100,000,000.00 1,000,000,000.00 10,000,000,000.00 Airfare observations daily Observations Daily Extrapolated Observations Daily
  • 7. Infare technology stack 2016+ Data processing: Apache Spark Message streaming: Kafka/Kinesis BigData storage: Hadoop/S3 Microservices: C#.Net/Akka Spray Real time analytics: MsSql/Cassandra Machine Learning: PySpark + Scikit Learn Tested on 6-8bn airfares a day
  • 8. Reaching soon a full market coverage: how to utilize it? Infare DataCenter Altus: historicalData Feeds Granular Data Access API (life + historical queries to DB) Prediction and Analytics API (all models presented later) Pharos: life data + prediction Researched prediction since 2012, however accuracy requires larger market coverage. Estimated that at 5bn airfares/day is the required coverage for launch of the final product.
  • 9. Prediction: minimum future price + API access
  • 11. Developing Prediction at Scale • Tens to hundreds of millions of unique trips observed daily • Tens to hundreds observed prices per trip • Clustering price vectors • Training model per cluster • 10000-50000 models • Training should take 2-3h to enable daily or real time update
  • 12. Prediction of highly multivariate time series Drawing depicts trivial case in 2 dim and 3 models. In reality there are tens of thousands clusters in > 300 dim space Each point is representing n-dim vector time series Cluster the time series (after dimensionality reduction reducing sparsity) Train ML models on the data within respective cluster
  • 13. Remarks regarding modelling + • Requires careful feature selection • Dimensionality reduction of time series space done using polynomial fitting or inverse exponential series fitting • Transforms the price vectors into a parameters space 𝑓: 𝑃 ↦ Θ • Clustering of time series projection Θ using k-means or Gaussian Mixture Model • ARIMA formulated as Linear Regression trained on P space: 𝐴𝑅𝐼𝑀𝐴 0, 1, 𝑛 ≡ 𝒚 = 𝑿𝛽 + 𝛼, 𝑤ℎ𝑒𝑟𝑒 dim 𝑿 = ∙, 𝑛 • For some clusters Support Vector Regression with Radial Basis Function Kernel • Quantize the continuous co-domain to finite states drawn from data • Requires in-memory parallel processing, using Scikit Learn on PySpark
  • 14. could be solved as Blind Source Separation or Machine Learning problem Future research: estimating competitors’ demand curves Looking for a partner Airline to pilot this research project Airline’s own historical and current demand curves Estimate of competitor’s current and future demand curves Infare’s historical and current market prices
  • 15. Question to audience What do you think is the most important product? 1) Granular life and historical data access API 3) Estimating competitors’ booking curves 2) Price Prediction in Pharos + API
  • 16. THANK YOU! Please contact to us if you would like to collaborate in research Josef Habdank 20th of May, 2016 Lead Data Scientist & Data Platform Architect jha@infare.com www.infare.com