Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily, Josef Habdank, Infare Solutions

3,349 views

Published on

Prediction using Machine Learning (ML) techniques on Big Data is a computationally and system-wide challenging problem. Especially in the case when the system is processing approximately 10^9 observations per day scalability is the prime concern. In order to be able to rapidly train models covering whole multivariate space the time series vectors, which exhibit significant similarities, are clustered into the groups. Consequently the resulting vector clusters could be modelled using ML tools capable of coefficient estimation at the massive scale (Apache Spark with Scikit Learn). Presentation describes application of the Linear Regression and Support Vector Regression with Radial Basis Function kernel. This approach enables training models fast enough to complete the task within a couple of hours, allowing daily or even real time updates of the coefficients. The above machine learning framework is used to predict the airfares used as support tool for the Revenue Management systems.

Published in: Science

Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily, Josef Habdank, Infare Solutions

  1. 1. Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily AGIFORS RM 2016 Josef Habdank 20th of May, 2016 Lead Data Scientist & Data Platform Architect jha@infare.com www.infare.com
  2. 2. In business since 2000 150 Airlines Customers 11 Airports and several OTAs Customers 7 offices worldwide 5 5000-6000 revenue managers login to our platform every week Leading provider of Airfare Intelligence Solutions to the Aviation Industry Delivers actionable information based on huge amount of freshly collected and historical data https://www.youtube.com/watch?v=h9cQTooY92E
  3. 3. Pharos: life analytics Airfare Collection and Analytics Online Airfare Data Collection Data Processing and Modelling Altus: historical analytics Data Feeds
  4. 4. Collecting 1 billion a day airfares Reached 1bn/day airfares on 7th of April 2016 Conservative projected growth based on leads - 500,000,000.00 1,000,000,000.00 1,500,000,000.00 2,000,000,000.00 2,500,000,000.00 3,000,000,000.00 3,500,000,000.00 Airfare observations daily Observations Daily Extrapolated Observations Daily
  5. 5. Data collection doubling time ~7-12 months Reached 1bn/day airfares on 7th of April 2016 Conservative projected growth based on leads 100,000.00 1,000,000.00 10,000,000.00 100,000,000.00 1,000,000,000.00 10,000,000,000.00 Airfare observations daily Observations Daily Extrapolated Observations Daily
  6. 6. Infare technology stack 2015 2016+
  7. 7. Infare technology stack 2016+ Data processing: Apache Spark Message streaming: Kafka/Kinesis BigData storage: Hadoop/S3 Microservices: C#.Net/Akka Spray Real time analytics: MsSql/Cassandra Machine Learning: PySpark + Scikit Learn Tested on 6-8bn airfares a day
  8. 8. Reaching soon a full market coverage: how to utilize it? Infare DataCenter Altus: historicalData Feeds Granular Data Access API (life + historical queries to DB) Prediction and Analytics API (all models presented later) Pharos: life data + prediction Researched prediction since 2012, however accuracy requires larger market coverage. Estimated that at 5bn airfares/day is the required coverage for launch of the final product.
  9. 9. Prediction: minimum future price + API access
  10. 10. Prediction: price evolution + API access
  11. 11. Developing Prediction at Scale • Tens to hundreds of millions of unique trips observed daily • Tens to hundreds observed prices per trip • Clustering price vectors • Training model per cluster • 10000-50000 models • Training should take 2-3h to enable daily or real time update
  12. 12. Prediction of highly multivariate time series Drawing depicts trivial case in 2 dim and 3 models. In reality there are tens of thousands clusters in > 300 dim space Each point is representing n-dim vector time series Cluster the time series (after dimensionality reduction reducing sparsity) Train ML models on the data within respective cluster
  13. 13. Remarks regarding modelling + • Requires careful feature selection • Dimensionality reduction of time series space done using polynomial fitting or inverse exponential series fitting • Transforms the price vectors into a parameters space 𝑓: 𝑃 ↦ Θ • Clustering of time series projection Θ using k-means or Gaussian Mixture Model • ARIMA formulated as Linear Regression trained on P space: 𝐴𝑅𝐼𝑀𝐴 0, 1, 𝑛 ≡ 𝒚 = 𝑿𝛽 + 𝛼, 𝑤ℎ𝑒𝑟𝑒 dim 𝑿 = ∙, 𝑛 • For some clusters Support Vector Regression with Radial Basis Function Kernel • Quantize the continuous co-domain to finite states drawn from data • Requires in-memory parallel processing, using Scikit Learn on PySpark
  14. 14. could be solved as Blind Source Separation or Machine Learning problem Future research: estimating competitors’ demand curves Looking for a partner Airline to pilot this research project Airline’s own historical and current demand curves Estimate of competitor’s current and future demand curves Infare’s historical and current market prices
  15. 15. Question to audience What do you think is the most important product? 1) Granular life and historical data access API 3) Estimating competitors’ booking curves 2) Price Prediction in Pharos + API
  16. 16. THANK YOU! Please contact to us if you would like to collaborate in research Josef Habdank 20th of May, 2016 Lead Data Scientist & Data Platform Architect jha@infare.com www.infare.com

×