Analysis and Prediction of Flight Pricesusing historical pricing data1st Swiss Hadoop User Group meeting – May 14, 2012Jér...
Overview   Project setup   Goals   Exploratory data analysis (Hadoop)   Classification & prediction methods   Process...
Project setup Airline tickets can be bought up to ~1 year in advance.   Prices change from day to day. Amadeus CRS is t...
Goals1. Construct and train a general classifier so that it can   distinguish between expensive and cheap tickets.2. Use t...
Exploratory data analysis Extent of the dataset:   27.2 billion records   132.2 GiB (uncompressed)   63 departure airp...
Exploratory data analysis The majority of activity is concentrated in Europe:                                            ...
Exploratory data analysis Lots of fields:      “Buy” date:        When was this price current?      “Fly” date:        ...
Exploratory data analysis Visualizing small subsets of the data helps understand the  data. Lots of simple Hadoop jobs u...
Exploratory data analysis        For ZRH-BKK, plot the prices of the cheapest tickets available every day:               ...
Classification & Prediction methods Implemented two different classifiers:     Support vector machine (SVM)     L1- reg...
Classification & Prediction methods SVM: binary linear classifier   Goal: Find maximum-margin hyperplane    that divides...
Classification & Prediction methods Implementation uses:   Hinge loss function:      Takes into account “outliers”.   ...
Hadoop: Preprocessing Generate training labels (y) from dataset:   Convert currencies using historical exchange rates.  ...
Hadoop: Preprocessing Extract features from plaintext records (x).   Each plaintext record is transformed into a 930-dim...
Hadoop: Processing pipeline Shuffle the data    (P)SGD demands random selection of     data points Partition the data i...
Extensions done to the basic algorithms: Hierarchical classification:              Per airline classification:    Train...
Results: Overall accuracy Dataset: 10% subsample of all records (class economy)                                          ...
Results: Overall accuracy Dataset: All records ZRH -> * (economy)                                            18
Results: Overall accuracy Dataset: All records ZRH -> BKK (economy)                                              19
Results: Analyzing a single airline X SVM classifier 0, for airline X, dataset 10% full subsample                        ...
Results: Analyzing a single airline X SVM classifier 0, for airline X, dataset 10% full subsample                        ...
Questions!             22
Upcoming SlideShare
Loading in …5
×

14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

4,244 views

Published on

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,244
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
62
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

  1. 1. Analysis and Prediction of Flight Pricesusing historical pricing data1st Swiss Hadoop User Group meeting – May 14, 2012Jérémie Miserez - miserezj@student.ethz.ch2012-05-14
  2. 2. Overview Project setup Goals Exploratory data analysis (Hadoop) Classification & prediction methods Processing pipeline (Hadoop) Results This project was done as part of my Bachelor’s thesis at the Systems Group, ETH Zürich, in collaboration with Amadeus IT Group SA. 2
  3. 3. Project setup Airline tickets can be bought up to ~1 year in advance.  Prices change from day to day. Amadeus CRS is the largest global distribution system in the travel/tourism industry:  sells tickets for 435 airlines (also hotels, cruises, etc.)  processes ~850 million billable transactions per year Amadeus provided us with a dataset containing buyable tickets for each day from May 2008 – Jan 2011. 3
  4. 4. Goals1. Construct and train a general classifier so that it can distinguish between expensive and cheap tickets.2. Use this classifier to predict the prices of future tickets.3. Determine which factors have the greatest impact on price by analyzing the trained classifier. But first: Need to understand dataset! 4
  5. 5. Exploratory data analysis Extent of the dataset:  27.2 billion records  132.2 GiB (uncompressed)  63 departure airports, 428 destinations, 4387 routes, 117 airlines 5
  6. 6. Exploratory data analysis The majority of activity is concentrated in Europe: 6
  7. 7. Exploratory data analysis Lots of fields:  “Buy” date: When was this price current?  “Fly” date: When does the flight leave?  …  Price & currency  …  Cabin class Economy/Business/First (98% economy tickets)  Booking class A-Z  …  Airline The airline selling the ticket.  … Not a time series, tickets are not linked over time. 7
  8. 8. Exploratory data analysis Visualizing small subsets of the data helps understand the data. Lots of simple Hadoop jobs used to preprocess the data, multiple visualizations using Matlab. Can we see some patterns already? 8
  9. 9. Exploratory data analysis For ZRH-BKK, plot the prices of the cheapest tickets available every day: 2400 EUR Buy date December July 600 EUR Fly date 9
  10. 10. Classification & Prediction methods Implemented two different classifiers:  Support vector machine (SVM)  L1- regularized linear regression Both are convex minimization problems that can be solved online by employing the stochastic gradient descent (SGD) method.  Online algorithm results in constant memory usage, does not depend on size of dataset.  “Stochastic”: Select order of training points at random from dataset. SGD can be parallelized (parallelized SGD)* with almost no overhead, and is very suitable for use with MapReduce. * Zinkevich, M. Weimer, A. Smola, and L. Li. “Parallelized stochastic gradient descent”, 24th Annual Conference on Neural Information Processing Systems, 2010. 10
  11. 11. Classification & Prediction methods SVM: binary linear classifier  Goal: Find maximum-margin hyperplane that divides the points with label “+1” from those with label “-1”.  After training:  Hyperplane parameters:  Get label for a data point as  Training:  Generate training label for i-th data point  Choose hyperplane parameters so the margin is maximal and the training data is still correctly classified: 11
  12. 12. Classification & Prediction methods Implementation uses:  Hinge loss function:  Takes into account “outliers”.  Regularization parameter  Bounds length of , i.e. large increase generalization.  Preprocess data for zero mean, unit variance  For training points: Margin: , with lower bound: 12
  13. 13. Hadoop: Preprocessing Generate training labels (y) from dataset:  Convert currencies using historical exchange rates.  For each route r, calculate the arithmetic mean (and standard deviation) of the price over all tickets.  Assign labels:  Label +: “Above mean price for this route”  Label -: “Below mean price for this route”  Only store mean/std-dev, do not actually store labels in the HDFS. 13
  14. 14. Hadoop: Preprocessing Extract features from plaintext records (x).  Each plaintext record is transformed into a 930-dimensional vector.  Each dimension contains a numerical value corresponding to a feature such as:  Number of days between “Buy” and “Fly” dates  Week of day (for all dates)  Is the day on a weekend (for all dates).  Is the Currency CHF?  etc.  Each dimension is normalized to zero mean and unit variance.  (per route r) 14
  15. 15. Hadoop: Processing pipeline Shuffle the data  (P)SGD demands random selection of data points Partition the data into n (=1200) chunks Train using PSGD:  Parallel training on k (=40) chunks  Average hyperplane coefficients after all 1200 chunks have been processed (= after 30 iterations). We can get intermediate results by calculating the accuracy every time 40 chunks have been processed. 15
  16. 16. Extensions done to the basic algorithms: Hierarchical classification:  Per airline classification:  Train 7 classifiers in parallel  Train 1+21 classifiers in parallel  Increases runtime by a factor of 3.  Increases runtime by a factor of 2. General classifier 1 – Airline A classifier (21%) 2 - Airline B classifier (9%) 3 - Airline C classifier (7%) 4 – Airline D classifier (6%) … … 21 – “Other” airlines (15.4%) 16
  17. 17. Results: Overall accuracy Dataset: 10% subsample of all records (class economy) 17
  18. 18. Results: Overall accuracy Dataset: All records ZRH -> * (economy) 18
  19. 19. Results: Overall accuracy Dataset: All records ZRH -> BKK (economy) 19
  20. 20. Results: Analyzing a single airline X SVM classifier 0, for airline X, dataset 10% full subsample 20
  21. 21. Results: Analyzing a single airline X SVM classifier 0, for airline X, dataset 10% full subsample 21
  22. 22. Questions! 22

×