Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to Feed a Data Hungry Organization – by Traveloka Data Team

1,122 views

Published on

In Traveloka's Inaugural Data Meetup held in April 2017, Ainun Najib (Head of Data), Dr. Philip Thomas (Lead Data Scientist), and Rendy B. Junior (Lead Data Engineer) shared about the journey that Traveloka's Data Team have taken so far so that the audience can learn from the struggles and triumphs in managing Traveloka's burgeoning data.

You will learn more about:
1) Data culture in Traveloka
2) Data engineering in Traveloka
3) Data science in Traveloka

To follow our LinkedIn page, visit bit.ly/TravelokaLinkedInPage


Safe Harbor Statement

Our discussion may include predictions, estimates or other information that might be considered conclusive. While these conclusive statements represent our current judgment on the best practices, they are subject to risks and uncertainties that could cause actual results to differ materially. You are cautioned not to place undue reliance on our statements, which reflect our opinions only as of the date of this presentation. Please keep in mind that we are not obligating ourselves to revise or publicly release the results of any revision to these presentation materials in light of new information or future events.

Published in: Data & Analytics

How to Feed a Data Hungry Organization – by Traveloka Data Team

  1. 1. Traveloka Data Meetup v1.0.0 How to Feed a Data Hungry Organization
  2. 2. Part One Traveloka Data Culture
  3. 3. Part 1: Traveloka Data Culture Five Characteristics of Data Hungry Organization Driven Decision Learn from Mistakes Better Understanding Uncertainty and Variation High Quality Data Data Hungry Organization
  4. 4. Part 1: Traveloka Data Culture Our responsibility is to turn data into consumable insights DATA TEAM BETTER BUSINESS DECISION
  5. 5. Part 1: Traveloka Data Culture We need the brightest people to fill our needs and create the future Mathematics Business Programming Skills
  6. 6. Part 1: Traveloka Data Culture Some of the skills in mathematics Mathematics Optimization Decision Theory Statistics Differential Equations Time Series
  7. 7. Part 1: Traveloka Data Culture Some of the skills in business Business Strategy Finance Economics
  8. 8. Part 1: Traveloka Data Culture Some of the skills in programming Programming Data Wrangling Modelling Big Data
  9. 9. Part 1: Traveloka Data Culture This is how we structure our team Data Team Data Governance Machine Learning Engineering Data Analysis Data Science Data Engineering
  10. 10. Part 1: Traveloka Data Culture Houston, We have a problem. DW Tens of Terabytes Hundreds of ETLs Kafka Hundreds of topics Millions of Messages per Hour Hundreds of Megabytes per Second S3 Hundreds of Terabytes Redshift Tens of Thousand Queries Daily DOMO Thousands of Cards Hundreds of Users PeriscopeData Thousands of Dashboards Hundreds of Users
  11. 11. Part 1: Traveloka Data Culture We need state of the art technology to feed data hungry people Ingestion Gobblin Data Lake AWS S3 Batch Processing Spark, Airflow, Hadoop2, Python, Java App Data Warehouse Redshift, MongoDB, PostgreSQL Datahub Pubsub, Kafka Stream Processing DataFlow, MemSQL Pipeline Near Real Time DW GCP BigQuery, MemSQL Real Time DB AWS DynamoDB Ingestion Processin g Storage Presentation Source DB Mongo, PostgreSQL App / Services Java App Analytics Tools PeriscopeData, Spark, R, Domo Dataiku Holistics, Keboola ML Tools, Library, and Services Jupyter, Zeppelin, Caffe, DataDog, TensorFlow, Cloud Vision API Query Engine Qubole, Presto, Hive
  12. 12. Part Two Data Engineering
  13. 13. Part 2: Data Engineering Fast Food, Or…?
  14. 14. Part 2: Data Engineering MINDSETS Managed service for focus So we could focus more on the use cases
  15. 15. Part 2: Data Engineering MINDSETS Managed service for focus So we could focus more on the use cases
  16. 16. Part 2: Data Engineering Real Time Pipeline 5 min data delivery SLA. Real latency ~ 10s 100 ms query SLA. Real latency ~ 10ms (p95) Key value data, query by service/app Autoscale - Self service for each engineering team we provide governance, guidance, building blocks, and consultation
  17. 17. Part 2: Data Engineering Real Time Pipeline
  18. 18. Part 2: Data Engineering Near Real Time Pipeline Raw data, query by BI Tools 5 min data delivery SLA. Real latency ~ 5s Using Yaml for Schema definition (built and defined by ourselves) Self service for data analysts! with guidance and governance
  19. 19. Part 2: Data Engineering Near Real Time Pipeline
  20. 20. Part 2: Data Engineering Near Real Time Pipeline But, MemSQL is not managed service, it is on EC2. It is easy to scale, but not autoscale yet. So we are moving to… v2!! Currently on usability testing test by analysts. Self service, of course!
  21. 21. Part 2: Data Engineering Near Real Time Pipeline
  22. 22. Part 2: Data Engineering Analytical Pipeline Heavy data processing query by BI Tools 6 hour data delivery SLA
  23. 23. Part 2: Data Engineering Analytical Pipeline Interesting features: • Custom dev/prod environment, for self service! • Custom framework, on top of Spark • Custom airflow, separated queue for backfill • EMR autoscale for backfill • Redshift microbatch bulk load • etc...
  24. 24. Part 2: Data Engineering Summary
  25. 25. Part Three Data Science in Traveloka
  26. 26. Part 3: Data Science in Traveloka Three Things to Discuss Today Data Science Purpose Tools of the Trade Model Evaluations and Applications
  27. 27. Part 3: Data Science in Traveloka Three Things to Discuss Today Data Science Purpose Tools of the Trade Model Evaluations and Applications
  28. 28. Novia is 25 years old. She is single, outspoken, and mathematically gifted. As a student, she was deeply interested in calculus and statistics, and also participated in International Mathematical Olympiad. a. Novia is a data scientist b. Novia is a data scientist and is active as mathematical Olympiad tutor Part 3: Data Science in Traveloka
  29. 29. Part 3: Data Science in Traveloka Consider a regular six-sided die with four green faces and two red faces. The die will be rolled 20 times and the sequence of greens (G) and reds (R) will be recorded. Choose one sequence from a set of three. Which one is the more likely outcome? RGRRR GRGRRR GRRRRR
  30. 30. Part 3: Data Science in Traveloka
  31. 31. Part 3: Data Science in Traveloka
  32. 32. Remember This: The goal of data science exercise is to help us make a good business decision Logic Alternatives Information Preferences Part 3: Data Science in Traveloka
  33. 33. “if they learn nothing else about decision analysis from their studies, distinction between outcome and decisions will have been worth the price of admission” Ron Howard, Professor at Stanford University Father of Decision Analysis Part 3: Data Science in Traveloka Good Bad Good Took a taxi and arrived safely Drive home and arrived safely Bad Took a taxi and involved in accident Drive home and involved in accident Decisions Outcome
  34. 34. Part 3: Data Science in Traveloka Three Things to Discuss Today Data Science Purpose Tools of the Trade Model Evaluations and Applications
  35. 35. Data Science Framework: CRISP-DM Business Data Data Prep Model Evaluation Deployment Common Sense Part 3: Data Science in Traveloka
  36. 36. “Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world” -Atul Butte, Stanford- We use open source library for data science Wrangling • data.table • dplyr • sparkR • sparklyr • pandas • pyspark Visualizatio n • ggplot • matplotlib • seaborn • shiny Statistics • R • JAGS • STAN • Python • Julia Machine Learning • scikit-learn • caret • e1071 • fbprophet Part 3: Data Science in Traveloka
  37. 37. Are we using the algorithm? Or being used by it? Classification Linear Models Naïve Bayes Classifier Support Vector Classifier Vowpal Wabbit Classifier Random Forest Decision Trees Neural Network Extreme Gradient Boosted Trees Many more algos! Prediction Linear Models Nystroem Regressor Support Vector Regressor Vowpal Wabbit Regressor Random Forest Decision Trees Neural Network Extreme Gradient Boosted Trees More Algos! • Scikit-learn • Caret • TensorFlow • … Part 3: Data Science in Traveloka
  38. 38. We need more than just off the shelf libraries to feed data hungry people Bayesian Network Markov Chain Monte Carlo Part 3: Data Science in Traveloka
  39. 39. Part 3: Data Science in Traveloka Three Things to Discuss Today Data Science Purpose Tools of the Trade Model Evaluations and Applications
  40. 40. Model Evaluation: judging the usefulness of your model Rule #1 Never ever peek at the test set during training/validation Rule #2 You can never satisfy all the metrics, pick one or two metrics as your decision criteria beforehand Rule #3 Always do comparative statics on the final model Part 3: Data Science in Traveloka
  41. 41. Comparative Statics commonly used as feature importance analysis Part 3: Data Science in Traveloka
  42. 42. Remember the end goal: decisions What should we do? What might happen Part 3: Data Science in Traveloka
  43. 43. “But in my view, obsessive customer focus is by far the most protective of Day 1 vitality” Our data is telling us: • What do they want? • Do we serve their needs? • Are they trying to leave us? Part 3: Data Science in Traveloka My name is Jeff
  44. 44. Thank you!

×