Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Gartner peer forum sept 2011 orbitz


Published on

Published in: Technology, Business
  • Be the first to comment

Gartner peer forum sept 2011 orbitz

  1. 1. Architecting for Big Data Integrating Hadoop into an Enterprise Data Infrastructure Raghu Kashyap and Jonathan Seidman Gartner Peer Forum September 14 | 2011
  2. 2. Who We Are <ul><li>Raghu Kashyap </li></ul><ul><ul><li>Director, Web Analytics </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>@ragskashyap </li></ul></ul><ul><ul><li> </li></ul></ul><ul><li>Jonathan Seidman </li></ul><ul><ul><li>Lead Engineer, Business Intelligence/Big Data Team </li></ul></ul><ul><ul><li>Co-founder/organizer of Chicago Hadoop User Group and Chicago Big Data </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>@jseidman </li></ul></ul>page
  3. 3. page Launched in 2001, Chicago, IL Over 160 million bookings
  4. 4. What is Hadoop? <ul><li>Open source software that supports the storage and analysis of extremely large volumes of data – typically terabytes to petabytes. </li></ul><ul><li>Two primary components: </li></ul><ul><ul><li>Hadoop Distributed File System (HDFS) provides economical, reliable, fault tolerant and scalable storage of very large datasets across machines in a cluster. </li></ul></ul><ul><ul><li>MapReduce is a programming model for efficient distributed processing. Designed to reliably perform computations on large volumes of data in parallel. </li></ul></ul>page
  5. 5. Why Hadoop? <ul><li>Hadoop allows us to store and process data that was previously impractical because of cost, technical issues, etc., and places no constraints on how that data is processed. </li></ul>page $ per TB
  6. 6. Why We Started Using Hadoop page Optimizing hotel search…
  7. 7. Why We Started Using Hadoop <ul><li>In 2009, the Machine Learning team was formed to improve site performance. For example, improving hotel search results. </li></ul><ul><li>This required access to large volumes of behavioral data for analysis. </li></ul>page
  8. 8. The Problem… <ul><li>The only archive of the required data went back about two weeks. </li></ul>page Transactional Data (e.g. bookings) Data Warehouse Non-transactional Data (e.g. searches)
  9. 9. Hadoop Was Selected as a Solution… page Transactional Data (e.g. bookings) Data Warehouse Non-Transactional Data (e.g. searches) Hadoop
  10. 10. Unfortunately… <ul><li>We faced organizational resistance to deploying Hadoop. </li></ul><ul><ul><li>Not from management, but from other technical teams. </li></ul></ul><ul><li>Required persistence to convince them that we needed to introduce a new hardware spec to support Hadoop. </li></ul>page
  11. 11. Current Big Data Infrastructure Hadoop page MapReduce HDFS MapReduce Jobs (Java, Python, R/RHIPE) Analytic Tools (Hive, Pig) Data Warehouse (Greenplum) psql, gpload, Sqoop External Analytical Jobs (Java, R, etc.) Aggregated Data Aggregated Data
  12. 12. Hadoop Architecture Details <ul><li>Production cluster </li></ul><ul><ul><li>About 200TB of raw storage </li></ul></ul><ul><ul><li>336 (physical) cores </li></ul></ul><ul><ul><li>672GB RAM </li></ul></ul><ul><ul><li>4 client nodes (Hive, ad-hoc jobs, scheduled jobs, etc.) </li></ul></ul><ul><li>Development cluster for user testing </li></ul><ul><li>Test cluster for testing upgrades, new software, etc. </li></ul><ul><li>Cloudera CDH3 </li></ul>page
  13. 13. Deploying Hadoop Enabled Multiple Applications… page
  14. 14. But Brought New Challenges… <ul><li>Most of these efforts are driven by development teams. </li></ul><ul><li>The challenge now is unlocking the value of this data for non-technical users. </li></ul>page
  15. 15. In Early 2011… <ul><li>Big Data team is formed under Business Intelligence team at Orbitz Worldwide. </li></ul><ul><li>Reflects the importance of big data to the future of the company. </li></ul><ul><li>Allows the Big Data team to work more closely with the data warehouse and BI teams. </li></ul><ul><li>We’re also evaluating tools to facilitate analysis of Hadoop data by the wider organization. </li></ul>page
  16. 16. Karmasphere Analyst page
  17. 17. Karmasphere Analyst page
  18. 18. Datameer Analytics Solution page
  19. 19. Datameer Analytics Solution page
  20. 20. Not to Mention Other BI Vendors… page
  21. 21. One More Use Case – Click Data Processing <ul><li>Still under development, but a good example of how Hadoop can be used to complement an existing data warehouse. </li></ul>page
  22. 22. Click Data Processing – Current Data Warehouse Processing page Web Server Logs ETL DW Data Cleansing (Stored procedure) DW Web Server Web Servers 3 hours 2 hours ~20% original data size
  23. 23. Click Data Processing – Proposed Hadoop Processing page Web Server Logs HDFS Data Cleansing (MapReduce) DW Web Server Web Servers
  24. 24. Lessons Learned <ul><li>Expect organizational resistance from unanticipated directions. </li></ul><ul><li>Advice for finding big data developers: </li></ul><ul><ul><li>Don ’t bother. </li></ul></ul><ul><ul><li>Instead, train smart and motivated internal resources or new hires. </li></ul></ul><ul><li>But get help if you need it. </li></ul><ul><ul><li>There are a number of experienced providers who can help you get started. </li></ul></ul>page
  25. 25. Lessons Learned <ul><li>Hadoop market is still immature, but growing quickly. Better tools are on the way. </li></ul><ul><ul><li>Look beyond the usual (enterprise) suspects. Many of the most interesting companies in the big data space are small startups. </li></ul></ul><ul><li>Use the appropriate tool based on requirements. Treat Hadoop as a complement, not replacement, to traditional data stores. </li></ul>page
  26. 26. Lessons Learned <ul><li>Work closely with your existing data management teams. </li></ul><ul><ul><li>Your idea of what constitutes “big data” might quickly diverge from theirs. </li></ul></ul><ul><li>The flip-side to this is that Hadoop can be an excellent tool to off-load resource-consuming jobs from your data warehouse. </li></ul>page
  27. 27. In the Near Future… <ul><li>Production cluster capacity increase: </li></ul><ul><ul><li>~500TB of raw storage. </li></ul></ul><ul><li>Further integration with the data warehouse. </li></ul><ul><li>Deployment of analysis and reporting tools on top of Hadoop. </li></ul>page
  28. 28. <ul><li>Web Analytics and Big Data </li></ul>page
  29. 29. What is Web Analytics? <ul><li>Understand the impact and economic value of the website </li></ul><ul><li>Rigorous outcome analysis </li></ul><ul><li>Passion for customer centricity by embracing voice-of-customer initiatives </li></ul><ul><li>Fail faster by leveraging the power of experimentation(MVT) </li></ul>
  30. 30. Challenges <ul><li>Site Analytics </li></ul><ul><ul><li>Lack of multi-dimensional capabilities </li></ul></ul><ul><ul><li>Hard to find the right insight </li></ul></ul><ul><ul><li>Heavy investment on the tools    </li></ul></ul><ul><ul><li>Precision vs Direction </li></ul></ul>
  31. 31. continued…. <ul><li>Big Data </li></ul><ul><ul><li>No data unification or uniform platform across organizations and business units </li></ul></ul><ul><ul><li>No easy data extraction capabilities </li></ul></ul><ul><ul><ul><li>Business </li></ul></ul></ul><ul><ul><li>Distinction between reporting and testing(MVT) </li></ul></ul><ul><ul><li>Minimal measurement of outcomes </li></ul></ul>
  32. 32. Data Categories <ul><li>Traffic acquisition </li></ul><ul><li>Marketing optimization </li></ul><ul><li>User engagement </li></ul><ul><li>Ad optimization </li></ul><ul><li>User behaviour </li></ul>
  33. 33. Web Analytics & Big Data <ul><li>OWW generates a couple million air and hotel searches every day. </li></ul><ul><li>Massive amounts of data. Hundreds of GB of log data per day. </li></ul><ul><li>Expensive and difficult to store and process this data using existing data infrastructure. </li></ul>
  34. 34. Processing of Web Analytics Data
  35. 35. Aggregating data into Data Warehouse
  36. 36. Data Analysis Jobs <ul><li>Traffic Source and Campaign activities </li></ul><ul><li>Daily jobs, Weekly analysis </li></ul><ul><li>Map reduce job </li></ul><ul><ul><li>~ 20 minutes for one day raw logs </li></ul></ul><ul><ul><li>~ 3 minutes to load to hive tables </li></ul></ul><ul><ul><li>Generates more than 25 million records for a month </li></ul></ul>
  37. 37. Business Insights page
  38. 38. Centralized Decentralization Web Analytics team + SEO team + Hotel optimization team
  39. 39. Model for success <ul><li>Measure the performance of your feature and fail fast </li></ul><ul><li>Experimentation and testing should be ingrained into every key feature. </li></ul><ul><li>Break down into smaller chunks of data extraction </li></ul>
  40. 40. Should everyone do this? <ul><li>Do you have the Technology strength to invest and use Big Data? </li></ul><ul><li>Analytics using Big Data comes with a price (resource, time) </li></ul><ul><li>Big Data mining != analysis </li></ul><ul><li>Key Data warehouse challenges still exist (time, data validity) </li></ul>
  41. 41. Questions? <ul><li> </li></ul>