Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Most people think of, but Orbitz Worldwide is really a global portfolio of leading online travel consumer brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub. Orbitz also provides business to business services - Orbitz Worldwide Distribution provides hotel booking capabilities to a number of leading carriers such as Amtrak, Delta, LAN, KLM, Air France and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients Orbitz started in 1999, orbitz site launched in 2001.
  • A couple of years ago when I mentioned Hadoop I’d often get blank stares, even from developers. I think most folks now are at least aware of what Hadoop is.
  • This chart isn’t exactly an apples-to-apples comparison, but provides some idea of the difference in cost per TB for the DW vs. Hadoop Hadoop doesn’t provide the same functionality as a data warehouse, but it does allow us to store and process data that wasn’t practical before for economic and technical reasons. Putting data into a DB or DWH requires having knowledge or making assumptions about how the data will be used. Either way you’re putting constraints around how the data is accessed and processed. With Hadoop each application can process the raw data in whatever way is required. If you decide you need to analyze different attributes you just run a new query.
  • The initial motivation was to solve a particular business problem. Orbitz wanted to be able to use intelligent algorithms to optimize various site functions, for example optimizing hotel search by showing consumers hotels that more closely match their preferences, leading to more bookings.
  • Improving hotel search requires access to such data as which hotels users saw in search results, which hotels they clicked on, and which hotels were actually booked. Much of this data was available in web analytics logs.
  • Management was supportive of anything that facilitated ML team efforts. But when we presented a hardware spec for servers with local non-raided storage, etc. syseng offered us blades with attached storage.
  • Hadoop is used to crunch data for input to a system to recommend products to users. Although we use third-party sites to monitor site performance, Hadoop allows the front end team to provide detailed reports on page download performance, providing valuable trending data not available from other sources. Data is used for analysis of user segments, which can drive personalization. This chart shows that Safari users click on hotels with higher mean and median prices as opposed to other users. This is just a handful of examples of how Hadoop is driving business value.
  • Recently received an email from a user seeking access to Hive. Sent him a detailed email with info on accessing Hive, etc. Received an email back basically saying “you lost me at ssh”.
  • Previous to 2011 Hadoop responsibilities were split across technology teams. Moving under a single team centralized responsibility and resources for Hadoop.
  • Processing of click data gathered by web servers. This click data contains marketing info. data cleansing step is done inside data warehouse using a stored procedure further downstream processing is done to generate final data sets for reporting Although this processing generates the required user reports, this process consumes considerable time and resources on the data warehouse, consuming resources that could be used for reports, queries, etc.
  • ETL step is eliminated, instead raw logs will be uploaded to HDFS which is a much faster process Moving the data cleansing to MapReduce will allow us to take advantage of Hadoop’s efficiencies and greatly speed up the processing. Moves the “heavy lifting” of processing the relatively large data sets to Hadoop, and takes advantage of Hadoop’s efficiencies.
  • Bad news is we need to significantly increase the number of servers in our cluster, the good news is that this is because teams are using Hadoop, and new projects are coming online.
  • Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011

    1. 1. Architecting for Big Data Integrating Hadoop into an Enterprise Data Infrastructure Raghu Kashyap and Jonathan Seidman Gartner Peer Forum September 14 | 2011
    2. 2. Who We Are <ul><li>Raghu Kashyap </li></ul><ul><ul><li>Director, Web Analytics </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>@ragskashyap </li></ul></ul><ul><ul><li> </li></ul></ul><ul><li>Jonathan Seidman </li></ul><ul><ul><li>Lead Engineer, Business Intelligence/Big Data Team </li></ul></ul><ul><ul><li>Co-founder/organizer of Chicago Hadoop User Group and Chicago Big Data </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>@jseidman </li></ul></ul>page
    3. 3. page Launched in 2001, Chicago, IL Over 160 million bookings
    4. 4. What is Hadoop? <ul><li>Open source software that supports the storage and analysis of extremely large volumes of data – typically terabytes to petabytes. </li></ul><ul><li>Two primary components: </li></ul><ul><ul><li>Hadoop Distributed File System (HDFS) provides economical, reliable, fault tolerant and scalable storage of very large datasets across machines in a cluster. </li></ul></ul><ul><ul><li>MapReduce is a programming model for efficient distributed processing. Designed to reliably perform computations on large volumes of data in parallel. </li></ul></ul>page
    5. 5. Why Hadoop? <ul><li>Hadoop allows us to store and process data that was previously impractical because of cost, technical issues, etc., and places no constraints on how that data is processed. </li></ul>page $ per TB
    6. 6. Why We Started Using Hadoop page Optimizing hotel search…
    7. 7. Why We Started Using Hadoop <ul><li>In 2009, the Machine Learning team was formed to improve site performance. For example, improving hotel search results. </li></ul><ul><li>This required access to large volumes of behavioral data for analysis. </li></ul>page
    8. 8. The Problem… <ul><li>The only archive of the required data went back about two weeks. </li></ul>page Transactional Data (e.g. bookings) Data Warehouse Non-transactional Data (e.g. searches)
    9. 9. Hadoop Was Selected as a Solution… page Transactional Data (e.g. bookings) Data Warehouse Non-Transactional Data (e.g. searches) Hadoop
    10. 10. Unfortunately… <ul><li>We faced organizational resistance to deploying Hadoop. </li></ul><ul><ul><li>Not from management, but from other technical teams. </li></ul></ul><ul><li>Required persistence to convince them that we needed to introduce a new hardware spec to support Hadoop. </li></ul>page
    11. 11. Current Big Data Infrastructure Hadoop page MapReduce HDFS MapReduce Jobs (Java, Python, R/RHIPE) Analytic Tools (Hive, Pig) Data Warehouse (Greenplum) psql, gpload, Sqoop External Analytical Jobs (Java, R, etc.) Aggregated Data Aggregated Data
    12. 12. Hadoop Architecture Details <ul><li>Production cluster </li></ul><ul><ul><li>About 200TB of raw storage </li></ul></ul><ul><ul><li>336 (physical) cores </li></ul></ul><ul><ul><li>672GB RAM </li></ul></ul><ul><ul><li>4 client nodes (Hive, ad-hoc jobs, scheduled jobs, etc.) </li></ul></ul><ul><li>Development cluster for user testing </li></ul><ul><li>Test cluster for testing upgrades, new software, etc. </li></ul><ul><li>Cloudera CDH3 </li></ul>page
    13. 13. Deploying Hadoop Enabled Multiple Applications… page
    14. 14. But Brought New Challenges… <ul><li>Most of these efforts are driven by development teams. </li></ul><ul><li>The challenge now is unlocking the value of this data for non-technical users. </li></ul>page
    15. 15. In Early 2011… <ul><li>Big Data team is formed under Business Intelligence team at Orbitz Worldwide. </li></ul><ul><li>Reflects the importance of big data to the future of the company. </li></ul><ul><li>Allows the Big Data team to work more closely with the data warehouse and BI teams. </li></ul><ul><li>We’re also evaluating tools to facilitate analysis of Hadoop data by the wider organization. </li></ul>page
    16. 16. Karmasphere Analyst page
    17. 17. Karmasphere Analyst page
    18. 18. Datameer Analytics Solution page
    19. 19. Datameer Analytics Solution page
    20. 20. Not to Mention Other BI Vendors… page
    21. 21. One More Use Case – Click Data Processing <ul><li>Still under development, but a good example of how Hadoop can be used to complement an existing data warehouse. </li></ul>page
    22. 22. Click Data Processing – Current Data Warehouse Processing page Web Server Logs ETL DW Data Cleansing (Stored procedure) DW Web Server Web Servers 3 hours 2 hours ~20% original data size
    23. 23. Click Data Processing – Proposed Hadoop Processing page Web Server Logs HDFS Data Cleansing (MapReduce) DW Web Server Web Servers
    24. 24. Lessons Learned <ul><li>Expect organizational resistance from unanticipated directions. </li></ul><ul><li>Advice for finding big data developers: </li></ul><ul><ul><li>Don ’t bother. </li></ul></ul><ul><ul><li>Instead, train smart and motivated internal resources or new hires. </li></ul></ul><ul><li>But get help if you need it. </li></ul><ul><ul><li>There are a number of experienced providers who can help you get started. </li></ul></ul>page
    25. 25. Lessons Learned <ul><li>Hadoop market is still immature, but growing quickly. Better tools are on the way. </li></ul><ul><ul><li>Look beyond the usual (enterprise) suspects. Many of the most interesting companies in the big data space are small startups. </li></ul></ul><ul><li>Use the appropriate tool based on requirements. Treat Hadoop as a complement, not replacement, to traditional data stores. </li></ul>page
    26. 26. Lessons Learned <ul><li>Work closely with your existing data management teams. </li></ul><ul><ul><li>Your idea of what constitutes “big data” might quickly diverge from theirs. </li></ul></ul><ul><li>The flip-side to this is that Hadoop can be an excellent tool to off-load resource-consuming jobs from your data warehouse. </li></ul>page
    27. 27. In the Near Future… <ul><li>Production cluster capacity increase: </li></ul><ul><ul><li>~500TB of raw storage. </li></ul></ul><ul><li>Further integration with the data warehouse. </li></ul><ul><li>Deployment of analysis and reporting tools on top of Hadoop. </li></ul>page
    28. 28. <ul><li>Web Analytics and Big Data </li></ul>page
    29. 29. What is Web Analytics? <ul><li>Understand the impact and economic value of the website </li></ul><ul><li>Rigorous outcome analysis </li></ul><ul><li>Passion for customer centricity by embracing voice-of-customer initiatives </li></ul><ul><li>Fail faster by leveraging the power of experimentation(MVT) </li></ul>
    30. 30. Challenges <ul><li>Site Analytics </li></ul><ul><ul><li>Lack of multi-dimensional capabilities </li></ul></ul><ul><ul><li>Hard to find the right insight </li></ul></ul><ul><ul><li>Heavy investment on the tools    </li></ul></ul><ul><ul><li>Precision vs Direction </li></ul></ul>
    31. 31. continued…. <ul><li>Big Data </li></ul><ul><ul><li>No data unification or uniform platform across organizations and business units </li></ul></ul><ul><ul><li>No easy data extraction capabilities </li></ul></ul><ul><ul><ul><li>Business </li></ul></ul></ul><ul><ul><li>Distinction between reporting and testing(MVT) </li></ul></ul><ul><ul><li>Minimal measurement of outcomes </li></ul></ul>
    32. 32. Data Categories <ul><li>Traffic acquisition </li></ul><ul><li>Marketing optimization </li></ul><ul><li>User engagement </li></ul><ul><li>Ad optimization </li></ul><ul><li>User behaviour </li></ul>
    33. 33. Web Analytics & Big Data <ul><li>OWW generates a couple million air and hotel searches every day. </li></ul><ul><li>Massive amounts of data. Hundreds of GB of log data per day. </li></ul><ul><li>Expensive and difficult to store and process this data using existing data infrastructure. </li></ul>
    34. 34. Processing of Web Analytics Data
    35. 35. Aggregating data into Data Warehouse
    36. 36. Data Analysis Jobs <ul><li>Traffic Source and Campaign activities </li></ul><ul><li>Daily jobs, Weekly analysis </li></ul><ul><li>Map reduce job </li></ul><ul><ul><li>~ 20 minutes for one day raw logs </li></ul></ul><ul><ul><li>~ 3 minutes to load to hive tables </li></ul></ul><ul><ul><li>Generates more than 25 million records for a month </li></ul></ul>
    37. 37. Business Insights page
    38. 38. Centralized Decentralization Web Analytics team + SEO team + Hotel optimization team
    39. 39. Model for success <ul><li>Measure the performance of your feature and fail fast </li></ul><ul><li>Experimentation and testing should be ingrained into every key feature. </li></ul><ul><li>Break down into smaller chunks of data extraction </li></ul>
    40. 40. Should everyone do this? <ul><li>Do you have the Technology strength to invest and use Big Data? </li></ul><ul><li>Analytics using Big Data comes with a price (resource, time) </li></ul><ul><li>Big Data mining != analysis </li></ul><ul><li>Key Data warehouse challenges still exist (time, data validity) </li></ul>
    41. 41. Questions? <ul><li> </li></ul>