Your SlideShare is downloading. ×
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011


Published on

Published in: Technology, Business

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Most people think of, but Orbitz Worldwide is really a global portfolio of leading online travel consumer brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub. Orbitz also provides business to business services - Orbitz Worldwide Distribution provides hotel booking capabilities to a number of leading carriers such as Amtrak, Delta, LAN, KLM, Air France and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients Orbitz started in 1999, orbitz site launched in 2001.
  • A couple of years ago when I mentioned Hadoop I’d often get blank stares, even from developers. I think most folks now are at least aware of what Hadoop is.
  • This chart isn’t exactly an apples-to-apples comparison, but provides some idea of the difference in cost per TB for the DW vs. Hadoop Hadoop doesn’t provide the same functionality as a data warehouse, but it does allow us to store and process data that wasn’t practical before for economic and technical reasons. Putting data into a DB or DWH requires having knowledge or making assumptions about how the data will be used. Either way you’re putting constraints around how the data is accessed and processed. With Hadoop each application can process the raw data in whatever way is required. If you decide you need to analyze different attributes you just run a new query.
  • The initial motivation was to solve a particular business problem. Orbitz wanted to be able to use intelligent algorithms to optimize various site functions, for example optimizing hotel search by showing consumers hotels that more closely match their preferences, leading to more bookings.
  • Improving hotel search requires access to such data as which hotels users saw in search results, which hotels they clicked on, and which hotels were actually booked. Much of this data was available in web analytics logs.
  • Management was supportive of anything that facilitated ML team efforts. But when we presented a hardware spec for servers with local non-raided storage, etc. syseng offered us blades with attached storage.
  • Hadoop is used to crunch data for input to a system to recommend products to users. Although we use third-party sites to monitor site performance, Hadoop allows the front end team to provide detailed reports on page download performance, providing valuable trending data not available from other sources. Data is used for analysis of user segments, which can drive personalization. This chart shows that Safari users click on hotels with higher mean and median prices as opposed to other users. This is just a handful of examples of how Hadoop is driving business value.
  • Recently received an email from a user seeking access to Hive. Sent him a detailed email with info on accessing Hive, etc. Received an email back basically saying “you lost me at ssh”.
  • Previous to 2011 Hadoop responsibilities were split across technology teams. Moving under a single team centralized responsibility and resources for Hadoop.
  • Processing of click data gathered by web servers. This click data contains marketing info. data cleansing step is done inside data warehouse using a stored procedure further downstream processing is done to generate final data sets for reporting Although this processing generates the required user reports, this process consumes considerable time and resources on the data warehouse, consuming resources that could be used for reports, queries, etc.
  • ETL step is eliminated, instead raw logs will be uploaded to HDFS which is a much faster process Moving the data cleansing to MapReduce will allow us to take advantage of Hadoop’s efficiencies and greatly speed up the processing. Moves the “heavy lifting” of processing the relatively large data sets to Hadoop, and takes advantage of Hadoop’s efficiencies.
  • Bad news is we need to significantly increase the number of servers in our cluster, the good news is that this is because teams are using Hadoop, and new projects are coming online.
  • Transcript

    • 1. Architecting for Big Data Integrating Hadoop into an Enterprise Data Infrastructure Raghu Kashyap and Jonathan Seidman Gartner Peer Forum September 14 | 2011
    • 2. Who We Are
      • Raghu Kashyap
        • Director, Web Analytics
        • [email_address]
        • @ragskashyap
      • Jonathan Seidman
        • Lead Engineer, Business Intelligence/Big Data Team
        • Co-founder/organizer of Chicago Hadoop User Group and Chicago Big Data
        • [email_address]
        • @jseidman
    • 3. page Launched in 2001, Chicago, IL Over 160 million bookings
    • 4. What is Hadoop?
      • Open source software that supports the storage and analysis of extremely large volumes of data – typically terabytes to petabytes.
      • Two primary components:
        • Hadoop Distributed File System (HDFS) provides economical, reliable, fault tolerant and scalable storage of very large datasets across machines in a cluster.
        • MapReduce is a programming model for efficient distributed processing. Designed to reliably perform computations on large volumes of data in parallel.
    • 5. Why Hadoop?
      • Hadoop allows us to store and process data that was previously impractical because of cost, technical issues, etc., and places no constraints on how that data is processed.
      page $ per TB
    • 6. Why We Started Using Hadoop page Optimizing hotel search…
    • 7. Why We Started Using Hadoop
      • In 2009, the Machine Learning team was formed to improve site performance. For example, improving hotel search results.
      • This required access to large volumes of behavioral data for analysis.
    • 8. The Problem…
      • The only archive of the required data went back about two weeks.
      page Transactional Data (e.g. bookings) Data Warehouse Non-transactional Data (e.g. searches)
    • 9. Hadoop Was Selected as a Solution… page Transactional Data (e.g. bookings) Data Warehouse Non-Transactional Data (e.g. searches) Hadoop
    • 10. Unfortunately…
      • We faced organizational resistance to deploying Hadoop.
        • Not from management, but from other technical teams.
      • Required persistence to convince them that we needed to introduce a new hardware spec to support Hadoop.
    • 11. Current Big Data Infrastructure Hadoop page MapReduce HDFS MapReduce Jobs (Java, Python, R/RHIPE) Analytic Tools (Hive, Pig) Data Warehouse (Greenplum) psql, gpload, Sqoop External Analytical Jobs (Java, R, etc.) Aggregated Data Aggregated Data
    • 12. Hadoop Architecture Details
      • Production cluster
        • About 200TB of raw storage
        • 336 (physical) cores
        • 672GB RAM
        • 4 client nodes (Hive, ad-hoc jobs, scheduled jobs, etc.)
      • Development cluster for user testing
      • Test cluster for testing upgrades, new software, etc.
      • Cloudera CDH3
    • 13. Deploying Hadoop Enabled Multiple Applications… page
    • 14. But Brought New Challenges…
      • Most of these efforts are driven by development teams.
      • The challenge now is unlocking the value of this data for non-technical users.
    • 15. In Early 2011…
      • Big Data team is formed under Business Intelligence team at Orbitz Worldwide.
      • Reflects the importance of big data to the future of the company.
      • Allows the Big Data team to work more closely with the data warehouse and BI teams.
      • We’re also evaluating tools to facilitate analysis of Hadoop data by the wider organization.
    • 16. Karmasphere Analyst page
    • 17. Karmasphere Analyst page
    • 18. Datameer Analytics Solution page
    • 19. Datameer Analytics Solution page
    • 20. Not to Mention Other BI Vendors… page
    • 21. One More Use Case – Click Data Processing
      • Still under development, but a good example of how Hadoop can be used to complement an existing data warehouse.
    • 22. Click Data Processing – Current Data Warehouse Processing page Web Server Logs ETL DW Data Cleansing (Stored procedure) DW Web Server Web Servers 3 hours 2 hours ~20% original data size
    • 23. Click Data Processing – Proposed Hadoop Processing page Web Server Logs HDFS Data Cleansing (MapReduce) DW Web Server Web Servers
    • 24. Lessons Learned
      • Expect organizational resistance from unanticipated directions.
      • Advice for finding big data developers:
        • Don ’t bother.
        • Instead, train smart and motivated internal resources or new hires.
      • But get help if you need it.
        • There are a number of experienced providers who can help you get started.
    • 25. Lessons Learned
      • Hadoop market is still immature, but growing quickly. Better tools are on the way.
        • Look beyond the usual (enterprise) suspects. Many of the most interesting companies in the big data space are small startups.
      • Use the appropriate tool based on requirements. Treat Hadoop as a complement, not replacement, to traditional data stores.
    • 26. Lessons Learned
      • Work closely with your existing data management teams.
        • Your idea of what constitutes “big data” might quickly diverge from theirs.
      • The flip-side to this is that Hadoop can be an excellent tool to off-load resource-consuming jobs from your data warehouse.
    • 27. In the Near Future…
      • Production cluster capacity increase:
        • ~500TB of raw storage.
      • Further integration with the data warehouse.
      • Deployment of analysis and reporting tools on top of Hadoop.
    • 28.
      • Web Analytics and Big Data
    • 29. What is Web Analytics?
      • Understand the impact and economic value of the website
      • Rigorous outcome analysis
      • Passion for customer centricity by embracing voice-of-customer initiatives
      • Fail faster by leveraging the power of experimentation(MVT)
    • 30. Challenges
      • Site Analytics
        • Lack of multi-dimensional capabilities
        • Hard to find the right insight
        • Heavy investment on the tools   
        • Precision vs Direction
    • 31. continued….
      • Big Data
        • No data unification or uniform platform across organizations and business units
        • No easy data extraction capabilities
          • Business
        • Distinction between reporting and testing(MVT)
        • Minimal measurement of outcomes
    • 32. Data Categories
      • Traffic acquisition
      • Marketing optimization
      • User engagement
      • Ad optimization
      • User behaviour
    • 33. Web Analytics & Big Data
      • OWW generates a couple million air and hotel searches every day.
      • Massive amounts of data. Hundreds of GB of log data per day.
      • Expensive and difficult to store and process this data using existing data infrastructure.
    • 34. Processing of Web Analytics Data
    • 35. Aggregating data into Data Warehouse
    • 36. Data Analysis Jobs
      • Traffic Source and Campaign activities
      • Daily jobs, Weekly analysis
      • Map reduce job
        • ~ 20 minutes for one day raw logs
        • ~ 3 minutes to load to hive tables
        • Generates more than 25 million records for a month
    • 37. Business Insights page
    • 38. Centralized Decentralization Web Analytics team + SEO team + Hotel optimization team
    • 39. Model for success
      • Measure the performance of your feature and fail fast
      • Experimentation and testing should be ingrained into every key feature.
      • Break down into smaller chunks of data extraction
    • 40. Should everyone do this?
      • Do you have the Technology strength to invest and use Big Data?
      • Analytics using Big Data comes with a price (resource, time)
      • Big Data mining != analysis
      • Key Data warehouse challenges still exist (time, data validity)
    • 41. Questions?