Your SlideShare is downloading. ×
Architecting for Big Data  Integrating Hadoop into an Enterprise Data Infrastructure Raghu Kashyap and Jonathan Seidman Ga...
Who We Are <ul><li>Raghu Kashyap </li></ul><ul><ul><li>Director, Web Analytics </li></ul></ul><ul><ul><li>[email_address] ...
page  Launched in  2001, Chicago, IL  Over 160 million bookings
What is Hadoop? <ul><li>Open source software that supports the storage and analysis of extremely large volumes of data – t...
Why Hadoop? <ul><li>Hadoop allows us to store and process data that was previously impractical because of cost, technical ...
Why We Started Using Hadoop page  Optimizing hotel search…
Why We Started Using Hadoop <ul><li>In 2009, the Machine Learning team was formed to improve site performance. For example...
The Problem… <ul><li>The only archive of the required data went back about two weeks. </li></ul>page  Transactional Data (...
Hadoop Was Selected as a Solution… page  Transactional Data (e.g. bookings) Data Warehouse Non-Transactional Data (e.g. se...
Unfortunately… <ul><li>We faced organizational resistance to deploying Hadoop. </li></ul><ul><ul><li>Not from management, ...
Current Big Data Infrastructure Hadoop page  MapReduce HDFS MapReduce Jobs (Java, Python,  R/RHIPE) Analytic Tools (Hive, ...
Hadoop Architecture Details <ul><li>Production cluster </li></ul><ul><ul><li>About 200TB of raw storage </li></ul></ul><ul...
Deploying Hadoop Enabled Multiple Applications… page
But Brought New Challenges… <ul><li>Most of these efforts are driven by development teams. </li></ul><ul><li>The challenge...
In Early 2011… <ul><li>Big Data team is formed under Business Intelligence team at Orbitz Worldwide. </li></ul><ul><li>Ref...
Karmasphere Analyst page
Karmasphere Analyst page
Datameer Analytics Solution page
Datameer Analytics Solution page
Not to Mention Other BI Vendors… page
One More Use Case – Click Data Processing <ul><li>Still under development, but a good example of how Hadoop can be used to...
Click Data Processing – Current Data Warehouse Processing  page  Web Server Logs ETL DW Data Cleansing (Stored  procedure)...
Click Data Processing – Proposed Hadoop Processing  page  Web Server Logs HDFS Data Cleansing (MapReduce) DW Web Server We...
Lessons Learned <ul><li>Expect organizational resistance from unanticipated directions. </li></ul><ul><li>Advice for findi...
Lessons Learned <ul><li>Hadoop market is still immature, but growing quickly. Better tools are on the way. </li></ul><ul><...
Lessons Learned <ul><li>Work closely with your existing data management teams. </li></ul><ul><ul><li>Your idea of what con...
In the Near Future… <ul><li>Production cluster capacity increase: </li></ul><ul><ul><li>~500TB of raw storage. </li></ul><...
<ul><li>Web Analytics and Big Data </li></ul>page
What is  Web  Analytics? <ul><li>Understand the impact and economic value of the website  </li></ul><ul><li>Rigorous outco...
Challenges <ul><li>Site Analytics </li></ul><ul><ul><li>Lack of multi-dimensional capabilities </li></ul></ul><ul><ul><li>...
continued…. <ul><li>Big Data  </li></ul><ul><ul><li>No data unification or uniform platform across organizations and busin...
Data Categories <ul><li>Traffic acquisition </li></ul><ul><li>Marketing optimization </li></ul><ul><li>User engagement </l...
Web Analytics & Big Data <ul><li>OWW generates a couple million air and hotel searches every day. </li></ul><ul><li>Massiv...
Processing of Web Analytics Data
Aggregating data into Data Warehouse
Data Analysis Jobs <ul><li>Traffic Source and Campaign activities </li></ul><ul><li>Daily jobs, Weekly analysis </li></ul>...
Business Insights page
Centralized Decentralization Web Analytics team + SEO team + Hotel optimization team
Model for success <ul><li>Measure the performance of your feature and fail fast </li></ul><ul><li>Experimentation and test...
Should everyone do this? <ul><li>Do you have the Technology strength to invest and use Big Data? </li></ul><ul><li>Analyti...
Questions? <ul><li>http://careers.orbitz.com </li></ul>
Upcoming SlideShare
Loading in...5
×

Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011

10,017

Published on

Published in: Technology, Business
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
10,017
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
7
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide
  • Most people think of orbitz.com, but Orbitz Worldwide is really a global portfolio of leading online travel consumer brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub. Orbitz also provides business to business services - Orbitz Worldwide Distribution provides hotel booking capabilities to a number of leading carriers such as Amtrak, Delta, LAN, KLM, Air France and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients Orbitz started in 1999, orbitz site launched in 2001.
  • A couple of years ago when I mentioned Hadoop I’d often get blank stares, even from developers. I think most folks now are at least aware of what Hadoop is.
  • This chart isn’t exactly an apples-to-apples comparison, but provides some idea of the difference in cost per TB for the DW vs. Hadoop Hadoop doesn’t provide the same functionality as a data warehouse, but it does allow us to store and process data that wasn’t practical before for economic and technical reasons. Putting data into a DB or DWH requires having knowledge or making assumptions about how the data will be used. Either way you’re putting constraints around how the data is accessed and processed. With Hadoop each application can process the raw data in whatever way is required. If you decide you need to analyze different attributes you just run a new query.
  • The initial motivation was to solve a particular business problem. Orbitz wanted to be able to use intelligent algorithms to optimize various site functions, for example optimizing hotel search by showing consumers hotels that more closely match their preferences, leading to more bookings.
  • Improving hotel search requires access to such data as which hotels users saw in search results, which hotels they clicked on, and which hotels were actually booked. Much of this data was available in web analytics logs.
  • Management was supportive of anything that facilitated ML team efforts. But when we presented a hardware spec for servers with local non-raided storage, etc. syseng offered us blades with attached storage.
  • Hadoop is used to crunch data for input to a system to recommend products to users. Although we use third-party sites to monitor site performance, Hadoop allows the front end team to provide detailed reports on page download performance, providing valuable trending data not available from other sources. Data is used for analysis of user segments, which can drive personalization. This chart shows that Safari users click on hotels with higher mean and median prices as opposed to other users. This is just a handful of examples of how Hadoop is driving business value.
  • Recently received an email from a user seeking access to Hive. Sent him a detailed email with info on accessing Hive, etc. Received an email back basically saying “you lost me at ssh”.
  • Previous to 2011 Hadoop responsibilities were split across technology teams. Moving under a single team centralized responsibility and resources for Hadoop.
  • Processing of click data gathered by web servers. This click data contains marketing info. data cleansing step is done inside data warehouse using a stored procedure further downstream processing is done to generate final data sets for reporting Although this processing generates the required user reports, this process consumes considerable time and resources on the data warehouse, consuming resources that could be used for reports, queries, etc.
  • ETL step is eliminated, instead raw logs will be uploaded to HDFS which is a much faster process Moving the data cleansing to MapReduce will allow us to take advantage of Hadoop’s efficiencies and greatly speed up the processing. Moves the “heavy lifting” of processing the relatively large data sets to Hadoop, and takes advantage of Hadoop’s efficiencies.
  • Bad news is we need to significantly increase the number of servers in our cluster, the good news is that this is because teams are using Hadoop, and new projects are coming online.
  • Transcript of "Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011"

    1. 1. Architecting for Big Data Integrating Hadoop into an Enterprise Data Infrastructure Raghu Kashyap and Jonathan Seidman Gartner Peer Forum September 14 | 2011
    2. 2. Who We Are <ul><li>Raghu Kashyap </li></ul><ul><ul><li>Director, Web Analytics </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>@ragskashyap </li></ul></ul><ul><ul><li>http://kashyaps.com </li></ul></ul><ul><li>Jonathan Seidman </li></ul><ul><ul><li>Lead Engineer, Business Intelligence/Big Data Team </li></ul></ul><ul><ul><li>Co-founder/organizer of Chicago Hadoop User Group http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/ and Chicago Big Data http://www.meetup.com/Chicago-Big-Data/ </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>@jseidman </li></ul></ul>page
    3. 3. page Launched in 2001, Chicago, IL Over 160 million bookings
    4. 4. What is Hadoop? <ul><li>Open source software that supports the storage and analysis of extremely large volumes of data – typically terabytes to petabytes. </li></ul><ul><li>Two primary components: </li></ul><ul><ul><li>Hadoop Distributed File System (HDFS) provides economical, reliable, fault tolerant and scalable storage of very large datasets across machines in a cluster. </li></ul></ul><ul><ul><li>MapReduce is a programming model for efficient distributed processing. Designed to reliably perform computations on large volumes of data in parallel. </li></ul></ul>page
    5. 5. Why Hadoop? <ul><li>Hadoop allows us to store and process data that was previously impractical because of cost, technical issues, etc., and places no constraints on how that data is processed. </li></ul>page $ per TB
    6. 6. Why We Started Using Hadoop page Optimizing hotel search…
    7. 7. Why We Started Using Hadoop <ul><li>In 2009, the Machine Learning team was formed to improve site performance. For example, improving hotel search results. </li></ul><ul><li>This required access to large volumes of behavioral data for analysis. </li></ul>page
    8. 8. The Problem… <ul><li>The only archive of the required data went back about two weeks. </li></ul>page Transactional Data (e.g. bookings) Data Warehouse Non-transactional Data (e.g. searches)
    9. 9. Hadoop Was Selected as a Solution… page Transactional Data (e.g. bookings) Data Warehouse Non-Transactional Data (e.g. searches) Hadoop
    10. 10. Unfortunately… <ul><li>We faced organizational resistance to deploying Hadoop. </li></ul><ul><ul><li>Not from management, but from other technical teams. </li></ul></ul><ul><li>Required persistence to convince them that we needed to introduce a new hardware spec to support Hadoop. </li></ul>page
    11. 11. Current Big Data Infrastructure Hadoop page MapReduce HDFS MapReduce Jobs (Java, Python, R/RHIPE) Analytic Tools (Hive, Pig) Data Warehouse (Greenplum) psql, gpload, Sqoop External Analytical Jobs (Java, R, etc.) Aggregated Data Aggregated Data
    12. 12. Hadoop Architecture Details <ul><li>Production cluster </li></ul><ul><ul><li>About 200TB of raw storage </li></ul></ul><ul><ul><li>336 (physical) cores </li></ul></ul><ul><ul><li>672GB RAM </li></ul></ul><ul><ul><li>4 client nodes (Hive, ad-hoc jobs, scheduled jobs, etc.) </li></ul></ul><ul><li>Development cluster for user testing </li></ul><ul><li>Test cluster for testing upgrades, new software, etc. </li></ul><ul><li>Cloudera CDH3 </li></ul>page
    13. 13. Deploying Hadoop Enabled Multiple Applications… page
    14. 14. But Brought New Challenges… <ul><li>Most of these efforts are driven by development teams. </li></ul><ul><li>The challenge now is unlocking the value of this data for non-technical users. </li></ul>page
    15. 15. In Early 2011… <ul><li>Big Data team is formed under Business Intelligence team at Orbitz Worldwide. </li></ul><ul><li>Reflects the importance of big data to the future of the company. </li></ul><ul><li>Allows the Big Data team to work more closely with the data warehouse and BI teams. </li></ul><ul><li>We’re also evaluating tools to facilitate analysis of Hadoop data by the wider organization. </li></ul>page
    16. 16. Karmasphere Analyst page
    17. 17. Karmasphere Analyst page
    18. 18. Datameer Analytics Solution page
    19. 19. Datameer Analytics Solution page
    20. 20. Not to Mention Other BI Vendors… page
    21. 21. One More Use Case – Click Data Processing <ul><li>Still under development, but a good example of how Hadoop can be used to complement an existing data warehouse. </li></ul>page
    22. 22. Click Data Processing – Current Data Warehouse Processing page Web Server Logs ETL DW Data Cleansing (Stored procedure) DW Web Server Web Servers 3 hours 2 hours ~20% original data size
    23. 23. Click Data Processing – Proposed Hadoop Processing page Web Server Logs HDFS Data Cleansing (MapReduce) DW Web Server Web Servers
    24. 24. Lessons Learned <ul><li>Expect organizational resistance from unanticipated directions. </li></ul><ul><li>Advice for finding big data developers: </li></ul><ul><ul><li>Don ’t bother. </li></ul></ul><ul><ul><li>Instead, train smart and motivated internal resources or new hires. </li></ul></ul><ul><li>But get help if you need it. </li></ul><ul><ul><li>There are a number of experienced providers who can help you get started. </li></ul></ul>page
    25. 25. Lessons Learned <ul><li>Hadoop market is still immature, but growing quickly. Better tools are on the way. </li></ul><ul><ul><li>Look beyond the usual (enterprise) suspects. Many of the most interesting companies in the big data space are small startups. </li></ul></ul><ul><li>Use the appropriate tool based on requirements. Treat Hadoop as a complement, not replacement, to traditional data stores. </li></ul>page
    26. 26. Lessons Learned <ul><li>Work closely with your existing data management teams. </li></ul><ul><ul><li>Your idea of what constitutes “big data” might quickly diverge from theirs. </li></ul></ul><ul><li>The flip-side to this is that Hadoop can be an excellent tool to off-load resource-consuming jobs from your data warehouse. </li></ul>page
    27. 27. In the Near Future… <ul><li>Production cluster capacity increase: </li></ul><ul><ul><li>~500TB of raw storage. </li></ul></ul><ul><li>Further integration with the data warehouse. </li></ul><ul><li>Deployment of analysis and reporting tools on top of Hadoop. </li></ul>page
    28. 28. <ul><li>Web Analytics and Big Data </li></ul>page
    29. 29. What is Web Analytics? <ul><li>Understand the impact and economic value of the website </li></ul><ul><li>Rigorous outcome analysis </li></ul><ul><li>Passion for customer centricity by embracing voice-of-customer initiatives </li></ul><ul><li>Fail faster by leveraging the power of experimentation(MVT) </li></ul>
    30. 30. Challenges <ul><li>Site Analytics </li></ul><ul><ul><li>Lack of multi-dimensional capabilities </li></ul></ul><ul><ul><li>Hard to find the right insight </li></ul></ul><ul><ul><li>Heavy investment on the tools    </li></ul></ul><ul><ul><li>Precision vs Direction </li></ul></ul>
    31. 31. continued…. <ul><li>Big Data </li></ul><ul><ul><li>No data unification or uniform platform across organizations and business units </li></ul></ul><ul><ul><li>No easy data extraction capabilities </li></ul></ul><ul><ul><ul><li>Business </li></ul></ul></ul><ul><ul><li>Distinction between reporting and testing(MVT) </li></ul></ul><ul><ul><li>Minimal measurement of outcomes </li></ul></ul>
    32. 32. Data Categories <ul><li>Traffic acquisition </li></ul><ul><li>Marketing optimization </li></ul><ul><li>User engagement </li></ul><ul><li>Ad optimization </li></ul><ul><li>User behaviour </li></ul>
    33. 33. Web Analytics & Big Data <ul><li>OWW generates a couple million air and hotel searches every day. </li></ul><ul><li>Massive amounts of data. Hundreds of GB of log data per day. </li></ul><ul><li>Expensive and difficult to store and process this data using existing data infrastructure. </li></ul>
    34. 34. Processing of Web Analytics Data
    35. 35. Aggregating data into Data Warehouse
    36. 36. Data Analysis Jobs <ul><li>Traffic Source and Campaign activities </li></ul><ul><li>Daily jobs, Weekly analysis </li></ul><ul><li>Map reduce job </li></ul><ul><ul><li>~ 20 minutes for one day raw logs </li></ul></ul><ul><ul><li>~ 3 minutes to load to hive tables </li></ul></ul><ul><ul><li>Generates more than 25 million records for a month </li></ul></ul>
    37. 37. Business Insights page
    38. 38. Centralized Decentralization Web Analytics team + SEO team + Hotel optimization team
    39. 39. Model for success <ul><li>Measure the performance of your feature and fail fast </li></ul><ul><li>Experimentation and testing should be ingrained into every key feature. </li></ul><ul><li>Break down into smaller chunks of data extraction </li></ul>
    40. 40. Should everyone do this? <ul><li>Do you have the Technology strength to invest and use Big Data? </li></ul><ul><li>Analytics using Big Data comes with a price (resource, time) </li></ul><ul><li>Big Data mining != analysis </li></ul><ul><li>Key Data warehouse challenges still exist (time, data validity) </li></ul>
    41. 41. Questions? <ul><li>http://careers.orbitz.com </li></ul>

    ×