Most people think of orbitz.com, but Orbitz Worldwide is really a global portfolio of leading online travel consumer brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub. Orbitz also provides business to business services - Orbitz Worldwide Distribution provides hotel booking capabilities to a number of leading carriers such as Amtrak, Delta, LAN, KLM, Air France and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients Orbitz started in 1999, orbitz site launched in 2001.
Some benefits of Hadoop you start to hear so many times they almost become cliché, but based on our experience at Orbitz they’ve proven to be true, so they bear repeating.
On Orbitz alone we do millions of searches and transactions daily, all of this activity leads to extremely large volumes of data – hundreds of GB/day. Not all of this data has value – much of it’s logged for historic reasons and is no longer useful, but much of it is valuable. In addition there’s more data that we’re not currently capturing that we know has value
This chart isn’t exactly an apples-to-apples comparison, but provides some idea of the difference in cost per TB for the DW vs. Hadoop Hadoop doesn’t provide the same functionality as a data warehouse, but it does allow us to store and process data that wasn’t practical before for economic and technical reasons.
Putting data into a DB or DWH requires having knowledge or making assumptions about how the data will be used. Either way you’re putting constraints around how the data is accessed and processed. With Hadoop each application can process the raw data in whatever way is required.
Our data warehouse contains a full archive of all transactions – every booking, refund, cancellation etc. Much valuable non-transactional data was just thrown away because it was uneconomical to store and didn’t necessarily have clear value.
Hadoop was deployed late 2009/early 2010 to begin collecting this non-transactional data. Orbitz has been using CDH for that entire period with great success. Much of this non-transactional data is contained in web analytics logs.
Having access to this data allows us to perform processing and analyses not previously possible.
Hadoop was first used to facilitate the machine learning teams work. This team needed accessed to large amounts of data on user interaction in order to do things like optimize hotel ranking and show consumers hotels more closely matching their preferences.
Hadoop is used to crunch data for input to a system to recommend products to users.
Although we use third-party sites to monitor site performance, Hadoop allows the front end team to provide detailed reports on page download performance, providing valuable trending data not available from other sources.
Hadoop collects and processes data for input to analyses to optimize cache performance.
Data is used analysis of user segments, which can drive personalization. This chart shows that Safari users click on hotels with higher mean and median prices as opposed to other users.
MAD: acronym for magnetic, agile, and deep agile: ability to quickly integrate new data sources deep: able to perform sophisticated analyses
This would facilitate access to all of our data through standard BI tools plus which most of our BI developers, not to mention users, develop SQL, ETL, etc, and are not Java developers and won’t be writing MR jobs we haven’t yet achieved this data warehouse nirvana
QlikView is used extensively for reporting at Orbitz. Although QlikView is working on enhancements to facilitate integration with tools such as Hadoop, there’s no direct integration. This is understandable since QlikView uses an in-memory model which presents a challenge when dealing with Hadoop sized data. We can however use Hadoop to summarize data for export to QlikView.
This provides an example of a typical processing flow for the large volumes of non-transactional data we’re collecting. This processing allows us to convert large volumes of un-structured data into structured data that can be queried, extracted, etc. for further processing.
This type of processing also allows us summarize large volumes of data into a data set that can be exported to the data warehouse, allowing us to query and report on that data using all of our standard BI tools.
Still being implemented, but a good example of how Hadoop allows us to offload time and resource intensive processing from the data warehouse.
Processing of click data gathered by web servers. This click data contains marketing info. data cleansing step is done inside data warehouse using a stored procedure further downstream processing is done to generate final data sets for reporting Although this processing generates the required user reports, this process consumes considerable time and resources on the data warehouse, consuming resources that could be used for reports, queries, etc.
ETL step is eliminated, instead raw logs will be uploaded to HDFS which is a much faster process Moving the data cleansing to MapReduce will allow us to take advantage of Hadoop’s efficiencies and greatly speed up the processing. Moves the “heavy lifting” of processing the relatively large data sets to Hadoop, and takes advantage of Hadoop’s efficiencies.
Transcript of "Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop"
Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster and Jonathan Seidman Chicago Data Summit April 26 | 2011
Who We Are <ul><li>Robert Lancaster </li></ul><ul><ul><li>Solutions Architect, Hotel Supply Team </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>@rob1lancaster </li></ul></ul><ul><li>Jonathan Seidman </li></ul><ul><ul><li>Lead Engineer, Business Intelligence/Big Data Team </li></ul></ul><ul><ul><li>Co-founder/organizer of Chicago Hadoop User Group (http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG ) </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>@jseidman </li></ul></ul>page
All of this is great, but… <ul><li>Most of these efforts are driven by development teams. </li></ul><ul><li>The challenge now is to unlock the value in this data by making it more available to the rest of the organization. </li></ul>page
page “ Given the ubiquity of data in modern organizations, a data warehouse can keep pace today only by being “magnetic”: attracting all the data sources that crop up within an organization regardless of data quality niceties.”* *MAD Skills: New Analysis Practices for Big Data
Click Data Processing – Current DW Processing page Web Server Logs ETL DW Data Cleansing (Stored procedure) DW Web Server Web Servers 3 hours 2 hours ~20% original data size
Click Data Processing – New Hadoop Processing page Web Server Logs HDFS Data Cleansing (MapReduce) DW Web Server Web Servers
Conclusions <ul><li>Market is still immature, but Hadoop has already become a valuable business intelligence tool, and will become an increasingly important part of a BI infrastructure. </li></ul><ul><li>Hadoop won’t replace your EDW, but any organization with a large EDW should at least be exploring Hadoop as a complement to their BI infrastructure. </li></ul><ul><li>Use Hadoop to offload the time and resource intensive processing of large data sets so you can free up your data warehouse to serve user needs. </li></ul><ul><li>The challenge now is making Hadoop more accessible to non-developers. Vendors are addressing this, so expect rapid advancements in Hadoop accessibility. </li></ul>page
Oh, and also… <ul><li>Orbitz is looking for a Lead Engineer for the BI/Big Data team. </li></ul><ul><li>Go to http://careers.orbitz.com / and search for IRC19035. </li></ul>page
References <ul><li>MAD Skills: New Analysis Practices for Big Data, Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, and Caleb Welton, 2009 </li></ul>page