Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Presentation Transcript
Architecting for Big Data Integrating Hadoop into an Enterprise Data Infrastructure Raghu Kashyap and Jonathan Seidman Gartner Peer Forum September 14 | 2011
Who We Are
Director, Web Analytics
Lead Engineer, Business Intelligence/Big Data Team
Co-founder/organizer of Chicago Hadoop User Group http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/ and Chicago Big Data http://www.meetup.com/Chicago-Big-Data/
page Launched in 2001, Chicago, IL Over 160 million bookings
What is Hadoop?
Open source software that supports the storage and analysis of extremely large volumes of data – typically terabytes to petabytes.
Two primary components:
Hadoop Distributed File System (HDFS) provides economical, reliable, fault tolerant and scalable storage of very large datasets across machines in a cluster.
MapReduce is a programming model for efficient distributed processing. Designed to reliably perform computations on large volumes of data in parallel.
Hadoop allows us to store and process data that was previously impractical because of cost, technical issues, etc., and places no constraints on how that data is processed.
page $ per TB
Why We Started Using Hadoop page Optimizing hotel search…
Why We Started Using Hadoop
In 2009, the Machine Learning team was formed to improve site performance. For example, improving hotel search results.
This required access to large volumes of behavioral data for analysis.
The only archive of the required data went back about two weeks.
page Transactional Data (e.g. bookings) Data Warehouse Non-transactional Data (e.g. searches)
Hadoop Was Selected as a Solution… page Transactional Data (e.g. bookings) Data Warehouse Non-Transactional Data (e.g. searches) Hadoop
We faced organizational resistance to deploying Hadoop.
Not from management, but from other technical teams.
Required persistence to convince them that we needed to introduce a new hardware spec to support Hadoop.
Current Big Data Infrastructure Hadoop page MapReduce HDFS MapReduce Jobs (Java, Python, R/RHIPE) Analytic Tools (Hive, Pig) Data Warehouse (Greenplum) psql, gpload, Sqoop External Analytical Jobs (Java, R, etc.) Aggregated Data Aggregated Data