Open source software that supports the storage and analysis of extremely large volumes of data – typically terabytes to petabytes.
Two primary components:
Hadoop Distributed File System (HDFS) provides economical, reliable, fault tolerant and scalable storage of very large datasets across machines in a cluster.
MapReduce is a programming model for efficient distributed processing. Designed to reliably perform computations on large volumes of data in parallel.
Hadoop allows us to store and process data that was previously impractical because of cost, technical issues, etc., and places no constraints on how that data is processed.
page $ per TB
Why We Started Using Hadoop page Optimizing hotel search…
Why We Started Using Hadoop
In 2009, the Machine Learning team was formed to improve site performance. For example, improving hotel search results.
This required access to large volumes of behavioral data for analysis.
The only archive of the required data went back about two weeks.
page Transactional Data (e.g. bookings) Data Warehouse Non-transactional Data (e.g. searches)
Hadoop Was Selected as a Solution… page Transactional Data (e.g. bookings) Data Warehouse Non-Transactional Data (e.g. searches) Hadoop
We faced organizational resistance to deploying Hadoop.
Not from management, but from other technical teams.
Required persistence to convince them that we needed to introduce a new hardware spec to support Hadoop.
Current Big Data Infrastructure Hadoop page MapReduce HDFS MapReduce Jobs (Java, Python, R/RHIPE) Analytic Tools (Hive, Pig) Data Warehouse (Greenplum) psql, gpload, Sqoop External Analytical Jobs (Java, R, etc.) Aggregated Data Aggregated Data