Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Hadoop


Published on

An introduction to Hadoop presentation geared towards educating potential clients on Hadoop\'s capabilities.

  • Be the first to comment

Introduction to Hadoop

  1. 1. Object Partners Inc. Introduction to Hadoop Click to edit Master subtitle style Demo by: Presented by: Nick Adelman Joel Crabb
  2. 2. Object Partners Inc. Agenda Ø Terminology Ø Why does Hadoop Exist? Ø HDFS and Hbase Ø Examples Ø Getting Started Ø Demo
  3. 3. Object Partners Inc. Terminology Ø Hadoop – Core set of technologies hosted by Apache Foundation for storing and searching data sets in the Tera and Petabyte range Ø HDFS – Hadoop File System used as the basis for all Hadoop technologies Ø Hbase – Distributed Map based database which uses HDFS as its underlying data store Ø Map Reduce – A framework for programming distributed parallel processing algorithms
  4. 4. Object Partners Inc. Terminology Ø Distributed Computing – A computing paradigm that parallelizes computations over multiple compute nodes in order to decrease overall processing time Ø NOSQL – Programming paradigm which does not use a relational database as the backend data store Ø Big Data – Generic term used when working with large data sets Ø Name Node – Server that knows location of all files in cluster
  5. 5. Object Partners Inc. Enterprise Architecture 101 HDFS HDFS Map Reduce Data Data Hbase Hbase RDBMS RDBMS
  6. 6. Object Partners Inc. The New System Constraint Ø Hard disk seek time is the new constraint when working with a Petabyte data set – Spread the seek time among multiple servers – Isolate the data to a single read per disk – Faster to read too much data sequentially on disk and discard the excess Ø Working under this paradigm requires New Tools
  7. 7. Object Partners Inc. New Tools: Why does Hadoop exist? Ø In the early 2000s Google had problems: Ø Problem 1: Store Tera and Petabytes of data: – Inexpensive, Reliable, Accessible Ø Answer: distributed file system Ø Problem 2: Distributed Computing is Hard Ø Answer: make distributed computing easier Ø Problem 3: Datasets too large for RDBMS Ø Answer: make a new way to store application data
  8. 8. Object Partners Inc. Google’s Solution: Tool 1 Ø Google File System (GFS) – A file system specifically built to manage large files and support distributed computing Ø Inexpensive: – Store files distributed across a cluster of cheap servers Ø Reliable: – Plan for server failure: if you have 1000 servers, one will fail every day – Always maintain three copies of each file (configurable) Ø Accessible: – File Chunk size is 64MB = Less file handles to manage – Master table keeps track of locations of each file copy Problem 1: Store Tera and Petabytes of data
  9. 9. Object Partners Inc. Google’s Solution: Tool 2 Ø Map Reduce – abstracts away the hard parts of distributed computing Ø Programmers no longer need to manage: – Where is the data? – What piece of data am I working on? – How do I move data and result sets? – How do I combine results? Ø Leverages the GFS – Send processing to the data – Multiple file copies means higher chance to use more nodes for each process Problem 2: Distributed Computing is Hard
  10. 10. Object Partners Inc. Tool 2: Map Reduce Ø Distributed parallel processing framework Ø Map - done N times on N servers – Perform an operation (search) on a chunk (GBs) of data Ø Search 100 GB – Process Map on 25 servers with 4GB of memory – 100 GB processed in-parallel in-memory – Create Maps storing results (key-value pair) Ø Reduce – Take Maps from N nodes – Merge (reduce) maps to a single sorted map (result set) Problem 2: Distributed Computing is Hard
  11. 11. Object Partners Inc. Google’s Solution: Tool 3 Ø Bigtable: new paradigm in storing large data sets – “a sparse, distributed, persistent multi-dimensional sorted map”* *Bigtable: A Distributed Storage System for Structured Data Ø Sparse: Few entries in map are populated Ø Distributed: Data spread across multiple logical machines in multiple copies Ø Multi-dimensional: Maps within maps organize and store data Ø Sorted: Sorted by lexiographic keys – Lexiographic = alphabetically including numbers Problem 3: Data sets too large for RDBMS
  12. 12. Object Partners Inc. Google’s Architecture Map Reduce Direct Access Map Reduce Bigtable GFS
  13. 13. Object Partners Inc. Hadoop – If Something Works… Ø Hadoop was started to recreate these technologies in the Open Source community GFS HDFS Bigtable Hbase Map Map Reduce Reduce
  14. 14. Object Partners Inc. A Little More on HDFS Ø Plan for Failure – In a thousand node cluster, machines will fail often – HDFS is built to detect failure and redistribute files Ø Fast Data Access – Generally a batch processing system Ø Large Files – typically GB to TB files Ø Simple Coherency – Once file is closed, it cannot be updated or appended Ø Cloud Ready – Setup on Amazon EC2 / S3 Summarized from:
  15. 15. Object Partners Inc. A Little More on Hbase Ø Multi-dimensional Map Ø Map<byte[ ] – Map<byte[ ] • Map<byte[ ] – Map<Long, byte[]>>>> Ø First Map: Row Key to Column Family Ø Second Map: Column Family to Column Label Ø Third Map: Column Label to Timestamp Ø Fourth Map: Timestamp to Value A Column Family is a grouping of columns of the same data type.
  16. 16. Object Partners Inc. Hbase Storage Model
  17. 17. Object Partners Inc. Hbase Access Ø REST interface – Ø Groovy – Ø Scala –
  18. 18. Object Partners Inc. Industry Examples Ø Web/File Search (Yahoo!) Ø Yahoo! Is the main sponsor and contributor to Hadoop Ø Has over 25,000 servers running Hadoop Ø Log aggregation (Amazon, Facebook, Baidu) Ø RDBMS replacement (Google Analytics) Ø Image store (Google Earth) Ø Email store (Gmail) Ø Natural Language Search (Microsoft) Ø Many more… * Information from
  19. 19. Object Partners Inc. Use Case #1: Yahoo! Search Ø Problem circa 2006 Ø Yahoo! search is seen as inferior to Google’s Ø Google is better at: – Storing Tera and Petabytes of unstructured data – Searching the data set efficiently – Applying custom analytics to data set – Presenting a more relevant result set
  20. 20. Object Partners Inc. Use Case #1: Yahoo! Search Ø Solution – Emulate Google with Hadoop’s HDFS, Pig and Map Reduce – HDFS • Stores Petabytes of web page data distributed over a cluster of compute nodes (1000s) • Runs on commodity hardware • Average server – 2X4 core, 4 – 32 GB RAM * – Pig (Hadoop Sub-project) • Analytics processing platform – Map Reduce • Build indexes from raw web data *
  21. 21. Object Partners Use Case #2: RDBMS Replacement Inc. Ø Google Analytics circa 2006 Ø Problem – Store Terabytes of analytics data about website usage – GBs of data added per hour – Data added in small increments – Access and display data in < 3 seconds per request
  22. 22. Object Partners Inc. Use Case #2: RDBMS Replacement Ø Solution – Bigtable, Map Reduce on GFS Ø Bigtable sits over GFS inputs small bits of data Ø In 2006, GA cluster supported ~220 TB* Ø Raw Click Table (200 TB) – Rows keyed by WebsiteName + Session Time – All website data stored consecutively on disk Ø Summary Table (20 TB) – Map Reduce of Raw Click Table for customer web views Pattern: Collect data in one Bigtable instance Map Reduce to a View Bigtable instance *Bigtable: A Distributed Storage System for Structured Data
  23. 23. Object Partners Inc. Can You Use Hadoop? Ø IF… – You have a large amount of data (Terabytes+) – You can split your data collection data store from your online or analytics data store – You can order your data lexiographically – You can run analytics as batches – You cannot afford a large enough RDBMS – You need dynamic column additions – You need near linear performance as data set grows
  24. 24. Object Partners Inc. Other Hadoop Technologies Ø Hive – SQL like query language to use Hadoop like a data warehouse Ø Pig – parallel data analysis framework Ø Zookeeper – Distributed application coordination framework Ø Chukwa – Data collection system for distributed computing Ø Avro – data serialization framework
  25. 25. Object Partners Inc. New Skills for IT Ø Learning to restructure data Ø Learning to write Map Reduce programs Ø Learning to maintain a Hadoop cluster Ø Forgetting RDBMS/SQL dominated design principals It takes a new style of creativity to both structure data in Hadoop and write useful Map Reduce programs.
  26. 26. Object Partners Inc. Getting Started Ø You can install a test system on a single Unix box Ø For a full system a minimum of 3 servers – 10 to 20 servers is a small cluster Ø Expect to spend a day to a week getting a multi- node cluster configured. Ø A book like Pro Hadoop, by Jason Venner may save you time but is based on the 0.19 Hadoop release (currently at 0.20)
  27. 27. Object Partners Inc. Optional Quickstart Ø Cloudera has a preconfigured single node Hadoop instance available for download at: Ø Yahoo! Has a Hadoop distribution as well at:
  28. 28. Object Partners Inc. Alternatives to Hbase Ø Project Voldemort – – Used by Linked In Ø Hypertable – – Used by BaiDu (Search leader of China) Ø Cassandra – – Apache sponsored distributed database – Used by Facebook
  29. 29. Object Partners Inc. Helpful Information Ø Ø Ø Ø Ø Ø Ø Twitter: @hbase Ø Two articles on Map Reduce in the 01/2010 Communications of the ACM
  30. 30. DEMO Object Partners Inc.