Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop - A Very Short Introduction


Published on

A short introduction to Hadoop and it's ecosystem.

Published in: Technology
  • Be the first to comment

Hadoop - A Very Short Introduction

  1. 1. Hadoop A Distributed Programming Framework A Very Short Introduction @Dewang_Mistry
  2. 2. “Big data” is data that becomes large enough that it cannot be processed using conventional methods ~ O’Reilly Radar
  3. 3. Hadoop Apache Hadoop is not a database Apache Hadoop is not a single program, tool or application but a set of projects with a common goal integrated under one umbrella / term Hadoop (Core)
  4. 4. Distributed Systems Low-end/commodity machines (scale-out) Huge monolithic servers (scale-up)
  5. 5. Anatomy of a Hadoop Cluster Distributed Computing (MapReduce) Distributed storage (HDFS) Commodity Hardware
  6. 6. Hadoop Architecture The MapReduce master is responsible for organizing where computational work should be scheduled on the slave nodes. Name Node Job Tracker HDFS The HDFS master is responsible for partitioning the storage across the slave nodes and keeping track of where data is located. Data Node Data Node Data Node Task Tracker HDFS Task Tracker HDFS Task Tracker HDFS Let the data remain where it is and move the executable code to its hosting machine.
  7. 7. Hadoop Ecosystem Predictive analytics Misc. Crunch RHadoop Sqoop Cascading RHIPE Hue Pig R Flume Hive mahout Hbase High-level languages HDFS MapReduce Hadoop
  8. 8. MapReduce Stated simply, the mapper is meant to filter and transform the input into something that the reducer can aggregate over. MapReduce uses lists and (key/value) pairs as its main data primitives. Example next Shapes are keys, its colors are values.
  9. 9. MapReduce IN IN IN IN IN IN Map (k1, v1) Reduce (k2, v2) OUT OUT OUT
  10. 10. Data Logistics HDFS Move data from RDBMS into Hadoop using Sqoop Move log files using Flume, Chukwa, or Scribe
  11. 11. Writing Map/Reduce Jobs We can use multiple languages to write Map/Reduce jobs Python with Hadoop Streaming Pros: fast development Cons: slower than Java, no access to Hadoop API Java Pros: fast, access to Hadoop API Cons: verbose language PIG Pros: very small scripts, faster than streaming Cons: yet another language to learn Hive Pros: SQL like syntax (easy for non-programmers) and relational data model Cons: slower than PIG, more moving parts
  12. 12. Use Cases Where can we use Hadoop? Reporting Granular reports over large data set of 5-7 years Business analysis Risk analysis Predictive analysis Operational analysis Root cause analysis Latency analysis Better capacity planning (servers, people, bandwidth) Product features Recommendations (better than external parties, because of the amount of data)