Your SlideShare is downloading. ×
  • Like
Hadoop - A Very Short Introduction
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hadoop - A Very Short Introduction


A short introduction to Hadoop and it's ecosystem.

A short introduction to Hadoop and it's ecosystem.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Hadoop A Distributed Programming Framework A Very Short Introduction @Dewang_Mistry
  • 2. “Big data” is data that becomes large enough that it cannot be processed using conventional methods ~ O’Reilly Radar
  • 3. Hadoop Apache Hadoop is not a database Apache Hadoop is not a single program, tool or application but a set of projects with a common goal integrated under one umbrella / term Hadoop (Core)
  • 4. Distributed Systems Low-end/commodity machines (scale-out) Huge monolithic servers (scale-up)
  • 5. Anatomy of a Hadoop Cluster Distributed Computing (MapReduce) Distributed storage (HDFS) Commodity Hardware
  • 6. Hadoop Architecture The MapReduce master is responsible for organizing where computational work should be scheduled on the slave nodes. Name Node Job Tracker HDFS The HDFS master is responsible for partitioning the storage across the slave nodes and keeping track of where data is located. Data Node Data Node Data Node Task Tracker HDFS Task Tracker HDFS Task Tracker HDFS Let the data remain where it is and move the executable code to its hosting machine.
  • 7. Hadoop Ecosystem Predictive analytics Misc. Crunch RHadoop Sqoop Cascading RHIPE Hue Pig R Flume Hive mahout Hbase High-level languages HDFS MapReduce Hadoop
  • 8. MapReduce Stated simply, the mapper is meant to filter and transform the input into something that the reducer can aggregate over. MapReduce uses lists and (key/value) pairs as its main data primitives. Example next Shapes are keys, its colors are values.
  • 9. MapReduce IN IN IN IN IN IN Map (k1, v1) Reduce (k2, v2) OUT OUT OUT
  • 10. Data Logistics HDFS Move data from RDBMS into Hadoop using Sqoop Move log files using Flume, Chukwa, or Scribe
  • 11. Writing Map/Reduce Jobs We can use multiple languages to write Map/Reduce jobs Python with Hadoop Streaming Pros: fast development Cons: slower than Java, no access to Hadoop API Java Pros: fast, access to Hadoop API Cons: verbose language PIG Pros: very small scripts, faster than streaming Cons: yet another language to learn Hive Pros: SQL like syntax (easy for non-programmers) and relational data model Cons: slower than PIG, more moving parts
  • 12. Use Cases Where can we use Hadoop? Reporting Granular reports over large data set of 5-7 years Business analysis Risk analysis Predictive analysis Operational analysis Root cause analysis Latency analysis Better capacity planning (servers, people, bandwidth) Product features Recommendations (better than external parties, because of the amount of data)