Your SlideShare is downloading. ×
0
Hadoop
A Distributed Programming Framework

A Very Short Introduction

@Dewang_Mistry

DewangMistry.com
“Big data” is data that
becomes large enough
that it cannot be
processed using
conventional methods
~ O’Reilly Radar
Hadoop

Apache Hadoop is not a database
Apache Hadoop is not a single program, tool or application but a set of projects w...
Distributed Systems

Low-end/commodity machines
(scale-out)
Huge monolithic
servers (scale-up)
Anatomy of a Hadoop Cluster
Distributed Computing (MapReduce)

Distributed storage (HDFS)

Commodity Hardware
Hadoop Architecture
The MapReduce master is
responsible for organizing
where computational work
should be scheduled on the...
Hadoop Ecosystem
Predictive analytics

Misc.

Crunch

RHadoop

Sqoop

Cascading

RHIPE

Hue

Pig

R

Flume

Hive

mahout

...
MapReduce
Stated simply, the mapper is meant to filter and
transform the input into something that the reducer can
aggrega...
MapReduce
IN

IN

IN

IN

IN

IN

Map

(k1, v1)

Reduce

(k2, v2)

OUT

OUT

OUT
Data Logistics
HDFS

Move data from RDBMS into Hadoop using Sqoop
Move log files using Flume, Chukwa, or Scribe
Writing Map/Reduce Jobs
We can use multiple languages to write Map/Reduce jobs
Python with Hadoop Streaming
Pros: fast dev...
Use Cases
Where can we use Hadoop?
Reporting
Granular reports over large data set of 5-7 years
Business analysis
Risk anal...
Upcoming SlideShare
Loading in...5
×

Hadoop - A Very Short Introduction

2,302

Published on

A short introduction to Hadoop and it's ecosystem.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,302
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
24
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop - A Very Short Introduction"

  1. 1. Hadoop A Distributed Programming Framework A Very Short Introduction @Dewang_Mistry DewangMistry.com
  2. 2. “Big data” is data that becomes large enough that it cannot be processed using conventional methods ~ O’Reilly Radar
  3. 3. Hadoop Apache Hadoop is not a database Apache Hadoop is not a single program, tool or application but a set of projects with a common goal integrated under one umbrella / term Hadoop (Core)
  4. 4. Distributed Systems Low-end/commodity machines (scale-out) Huge monolithic servers (scale-up)
  5. 5. Anatomy of a Hadoop Cluster Distributed Computing (MapReduce) Distributed storage (HDFS) Commodity Hardware
  6. 6. Hadoop Architecture The MapReduce master is responsible for organizing where computational work should be scheduled on the slave nodes. Name Node Job Tracker HDFS The HDFS master is responsible for partitioning the storage across the slave nodes and keeping track of where data is located. Data Node Data Node Data Node Task Tracker HDFS Task Tracker HDFS Task Tracker HDFS Let the data remain where it is and move the executable code to its hosting machine.
  7. 7. Hadoop Ecosystem Predictive analytics Misc. Crunch RHadoop Sqoop Cascading RHIPE Hue Pig R Flume Hive mahout Hbase High-level languages HDFS MapReduce Hadoop
  8. 8. MapReduce Stated simply, the mapper is meant to filter and transform the input into something that the reducer can aggregate over. MapReduce uses lists and (key/value) pairs as its main data primitives. Example next Shapes are keys, its colors are values.
  9. 9. MapReduce IN IN IN IN IN IN Map (k1, v1) Reduce (k2, v2) OUT OUT OUT
  10. 10. Data Logistics HDFS Move data from RDBMS into Hadoop using Sqoop Move log files using Flume, Chukwa, or Scribe
  11. 11. Writing Map/Reduce Jobs We can use multiple languages to write Map/Reduce jobs Python with Hadoop Streaming Pros: fast development Cons: slower than Java, no access to Hadoop API Java Pros: fast, access to Hadoop API Cons: verbose language PIG Pros: very small scripts, faster than streaming Cons: yet another language to learn Hive Pros: SQL like syntax (easy for non-programmers) and relational data model Cons: slower than PIG, more moving parts
  12. 12. Use Cases Where can we use Hadoop? Reporting Granular reports over large data set of 5-7 years Business analysis Risk analysis Predictive analysis Operational analysis Root cause analysis Latency analysis Better capacity planning (servers, people, bandwidth) Product features Recommendations (better than external parties, because of the amount of data)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×