Hadoop - A Very Short Introduction
Upcoming SlideShare
Loading in...5
×
 

Hadoop - A Very Short Introduction

on

  • 1,074 views

A short introduction to Hadoop and it's ecosystem.

A short introduction to Hadoop and it's ecosystem.

Statistics

Views

Total Views
1,074
Views on SlideShare
711
Embed Views
363

Actions

Likes
1
Downloads
19
Comments
0

5 Embeds 363

http://dewangmistry.com 229
http://dewang.org 112
http://dewang.me 18
http://www.dewang.org 3
http://www.dewang.me 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop - A Very Short Introduction Hadoop - A Very Short Introduction Presentation Transcript

  • Hadoop A Distributed Programming Framework A Very Short Introduction @Dewang_Mistry DewangMistry.com
  • “Big data” is data that becomes large enough that it cannot be processed using conventional methods ~ O’Reilly Radar
  • Hadoop Apache Hadoop is not a database Apache Hadoop is not a single program, tool or application but a set of projects with a common goal integrated under one umbrella / term Hadoop (Core)
  • Distributed Systems Low-end/commodity machines (scale-out) Huge monolithic servers (scale-up)
  • Anatomy of a Hadoop Cluster Distributed Computing (MapReduce) Distributed storage (HDFS) Commodity Hardware
  • Hadoop Architecture The MapReduce master is responsible for organizing where computational work should be scheduled on the slave nodes. Name Node Job Tracker HDFS The HDFS master is responsible for partitioning the storage across the slave nodes and keeping track of where data is located. Data Node Data Node Data Node Task Tracker HDFS Task Tracker HDFS Task Tracker HDFS Let the data remain where it is and move the executable code to its hosting machine.
  • Hadoop Ecosystem Predictive analytics Misc. Crunch RHadoop Sqoop Cascading RHIPE Hue Pig R Flume Hive mahout Hbase High-level languages HDFS MapReduce Hadoop
  • MapReduce Stated simply, the mapper is meant to filter and transform the input into something that the reducer can aggregate over. MapReduce uses lists and (key/value) pairs as its main data primitives. Example next Shapes are keys, its colors are values.
  • MapReduce IN IN IN IN IN IN Map (k1, v1) Reduce (k2, v2) OUT OUT OUT
  • Data Logistics HDFS Move data from RDBMS into Hadoop using Sqoop Move log files using Flume, Chukwa, or Scribe
  • Writing Map/Reduce Jobs We can use multiple languages to write Map/Reduce jobs Python with Hadoop Streaming Pros: fast development Cons: slower than Java, no access to Hadoop API Java Pros: fast, access to Hadoop API Cons: verbose language PIG Pros: very small scripts, faster than streaming Cons: yet another language to learn Hive Pros: SQL like syntax (easy for non-programmers) and relational data model Cons: slower than PIG, more moving parts
  • Use Cases Where can we use Hadoop? Reporting Granular reports over large data set of 5-7 years Business analysis Risk analysis Predictive analysis Operational analysis Root cause analysis Latency analysis Better capacity planning (servers, people, bandwidth) Product features Recommendations (better than external parties, because of the amount of data)