• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Introduction to Hadoop
 

Introduction to Hadoop

on

  • 126 views

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. ...

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Statistics

Views

Total Views
126
Views on SlideShare
117
Embed Views
9

Actions

Likes
0
Downloads
0
Comments
0

3 Embeds 9

https://www.linkedin.com 5
http://www.linkedin.com 2
http://www.slideee.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Introduction to Hadoop Introduction to Hadoop Presentation Transcript

    • Introduction to Hadoop Ran Ziv Introduction to Hadoop Ran Ziv© 2012 1
    • Who Am I? Ran Ziv Current: Past: Data Architect at Technology Research Group Architect, Data Platform & Analytics Group Manager, LivePerson Analytics Project Manager, Software Industry System Analyst, Telco Industry Fraud Detection Systems Engineer, Telco Industry Data Researcher, Internet Industry Introduction to Hadoop Ran Ziv© 2012 2
    • What’s Ahead? • Solid introduction to Apache Hadoop • What it is • Why it’s relevant • How it works • The Ecosystem • No prior experience needed • Feel free to ask questions Introduction to Hadoop Ran Ziv© 2012 3
    • What Is Apache Hadoop? • Scalable data storage and processing • Open source Apache project • Harnesses the power of commodity servers • Distributed and fault-tolerant • “Core” Hadoop consists of two main parts • HDFS (storage) • MapReduce (processing) Introduction to Hadoop Ran Ziv© 2012 4
    • A Large Ecosystem Introduction to Hadoop Ran Ziv© 2012 5
    • A Coherent Platform Introduction to Hadoop Ran Ziv© 2012 6
    • How Did Apache Hadoop Originate? • Heavily influenced by Google’s architecture • Notably, the Google File System and MapReduce papers • Other Web companies quickly saw the benefits • Early adoption by Yahoo, Facebook and others Introduction to Hadoop Ran Ziv© 2012 8
    • What Is Common Across Hadoop-able Problems? • Nature of the data • Complex data • Multiple data sources • Lots of it • Nature of the analysis • Parallel execution • Spread data over a cluster of servers and take the computation to the data Introduction to Hadoop Ran Ziv© 2012 15
    • Benefits Of Analyzing With Hadoop • Previously impossible/impractical to do this analysis • Analysis conducted at lower cost • Analysis conducted in less time • Greater flexibility • Linear scalability Introduction to Hadoop Ran Ziv© 2012 16
    • Hadoop: How does it work? • Moore’s law… and not Introduction to Hadoop Ran Ziv© 2012 17
    • Disk Capacity and Price • We’re generating more data than ever before • Fortunately, the size and cost of storage has kept pace • Capacity has increased while price has decreased Introduction to Hadoop Ran Ziv© 2012 18
    • Disk Capacity and Performance • Disk performance has also increased in the last 15 years • Unfortunately, transfer rates haven’t kept pace with capacity Introduction to Hadoop Ran Ziv© 2012 19
    • Architecture of a Typical HPC System Introduction to Hadoop Ran Ziv© 2012 20
    • Architecture of a Typical HPC System Introduction to Hadoop Ran Ziv© 2012 21
    • Architecture of a Typical HPC System Introduction to Hadoop Ran Ziv© 2012 22
    • Architecture of a Typical HPC System Introduction to Hadoop Ran Ziv© 2012 23
    • You Don’t Just Need Speed… • The problem is that we have way more data than code Introduction to Hadoop Ran Ziv© 2012 24
    • You Need Speed At Scale Introduction to Hadoop Ran Ziv© 2012 25
    • Introduction to Hadoop Ran Ziv© 2012 26
    • Collocated Storage and Processing • Solution: store and process data on the same nodes • Data Locality: “Bring the computation to the data” • Reduces I/O and boosts performance Introduction to Hadoop Ran Ziv© 2012 27
    • Introducing HDFS • Hadoop Distributed File System • Scalable storage influenced by Google’s file system paper • It’s not a general-purpose file system • HDFS is optimized for Hadoop • Values high throughput much more than low latency • It’s a user-space java process • Primarily accessed via command-line utilities and Java API Introduction to Hadoop Ran Ziv© 2012 29
    • HDFS is (Mostly) UNIX-Like • In many ways, HDFS is similar to a unix file system • Hierarchical • Unix-style paths (e.g. /foo/bar/myfile.txt) • File ownership and permissions • There are also some major deviations from Unix • Cannot modify files once written Introduction to Hadoop Ran Ziv© 2012 30
    • HDFS High-Level Architecture • HDFS follows a master-slave architecture • There are two essential deamons in HDFS • Master: NameNode • Responsible for namespace and metadata • Namespace: file hierarchy • Metadata: ownership, permissions, block locations, etc. • Slave: DataNode • Responsible for storing actual datablocks Introduction to Hadoop Ran Ziv© 2012 31
    • HDFS Blocks • When a file is added to HDFS, it’s split into blocks • This is a similar concept to native file systems • HDFS uses a much larger block size (64MB), for performance Introduction to Hadoop Ran Ziv© 2012 32
    • HDFS Replication • Those blocks are then replicated across machines • The first block might be replicated to A, C and D Introduction to Hadoop Ran Ziv© 2012 33
    • HDFS Replication • The next block might be replicated to B, D and E Introduction to Hadoop Ran Ziv© 2012 34
    • HDFS Replication • The last block might be replicated to A, C and E Introduction to Hadoop Ran Ziv© 2012 35
    • HDFS Reliability • Replication helps to achieve reliability • Even when a node fails, two copies of the block remain • These will be re-replicated to other nodes automatically Introduction to Hadoop Ran Ziv© 2012 36
    • Introduction to Hadoop Ran Ziv© 2012 37
    • MapReduce High-Level Architecture • Like HDFS, MapReduce has a master-slave Architecture • There are two daemons in “classical” MapReduce • Master: JobTracker • Responsible for dividing, scheduling and monitoring work • Slave: TaskTracker • Responsible for actual processing Introduction to Hadoop Ran Ziv© 2012 38
    • Gentle Introduction to MapReduce • MapReduce is conceptually like a UNIX pipeline • One function (Map) processes data • That output is ultimately input to another function (Reduce) Introduction to Hadoop Ran Ziv© 2012 39
    • The Map Function • Operates on each record individually • Typical uses include filtering, parsing, or transforming Introduction to Hadoop Ran Ziv© 2012 40
    • Intermediate Processing • The Map function’s output is grouped and sorted • This is the automatic “sort and shuffle” process in Hadoop Introduction to Hadoop Ran Ziv© 2012 41
    • The Reduce Function • Operates on all records in a group • Often used for sum, average or other aggregate functions Introduction to Hadoop Ran Ziv© 2012 42
    • MapReduce Benefits • Complex details are abstracted away from the developer • No file I/O • No networking code • No synchronization Introduction to Hadoop Ran Ziv© 2012 43
    • MapReduce Example in Python • MapReduce code for Hadoop is typically written in Java • But possible to use nearly any language with Hadoop Streaming • I’ll show the log event counter using MapReduce in Python • It’s very helpful to see the data as well as the code Introduction to Hadoop Ran Ziv© 2012 44
    • Job Input • Each mapper gets a chunk of job’s input data to • This “chunk” is called an InputSplit Introduction to Hadoop Ran Ziv© 2012 45
    • Python Code for Map Function • Our map function will parse the event type • And then output that event (key) and a literal 1 (value) Introduction to Hadoop Ran Ziv© 2012 46
    • Output of Map Function • The map function produces key/value pairs as output Introduction to Hadoop Ran Ziv© 2012 47
    • Input to Reduce Function • The Reducer receives a key and all values for that key • Keys are always passed to reducers in sorted order • Although it’s not obvious here, values are unordered Introduction to Hadoop Ran Ziv© 2012 48
    • Python Code for Reduce Function • The Reducer first extracts the key and value it was passed Introduction to Hadoop Ran Ziv© 2012 49
    • Python Code for Reduce Function • Then simply adds up the value for each key Introduction to Hadoop Ran Ziv© 2012 50
    • Output of Reduce Function • The output of this Reduce function is a sum for each level Introduction to Hadoop Ran Ziv© 2012 51
    • Recap of Data Flow Introduction to Hadoop Ran Ziv© 2012 52
    • Input Splits Feed the Map Tasks • Input for the entire job is subdivided into InputSplits • An InputSplit usually corresponds to a single HDFS block • Each of these serves as input to a single Map task Introduction to Hadoop Ran Ziv© 2012 53
    • Mappers Feed the Shuffle and Sort • Output of all Mappers is partitioned, merged, and sorted (No code required – Hadoop does this automatically) Introduction to Hadoop Ran Ziv© 2012 54
    • Shuffle and Sort Feeds the Reducers • All values for a given key are then collapsed into a list • The key and all its values are fed to reducers as input Introduction to Hadoop Ran Ziv© 2012 55
    • Each Reducer Has an Output File • These are stored in HDFS below your output directory • Use hadoop fs -getmerge to combine them into a local copy Introduction to Hadoop Ran Ziv© 2012 56
    • Apache Hadoop Ecosystem: Overview • "Core Hadoop" consists of HDFS and MapReduce • These are the kernel of a much broader platform • Hadoop has many related projects • Some help you integrate Hadoop with other systems • Others help you analyze your data • Still others, like Oozie, help you use Hadoop more effectively • Most are open source Apache projects like Hadoop • Also like Hadoop, they have funny names Introduction to Hadoop Ran Ziv© 2012 57
    • Ecosystem: Apache Flume Introduction to Hadoop Ran Ziv© 2012 58
    • Ecosystem: Apache Sqoop • Integrates with any JDBC-compatible database • Retrieve all tables, a single table, or a portion to store in HDFS • Can also export data from HDFS back to the database Introduction to Hadoop Ran Ziv© 2012 59
    • Ecosystem: Apache Hive • Hive allows you to do SQL-like queries on data in HDFS • It turns this into MapReduce jobs that run on your cluster • Reduces development time Introduction to Hadoop Ran Ziv© 2012 60
    • Ecosystem: Apache Pig • Apache Pig has a similar purpose to Hive • It has a high-level language (PigLatin) for data analysis • Scripts yield MapReduce jobs that run on your cluster • But Pig’s approach is much different than Hive Introduction to Hadoop Ran Ziv© 2012 61
    • Ecosystem: Apache HBase • NoSQL database built on HDFS • Low-latency and high-performance for reads and writes • Extremely scalable • Tables can have billions of rows • And potentially millions of columns Introduction to Hadoop Ran Ziv© 2012 62
    • When is Hadoop (Not) a Good Choice • Hadoop may be a great choice when • You need to process non-relational (unstructured) data • You are processing large amounts of data • You can run your jobs in batch mode • Hadoop may not be a great choice when • You’re processing small amounts of data • Your algorithms require communication among nodes • You need very low latency or transactions • As always, use the best tool for the job • And know how to integrate it with other systems Introduction to Hadoop Ran Ziv© 2012 63
    • Conclusion • Thanks for your time! • Questions? Introduction to Hadoop Ran Ziv© 2012 64