Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop - Just the Basics for Big Data Rookies

5,100 views

Published on

Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.

Hadoop Ecosystem Overview

HDFS Architecture

Hadoop MapReduce -- MRv1 and YARN (MRv2)

MapReduce Primer -- Components and Code Example

Published in: Technology
  • Be the first to comment

Hadoop - Just the Basics for Big Data Rookies

  1. 1. Hadoop Just the Basics for Big Data Rookies Adam Shook ashook@gopivotal.com © 2013 SpringOne 2GX. All rights reserved. Do not distribute without permission.
  2. 2. Agenda • • • • • Hadoop Overview HDFS Architecture Hadoop MapReduce Hadoop Ecosystem MapReduce Primer • Buckle up!
  3. 3. Hadoop Overview
  4. 4. Hadoop Core • Open-source Apache project out of Yahoo! in 2006 • Distributed fault-tolerant data storage and batch processing • Provides linear scalability on commodity hardware • Adopted by many: – Amazon, AOL, eBay, Facebook, Foursquare, Google, IBM, Netflix, Twitter, Yahoo!, and many, many more
  5. 5. Why? • Bottom line: – Flexible – Scalable – Inexpensive
  6. 6. Overview • Great at – Reliable storage for multi-petabyte data sets – Batch queries and analytics – Complex hierarchical data structures with changing schemas, unstructured and structured data • Not so great at – Changes to files (can’t do it…) – Low-latency responses – Analyst usability • This is less of a concern now due to higher-level languages
  7. 7. Data Structure • • • • Bytes! No more ETL necessary Store data now, process later Structure on read – Built-in support for common data types and formats – Extendable – Flexible
  8. 8. Versioning • Version 0.20.x, 0.21.x, 0.22.x, 1.x.x – Two main MR packages: • org.apache.hadoop.mapred (deprecated) • org.apache.hadoop.mapreduce (new hotness) • Version 2.x.x, alpha’d in May 2012 – NameNode HA – YARN – Next Gen MapReduce
  9. 9. HDFS Architecture
  10. 10. HDFS Overview • Hierarchical UNIX-like file system for data storage – sort of • Splitting of large files into blocks • Distribution and replication of blocks to nodes • Two key services – Master NameNode – Many DataNodes • Checkpoint Node (Secondary NameNode)
  11. 11. NameNode • • • • • Single master service for HDFS Single point of failure (HDFS 1.x) Stores file to block to location mappings in the namespace All transactions are logged to disk NameNode startup reads namespace image and logs
  12. 12. Checkpoint Node (Secondary NN) • Performs checkpoints of the NameNode’s namespace and logs • Not a hot backup! 1. Loads up namespace 2. Reads log transactions to modify namespace 3. Saves namespace as a checkpoint
  13. 13. DataNode • • • • Stores blocks on local disk Sends frequent heartbeats to NameNode Sends block reports to NameNode Clients connect to DataNode for I/O
  14. 14. How HDFS Works - Writes Client contacts NameNode to write data 1 Client NameNode 2 NameNode says write it to these nodes 3 Client sequentially writes blocks to DataNode A1 DataNode A A2 A3 DataNode B A4 DataNode C DataNode D
  15. 15. How HDFS Works - Writes Client DataNodes replicate data blocks, orchestrated by the NameNode A1 NameNode A2 A2 A1 A4 A3 DataNode A DataNode B A3 A2 A4 DataNode C A4 A1 A3 DataNode D
  16. 16. How HDFS Works - Reads Client contacts NameNode to read data 1 Client NameNode 2 NameNode says you can find it here 3 Client sequentially reads blocks from DataNode A1 A2 A2 A1 A4 A3 DataNode A DataNode B A3 A2 A4 DataNode C A4 A1 A3 DataNode D
  17. 17. How HDFS Works - Failure Client NameNode Client connects to another node serving that block A1 A2 A2 A1 A4 A3 DataNode A DataNode B A3 A2 A4 DataNode C A4 A1 A3 DataNode D
  18. 18. Block Replication • Default of three replicas • Rack-aware system DN DN DN DN DN … • Automatic re-copy by NameNode, as needed Rack 2 … – One block on same rack – One block on same rack, different host – One block on another rack Rack 1 DN
  19. 19. HDFS 2.0 Features • NameNode High-Availability (HA) – Two redundant NameNodes in active/passive configuration – Manual or automated failover • NameNode Federation – Multiple independent NameNodes using the same collection of DataNodes
  20. 20. Hadoop MapReduce
  21. 21. Hadoop MapReduce 1.x • Moves the code to the data • JobTracker – Master service to monitor jobs • TaskTracker – Multiple services to run tasks – Same physical machine as a DataNode • A job contains many tasks • A task contains one or more task attempts
  22. 22. JobTracker • • • • • Monitors job and task progress Issues task attempts to TaskTrackers Re-tries failed task attempts Four failed attempts = one failed job Schedules jobs in FIFO order – Fair Scheduler • Single point of failure for MapReduce
  23. 23. TaskTrackers • • • • Runs on same node as DataNode service Sends heartbeats and task reports to JobTracker Configurable number of map and reduce slots Runs map and reduce task attempts – Separate JVM!
  24. 24. Exploiting Data Locality • JobTracker will schedule task on a TaskTracker that is local to the block – 3 options! • If TaskTracker is busy, selects TaskTracker on same rack – Many options! • If still busy, chooses an available TaskTracker at random – Rare!
  25. 25. How MapReduce Works Client submits job to JobTracker 1 Client JobTracker 4 JobTracker submits tasks to TaskTrackers JobTracker reports metrics 2 A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3 DataNode A DataNode B DataNode C DataNode D TaskTracker A TaskTracker B TaskTracker C TaskTracker D B1 B3 B4 B2 B3 Job output is written to DataNodes w/replication B1 B3 3 B2 B4 B4 B1 B2
  26. 26. How MapReduce Works - Failure Client JobTracker JobTracker assigns task to different node A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3 DataNode A DataNode B DataNode C DataNode D TaskTracker A TaskTracker B TaskTracker C TaskTracker D B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2
  27. 27. YARN • Abstract framework for distributed application development • Split functionality of JobTracker into two components – ResourceManager – ApplicationMaster • TaskTracker becomes NodeManager – Containers instead of map and reduce slots • Configurable amount of memory per NodeManager
  28. 28. MapReduce 2.x on YARN • MapReduce API has not changed – Rebuild required to upgrade from 1.x to 2.x • Application Master launches and monitors job via YARN • MapReduce History Server to store… history
  29. 29. Hadoop Ecosystem
  30. 30. Hadoop Ecosystem • Core Technologies – Hadoop Distributed File System – Hadoop MapReduce • Many other tools… – Which I will be describing… now
  31. 31. Moving Data • Sqoop – Moving data between RDBMS and HDFS – Say, migrating MySQL tables to HDFS • Flume – Streams event data from sources to sinks – Say, weblogs from multiple servers into HDFS
  32. 32. Flume Architecture
  33. 33. Higher Level APIs • Pig – Data-flow language – aptly named PigLatin -- to generate one or more MapReduce jobs against data stored locally or in HDFS • Hive – Data warehousing solution, allowing users to write SQL-like queries to generate a series of MapReduce jobs against data stored in HDFS
  34. 34. Pig Word Count A = LOAD '$input'; B = FOREACH A GENERATE FLATTEN(TOKENIZE($0)) AS word; C = GROUP B BY word; D = FOREACH C GENERATE group AS word, COUNT(B); STORE D INTO '$output';
  35. 35. Key/Value Stores • HBase • Accumulo • Implementations of Google’s Big Table for HDFS • Provides random, real-time access to big data • Supports updates and deletes of key/value pairs
  36. 36. HBase Architecture ZooKeeper Client RegionServer RegionServer Region Store StoreFile Master Region MemStore StoreFile Store StoreFile Store MemStore StoreFile StoreFile HDFS MemStore StoreFile Store StoreFile MemStore StoreFile
  37. 37. Data Structure • Avro – Data serialization system designed for the Hadoop ecosystem – Expressed as JSON • Parquet – Compressed, efficient columnar storage for Hadoop and other systems
  38. 38. Scalable Machine Learning • Mahout – Library for scalable machine learning written in Java – Very robust examples! – Classification, Clustering, Pattern Mining, Collaborative Filtering, and much more
  39. 39. Workflow Management • Oozie – Scheduling system for Hadoop Jobs – Support for: • • • • Java MapReduce Streaming MapReduce Pig, Hive, Sqoop, Distcp Any ol’ Java or shell script program
  40. 40. Real-time Stream Processing • Storm – Open-source project which runs a streaming of data, called a spout, to a series of execution agents called bolts – Scalable and faulttolerant, with guaranteed processing of data – Benchmarks of over a million tuples processed per second per node
  41. 41. Distributed Application Coordination • ZooKeeper – An effort to develop and maintain an open-source server which enables highly reliable distributed coordination – Designed to be simple, replicated, ordered, and fast – Provides configuration management, distributed synchronization, and group services for applications
  42. 42. ZooKeeper Architecture
  43. 43. Hadoop Streaming • Write MapReduce mappers and reducers using stdin and stdout • Execute on command line using Hadoop Streaming JAR // TODO verify hadoop jar hadoop-streaming.jar -input input -output outputdir -mapper org.apache.hadoop.mapreduce.Mapper -reduce /bin/wc
  44. 44. SQL on Hadoop • • • • Apache Drill Cloudera Impala Hive Stinger Pivotal HAWQ • MPP execution of SQL queries against HDFS data
  45. 45. HAWQ Architecture
  46. 46. That’s a lot of projects • I am likely missing several (Sorry, guys!) • Each cropped up to solve a limitation of Hadoop Core • Know your ecosystem • Pick the right tool for the right job
  47. 47. Sample Architecture Flume Agent SQL Oozie Webserve r Website Flume Agent Sales MapReduce HBase HDFS Flume Agent Call Center Pig SQL Storm
  48. 48. MapReduce Primer
  49. 49. MapReduce Paradigm • Data processing system with two key phases • Map – Perform a map function on input key/value pairs to generate intermediate key/value pairs • Reduce – Perform a reduce function on intermediate key/value groups to generate output key/value pairs • Groups created by sorting map output
  50. 50. (0, "hadoop is fun") (52, "I love hadoop") (104, "Pig is more fun") Map Task 0 Map Task 1 Map Task 2 ("hadoop", 1) ("I", 1) ("Pig", 1) ("is", 1) ("love", 1) ("is", 1) ("fun", 1) Map Input ("hadoop", 1) ("more", 1) Map Output ("fun", 1) SHUFFLE AND SORT ("fun", {1,1}) Reducer Input Groups ("hadoop", {1,1}) ("love", {1}) ("I", {1}) Reduce Task 0 ("is", {1,1}) ("more", {1}) ("Pig", {1}) Reduce Task 1 ("fun", 2) Reducer Output ("hadoop", 2) ("love", 1) ("I", 1) ("is", 2) ("more", 1) ("Pig", 1)
  51. 51. Hadoop MapReduce Components • Map Phase – – – – – Input Format Record Reader Mapper Combiner Partitioner • Reduce Phase – – – – – Shuffle Sort Reducer Output Format Record Writer
  52. 52. Writable Interfaces public interface Writable { void write(DataOutput out); void readFields(DataInput in); } public interface WritableComparable<T> extends Writable, Comparable<T> { } • BooleanWritable • IntWritable • BytesWritable • LongWritable • ByteWritable • NullWritable • DoubleWritable • Text • FloatWritable
  53. 53. InputFormat public abstract class InputFormat<K, V> { public abstract List<InputSplit> getSplits(JobContext context); public abstract RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context); }
  54. 54. RecordReader public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable { public abstract void initialize(InputSplit split, TaskAttemptContext context); public abstract boolean nextKeyValue(); public abstract KEYIN getCurrentKey(); public abstract VALUEIN getCurrentValue(); public abstract float getProgress(); public abstract void close(); }
  55. 55. Mapper public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { protected void setup(Context context) { /* NOTHING */ } protected void cleanup(Context context) { /* NOTHING */ } protected void map(KEYIN key, VALUEIN value, Context context) { context.write((KEYOUT) key, (VALUEOUT) value); } public void run(Context context) { setup(context); while (context.nextKeyValue()) map(context.getCurrentKey(), context.getCurrentValue(), context); cleanup(context); } }
  56. 56. Partitioner public abstract class Partitioner<KEY, VALUE> { public abstract int getPartition(KEY key, VALUE value, int numPartitions); } • Default HashPartitioner uses key’s hashCode() % numPartitions
  57. 57. Reducer public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { protected void setup(Context context) { /* NOTHING */ } protected void cleanup(Context context) { /* NOTHING */ } protected void reduce(KEYIN key, Iterable<VALUEIN> value, Context context) { for (VALUEIN value : values) context.write((KEYOUT) key, (VALUEOUT) value); } public void run(Context context) { setup(context); while (context.nextKey()) reduce(context.getCurrentKey(), context.getValues(), context); cleanup(context); } }
  58. 58. OutputFormat public abstract class OutputFormat<K, V> { public abstract RecordWriter<K, V> getRecordWriter(TaskAttemptContext context); public abstract void checkOutputSpecs(JobContext context); public abstract OutputCommitter getOutputCommitter(TaskAttemptContext context); }
  59. 59. RecordWriter public abstract class RecordWriter<K, V> { public abstract void write(K key, V value); public abstract void close(TaskAttemptContext context); }
  60. 60. Word Count Example
  61. 61. Problem • Count the number of times each word is used in a body of text • Uses TextInputFormat and TextOutputFormat map(byte_offset, line) foreach word in line emit(word, 1) reduce(word, counts) sum = 0 foreach count in counts sum += count emit(word, sum)
  62. 62. Mapper Code public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); } } }
  63. 63. Shuffle and Sort Mapper 0 P0 P1 P2 Mapper 1 P3 P0 P1 P2 Mapper 2 P3 P0 P1 P2 Mapper 3 P3 P0 P1 P2 P3 1 2 P0 P0 P0 P0 P1 P1 P1 P1 P2 P2 P2 P2 P3 P3 P3 P3 3 P0 P1 P2 P3 Reducer 0 Reducer 1 Reducer 2 Reducer 3 Mapper outputs to a single logically partitioned file Reducers copy their parts Reducer merges partitions, sorting by key
  64. 64. Reducer Code public class IntSumReducer extends Reducer<Text, LongWritable, Text, IntWritable> { private IntWritable outvalue = new IntWritable(); private int sum = 0; public void reduce(Text key, Iterable<IntWritable> values, Context context) { sum = 0; for (IntWritable val : values) { sum += val.get(); } outvalue.set(sum); context.write(key, outvalue); } }
  65. 65. So what’s so hard about it? All the problems you'll ever have ever MapReduce that’s a tiny box
  66. 66. So what’s so hard about it? • MapReduce is a limitation • Entirely different way of thinking • Simple processing operations such as joins are not so easy when expressed in MapReduce • Proper implementation is not so easy • Lots of configuration and implementation details for optimal performance – Number of reduce tasks, data skew, JVM size, garbage collection
  67. 67. So what does this mean for you? • Hadoop is written primarily in Java • Components are extendable and configurable • Custom I/O through Input and Output Formats – Parse custom data formats – Read and write using external systems • Higher-level tools enable rapid development of big data analysis
  68. 68. Resources, Wrap-up, etc. • • • • http://hadoop.apache.org Very supportive community Strata + Hadoop World Oct. 28th – 30th in Manhattan Plenty of resources available to learn more – – – – Blogs Email lists Books Shameless Plug -- MapReduce Design Patterns
  69. 69. Getting Started • Pivotal HD Single-Node VM and Community Edition – http://gopivotal.com/pivotal-products/data/pivotal-hd • For the brave and bold -- Roll-your-own! – http://hadoop.apache.org/docs/current
  70. 70. Acknowledgements • Apache Hadoop, the Hadoop elephant logo, HDFS, Accumulo, Avro, Drill, Flume, HBase, Hive, M ahout, Oozie, Pig, Sqoop, YARN, and ZooKeeper are trademarks of the Apache Software Foundation • Cloudera Impala is a trademark of Cloudera • Parquet is copyright Twitter, Cloudera, and other contributors • Storm is licensed under the Eclipse Public License
  71. 71. Learn More. Stay Connected. • Talk to us on Twitter: @springcentral • Find Session replays on YouTube: spring.io/video

×