Hadoop pig

  • 4,478 views
Uploaded on

Lamine's presentation...

Lamine's presentation...

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,478
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
75
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The analytics stack Hadoop & Pig
  • 2. Outline of the presentation Hadoop  Motivations. What is it? And high-level concepts  The Ecosystem. The MapReduce model & framework and HDFS  Programming with Hadoop Pig  What is it? Motivations  Model & components Integration with Cassandra 2
  • 3. Please interrupt and ask questions! 3
  • 4. Traditional HPC systems CPU-intensive computations  Relatively small amount of data  Tightly-coupled applications  Highly concurrent I/O requirements  Complex message passing paradigms such as MPI, PVM…  Developers might need to spend some time designing for failure 4
  • 5. Challenges Data and storage  Locality, computation close to the data In large-scale systems, nodes fail  Mean time between failures: 1 node / 3 years, 1000 nodes / 1 day  Built-in fault-tolerance Distributed programming is complex  Need a simple data-parallel programming model. Users would structure the application in high-level functions, the system distributes the data & jobs and handles communications and faults 5
  • 6. What requirements A simple data-parallel programming model, designed for high scalability and resiliency  Scalability to large-scale data volumes  Automated fault-tolerance at application level rather than relying on high-availability hardware  Simplified I/O and tasks monitoring  All based on cost-efficient commodity machines (cheap, but unreliable), and commodity network 6
  • 7. Hadoop’s core concepts Data spread in advance, persistent (in terms of locality), and replicated No inter-dependencies / shared nothing architecture Applications written in two pieces of code  And developers do not have to worry about the underlying issues in networking, jobs interdependencies, scheduling, etc… 7
  • 8. Where does it come from? Hadoop originated from Apache Nutch, an open source web search engine After the publications of the GFS and MapReduce papers, in 2003 & 2004, the Nutch developers decided to implement open source versions In February 2006, it became Hadoop, with a dedicated team at Yahoo! September 2007 - release 0.14.1 Last release 1.0.3 out last week Used by a large number of companies including Facebook, LinkedIn, Twitter, hulu, among many others.. 8
  • 9. The model A map function processes a key/value pair to generate a set of intermediate key/value pairs  Divides the problem into smaller ‘intermediate key/value’ pairs The reduce function merge all intermediate values associated with the same intermediate key Run-time system takes care of:  Partitioning the input data across nodes (blocks/chunks typically of 64Mb to 128Mb)  Scheduling the data and execution. Maps operate on a single block.  Manages node failures, replication, re-submissions.. 9
  • 10. Simple Word Count♯key: offset, value: linedef mapper(): for line in open(“doc”): for word in line.split(): output(word, 1)♯key: a word, value: iterator over countsdef reducer(): output(key, sum(value)) 10
  • 11. The Combiner A combiner is a local aggregation function for repeated keys produced by the map Works for associative functions like sum, count, max Decreases the size of intermediate data / communications map-side aggregation for word count: def combiner(): output(key, sum(values)) 11
  • 12. Some other basic examples… Distributed Grep:  Map function emits a line if it matches a supplied pattern  Reduce function is an identity function that copies the supplied intermediate data to the output Count of URL accesses:  Map function processes logs of web page requests and outputs <URL, 1>  Reduce function adds together all values for the same URL, emitting <URL, total count> pairs Reverse Web-Link graph:  Map function outputs <tgt, src> for each link to a tgt in a page named src  Reduce concatenates the list of all src URLS associated with a given tgt URL and emits the pair: <tgt, list(src)> Inverted Index:  Map function parses each document, emitting a sequence of <word, doc_ID>  Reduce accepts all pairs for a given word and emits a <word, list(doc_ID)> pair 12
  • 13. Hadoop Ecosystem Core 13 components from http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
  • 14. Hadoop components Hadoop consists of two core components  The MapReduce framework, and  The Hadoop Distributed File System MapReduce layer  JobTracker  TaskTrackers HDFS layer  Namenode  Secondary namenode  Datanode Example of a typical physical distribution within a 14 Hadoop cluster
  • 15. HDFS Scalable and fault-tolerant. Based on Namenode Google’s GFS File1 1 Single namenode stores metadata (file 2 3 names, block locations, etc.). 4 Files split into chunks, replicated across several datanodes (typically 3+). It is rack- aware Optimised for large files, sequential 1 2 1 3 streaming reads, rather than random 2 1 4 2 4 3 3 4 Files written once, no append Datanodes 15
  • 16. HDFS  HDFS API / HDFS FS Shell for command line* > hadoop fs –copyFromLocal local_dir hdfs_dir > hadoop fs –copToLocal hdfs_dir local_dir  Tools  Flume: collects, aggregates and move log data from application servers to HDFS  Sqoop: HDFS import and export to SQL*http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html 16
  • 17. MapReduce execution In Hadoop, a Job (full program) is a set of tasks Each task (mapper or reducer) is attempted at least once, or multiple times if it crashes. Multiple attempts may also occur in parallel The tasks run inside a separate JVM on the tasktracker All the class files are assembled into a jar file, which will be uploaded into HDFS, before notifying the tasktracker 17
  • 18. MapReduce executionMapReduce Job Master Split 0 Worker Split 1 Worker read Local write Split 2 Worker Remote read Worker Split 3 Split 4 Worker Output files Intermediate Input files files locally 18
  • 19. Getting Started… Multiple choices - Vanilla Apache version, or one of the numerous existing distros  hadoop.apache.org  www.cloudera.com [A set of VMs is also provided]  http://www.karmasphere.com/  … Three ways to write jobs in Hadoop:  Java API  Hadoop Streaming (for Python, Perl, etc.)  Pipes API (C++) 19
  • 20. Word Count in Javapublic static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); FileInputFormat.setInputPaths(conf, args[0]); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf);} 20
  • 21. Word Count in Java – mapperpublic class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { out.collect(new text(itr.nextToken()), ONE); } }} 21
  • 22. Word Count in Java – reducerpublic class ReduceClass extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); }} 22
  • 23. Getting keys and values Input file Reducer ReducerInput Format Input split Input split Output Format RecordWriter RecordWriter RecordReader RecordReader Output file Output file Mapper Mapper 23
  • 24. Hadoop Streaming Mapper.py: #!/usr/bin/env python import sys for line in sys.stdin: for word in line.split(): print "%st%s" % (word, 1) Reducer.py: #!/usr/bin/env python import sys dict={} for line in sys.stdin: word, count = line.split("t", 1) dict[word] = dict.get(word, 0) + int(count) counts = dict.items() for word, count in counts: print "%st%s" % (word.lower(), count)You can locally test your code on the command line: $> cat data | mapper | sort | reducer 24
  • 25. High-level tools MapReduce is fairly low-level: must think about keys, values, partitioning, etc. How to express parallel algorithms by a series of MapReduce jobs  Can be hard to capture common job building blocks Different use cases require different tools! 25
  • 26. Pig Apache Pig is a platform raising a level of abstraction for processing large datasets. Its language, Pig Latin is a simple query algebra expressing data transformations and applying functions to records Pig MapReduce jobs Hadoop / HDFS job submission Started at Yahoo! Research, >60% of Hadoop jobs within Yahoo! are Pig jobs 26
  • 27. Motivations MapReduce requires a Java programmer  Solution was to abstract it and create a system where users are familiar with scripting languages Other than very trivial applications, MapReduce requires multiple stages, leading to long development cycles  Rapid prototyping. Increased productivity In MapReduce users have to reinvent common functionality (join, filter, etc.)  Pig provides them 27
  • 28. Used for Rapid prototyping of algorithms for processing large datasets Log analysis Ad hoc queries across various large datasets Analytics (including through sampling) Pig Mix provides a set of performance and scalability benchmarks. Currently 1.1 times MapReduce speed. 28
  • 29. Using Pig Grunt, the Pig shell Executing scripts directly Embedding Pig in Java (using PigServer, similar to SQL using JDBC), or Python A range of tools including Eclipse plug-ins  PigPen, Pig Editor… 29
  • 30. Execution modes Pig has two execution types or modes: local mode and Hadoop mode Local  Pig runs in a single JVM and accesses the local filesystem. Starting form v0.7 it uses the Hadoop job runner. Hadoop mode  Pig runs on a Hadoop cluster (you need to tell Pig about the version and point it to your Namenode and Jobtracker 30
  • 31. Running Pig Pig resides on the user’s machine and can be independent from the Hadoop cluster Pig is written in Java and is portable  Compiles into map reduce jobs and submit them to the cluster No need to install anything extra on the cluster Pig client 31
  • 32. How does it work  Pig defines a DAG. A step-by-step set of operations, each performing a transformation  Pig defines a logical plan for these transformations:A = LOAD ’file as (line); • Parses, checks, & optimisesB = FOREACH A GENERATE • Plan the executionFLATTEN(TOKENIZE(line)) AS word; • Maps & ReducesC = GROUP B BY word; • Passes the jar to HadoopD = FOREACH C GENERATE group, • Monitor the progressCOUNT(words);STORE D INTO ‘output’ 32
  • 33. Data types & expressions Scalar type:  Int, Long, Float, Double, Chararray, Bytearray Complex type representing nested structures:  Tuple: sequence of fields of any type  Bag: an unordered collection of tuples  Map: a set of key-value pairs. Keys must be atoms, values may be any type Expressions:  used in Pig as a part of a statement; field name, position ($), arithmetic, conditional, comparison, Boolean, etc. 33
  • 34. Functions Load / Store  Data loaders; PigStorage, BinStorage, BinaryStorage, TextLoader, PigDump Evaluation  Many built-in functions MAX, COUNT, SUM, DIFF, SIZE… Filter  A special type of eval function used by the FILTER operator. IsEmpty is a built-in function Comparison  Function used in ORDER statement; ASC | DESC 34
  • 35. Schemas Schemas enable you to associate names and types of the fields in the relation Schemas are optional but recommended whenever possible; type declarations result in better parse-time error checking and more efficient code execution They are defined using the AS keyword with operators  Schema definition for simple data types: > records = LOAD input/data AS (id:int, date:chararray); 35
  • 36. Statements and aliases Each statement, defining a data processing operator / relation, produces a dataset with an aliasgrunt> records = LOAD input/data AS (id:int, date:chararray); LOAD returns a tuple, which elements can be referenced by position or by name Very useful operators are DUMP, ILLUSTRATE, and DESCRIBE 36
  • 37. Filtering data Filter is user to work with tuples and rows of data Select data you want, or remove the data you are not interested in Filtering early in the processing pipeline minimises the amount of data flowing through the system, which can improve efficiencygrunt> filtered_records = FILTER records BY id == 234; 37
  • 38. Foreach .. Generate Foreach .. Generate acts on columns on every row in a relationgrunt> ids = FOREACH records GENERATE id; Positional reference. This statement has the same outputgrunt> ids = FOREACH records GENERATE $0; The elements of ‘ids’ however are not named ‘id’ unless you add ‘AS id’ at the end of your statementgrunt> ids = FOREACH records GENERATE $0 AS id; 38
  • 39. Grouping and joining Group .. by makes an output bag containing grouped fields with the same schema using a grouping key Join performs inner, equijoin of two or more relations based on common field values. You can also perform outer joins using keywords left, right and full Cogroup is similar to Group, using multiple relations, and creates a nested set of output tuples 39
  • 40. Ordering, combining, splitting… Order imposes an order on the output to sort a relation by one or more fields The Limit statement limits the number of results Split partitions a relation into two or more relations the Sample operator selects a random data sample with the stated sample size the Union operator to merge the contents of two or more relations 40
  • 41. Stream The Stream operator allows to transform data in a relation using an external program or scriptgrunt> C = STREAM A THROUGH `cut -f 2`;  Extract the second field of A using cut The scripts are shipped to the cluster using grunt> DEFINE script `script.py` SHIP (‘script.py’); grunt> D = STREAM C THROUGH script AS (…); 41
  • 42. User defined functions Support and a community of user-defined functions (UDFs) UDFs can encapsulate users processing logic in filtering, comparison, evaluation, grouping, or storage  filter functions for instance are all subclasses of FilterFunc, which itself is a subclass of EvalFunc PiggyBank: the Pig community sharing their UDFs DataFu: Linkedins collection of Pig UDFs 42
  • 43. A simple eval UDF examplepackage myudfs;import …public class UPPER extends EvalFunc<String>{ public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String) input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } }} 43
  • 44. An Example Load Users Load PagesLet’s find the top 5 most visited Filter by age pages by users aged 18 – 25. Input: user data file, and page Join on name view data file. Group on url Count clicks Order by clicks Take top 5 44
  • 45. A simple scriptUsers = LOAD ‘users’ AS (name, age);Filtered = FILTER Users BY age >= 18 and age <= 25;Pages = LOAD ‘pages’ AS (user, url);Joined = JOIN Filtered BY name, Pages by user;Grouped = GROUP Joined BY url;Summed = FOREACH Grouped GENERATE group, count(Joined) AS clicks;Sorted = ORDER Summed BY clicks desc;Top5 = LIMIT Sorted 5;STORE Top5 INTO ‘top5sites’; 45
  • 46. iiii m m m m p p p pi m p o r t o o o o r r r r t t t t j j j j a a a a v v v v a a a a . . . . i u u u o t t t . i i i I l l l O . . . o r g . a p a c h e . h a d o o p . f s . P a t h ; E A I L x r t i c r e s e a r t p t i o n ; y L i s t ; a t o r ; ; / / f o r D o t h e ( S t r i n g f o r In MapReduce! c r o s s s 1 : ( S t r i n g p r o d u c t f i r s t ) s 2 : { a n s e c o n } r e p o r t e r . s e t S t a t u s ( " O K " ) ; d d ) c o l l e c t l p . s e t O u t p u t K e y C l p . s e t O u t p u t V a l u l p . s e t M a p p e r C l a s F i l e I n p u t F o r m a t . t h e v a l u e s P a t hu s e r / g a t e s / p a g e s " ) ) ; { ( " / F i l e O u t p u t F o r m a t l e s a . a C ( d s s l L d s a o I ( s a n e t O u t T s d p e ( P u x T a t t e g P . x e a c t s t p u t P a t h l . . h a c c ( s l l li m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ; S t r i n g o u t v a l = k e y + " , " + s 1 +n e w "P a t h ( " / u " , + s 2 ; s e r / g a t e s / t m p /i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ; o c . c o l l e c t ( n u l l , n e w T e x t ( o ul p . s e t N u m R e d u c e T t v a l ) ) ; a s k s ( 0 ) ;i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ; r e p o r t e r . s e t S t a t u s ( " O K " ) ; J o b l o a d P a g e s = n e w J o b ( l p ) ;i p o r t m o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ; }i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ; } J o b C o n f l f u = n e w J o b C o n f ( M R E xi m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t}; e t J o b N a m e ( " L o a d l f u . s a n d F i l t e r U s e r si m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ; } l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F oi m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x tp u b l i c os t a t i c I n p u t F r m a t ; c l a s s L o a d J o i n e d e x t e n d s M a p R el f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a d u c e B a s ei m p o r t po r g . a h a d o o p . m a p r e d . M a p p e r ; a c h e . i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n gl f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c W r i t a b l e > {i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ; l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l ti m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ;p u b l i c v o i d m a p ( F i l e I nI n p u t P a t h ( l f u , p u t F o r m a t . a d d n e wi m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ; T e x t k , P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ; T e x t v a l , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t hi m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ; c t o r < T e x t , lL o n g W r i t a b l e > O u t p u t C o l e o c , n e w P a t h ( " / u s e r / g a t e s / t m p /i m p t o r o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ; R e p o r t e r r e p o r t e r ) t h r o w s I O E x c el f u . s e t N u m R e d u c e T a s k s ( 0 ) ; p t i o n {i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o/ / aF i n d r m t ; t h e u r l J o b l o a d U s e r s = n e w J o b ( l f u ) ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ; S t r i n g l i n e = v a l . t o S t r i n g ( ) ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( , ) ; J o b C o n f j oM R E x a m p l e . c l a s s ) ; i n = n e w J o b C o n f (i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b C o n t r o l ; i n t s e c o n d C o m m a C= ml i n e . i n d o m a ) ; e x O f ( , , j o i n . s e t J o b N a m e ( " J o i n f i r s t U s e r s a ni m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ; S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o mj o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e m a , s e c o n d C o m m a ) ; / / d r o p t h e r e s t o f t h e r e c o r d , I d oj o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l n t n e e d i t a n y m o r e ,p u b l i c c l a s s M R E x a m p l e { / / j u s t p a s s a 1 f o r t h e c o m b i n e r / r ej o i n . s e t O u t p u t V a l u e C l a s s ( T e x t . d u c e r t o s u m i n s t e a d . p u b l i c s t a t i c c l a s s L o a d P a g e s e x t e n d s M a p R e d u c e BT e x t a s e o u t K e y = n e w T e x t ( k e y ) ; j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a p e r . c l a s s ) ; i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t ,o c . c o l l e c t ( o u t K e y , T e x t > { n e w L o n g W r i t a b l e ( 1j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s L ) ) ; } F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t} v a l , P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) O u t p u t C o l l e c t o r < T e x t , T e x t > o c , b l i c p u s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R eF i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j d u c e B a s e R e p o r t e r r e p o r t e r ) t h r o w s I O E x c ei m p l e m e n t s p t i o n { R e d u c e r < T e x t , L o n g W r iP a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " ) t a b l e , W r i t a b l e C o m p a r a b l e , / / P u l l t h e k e y o u t W r i t a b l e > { F i l e O ut O u t p u t P a t h ( j o i n , t p u t F o r m a t . s e n e w S t r i n g l i n e = v a l . t o S t r i n g ( ) ; P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( , ) ; p u b l i c v o i d r e d u c e ( j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ; S t r i ns t r i n g ( 0 , if i r s t C o m m a ) ; g k e y = l n e . u b y , T e x t k e J o b j o i n J o b = n e w J o b ( j o i n ) ; S t r i n g v a l u e = l i n e . s u b s t r i n g ( f i r s t C o m m a + 1I t e r a t o r < L o n g W r i t a b l e > ) ; i t e r , j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a T e x t o u t K e y = n e w T e x t ( k e y ) ; O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a bj o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s l e , W r i t a b l e > o c , / / P r e p e n d a n i n d e x t o t h e v a l u e s o w e k n o w R e p o r t e r lr e p o r t e r ) w h i c h f i e t h r o w s I O E x c e p t i o n { / / i t c a m e f r o m . / / A d d u p a l l t h e v a l u e s w e s e e J o b C o n f g r o u p a= pn e w cJ o b C o n f ( M R x m l . l a s s ) ; T e x t o u t V a l v= ln e w ;T e x t ( " 1 " + a u ) g r o u p . s e t J o b N a m e ( " G r o u p U R L s " ) o c . c o l l e c t ( o u t K e y , o u t V a l ) ; l o n g s u m = 0 ; g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T } i l e (w h e r . h a s N e x t ( ) ) i t { g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c } s u m + = i t e r . n e x t ( ) . g e t ( ) ; g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g p u b l i c s t a t i c c l a s s L o a d A n d F i l t e r U s e r s e x t e n d s M a p R er e p o r t e r . s e t S t a t u s ( " O K d u c e B a s e " ) ; g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e l e O u t p u t F o r m a t . c l a i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , } e x t > T { g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l , o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u mg r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r ) ) ; O u t p u t C o l l e c t o r < T e x t , T e x t > o c , } F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g R e p o r t e r r e p o r t e r ) t h r o w s I O} x c e p t i o n E { P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ; / / P u l l t h e k e y o u t p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e o r m a t . s e t O u t p u t P a t h ( g r F i l e O u t p u t F S t r i n g l i n e = v a l . t o S t r i n g ( ) ; m p l e m e n t s i M a p p e r < W r i t a b l e C o m p a r a b l e , aW r i t a b l e , /L o n g W r i t a b l e , u p e d " ) ) ; P t h ( " / u s e r g a t e s / t m p / g r o i n t f i r s t C o m m a = l i n e . i n d e x OT e x t > ){ f ( , ; g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ; S t r i n g v a lf i r s t C o m m a u e = l i n e . s+ b1 ) ; u s t r i n g ( J o b g r o u p J o b = n e w J o b ( g r o u p ) ; i n t a g e = I n t e g e r . p a r s e I n t ( v a l u e ) ; p u b l i c v o i d m a p ( g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J i f ( a g e < 1 8 | | a g e > 2 5 ) r e t u r n ; W r i t a b l e C o m p a r a b l e k e y , S t r i n g k e y = l i n e . s u b s t r i n g ( 0 , f i r s t C o m m a ) ; W r i t a b l e v a l , J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M T e x t o u t K e y = n e w T e x t ( k e y ) ; O u t p u t C o l l e c t o r < L o n g W r i t a b l e , Tt o p 1 0 0 . s e t J o b N a m e ( " T o p e x t > o c , 1 0 0 s i t e / / P r e p e n d a n ei n d e x k n o w t o it h e w h c h fv a l u e i l e s o w R et h r o w s p o t e r I O E x c e p t i o n r e p o r t e r ) { t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e / / i t c a m e f r o m . o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t )t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W k e y ) ; T e x t o u t V a l = n e w T e x t ( " 2 " + v a l u e ) ;} t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x o c . c o l l e c t ( o u t K e y , o u t V a l ) ; } t o p 1 0 0 . s e t O u t p u t F oo r m a t . c l a s s ) ; r m a t ( S e q u e n c } p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p Rt o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c e d u c e B a s e } i m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , Lt o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C o n g W r i t a b l e , T e x t > { p u b l i c s t a t i c c l a s s J o i n e x t e n d s M a p R e d u c e B a s e t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l i m p l e m e n t s R e d u c e r < T e x t , T e x t , T e x t , T e xi n t {c o u n t t > = 0 ; F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t p u b l i c e d u c e ( v o i d r P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ; p u b l i c v o i d r e d u c e ( T e x t k e y , L o n g W r i t a b l e k e y , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p I t e r a t o r < T e x t > i t e r , I t e r a t o r < T e x t > i t e r , P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1 O u t p u t C o l l e c t o r < T e x t , T e x t > o c , O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ; o c , R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i oR e p o r t e r n { r e p o r t e r ) t h r o w s I O E x c e p t i oJ o b n { l i m i t = n e w J o b ( t o p 1 0 0 ) ; / / F o r e a c h v a l u e , f i g u r e o u t w h i c h f i l e i t s f r o m a n d l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o bs t o r e i t / / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d s / / a c c o r d i n g l y . w< i1 0 0 (& & ui t e r . h a s N e x t ( ) ) h l e c o n t { J o b C o n t r o l j c = n e1 0 0 os i t e s rf o r w J b C o n t o l L i s t < S t r i n g > f i r s t = n e w A r r a y L i s t < S t r i n g > ( )o c . c o l l e c t ( k e y , ; i t e r . n e x1 8 )t o t ( ) ; 2 5 " ) ; L i s t < S t r i n g > s e c o n d = n e w A r r a y L i s t < S t r i n g > (c o u n t + + ; ) ; j c . a d d J o b ( l o a d P a g e s ) ; } j c . a d d J o b ( l o a d U s e r s ) ; w h i l e ( i t e r . h a s N e x t ( ) ) { } j c . a d d J o b ( j o i n J o b ) ; T e x t t = i t e r . n e x t ( ) ; } j c . a d d J o b ( g r o u p J o b ) ; S t S t r r i n i g n g v ( a ) l ; u e = t . t o p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o wj c . a d d J o b ( l i m i t ) ; s I O E x c e p t i o n { i f ( v a l u e . c h a r A t ( 0 ) = = 1 ) J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s sj c . r u n ( ) ; ) ;f i r s t . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ; t J o b N a m e ( " L o a d l p . s e P a g e s " ) ; } e l s e s e c o n d . a d d ( v a l u e . s u b s t r i n g (l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a} . c l a s s ) ; 1 ) ) ; t 46
  • 47. Ease of TranslationLoad Users Load Pages Users = LOAD …Filter by age Filtered = FILTER … Pages = LOAD … Join on name Joined = JOIN … Group on url Grouped = GROUP … Summed = … COUNT()… Count clicks Sorted = ORDER … Order by clicks Top5 = LIMIT … Take top 5 47
  • 48. The Hadoop/Pig/Cassandra stack Cassandra has gained some significant integration points with Hadoop and its analytics tools In order to achieve Hadoop’s data locality, Cassandra nodes must be part of the Hadoop cluster by running a tasktracker process. So the namenode and jobtracker can reside outside of the Cassandra cluster A three- node Cassandra/Hadoop cluster with external namenode / jobtracker 48
  • 49. Hadoop jobs Cassandra has a Java source package for Hadoop integration org.apache.cassandra.hadoop ColumnFamilyInputFormat extends InputFormat ColumnFamilyOutputFormat extends OutputFormat ConfigHelper a helper class to configure Cassandra-specific information Hadoop output streaming was introduced in 0.7 but removed from 0.8 49
  • 50. Pig alongside Cassandra The Pig integration CassandraStorage() (a LoadFunc implementation) allows Pig to Load/Store data from/to Cassandra grunt> LOAD cassandra://Keyspace/cf USING CassandraStorage(); The pig_cassandra script, shipped with Cassandra source, performs the necessary initialisation (Pig environments variables still needs to be set) Pygmalion is a set of scripts and UDFs to facilitate the use of Pig alongside Cassandra 50
  • 51. Workflow A workflow system provides an infrastructure to set up & manage a sequence of interdependent jobs / set of jobs The hadoop ecosystem includes a set of workflow tools to run applications over MapReduce processes or High-level languages  Cascading (http://www.cascading.org/). A java library defining data processing workflows and rendering them to MapReduce jobs  Oozie (http://yahoo.github.com/oozie/) 51
  • 52. Some links http://hadoop.apache.org http://pig.apache.org/ https://cwiki.apache.org/confluence/display/PIG/Index PiggyBank: https://cwiki.apache.org/confluence/display/PIG/PiggyBank DataFu: https://github.com/linkedin/datafu Pygmalion: https://github.com/jeromatron/pygmalion http://code.google.com/edu/parallel/mapreduce-tutorial.html Video tutorials from Cloudera: http://www.cloudera.com/hadoop-training Interesting papers:  http://bit.ly/rskJho - Original MapReduce paper  http://bit.ly/KvFXxT - Pig paper: ‘Building a High-Level Dataflow System on top of MapReduce: The Pig Experience’ 52
  • 53. A simple data flow Load checkins data Keep only the two ids Top 50 users / locations [same script, different group key]Group by user/loc id & Order Limit to top 50 53
  • 54. Another data flowLoad checkins data Split_date All the checkins, over weeks Group by date Group by weeks using Count the tuples Stream 54