Hadoop pig
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Hadoop pig

  • 4,595 views
Uploaded on

Lamine's presentation...

Lamine's presentation...

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,595
On Slideshare
4,588
From Embeds
7
Number of Embeds
4

Actions

Shares
Downloads
74
Comments
0
Likes
1

Embeds 7

http://eltropy.org 3
http://localtropy.com 2
http://eltropy.com 1
https://tasks.crowdflower.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The analytics stack Hadoop & Pig
  • 2. Outline of the presentation Hadoop  Motivations. What is it? And high-level concepts  The Ecosystem. The MapReduce model & framework and HDFS  Programming with Hadoop Pig  What is it? Motivations  Model & components Integration with Cassandra 2
  • 3. Please interrupt and ask questions! 3
  • 4. Traditional HPC systems CPU-intensive computations  Relatively small amount of data  Tightly-coupled applications  Highly concurrent I/O requirements  Complex message passing paradigms such as MPI, PVM…  Developers might need to spend some time designing for failure 4
  • 5. Challenges Data and storage  Locality, computation close to the data In large-scale systems, nodes fail  Mean time between failures: 1 node / 3 years, 1000 nodes / 1 day  Built-in fault-tolerance Distributed programming is complex  Need a simple data-parallel programming model. Users would structure the application in high-level functions, the system distributes the data & jobs and handles communications and faults 5
  • 6. What requirements A simple data-parallel programming model, designed for high scalability and resiliency  Scalability to large-scale data volumes  Automated fault-tolerance at application level rather than relying on high-availability hardware  Simplified I/O and tasks monitoring  All based on cost-efficient commodity machines (cheap, but unreliable), and commodity network 6
  • 7. Hadoop’s core concepts Data spread in advance, persistent (in terms of locality), and replicated No inter-dependencies / shared nothing architecture Applications written in two pieces of code  And developers do not have to worry about the underlying issues in networking, jobs interdependencies, scheduling, etc… 7
  • 8. Where does it come from? Hadoop originated from Apache Nutch, an open source web search engine After the publications of the GFS and MapReduce papers, in 2003 & 2004, the Nutch developers decided to implement open source versions In February 2006, it became Hadoop, with a dedicated team at Yahoo! September 2007 - release 0.14.1 Last release 1.0.3 out last week Used by a large number of companies including Facebook, LinkedIn, Twitter, hulu, among many others.. 8
  • 9. The model A map function processes a key/value pair to generate a set of intermediate key/value pairs  Divides the problem into smaller ‘intermediate key/value’ pairs The reduce function merge all intermediate values associated with the same intermediate key Run-time system takes care of:  Partitioning the input data across nodes (blocks/chunks typically of 64Mb to 128Mb)  Scheduling the data and execution. Maps operate on a single block.  Manages node failures, replication, re-submissions.. 9
  • 10. Simple Word Count♯key: offset, value: linedef mapper(): for line in open(“doc”): for word in line.split(): output(word, 1)♯key: a word, value: iterator over countsdef reducer(): output(key, sum(value)) 10
  • 11. The Combiner A combiner is a local aggregation function for repeated keys produced by the map Works for associative functions like sum, count, max Decreases the size of intermediate data / communications map-side aggregation for word count: def combiner(): output(key, sum(values)) 11
  • 12. Some other basic examples… Distributed Grep:  Map function emits a line if it matches a supplied pattern  Reduce function is an identity function that copies the supplied intermediate data to the output Count of URL accesses:  Map function processes logs of web page requests and outputs <URL, 1>  Reduce function adds together all values for the same URL, emitting <URL, total count> pairs Reverse Web-Link graph:  Map function outputs <tgt, src> for each link to a tgt in a page named src  Reduce concatenates the list of all src URLS associated with a given tgt URL and emits the pair: <tgt, list(src)> Inverted Index:  Map function parses each document, emitting a sequence of <word, doc_ID>  Reduce accepts all pairs for a given word and emits a <word, list(doc_ID)> pair 12
  • 13. Hadoop Ecosystem Core 13 components from http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
  • 14. Hadoop components Hadoop consists of two core components  The MapReduce framework, and  The Hadoop Distributed File System MapReduce layer  JobTracker  TaskTrackers HDFS layer  Namenode  Secondary namenode  Datanode Example of a typical physical distribution within a 14 Hadoop cluster
  • 15. HDFS Scalable and fault-tolerant. Based on Namenode Google’s GFS File1 1 Single namenode stores metadata (file 2 3 names, block locations, etc.). 4 Files split into chunks, replicated across several datanodes (typically 3+). It is rack- aware Optimised for large files, sequential 1 2 1 3 streaming reads, rather than random 2 1 4 2 4 3 3 4 Files written once, no append Datanodes 15
  • 16. HDFS  HDFS API / HDFS FS Shell for command line* > hadoop fs –copyFromLocal local_dir hdfs_dir > hadoop fs –copToLocal hdfs_dir local_dir  Tools  Flume: collects, aggregates and move log data from application servers to HDFS  Sqoop: HDFS import and export to SQL*http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html 16
  • 17. MapReduce execution In Hadoop, a Job (full program) is a set of tasks Each task (mapper or reducer) is attempted at least once, or multiple times if it crashes. Multiple attempts may also occur in parallel The tasks run inside a separate JVM on the tasktracker All the class files are assembled into a jar file, which will be uploaded into HDFS, before notifying the tasktracker 17
  • 18. MapReduce executionMapReduce Job Master Split 0 Worker Split 1 Worker read Local write Split 2 Worker Remote read Worker Split 3 Split 4 Worker Output files Intermediate Input files files locally 18
  • 19. Getting Started… Multiple choices - Vanilla Apache version, or one of the numerous existing distros  hadoop.apache.org  www.cloudera.com [A set of VMs is also provided]  http://www.karmasphere.com/  … Three ways to write jobs in Hadoop:  Java API  Hadoop Streaming (for Python, Perl, etc.)  Pipes API (C++) 19
  • 20. Word Count in Javapublic static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); FileInputFormat.setInputPaths(conf, args[0]); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf);} 20
  • 21. Word Count in Java – mapperpublic class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { out.collect(new text(itr.nextToken()), ONE); } }} 21
  • 22. Word Count in Java – reducerpublic class ReduceClass extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); }} 22
  • 23. Getting keys and values Input file Reducer ReducerInput Format Input split Input split Output Format RecordWriter RecordWriter RecordReader RecordReader Output file Output file Mapper Mapper 23
  • 24. Hadoop Streaming Mapper.py: #!/usr/bin/env python import sys for line in sys.stdin: for word in line.split(): print "%st%s" % (word, 1) Reducer.py: #!/usr/bin/env python import sys dict={} for line in sys.stdin: word, count = line.split("t", 1) dict[word] = dict.get(word, 0) + int(count) counts = dict.items() for word, count in counts: print "%st%s" % (word.lower(), count)You can locally test your code on the command line: $> cat data | mapper | sort | reducer 24
  • 25. High-level tools MapReduce is fairly low-level: must think about keys, values, partitioning, etc. How to express parallel algorithms by a series of MapReduce jobs  Can be hard to capture common job building blocks Different use cases require different tools! 25
  • 26. Pig Apache Pig is a platform raising a level of abstraction for processing large datasets. Its language, Pig Latin is a simple query algebra expressing data transformations and applying functions to records Pig MapReduce jobs Hadoop / HDFS job submission Started at Yahoo! Research, >60% of Hadoop jobs within Yahoo! are Pig jobs 26
  • 27. Motivations MapReduce requires a Java programmer  Solution was to abstract it and create a system where users are familiar with scripting languages Other than very trivial applications, MapReduce requires multiple stages, leading to long development cycles  Rapid prototyping. Increased productivity In MapReduce users have to reinvent common functionality (join, filter, etc.)  Pig provides them 27
  • 28. Used for Rapid prototyping of algorithms for processing large datasets Log analysis Ad hoc queries across various large datasets Analytics (including through sampling) Pig Mix provides a set of performance and scalability benchmarks. Currently 1.1 times MapReduce speed. 28
  • 29. Using Pig Grunt, the Pig shell Executing scripts directly Embedding Pig in Java (using PigServer, similar to SQL using JDBC), or Python A range of tools including Eclipse plug-ins  PigPen, Pig Editor… 29
  • 30. Execution modes Pig has two execution types or modes: local mode and Hadoop mode Local  Pig runs in a single JVM and accesses the local filesystem. Starting form v0.7 it uses the Hadoop job runner. Hadoop mode  Pig runs on a Hadoop cluster (you need to tell Pig about the version and point it to your Namenode and Jobtracker 30
  • 31. Running Pig Pig resides on the user’s machine and can be independent from the Hadoop cluster Pig is written in Java and is portable  Compiles into map reduce jobs and submit them to the cluster No need to install anything extra on the cluster Pig client 31
  • 32. How does it work  Pig defines a DAG. A step-by-step set of operations, each performing a transformation  Pig defines a logical plan for these transformations:A = LOAD ’file as (line); • Parses, checks, & optimisesB = FOREACH A GENERATE • Plan the executionFLATTEN(TOKENIZE(line)) AS word; • Maps & ReducesC = GROUP B BY word; • Passes the jar to HadoopD = FOREACH C GENERATE group, • Monitor the progressCOUNT(words);STORE D INTO ‘output’ 32
  • 33. Data types & expressions Scalar type:  Int, Long, Float, Double, Chararray, Bytearray Complex type representing nested structures:  Tuple: sequence of fields of any type  Bag: an unordered collection of tuples  Map: a set of key-value pairs. Keys must be atoms, values may be any type Expressions:  used in Pig as a part of a statement; field name, position ($), arithmetic, conditional, comparison, Boolean, etc. 33
  • 34. Functions Load / Store  Data loaders; PigStorage, BinStorage, BinaryStorage, TextLoader, PigDump Evaluation  Many built-in functions MAX, COUNT, SUM, DIFF, SIZE… Filter  A special type of eval function used by the FILTER operator. IsEmpty is a built-in function Comparison  Function used in ORDER statement; ASC | DESC 34
  • 35. Schemas Schemas enable you to associate names and types of the fields in the relation Schemas are optional but recommended whenever possible; type declarations result in better parse-time error checking and more efficient code execution They are defined using the AS keyword with operators  Schema definition for simple data types: > records = LOAD input/data AS (id:int, date:chararray); 35
  • 36. Statements and aliases Each statement, defining a data processing operator / relation, produces a dataset with an aliasgrunt> records = LOAD input/data AS (id:int, date:chararray); LOAD returns a tuple, which elements can be referenced by position or by name Very useful operators are DUMP, ILLUSTRATE, and DESCRIBE 36
  • 37. Filtering data Filter is user to work with tuples and rows of data Select data you want, or remove the data you are not interested in Filtering early in the processing pipeline minimises the amount of data flowing through the system, which can improve efficiencygrunt> filtered_records = FILTER records BY id == 234; 37
  • 38. Foreach .. Generate Foreach .. Generate acts on columns on every row in a relationgrunt> ids = FOREACH records GENERATE id; Positional reference. This statement has the same outputgrunt> ids = FOREACH records GENERATE $0; The elements of ‘ids’ however are not named ‘id’ unless you add ‘AS id’ at the end of your statementgrunt> ids = FOREACH records GENERATE $0 AS id; 38
  • 39. Grouping and joining Group .. by makes an output bag containing grouped fields with the same schema using a grouping key Join performs inner, equijoin of two or more relations based on common field values. You can also perform outer joins using keywords left, right and full Cogroup is similar to Group, using multiple relations, and creates a nested set of output tuples 39
  • 40. Ordering, combining, splitting… Order imposes an order on the output to sort a relation by one or more fields The Limit statement limits the number of results Split partitions a relation into two or more relations the Sample operator selects a random data sample with the stated sample size the Union operator to merge the contents of two or more relations 40
  • 41. Stream The Stream operator allows to transform data in a relation using an external program or scriptgrunt> C = STREAM A THROUGH `cut -f 2`;  Extract the second field of A using cut The scripts are shipped to the cluster using grunt> DEFINE script `script.py` SHIP (‘script.py’); grunt> D = STREAM C THROUGH script AS (…); 41
  • 42. User defined functions Support and a community of user-defined functions (UDFs) UDFs can encapsulate users processing logic in filtering, comparison, evaluation, grouping, or storage  filter functions for instance are all subclasses of FilterFunc, which itself is a subclass of EvalFunc PiggyBank: the Pig community sharing their UDFs DataFu: Linkedins collection of Pig UDFs 42
  • 43. A simple eval UDF examplepackage myudfs;import …public class UPPER extends EvalFunc<String>{ public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String) input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } }} 43
  • 44. An Example Load Users Load PagesLet’s find the top 5 most visited Filter by age pages by users aged 18 – 25. Input: user data file, and page Join on name view data file. Group on url Count clicks Order by clicks Take top 5 44
  • 45. A simple scriptUsers = LOAD ‘users’ AS (name, age);Filtered = FILTER Users BY age >= 18 and age <= 25;Pages = LOAD ‘pages’ AS (user, url);Joined = JOIN Filtered BY name, Pages by user;Grouped = GROUP Joined BY url;Summed = FOREACH Grouped GENERATE group, count(Joined) AS clicks;Sorted = ORDER Summed BY clicks desc;Top5 = LIMIT Sorted 5;STORE Top5 INTO ‘top5sites’; 45
  • 46. iiii m m m m p p p pi m p o r t o o o o r r r r t t t t j j j j a a a a v v v v a a a a . . . . i u u u o t t t . i i i I l l l O . . . o r g . a p a c h e . h a d o o p . f s . P a t h ; E A I L x r t i c r e s e a r t p t i o n ; y L i s t ; a t o r ; ; / / f o r D o t h e ( S t r i n g f o r In MapReduce! c r o s s s 1 : ( S t r i n g p r o d u c t f i r s t ) s 2 : { a n s e c o n } r e p o r t e r . s e t S t a t u s ( " O K " ) ; d d ) c o l l e c t l p . s e t O u t p u t K e y C l p . s e t O u t p u t V a l u l p . s e t M a p p e r C l a s F i l e I n p u t F o r m a t . t h e v a l u e s P a t hu s e r / g a t e s / p a g e s " ) ) ; { ( " / F i l e O u t p u t F o r m a t l e s a . a C ( d s s l L d s a o I ( s a n e t O u t T s d p e ( P u x T a t t e g P . x e a c t s t p u t P a t h l . . h a c c ( s l l li m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ; S t r i n g o u t v a l = k e y + " , " + s 1 +n e w "P a t h ( " / u " , + s 2 ; s e r / g a t e s / t m p /i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ; o c . c o l l e c t ( n u l l , n e w T e x t ( o ul p . s e t N u m R e d u c e T t v a l ) ) ; a s k s ( 0 ) ;i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ; r e p o r t e r . s e t S t a t u s ( " O K " ) ; J o b l o a d P a g e s = n e w J o b ( l p ) ;i p o r t m o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ; }i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ; } J o b C o n f l f u = n e w J o b C o n f ( M R E xi m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t}; e t J o b N a m e ( " L o a d l f u . s a n d F i l t e r U s e r si m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ; } l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F oi m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x tp u b l i c os t a t i c I n p u t F r m a t ; c l a s s L o a d J o i n e d e x t e n d s M a p R el f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a d u c e B a s ei m p o r t po r g . a h a d o o p . m a p r e d . M a p p e r ; a c h e . i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n gl f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c W r i t a b l e > {i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ; l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l ti m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ;p u b l i c v o i d m a p ( F i l e I nI n p u t P a t h ( l f u , p u t F o r m a t . a d d n e wi m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ; T e x t k , P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ; T e x t v a l , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t hi m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ; c t o r < T e x t , lL o n g W r i t a b l e > O u t p u t C o l e o c , n e w P a t h ( " / u s e r / g a t e s / t m p /i m p t o r o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ; R e p o r t e r r e p o r t e r ) t h r o w s I O E x c el f u . s e t N u m R e d u c e T a s k s ( 0 ) ; p t i o n {i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o/ / aF i n d r m t ; t h e u r l J o b l o a d U s e r s = n e w J o b ( l f u ) ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ; S t r i n g l i n e = v a l . t o S t r i n g ( ) ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( , ) ; J o b C o n f j oM R E x a m p l e . c l a s s ) ; i n = n e w J o b C o n f (i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b C o n t r o l ; i n t s e c o n d C o m m a C= ml i n e . i n d o m a ) ; e x O f ( , , j o i n . s e t J o b N a m e ( " J o i n f i r s t U s e r s a ni m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ; S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o mj o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e m a , s e c o n d C o m m a ) ; / / d r o p t h e r e s t o f t h e r e c o r d , I d oj o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l n t n e e d i t a n y m o r e ,p u b l i c c l a s s M R E x a m p l e { / / j u s t p a s s a 1 f o r t h e c o m b i n e r / r ej o i n . s e t O u t p u t V a l u e C l a s s ( T e x t . d u c e r t o s u m i n s t e a d . p u b l i c s t a t i c c l a s s L o a d P a g e s e x t e n d s M a p R e d u c e BT e x t a s e o u t K e y = n e w T e x t ( k e y ) ; j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a p e r . c l a s s ) ; i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t ,o c . c o l l e c t ( o u t K e y , T e x t > { n e w L o n g W r i t a b l e ( 1j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s L ) ) ; } F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t} v a l , P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) O u t p u t C o l l e c t o r < T e x t , T e x t > o c , b l i c p u s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R eF i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j d u c e B a s e R e p o r t e r r e p o r t e r ) t h r o w s I O E x c ei m p l e m e n t s p t i o n { R e d u c e r < T e x t , L o n g W r iP a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " ) t a b l e , W r i t a b l e C o m p a r a b l e , / / P u l l t h e k e y o u t W r i t a b l e > { F i l e O ut O u t p u t P a t h ( j o i n , t p u t F o r m a t . s e n e w S t r i n g l i n e = v a l . t o S t r i n g ( ) ; P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( , ) ; p u b l i c v o i d r e d u c e ( j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ; S t r i ns t r i n g ( 0 , if i r s t C o m m a ) ; g k e y = l n e . u b y , T e x t k e J o b j o i n J o b = n e w J o b ( j o i n ) ; S t r i n g v a l u e = l i n e . s u b s t r i n g ( f i r s t C o m m a + 1I t e r a t o r < L o n g W r i t a b l e > ) ; i t e r , j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a T e x t o u t K e y = n e w T e x t ( k e y ) ; O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a bj o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s l e , W r i t a b l e > o c , / / P r e p e n d a n i n d e x t o t h e v a l u e s o w e k n o w R e p o r t e r lr e p o r t e r ) w h i c h f i e t h r o w s I O E x c e p t i o n { / / i t c a m e f r o m . / / A d d u p a l l t h e v a l u e s w e s e e J o b C o n f g r o u p a= pn e w cJ o b C o n f ( M R x m l . l a s s ) ; T e x t o u t V a l v= ln e w ;T e x t ( " 1 " + a u ) g r o u p . s e t J o b N a m e ( " G r o u p U R L s " ) o c . c o l l e c t ( o u t K e y , o u t V a l ) ; l o n g s u m = 0 ; g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T } i l e (w h e r . h a s N e x t ( ) ) i t { g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c } s u m + = i t e r . n e x t ( ) . g e t ( ) ; g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g p u b l i c s t a t i c c l a s s L o a d A n d F i l t e r U s e r s e x t e n d s M a p R er e p o r t e r . s e t S t a t u s ( " O K d u c e B a s e " ) ; g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e l e O u t p u t F o r m a t . c l a i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , } e x t > T { g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l , o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u mg r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r ) ) ; O u t p u t C o l l e c t o r < T e x t , T e x t > o c , } F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g R e p o r t e r r e p o r t e r ) t h r o w s I O} x c e p t i o n E { P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ; / / P u l l t h e k e y o u t p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e o r m a t . s e t O u t p u t P a t h ( g r F i l e O u t p u t F S t r i n g l i n e = v a l . t o S t r i n g ( ) ; m p l e m e n t s i M a p p e r < W r i t a b l e C o m p a r a b l e , aW r i t a b l e , /L o n g W r i t a b l e , u p e d " ) ) ; P t h ( " / u s e r g a t e s / t m p / g r o i n t f i r s t C o m m a = l i n e . i n d e x OT e x t > ){ f ( , ; g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ; S t r i n g v a lf i r s t C o m m a u e = l i n e . s+ b1 ) ; u s t r i n g ( J o b g r o u p J o b = n e w J o b ( g r o u p ) ; i n t a g e = I n t e g e r . p a r s e I n t ( v a l u e ) ; p u b l i c v o i d m a p ( g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J i f ( a g e < 1 8 | | a g e > 2 5 ) r e t u r n ; W r i t a b l e C o m p a r a b l e k e y , S t r i n g k e y = l i n e . s u b s t r i n g ( 0 , f i r s t C o m m a ) ; W r i t a b l e v a l , J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M T e x t o u t K e y = n e w T e x t ( k e y ) ; O u t p u t C o l l e c t o r < L o n g W r i t a b l e , Tt o p 1 0 0 . s e t J o b N a m e ( " T o p e x t > o c , 1 0 0 s i t e / / P r e p e n d a n ei n d e x k n o w t o it h e w h c h fv a l u e i l e s o w R et h r o w s p o t e r I O E x c e p t i o n r e p o r t e r ) { t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e / / i t c a m e f r o m . o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t )t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W k e y ) ; T e x t o u t V a l = n e w T e x t ( " 2 " + v a l u e ) ;} t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x o c . c o l l e c t ( o u t K e y , o u t V a l ) ; } t o p 1 0 0 . s e t O u t p u t F oo r m a t . c l a s s ) ; r m a t ( S e q u e n c } p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p Rt o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c e d u c e B a s e } i m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , Lt o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C o n g W r i t a b l e , T e x t > { p u b l i c s t a t i c c l a s s J o i n e x t e n d s M a p R e d u c e B a s e t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l i m p l e m e n t s R e d u c e r < T e x t , T e x t , T e x t , T e xi n t {c o u n t t > = 0 ; F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t p u b l i c e d u c e ( v o i d r P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ; p u b l i c v o i d r e d u c e ( T e x t k e y , L o n g W r i t a b l e k e y , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p I t e r a t o r < T e x t > i t e r , I t e r a t o r < T e x t > i t e r , P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1 O u t p u t C o l l e c t o r < T e x t , T e x t > o c , O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ; o c , R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i oR e p o r t e r n { r e p o r t e r ) t h r o w s I O E x c e p t i oJ o b n { l i m i t = n e w J o b ( t o p 1 0 0 ) ; / / F o r e a c h v a l u e , f i g u r e o u t w h i c h f i l e i t s f r o m a n d l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o bs t o r e i t / / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d s / / a c c o r d i n g l y . w< i1 0 0 (& & ui t e r . h a s N e x t ( ) ) h l e c o n t { J o b C o n t r o l j c = n e1 0 0 os i t e s rf o r w J b C o n t o l L i s t < S t r i n g > f i r s t = n e w A r r a y L i s t < S t r i n g > ( )o c . c o l l e c t ( k e y , ; i t e r . n e x1 8 )t o t ( ) ; 2 5 " ) ; L i s t < S t r i n g > s e c o n d = n e w A r r a y L i s t < S t r i n g > (c o u n t + + ; ) ; j c . a d d J o b ( l o a d P a g e s ) ; } j c . a d d J o b ( l o a d U s e r s ) ; w h i l e ( i t e r . h a s N e x t ( ) ) { } j c . a d d J o b ( j o i n J o b ) ; T e x t t = i t e r . n e x t ( ) ; } j c . a d d J o b ( g r o u p J o b ) ; S t S t r r i n i g n g v ( a ) l ; u e = t . t o p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o wj c . a d d J o b ( l i m i t ) ; s I O E x c e p t i o n { i f ( v a l u e . c h a r A t ( 0 ) = = 1 ) J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s sj c . r u n ( ) ; ) ;f i r s t . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ; t J o b N a m e ( " L o a d l p . s e P a g e s " ) ; } e l s e s e c o n d . a d d ( v a l u e . s u b s t r i n g (l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a} . c l a s s ) ; 1 ) ) ; t 46
  • 47. Ease of TranslationLoad Users Load Pages Users = LOAD …Filter by age Filtered = FILTER … Pages = LOAD … Join on name Joined = JOIN … Group on url Grouped = GROUP … Summed = … COUNT()… Count clicks Sorted = ORDER … Order by clicks Top5 = LIMIT … Take top 5 47
  • 48. The Hadoop/Pig/Cassandra stack Cassandra has gained some significant integration points with Hadoop and its analytics tools In order to achieve Hadoop’s data locality, Cassandra nodes must be part of the Hadoop cluster by running a tasktracker process. So the namenode and jobtracker can reside outside of the Cassandra cluster A three- node Cassandra/Hadoop cluster with external namenode / jobtracker 48
  • 49. Hadoop jobs Cassandra has a Java source package for Hadoop integration org.apache.cassandra.hadoop ColumnFamilyInputFormat extends InputFormat ColumnFamilyOutputFormat extends OutputFormat ConfigHelper a helper class to configure Cassandra-specific information Hadoop output streaming was introduced in 0.7 but removed from 0.8 49
  • 50. Pig alongside Cassandra The Pig integration CassandraStorage() (a LoadFunc implementation) allows Pig to Load/Store data from/to Cassandra grunt> LOAD cassandra://Keyspace/cf USING CassandraStorage(); The pig_cassandra script, shipped with Cassandra source, performs the necessary initialisation (Pig environments variables still needs to be set) Pygmalion is a set of scripts and UDFs to facilitate the use of Pig alongside Cassandra 50
  • 51. Workflow A workflow system provides an infrastructure to set up & manage a sequence of interdependent jobs / set of jobs The hadoop ecosystem includes a set of workflow tools to run applications over MapReduce processes or High-level languages  Cascading (http://www.cascading.org/). A java library defining data processing workflows and rendering them to MapReduce jobs  Oozie (http://yahoo.github.com/oozie/) 51
  • 52. Some links http://hadoop.apache.org http://pig.apache.org/ https://cwiki.apache.org/confluence/display/PIG/Index PiggyBank: https://cwiki.apache.org/confluence/display/PIG/PiggyBank DataFu: https://github.com/linkedin/datafu Pygmalion: https://github.com/jeromatron/pygmalion http://code.google.com/edu/parallel/mapreduce-tutorial.html Video tutorials from Cloudera: http://www.cloudera.com/hadoop-training Interesting papers:  http://bit.ly/rskJho - Original MapReduce paper  http://bit.ly/KvFXxT - Pig paper: ‘Building a High-Level Dataflow System on top of MapReduce: The Pig Experience’ 52
  • 53. A simple data flow Load checkins data Keep only the two ids Top 50 users / locations [same script, different group key]Group by user/loc id & Order Limit to top 50 53
  • 54. Another data flowLoad checkins data Split_date All the checkins, over weeks Group by date Group by weeks using Count the tuples Stream 54