A Brief Discussion on: Hadoop MapReduce, Pig,JavaFlume,Cascading & Dremel Presented By: Somnath Mazumdar 29th Nov 2011
MapReduceè Based on Googles MapReduce Programming Frameworkè FileSystem: GFS for MapReduce ... HDFS for Hadoopè Language: MapReduce is written in C++ but Hadoop is in Javaè Basic Functions : Map and Reduce inspired by similar primitives in LISP and other languages...Why we should use ??? l Automatic parallelization and distribution l Fault-tolerance l I/O scheduling l Status and monitoring
MapReduceMap Function: Reduce Function:(1) Processes input key/value (1) Combines all intermediate values pair for a particular key (2) Produces a set of merged output(2) Produces set of values intermediate pairs Syntax:Syntax: reduce (out_key, list(inter_value)) ->map (key,value)- list(out_value) >list(key,inter_value)
MapReduceApplications:(1) Distributed grep & Distributed sort(2) Web link-graph reversal, (3) Web access log stats, (4) Document clustering,(5) Machine Learning and so on...To know more:è MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat, Google, Inc.è Hadoop: The Definitive Guide - OReilly Media
PIGè First Pig developed at Yahoo Research around 2006 later moved to Apache Software Foundationè Pig is a data flow programming environment for processing large files based on MapReduce / Hadoop.è High-level platform for creating MapReduce programs used with Hadoop and HDFSè Apache library that interprets scripts written in Pig Latin and runs them on a Hadoop cluster. At Yahoo! 40% of all Hadoop jobs are run with Pig
PIGWorkFlow:First step: Load input data. Second step: Manipulate data with functions like filtering, using foreach, distinct or any user defined functions. Third step: Group the data. Final stage: Writing data into the DFS or repeating the step if another dataset arrives.Scripts written in PigLatin------------------->Hadoop ready jobs Pig Library/Engine Take Away Point:: Do more with data not with functions..
CascadingQuery API and Query Planner for defining, sharing, and executing data processing workflows.Supports to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.).Originally authored by Chris Wensel (founder of Concurrent, Inc.)What it offers?? Data Processing API (core) Process Planner Process SchedulerHow to use?? 1. Install Hadoop 2. Put Hadoop job .jar which must contain cascading .jars.
Cascading:‘Source-Pipe-Sink’How it works??Source: Data is captured from sources.Pipes: are created independent from the data they will process. Supports reusable ‘pipes’ concept.Sinks: Results are stored in output files or ‘sinks’.Data Processing API provides Source-Pipe-Sink mechanism.Once tied to data sources and sinks, it is called a ‘flow’(Topological Scheduler). These flows can be grouped into a ‘cascade’(CascadeConnector class), and the process scheduler will ensure a given flow does not execute until all its dependencies are satisfied.
CascadingPipe Assembly------MR Job Planner---->graph of dependent MapReduce jobs.Also provides External Data Interfaces for data...It efficiently supports splits, joins, grouping, and sorting.Usages: log file analysis, bioinformatics, machine learning, predictive analytics, web content mining etc.Cascading is cited as one of the top five most powerful Hadoop projects by SD Times in 2011.
FlumeJavaJava Library API that makes easy to develop,test and run efficient data parallel pipelines.Born on May 2009 @ Google LabLibrary is a collection of immutable parallel classes.Flumejava:1. abstracts how data is presented as in memory data structure or as file2. abstracts away the implementation details like local loop or remote MR job.3. Implements parallel job using deferred evaluation
FlumeJavaHow it works???Step1: invoke the parallel operation.Step2: Do not run. Do the following .. 2.1. Records the operation and the arguments. 2.2. save them into an internal execution plan graph structure. 2.3. Construct the execution plan for whole computation.Step3: Optimizes the execution plan.Step4: Execute them.Faster than typical MR pipeline with same logical struct. & easier.
FlumeJavaData Model:Pcollection<T>: central class, an immutable bag of elements of type TCan be unordered (collection(efficient)) or ordered (sequence).PTable<K, V>:Second central classImmutable multi-map with keys of class K and values of class VOperators:parallelDo(PCollection<T>): Core parallel primitivesgroupByKey(PTable<Pair<K,V>>)combineValues(PTable<Pair<K, Collection<V>>):flatten(): logical view of multiple PCollections as one PcollectionJoin()
DremelA distributed system for interactive analysis of large datasets since 2006 in Google.Provides custom, scalable data management solution built over shared clusters of commodity machines.Three Features/Key aspects:1. Storage Format: column-striped storage representation for non relational nested data (lossless representation).Why nested?It backs a platform-neutral, extensible mechanism for serializing structured data at Google.What is main aim??Store all values of a given field consecutively to improve retrieval efficiency.
Dremel2. Query Language: Provides a high-level, SQL-like language to express ad hoc queries. It efficiently implementable on columnar nested storage. Fields are referenced using path expressions.Supports nested subqueries, inter and intra-record aggregation, joins etc.3. Execution:Multi-level serving tree concept (distributed search engine) Several queries can execute simultaneously. Query dispatcher schedules queries based on priorities and balances load
I am lost..Are MR and Dremel same?? Features MapReduce aka MR DremelBirth Year & Place Since 2004 @ Google lab Since 2006 @ Google lab Type Distributed & parallel Distributed interactive programming framework ad hoc query systemScalable & Fault Yes Yes Tolerant Data processing Record oriented Column orientedBatch processing Yes NoIn situ processing No Yes Take away point:: Dremel it complements MapReduce-based computing.