Simplifying
Application
Development on
Hadoop
WidasConcepts Unternehmensberatung GmbH  Maybachstraße 2  71299 Wimsheim  http://www.widas.de
Big Data Engineer, WidasConcepts
Vinoth Kannan
Cascading User Group Meet
Berlin, Germany
26.05.2014
2What is Hadoop?
“Apache Hadoop is an open-source software framework for storage
and large-scale processing of data-sets on clusters of commodity
hardware.“ (Wikipedia)
Designed for
Possible to:
Works on:
• Batch
Processing
• Horizontal Scaling
• Bringing
Computation to
Data
Principles of Hadoop:
3Main Features
Reliable and Redundant
• No performance or data loss even on failure
Powerful
• Possible to have huge clusters (largest 40,000 nodes)
• Supports “Best of Breed Analytics“
Scalable
• Linearly scalable with increase in data volume
Cost Efficient
• No need for expensive hardware. Supports commodity hardware
Simple and flexible APIs
• Great ecosystem with multitude of solutions to support
4Traditional vs. Hadoop
Traditional Hadoop
More and larger server necessary to
accomplish tasks:
• computing capacity
• data capacity
Instead of upgrading the server, the cluster
size is increased with more machines
5
MapReduce are programming model to run applications
mostly on Hadoop
What is MapReduce?
Mapper
• Converts
input
(K,V) to
new (K,V)
Shuffle
• Sorts and
Groups
similar
keys with
all its
values
Reducer
• Translates
the Value
each
unique
Key to
new (K,V)
6MapReduce Paradigm
Map Shuffle Reduce
(K1, V1)
(K1, V1)
(K1, V1)
(K5, V5)
(K2, V2)
(K3, V3)
(K3, V3)
(K3, V3)
(K6, V6)
(K7, V7)
7Map Reduce with Multiple data sources
HDFS
Cassandra
SQL
HBase
MapReduce job
HDFS
Neo4j
SQL
MongoDB
Input Processing Output
8Jumping to the Hadoop Bandwagon
9Challenges with Map Reduce
Complex jobs which requires multiple mappers and
reducers
Chaining multiple MR jobs and scheduling them together
Wrong level of granularity of MR
Transforming business rules into Map Reduce paradigm
Testing and maintaining the code
10Growing opportunities in Hadoop
With the growing job trends in Hadoop, there is a huge gap
in the skillset required to meet the demand
Huge investment already made by enterprises in existing
business processes and training
How to Train Your Elephant ?!
Cascading
13What is Cascading ?
Cascading is a open source Java framework that
provides application development platform for building
data applications on Hadoop.
Developed by Chris Wensel in 2007
Underlying motivation for developing the Cascading Java
framework
Difficulty for Java developers
to write MapReduce Code
MapReduce is based on
functional programming
element
14Enterprise Data Flow - Challenge
Business Goals Data Sources
Using existing Skillset,
business process and tools
15Cascading Building Blocks – Highlevel Overview
Cascading
MapReduce
HDFS
Distributed Storage
16Cascading in Short
Functional programming way to Hadoop
Alternative and Easy API for MapReduce
Reusable Java components
Possibility for Test driven development
Can be used with any JVM- based languages
Java, JRuby, Clojure, etc
17Cascading Building Blocks
Pipes
Sinks
Taps Flow
18Sample Look of Cascading Flow
Source Tap
Sink Tap
Pipe Assembly
Flow
19Cascading Pipe Assemblies
Original
Tuple Streams
Transformed
Tuple Streams
Pipe
Each
GroupBy
CoGroup
Every
SubAssembly
20The quintessential WordCount Example
21The quintessential WordCount Example
22The quintessential WordCount Example
23The quintessential WordCount Example
Initialize properties
and tell Hadoop
which jar file to use
24The quintessential WordCount Example
Word-count
25The quintessential WordCount Example
Word-count
26Typical Pipe Assembly
CSV
NoSQL
Sequence File
Flow Definition
Flow A
27Cascading Multiple Flows
Flow A
Flow E
Flow B
Flow C
Flow D
Flow F
Flow G
Flow H
28Cascading Pipe Assemblies
lhs pipe definition
rhs pipe definition
Join lhs & rhs pipes
Join pipe assembly
29Cascading real-world Data Flow Use Cases
Analytics on login information
Analytics from ClickStream Data
30Support With multiple data Sources
HDFS
Cassandra
Mongodb
ElasticSearch
HBase
Memcached
Neo4j
Solr
ElephantDB RDBMS
Splunk
http://www.cascading.org/extensions/
31Support With major Serializers
http://www.cascading.org/extensions/
JSON AVRO
KYRO THRIFT
Predictive Models on Hadoop
33
Cascading Pattern is a machine learning project within the Cascading
development framework used to build enterprise data workflows
Pattern uses the industrial standard Predictive Model Markup Language
(PMML), an XML-based file format developed by Data Mining group
PMML is supported by most of the popular analytical tools such as R,
SaS, TeraData, Weka, Knime, Microsoft etc
Cascading Pattern
http://www.dmg.org/
34
Track trips
Maintain Logbook
Get Notified about best gas stations
Manage and compare vehicle cost
Fleet management
Social platform connecting drivers
Cascading Pattern on CarbookPlus
www.carbookplus.com
35CarbookPlus Fuel Cost Predicition
“MDM: Mobilitäts Daten
Marktplatz”, is a German federal
government organization that
provides open data about the
fuel prices across Germany on
real time.
http://www.mdm-portal.de/
Our Objective :
• Store the data from MDM into
HDFS
• Process and clean the data with
Cascading
• Build a model with R, predicting
the fuel price trend for the next 7
days & 24 hours
• Export the model as PMML
• Scale-out on the hadoop cluster,
with Cascading Pattern
• Store the results in Mongodb
36Exporting PMML model from R
Export model as PMML file
37Cascading Pattern Flow Definition
38Fuel Cost Predictor Result
39Algorithms Supported by Cascading Pattern
Random Forest
Linear Regression
Logistical Regression
K-Means Clustering
Hierarchical Clustering
Multinominal Model
https://github.com/cascading/pattern
40
Cascading Pattern to Support more predictive models
Neural Network
Support Vector Machine
More new features in Cascading 3.0
Future of Cascading
YARN
Cluster Resource Management
HDFS
Distributed Storage
Cascading 3.0
Spark
Tez
Execution Engine
Storm
When do you Start ?
42Questions?
Q & A
Thank you !!
Vinoth Kannan
Credits
www.soundcloud.com
www.concurrentinc.co
m
www.cascading.org
Big Data Engineer
WidasConcepts Gmbh
www.widas.de
@WidasConcepts@vinoth4v
/WidasConcepts
vinoth.kannan@widas.de

Cascading User Group Meet

  • 1.
    Simplifying Application Development on Hadoop WidasConcepts UnternehmensberatungGmbH  Maybachstraße 2  71299 Wimsheim  http://www.widas.de Big Data Engineer, WidasConcepts Vinoth Kannan Cascading User Group Meet Berlin, Germany 26.05.2014
  • 2.
    2What is Hadoop? “ApacheHadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.“ (Wikipedia) Designed for Possible to: Works on: • Batch Processing • Horizontal Scaling • Bringing Computation to Data Principles of Hadoop:
  • 3.
    3Main Features Reliable andRedundant • No performance or data loss even on failure Powerful • Possible to have huge clusters (largest 40,000 nodes) • Supports “Best of Breed Analytics“ Scalable • Linearly scalable with increase in data volume Cost Efficient • No need for expensive hardware. Supports commodity hardware Simple and flexible APIs • Great ecosystem with multitude of solutions to support
  • 4.
    4Traditional vs. Hadoop TraditionalHadoop More and larger server necessary to accomplish tasks: • computing capacity • data capacity Instead of upgrading the server, the cluster size is increased with more machines
  • 5.
    5 MapReduce are programmingmodel to run applications mostly on Hadoop What is MapReduce? Mapper • Converts input (K,V) to new (K,V) Shuffle • Sorts and Groups similar keys with all its values Reducer • Translates the Value each unique Key to new (K,V)
  • 6.
    6MapReduce Paradigm Map ShuffleReduce (K1, V1) (K1, V1) (K1, V1) (K5, V5) (K2, V2) (K3, V3) (K3, V3) (K3, V3) (K6, V6) (K7, V7)
  • 7.
    7Map Reduce withMultiple data sources HDFS Cassandra SQL HBase MapReduce job HDFS Neo4j SQL MongoDB Input Processing Output
  • 8.
    8Jumping to theHadoop Bandwagon
  • 9.
    9Challenges with MapReduce Complex jobs which requires multiple mappers and reducers Chaining multiple MR jobs and scheduling them together Wrong level of granularity of MR Transforming business rules into Map Reduce paradigm Testing and maintaining the code
  • 10.
    10Growing opportunities inHadoop With the growing job trends in Hadoop, there is a huge gap in the skillset required to meet the demand Huge investment already made by enterprises in existing business processes and training
  • 11.
    How to TrainYour Elephant ?!
  • 12.
  • 13.
    13What is Cascading? Cascading is a open source Java framework that provides application development platform for building data applications on Hadoop. Developed by Chris Wensel in 2007 Underlying motivation for developing the Cascading Java framework Difficulty for Java developers to write MapReduce Code MapReduce is based on functional programming element
  • 14.
    14Enterprise Data Flow- Challenge Business Goals Data Sources Using existing Skillset, business process and tools
  • 15.
    15Cascading Building Blocks– Highlevel Overview Cascading MapReduce HDFS Distributed Storage
  • 16.
    16Cascading in Short Functionalprogramming way to Hadoop Alternative and Easy API for MapReduce Reusable Java components Possibility for Test driven development Can be used with any JVM- based languages Java, JRuby, Clojure, etc
  • 17.
  • 18.
    18Sample Look ofCascading Flow Source Tap Sink Tap Pipe Assembly Flow
  • 19.
    19Cascading Pipe Assemblies Original TupleStreams Transformed Tuple Streams Pipe Each GroupBy CoGroup Every SubAssembly
  • 20.
  • 21.
  • 22.
  • 23.
    23The quintessential WordCountExample Initialize properties and tell Hadoop which jar file to use
  • 24.
  • 25.
  • 26.
    26Typical Pipe Assembly CSV NoSQL SequenceFile Flow Definition Flow A
  • 27.
    27Cascading Multiple Flows FlowA Flow E Flow B Flow C Flow D Flow F Flow G Flow H
  • 28.
    28Cascading Pipe Assemblies lhspipe definition rhs pipe definition Join lhs & rhs pipes Join pipe assembly
  • 29.
    29Cascading real-world DataFlow Use Cases Analytics on login information Analytics from ClickStream Data
  • 30.
    30Support With multipledata Sources HDFS Cassandra Mongodb ElasticSearch HBase Memcached Neo4j Solr ElephantDB RDBMS Splunk http://www.cascading.org/extensions/
  • 31.
    31Support With majorSerializers http://www.cascading.org/extensions/ JSON AVRO KYRO THRIFT
  • 32.
  • 33.
    33 Cascading Pattern isa machine learning project within the Cascading development framework used to build enterprise data workflows Pattern uses the industrial standard Predictive Model Markup Language (PMML), an XML-based file format developed by Data Mining group PMML is supported by most of the popular analytical tools such as R, SaS, TeraData, Weka, Knime, Microsoft etc Cascading Pattern http://www.dmg.org/
  • 34.
    34 Track trips Maintain Logbook GetNotified about best gas stations Manage and compare vehicle cost Fleet management Social platform connecting drivers Cascading Pattern on CarbookPlus www.carbookplus.com
  • 35.
    35CarbookPlus Fuel CostPredicition “MDM: Mobilitäts Daten Marktplatz”, is a German federal government organization that provides open data about the fuel prices across Germany on real time. http://www.mdm-portal.de/ Our Objective : • Store the data from MDM into HDFS • Process and clean the data with Cascading • Build a model with R, predicting the fuel price trend for the next 7 days & 24 hours • Export the model as PMML • Scale-out on the hadoop cluster, with Cascading Pattern • Store the results in Mongodb
  • 36.
    36Exporting PMML modelfrom R Export model as PMML file
  • 37.
  • 38.
  • 39.
    39Algorithms Supported byCascading Pattern Random Forest Linear Regression Logistical Regression K-Means Clustering Hierarchical Clustering Multinominal Model https://github.com/cascading/pattern
  • 40.
    40 Cascading Pattern toSupport more predictive models Neural Network Support Vector Machine More new features in Cascading 3.0 Future of Cascading YARN Cluster Resource Management HDFS Distributed Storage Cascading 3.0 Spark Tez Execution Engine Storm
  • 41.
    When do youStart ?
  • 42.
    42Questions? Q & A Thankyou !! Vinoth Kannan Credits www.soundcloud.com www.concurrentinc.co m www.cascading.org Big Data Engineer WidasConcepts Gmbh www.widas.de @WidasConcepts@vinoth4v /WidasConcepts vinoth.kannan@widas.de

Editor's Notes

  • #20 Pipe - Each – Defines Filter or Function each tuple has to pass through GroupBy – groups the filed on selected tuple stream by field name. Allows merging CoGroup – joins on common set of values. Joins can be Inner, outer, Left or Right Every – applies aggregtor to every group of tuples Subassembly - nesting reusable pipe assemblies into a Pipe class for inclusion in a larger pipe assembly.
  • #21 A Scheme defines what is stored in a Tap instance by declaring the Tuple field names, and alternately parsing or rendering the incoming or outgoing Tuple stream, respectively. A Scheme defines the type of resource data will be sourced from or sinked to.
  • #29  ----- Meeting Notes (21/05/14 11:43) ----- Pipe Each Filters Functions GroupBy Merge CoGroup Joins (Left,Right,Inner, Outter) Every Aggregator (Sum & Count) Buffer SubAssembly Nesting reusable pipe