Cascading User Group Meet


Published on

Leading Data with Innovation. My Talk on using Cascading for Data driven applications presented during the meetup in Berlin.

Published in: Engineering, Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Pipe -
    Each – Defines Filter or Function each tuple has to pass through
    GroupBy – groups the filed on selected tuple stream by field name. Allows merging
    CoGroup – joins on common set of values. Joins can be Inner, outer, Left or Right
    Every – applies aggregtor to every group of tuples
    Subassembly - nesting reusable pipe assemblies into a Pipe class for inclusion in a larger pipe assembly.
  • A Scheme defines what is stored in a Tap instance by declaring the Tuple field names, and alternately parsing or rendering the incoming or outgoing Tuple stream, respectively.
    A Scheme defines the type of resource data will be sourced from or sinked to.

  • ----- Meeting Notes (21/05/14 11:43) -----







    Joins (Left,Right,Inner, Outter)


    Aggregator (Sum & Count)



    Nesting reusable pipe

  • Cascading User Group Meet

    1. 1. Simplifying Application Development on Hadoop WidasConcepts Unternehmensberatung GmbH  Maybachstraße 2  71299 Wimsheim  Big Data Engineer, WidasConcepts Vinoth Kannan Cascading User Group Meet Berlin, Germany 26.05.2014
    2. 2. 2What is Hadoop? “Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.“ (Wikipedia) Designed for Possible to: Works on: • Batch Processing • Horizontal Scaling • Bringing Computation to Data Principles of Hadoop:
    3. 3. 3Main Features Reliable and Redundant • No performance or data loss even on failure Powerful • Possible to have huge clusters (largest 40,000 nodes) • Supports “Best of Breed Analytics“ Scalable • Linearly scalable with increase in data volume Cost Efficient • No need for expensive hardware. Supports commodity hardware Simple and flexible APIs • Great ecosystem with multitude of solutions to support
    4. 4. 4Traditional vs. Hadoop Traditional Hadoop More and larger server necessary to accomplish tasks: • computing capacity • data capacity Instead of upgrading the server, the cluster size is increased with more machines
    5. 5. 5 MapReduce are programming model to run applications mostly on Hadoop What is MapReduce? Mapper • Converts input (K,V) to new (K,V) Shuffle • Sorts and Groups similar keys with all its values Reducer • Translates the Value each unique Key to new (K,V)
    6. 6. 6MapReduce Paradigm Map Shuffle Reduce (K1, V1) (K1, V1) (K1, V1) (K5, V5) (K2, V2) (K3, V3) (K3, V3) (K3, V3) (K6, V6) (K7, V7)
    7. 7. 7Map Reduce with Multiple data sources HDFS Cassandra SQL HBase MapReduce job HDFS Neo4j SQL MongoDB Input Processing Output
    8. 8. 8Jumping to the Hadoop Bandwagon
    9. 9. 9Challenges with Map Reduce Complex jobs which requires multiple mappers and reducers Chaining multiple MR jobs and scheduling them together Wrong level of granularity of MR Transforming business rules into Map Reduce paradigm Testing and maintaining the code
    10. 10. 10Growing opportunities in Hadoop With the growing job trends in Hadoop, there is a huge gap in the skillset required to meet the demand Huge investment already made by enterprises in existing business processes and training
    11. 11. How to Train Your Elephant ?!
    12. 12. Cascading
    13. 13. 13What is Cascading ? Cascading is a open source Java framework that provides application development platform for building data applications on Hadoop. Developed by Chris Wensel in 2007 Underlying motivation for developing the Cascading Java framework Difficulty for Java developers to write MapReduce Code MapReduce is based on functional programming element
    14. 14. 14Enterprise Data Flow - Challenge Business Goals Data Sources Using existing Skillset, business process and tools
    15. 15. 15Cascading Building Blocks – Highlevel Overview Cascading MapReduce HDFS Distributed Storage
    16. 16. 16Cascading in Short Functional programming way to Hadoop Alternative and Easy API for MapReduce Reusable Java components Possibility for Test driven development Can be used with any JVM- based languages Java, JRuby, Clojure, etc
    17. 17. 17Cascading Building Blocks Pipes Sinks Taps Flow
    18. 18. 18Sample Look of Cascading Flow Source Tap Sink Tap Pipe Assembly Flow
    19. 19. 19Cascading Pipe Assemblies Original Tuple Streams Transformed Tuple Streams Pipe Each GroupBy CoGroup Every SubAssembly
    20. 20. 20The quintessential WordCount Example
    21. 21. 21The quintessential WordCount Example
    22. 22. 22The quintessential WordCount Example
    23. 23. 23The quintessential WordCount Example Initialize properties and tell Hadoop which jar file to use
    24. 24. 24The quintessential WordCount Example Word-count
    25. 25. 25The quintessential WordCount Example Word-count
    26. 26. 26Typical Pipe Assembly CSV NoSQL Sequence File Flow Definition Flow A
    27. 27. 27Cascading Multiple Flows Flow A Flow E Flow B Flow C Flow D Flow F Flow G Flow H
    28. 28. 28Cascading Pipe Assemblies lhs pipe definition rhs pipe definition Join lhs & rhs pipes Join pipe assembly
    29. 29. 29Cascading real-world Data Flow Use Cases Analytics on login information Analytics from ClickStream Data
    30. 30. 30Support With multiple data Sources HDFS Cassandra Mongodb ElasticSearch HBase Memcached Neo4j Solr ElephantDB RDBMS Splunk
    31. 31. 31Support With major Serializers JSON AVRO KYRO THRIFT
    32. 32. Predictive Models on Hadoop
    33. 33. 33 Cascading Pattern is a machine learning project within the Cascading development framework used to build enterprise data workflows Pattern uses the industrial standard Predictive Model Markup Language (PMML), an XML-based file format developed by Data Mining group PMML is supported by most of the popular analytical tools such as R, SaS, TeraData, Weka, Knime, Microsoft etc Cascading Pattern
    34. 34. 34 Track trips Maintain Logbook Get Notified about best gas stations Manage and compare vehicle cost Fleet management Social platform connecting drivers Cascading Pattern on CarbookPlus
    35. 35. 35CarbookPlus Fuel Cost Predicition “MDM: Mobilitäts Daten Marktplatz”, is a German federal government organization that provides open data about the fuel prices across Germany on real time. Our Objective : • Store the data from MDM into HDFS • Process and clean the data with Cascading • Build a model with R, predicting the fuel price trend for the next 7 days & 24 hours • Export the model as PMML • Scale-out on the hadoop cluster, with Cascading Pattern • Store the results in Mongodb
    36. 36. 36Exporting PMML model from R Export model as PMML file
    37. 37. 37Cascading Pattern Flow Definition
    38. 38. 38Fuel Cost Predictor Result
    39. 39. 39Algorithms Supported by Cascading Pattern Random Forest Linear Regression Logistical Regression K-Means Clustering Hierarchical Clustering Multinominal Model
    40. 40. 40 Cascading Pattern to Support more predictive models Neural Network Support Vector Machine More new features in Cascading 3.0 Future of Cascading YARN Cluster Resource Management HDFS Distributed Storage Cascading 3.0 Spark Tez Execution Engine Storm
    41. 41. When do you Start ?
    42. 42. 42Questions? Q & A Thank you !! Vinoth Kannan Credits m Big Data Engineer WidasConcepts Gmbh @WidasConcepts@vinoth4v /WidasConcepts