Cascading User Group Meet

  • 811 views
Uploaded on

Leading Data with Innovation. My Talk on using Cascading for Data driven applications presented during the meetup in Berlin.

Leading Data with Innovation. My Talk on using Cascading for Data driven applications presented during the meetup in Berlin.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
811
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
14
Comments
0
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Pipe -
    Each – Defines Filter or Function each tuple has to pass through
    GroupBy – groups the filed on selected tuple stream by field name. Allows merging
    CoGroup – joins on common set of values. Joins can be Inner, outer, Left or Right
    Every – applies aggregtor to every group of tuples
    Subassembly - nesting reusable pipe assemblies into a Pipe class for inclusion in a larger pipe assembly.
  • A Scheme defines what is stored in a Tap instance by declaring the Tuple field names, and alternately parsing or rendering the incoming or outgoing Tuple stream, respectively.
    A Scheme defines the type of resource data will be sourced from or sinked to.

  • ----- Meeting Notes (21/05/14 11:43) -----
    Pipe

    Each

    Filters

    Functions

    GroupBy

    Merge

    CoGroup

    Joins (Left,Right,Inner, Outter)

    Every

    Aggregator (Sum & Count)

    Buffer

    SubAssembly

    Nesting reusable pipe

Transcript

  • 1. Simplifying Application Development on Hadoop WidasConcepts Unternehmensberatung GmbH  Maybachstraße 2  71299 Wimsheim  http://www.widas.de Big Data Engineer, WidasConcepts Vinoth Kannan Cascading User Group Meet Berlin, Germany 26.05.2014
  • 2. 2What is Hadoop? “Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.“ (Wikipedia) Designed for Possible to: Works on: • Batch Processing • Horizontal Scaling • Bringing Computation to Data Principles of Hadoop:
  • 3. 3Main Features Reliable and Redundant • No performance or data loss even on failure Powerful • Possible to have huge clusters (largest 40,000 nodes) • Supports “Best of Breed Analytics“ Scalable • Linearly scalable with increase in data volume Cost Efficient • No need for expensive hardware. Supports commodity hardware Simple and flexible APIs • Great ecosystem with multitude of solutions to support
  • 4. 4Traditional vs. Hadoop Traditional Hadoop More and larger server necessary to accomplish tasks: • computing capacity • data capacity Instead of upgrading the server, the cluster size is increased with more machines
  • 5. 5 MapReduce are programming model to run applications mostly on Hadoop What is MapReduce? Mapper • Converts input (K,V) to new (K,V) Shuffle • Sorts and Groups similar keys with all its values Reducer • Translates the Value each unique Key to new (K,V)
  • 6. 6MapReduce Paradigm Map Shuffle Reduce (K1, V1) (K1, V1) (K1, V1) (K5, V5) (K2, V2) (K3, V3) (K3, V3) (K3, V3) (K6, V6) (K7, V7)
  • 7. 7Map Reduce with Multiple data sources HDFS Cassandra SQL HBase MapReduce job HDFS Neo4j SQL MongoDB Input Processing Output
  • 8. 8Jumping to the Hadoop Bandwagon
  • 9. 9Challenges with Map Reduce Complex jobs which requires multiple mappers and reducers Chaining multiple MR jobs and scheduling them together Wrong level of granularity of MR Transforming business rules into Map Reduce paradigm Testing and maintaining the code
  • 10. 10Growing opportunities in Hadoop With the growing job trends in Hadoop, there is a huge gap in the skillset required to meet the demand Huge investment already made by enterprises in existing business processes and training
  • 11. How to Train Your Elephant ?!
  • 12. Cascading
  • 13. 13What is Cascading ? Cascading is a open source Java framework that provides application development platform for building data applications on Hadoop. Developed by Chris Wensel in 2007 Underlying motivation for developing the Cascading Java framework Difficulty for Java developers to write MapReduce Code MapReduce is based on functional programming element
  • 14. 14Enterprise Data Flow - Challenge Business Goals Data Sources Using existing Skillset, business process and tools
  • 15. 15Cascading Building Blocks – Highlevel Overview Cascading MapReduce HDFS Distributed Storage
  • 16. 16Cascading in Short Functional programming way to Hadoop Alternative and Easy API for MapReduce Reusable Java components Possibility for Test driven development Can be used with any JVM- based languages Java, JRuby, Clojure, etc
  • 17. 17Cascading Building Blocks Pipes Sinks Taps Flow
  • 18. 18Sample Look of Cascading Flow Source Tap Sink Tap Pipe Assembly Flow
  • 19. 19Cascading Pipe Assemblies Original Tuple Streams Transformed Tuple Streams Pipe Each GroupBy CoGroup Every SubAssembly
  • 20. 20The quintessential WordCount Example
  • 21. 21The quintessential WordCount Example
  • 22. 22The quintessential WordCount Example
  • 23. 23The quintessential WordCount Example Initialize properties and tell Hadoop which jar file to use
  • 24. 24The quintessential WordCount Example Word-count
  • 25. 25The quintessential WordCount Example Word-count
  • 26. 26Typical Pipe Assembly CSV NoSQL Sequence File Flow Definition Flow A
  • 27. 27Cascading Multiple Flows Flow A Flow E Flow B Flow C Flow D Flow F Flow G Flow H
  • 28. 28Cascading Pipe Assemblies lhs pipe definition rhs pipe definition Join lhs & rhs pipes Join pipe assembly
  • 29. 29Cascading real-world Data Flow Use Cases Analytics on login information Analytics from ClickStream Data
  • 30. 30Support With multiple data Sources HDFS Cassandra Mongodb ElasticSearch HBase Memcached Neo4j Solr ElephantDB RDBMS Splunk http://www.cascading.org/extensions/
  • 31. 31Support With major Serializers http://www.cascading.org/extensions/ JSON AVRO KYRO THRIFT
  • 32. Predictive Models on Hadoop
  • 33. 33 Cascading Pattern is a machine learning project within the Cascading development framework used to build enterprise data workflows Pattern uses the industrial standard Predictive Model Markup Language (PMML), an XML-based file format developed by Data Mining group PMML is supported by most of the popular analytical tools such as R, SaS, TeraData, Weka, Knime, Microsoft etc Cascading Pattern http://www.dmg.org/
  • 34. 34 Track trips Maintain Logbook Get Notified about best gas stations Manage and compare vehicle cost Fleet management Social platform connecting drivers Cascading Pattern on CarbookPlus www.carbookplus.com
  • 35. 35CarbookPlus Fuel Cost Predicition “MDM: Mobilitäts Daten Marktplatz”, is a German federal government organization that provides open data about the fuel prices across Germany on real time. http://www.mdm-portal.de/ Our Objective : • Store the data from MDM into HDFS • Process and clean the data with Cascading • Build a model with R, predicting the fuel price trend for the next 7 days & 24 hours • Export the model as PMML • Scale-out on the hadoop cluster, with Cascading Pattern • Store the results in Mongodb
  • 36. 36Exporting PMML model from R Export model as PMML file
  • 37. 37Cascading Pattern Flow Definition
  • 38. 38Fuel Cost Predictor Result
  • 39. 39Algorithms Supported by Cascading Pattern Random Forest Linear Regression Logistical Regression K-Means Clustering Hierarchical Clustering Multinominal Model https://github.com/cascading/pattern
  • 40. 40 Cascading Pattern to Support more predictive models Neural Network Support Vector Machine More new features in Cascading 3.0 Future of Cascading YARN Cluster Resource Management HDFS Distributed Storage Cascading 3.0 Spark Tez Execution Engine Storm
  • 41. When do you Start ?
  • 42. 42Questions? Q & A Thank you !! Vinoth Kannan Credits www.soundcloud.com www.concurrentinc.co m www.cascading.org Big Data Engineer WidasConcepts Gmbh www.widas.de @WidasConcepts@vinoth4v /WidasConcepts vinoth.kannan@widas.de