Cascading meetup #4 @ BlueKai
 

Cascading meetup #4 @ BlueKai

on

  • 1,726 views

Slides from Cascading meetup #4, held at BlueKai in Cupertino, CA on 2013-03-05

Slides from Cascading meetup #4, held at BlueKai in Cupertino, CA on 2013-03-05

Statistics

Views

Total Views
1,726
Views on SlideShare
1,720
Embed Views
6

Actions

Likes
3
Downloads
11
Comments
0

1 Embed 6

https://twitter.com 6

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Cascading meetup #4 @ BlueKai Cascading meetup #4 @ BlueKai Presentation Transcript

  • Cascading Meetup #4 BlueKai Cupertino, CA 2013-03-05 Copyright @2013, Concurrent, Inc.Tuesday, 05 March 13 1
  • Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven DevelopmentTuesday, 05 March 13 2
  • Enterprise Data Workflows Customers Let’s consider an example app… at the front end Web App LOB use cases drive demand for apps logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 05 March 13 3LOB use cases drive the demand for Big Data apps
  • Enterprise Data Workflows Customers An example… in the back office Organizations have substantial investments Web App in people, infrastructure, process logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 05 March 13 4Enterprise organizations have seriously ginormous investments in existing back office practices:people, infrastructure, processes
  • Enterprise Data Workflows Customers An example… for the heavy lifting! “Main Street” firms are migrating Web App workflows to Hadoop, for cost savings and scale-out logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 05 March 13 5“Main Street” firms have invested in Hadoop to address Big Data needs,off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
  • Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞Tuesday, 05 March 13 6Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
  • Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Hadoop almost never gets used in isolation; data workflows define Start-ups: crave complexity and scale to become viable… the “glue” required for system new ventures move into Enterprise space of Enterprise apps integration to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞Tuesday, 05 March 13 7Hadoop is almost never used in isolation.Enterprise data workflows are about system integration.There are a couple different ways to arrive at the party.
  • Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven DevelopmentTuesday, 05 March 13 8
  • Cascading workflows – ANSI SQL • collab with Optiq – industry-proven code base Customers • ANSI SQL parser/optimizer atop Cascading flow planner Web App • JDBC driver to integrate into existing tools and app servers logs logs Cache Logs • relational catalog over a collection Support source of unstructured data trap tap tap sink tap • SQL shell prompt to run queries Modeling PMML Data Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 05 March 13 9ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
  • Cascading workflows – ANSI SQL • collab with Optiq – industry-proven code base Customers • ANSI SQL parser/optimizer atop Cascading flow planner Web App • JDBC driver to integrate into existing tools and app servers logs logs Cache Premise: most SQL in the world gets Logs • relational catalog over a collection Support of unstructured datawritten by machines… trap tap source tap sink tap • SQL shell prompt to run isn’t a database; this is about making This queries Modeling PMML Data Workflow machine-to-machine communications sink tap source tap simpler and more robust at scale. Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 05 March 13 10ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
  • Cascading workflows – ANSI SQL • enable analysts without retraining on Hadoop, etc. Customers • transparency for Support, Ops, Web App Finance, et al. logs Cache logs Logs Support source trap sink tap tap tap Data a language for queries – not a database, Modeling PMML Workflow but ANSI SQL as a DSL for workflows sink tap source tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 05 March 13 11ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
  • ANSI SQL – reviews Open Source Lingual Helps SQL Devs Unlock Hadoop Thor Olavsrud, 2013-02-22 cio.com/article/729283/Open_Source_Lingual_Helps_SQL_Devs_Unlock_Hadoop Hadoop Apps Without MapReduce Mindsets Adrian Bridgwater, 2013-02-28 drdobbs.com/open-source/hadoop-apps-without-mapreduce-mindsets/240149708 Concurrent gives old SQL users new Hadoop tricks Jack Clark, 2013-02-20 theregister.co.uk/2013/02/20/hadoop_sql_translator_lingual_launches/ Concurrent Open Source Project Ties SQL to Hadoop Michael Vizard, 2013-02-21 itbusinessedge.com/blogs/it-unmasked/concurrent-open-source-project-ties-sql-to-hadoop.html Concurrent Releases Lingual, a SQL DSL for Hadoop Boris Lublinsky, 2013-02-28 infoq.com/news/2013/02/LingualTuesday, 05 March 13 12
  • ANSI SQL – CSV data in local file system cascading.org/lingualTuesday, 05 March 13 13The test database for MySQL is available for download from https://launchpad.net/test-db/Here we have a bunch o’ CSV flat files in a directory in the local file system.Use the “lingual” command line interface to overlay DDL to describe the expected table schema.
  • ANSI SQL – shell prompt, catalog cascading.org/lingualTuesday, 05 March 13 14Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.
  • ANSI SQL – queries cascading.org/lingualTuesday, 05 March 13 15Here’s an example SQL query on that “employee” test database from MySQL.
  • ANSI SQL – layers abstraction RDBMS JVM Cluster parser ANSI SQL ANSI SQL compliant parser compliant parser optimizer logical plan, logical plan, optimized based on stats optimized based on stats planner physical plan API “plumbing” machine query history, app history, data table stats tuple stats topology b-trees, etc. heterogenous, distributed: Hadoop, IMDG, etc. visualization ERD flow diagram schema table schema tuple schema catalog relational catalog tap usage DB provenance (manual audit) data set producers/consumersTuesday, 05 March 13 16When you peel back the onion skin on a SQL query, each of the abstraction layers used in an RDBMS has an analogue (or better) in the context of Enterprise Data Workflows running on JVM clusters
  • ANSI SQL – JDBC driver public void run() throws ClassNotFoundException, SQLException { Class.forName( "cascading.lingual.jdbc.Driver" ); Connection connection = DriverManager.getConnection( "jdbc:lingual:local;schemas=src/main/resources/data/example" ); Statement statement = connection.createStatement();   ResultSet resultSet = statement.executeQuery( "select *n" + "from "EXAMPLE"."SALES_FACT_1997" as sn" + "join "EXAMPLE"."EMPLOYEE" as en" + "on e."EMPID" = s."CUST_ID"" );   while( resultSet.next() ) { int n = resultSet.getMetaData().getColumnCount(); StringBuilder builder = new StringBuilder();   for( int i = 1; i <= n; i++ ) { builder.append( ( i > 1 ? "; " : "" ) + resultSet.getMetaData().getColumnLabel( i ) + "=" + resultSet.getObject( i ) ); } System.out.println( builder ); }   resultSet.close(); statement.close(); connection.close(); }Tuesday, 05 March 13 17Note that in this example the schema for the DDL has been derived directly from the CSV files.In other words, point the JDBC connection at a directory of flat files and query as if they were already loaded into SQL.
  • ANSI SQL – JDBC driver $ gradle clean jar $ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar   CUST_ID=100; PROD_ID=10; EMPID=100; NAME=Bill CUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian Caveat: if you absolutely positively must have sub-second SQL query response for Pb-scale data on a 1000+ node cluster… Good luck with that! (call the MPP vendors) This ANSI SQL library is primarily intended for batch workflows – high throughput, not low-latency – for many under-represented use cases in Enterprise IT. It’s essentially ANSI SQL as a DSL.Tuesday, 05 March 13 18success
  • Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven DevelopmentTuesday, 05 March 13 19
  • Test-Driven Development (TDD) source: WikipediaTuesday, 05 March 13 20A general view of TDD process
  • Test-Driven Development (TDD) In terms of Big Data apps,TDD is not generally part of the conversationTuesday, 05 March 13 21TDD is not usually high on the list when people start discussing Big Data apps.
  • Traps – Cascading “exceptional data” • assert patterns (regex) on the tuple streams Customers • adjust assert levels, like log4j levels • define traps on branches Web App • tuples which fail asserts get trapped logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 05 March 13 22An innovation in Cascading was to introduce the notion of a “data exception”,based on setting stream assertion levels as part of the business logic of an app.
  • Traps – example code // set up...  Pipe etlPipe = new Pipe( "etlPipe" ); // some processing...  AssertMatches assertMatches = new AssertMatches( ".*true" ); etlPipe = new Each( etlPipe, AssertionLevel.STRICT, assertMatches );   // some processing...  FlowDef flowDef = FlowDef.flowDef().setName( "etl" ) .addSource( etlPipe, jsonTap ) .addTrap( etlPipe, trapTap ) .addTailSink( etlPipe, cacheTap );   if( options.has( "assert" ) ) flowDef.setAssertionLevel( AssertionLevel.STRICT ); else flowDef.setAssertionLevel( AssertionLevel.NONE );Tuesday, 05 March 13 23Example use in Cascading code
  • Traps – redirect exceptions in production shunt the trapped exceptional data to other parts of the organization: Customers • Ops: notifications Web App • QA: investigate data anomalies • Support: review customer records logs logs Logs Cache • Finance: audit Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingTuesday, 05 March 13 24
  • TDD – practice at scale 1. assert expected patterns in raw input 2. run just that, to find edge cases 3. handle the edge cases for input data 4. assert expected patterns after first chunk of processing 5. run just that, to verify failure 6. code until test passes GIS Regex tree Scrub export parse-tree species 7. repeat #4 for each chunk M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M RTuesday, 05 March 13 25
  • TDD – Cascalog features consider that TDD is about asserting and negating logical predicates… • Cascalog is based on logical predicates • function definitions as composable subqueries • functions are not particularly far from being unit tests • Midje: facts, mocks sritchie.github.com/2011/09/30/testing-cascalog-with-midje.html sritchie.github.com/2012/01/22/cascalog-testing-20.htmlTuesday, 05 March 13 26Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., nearly uses TDD as its methodology --in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.
  • Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven Development …plus, a proposalTuesday, 05 March 13 27
  • ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Suppose your organization is responsible for an large-scale app… Multiple teams develop reusable libraries…Tuesday, 05 March 13 28Suppose you have a app with a complex flow diagram like this, with contributions to the business logic from different departments…
  • ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Data Analysts: ANSI SQL queries for data prep (displaces Hive, etc.)Tuesday, 05 March 13 29Analysts are generally working with ANSI SQL queries in a DW, e.g., for ETL, data prep, pulling data cubes.These can migrate into a Cascading app to run on Hadoop.
  • ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Server-side Engineering: HBase tap for customer profiles (integrating other components)Tuesday, 05 March 13 30Engineering provides integration with customer profiles, e.g., transactional data objects in HBase.These can migrate into a Cascading app to run on Hadoop.
  • ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Ops + Support: Traps get routed to customer review (ties into notifications, etc.)Tuesday, 05 March 13 31Support needs to review exceptional data, via reports/notifications.These can migrate into a Cascading app to run on Hadoop.
  • ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Data Scientists: R => PMML for predictive models (displaces SAS, etc.)Tuesday, 05 March 13 32Scientists perform their model creation work in R, Weka, SAS, Microstrategy, etc., which can export as PMML.These can migrate into a Cascading app to run on Hadoop.
  • ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R App Engineering: Java/Scala/Clojure for business logic in data pipelines (displaces Pig, etc.)Tuesday, 05 March 13 33Generally the revenue apps require some custom business logic -- representing business process for LOB.These can migrate into a Cascading app to run on Hadoop.
  • ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Front-end Engineering: Memcached tap for pushing updates to API (integrating other components)Tuesday, 05 March 13 34Engineering provides integration with caching layer, for API updates.These can migrate into a Cascading app to run on Hadoop.
  • Cascading workflows – API principles • specify what is required, not how it must be achieved Customers • plan far ahead, before consuming cluster Web App resources – fail fast prior to submit logs Cache • fail the same way twice – deterministic logs Logs Support flow planners help reduce engineering trap source sink tap costs for debugging at scale tap tap Data Modeling PMML • same JAR, any scale – app does not Workflow source require a recompile to change data sink tap tap taps or cluster topologies Analytics Cubes customer Customer profile DBs Prefs • no surprises Hadoop Cluster ReportingTuesday, 05 March 13 35Some of the design principles for the pattern language
  • book… by Paco Nathan Enterprise Data Workflows with Cascading O’Reilly, 2013 amazon.com/dp/1449358721Tuesday, 05 March 13 36Our upcoming O’Reilly book: “Enterprise Data Workflows with Cascading”Should be in Rough Cuts soon -- scheduled to be out in print this June
  • drill-down… blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities: cascading.org zest.to/group11 github.com/Cascading conjars.org goo.gl/KQtUL concurrentinc.com join us for very interesting work! Copyright @2013, Concurrent, Inc.Tuesday, 05 March 13 37Links to our open source projects, developer community, etc…