• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content


Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Cascalog internal dsl_preso






Total Views
Views on SlideShare
Embed Views



2 Embeds 7

http://static.slidesharecdn.com 5
http://optimus.keyevent.com:8080 2



Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Case study not an INTRO Showing some code, but not describing the syntax (reference at the end)
  • Clojure is fully interoperable with Java In this way we get the benefit of many simultaneous streams of work. - Optimizations (join optimizations, for example) - New Features - Adapters and Interfaces to other technologies (e.g. Sink taps for HBase or MySql)
  • How to express the solution Too much control: Full java program (a lot of overhead to deal with before being able to express your business logic) Too little control: external query language.  Either evolve into a turing complete language, or use string manipulation to construct queries -------- A case study of a problem we had. Going to talk about the probelm   the three types of solutions and what they might look like and, of course, I'll go into the most detail for eh Cascalog solution
  • Requirement is to pre-aggregate a daily data-set.   Typical star-schema type data sample dimensions Highlighted columns are "dimensions" or "attributes" of the data.
  • A possible aggregation
  • With 3 dimensions, 8 rollups Including all dimensions and none.   if we were doing this with sql, this would be 7 different group by clauses.   Our data has 6 dimensions, so 63 aggregations
  • It's always possible to compute any agg from the original data set, but we can do better. Using A, B and C as the dimensions. Each node in the tree is a possible rollup. More efficient to make a child node from its parent, rather than the original data. Aggregations with lower cardinality are preferred.  [fewer records to aggregate]
  •   To look at something we're all familiar with, consider how you would do this for a relational database.   first query is based on orig. data second is on a previous agg   Where group_avg is some UDF that can do an average given a series of previous averages and a count from a previous grouping.   DRY principle    - determine table names for INSERT and FROM.  - determine column lists and when to use UDF avg vs. standard avg.  - Determine which previous aggregate table to use as a source.
  • alternative is a custom map/reduce program Not going in to much detail here. Too low level.  You have to do everything yourself. Limited compos-ability, only what you build yourself. Choose and name the output path Select the best previous aggregation as the input path Determine which fields need to be aggregated for this run and set up the Job Conf.    
  • In this way we get the benefit of many simultaneous streams of work. - Optimizations (join optimizations, for example) - New Features - Adapters and Interfaces to other technologies (e.g. Sink taps for HBase or MySql)
  • typical, idiomatic clojure code, functional,  recursion, immutability, etc.     An internal DSL extends the  language, but you still have the full power of the host language. none of the functions care about the contents or length of the DIMS vector [A,B,C] yields [A,B] [A,C] [B,C] For key a,b,c select dimensions [B,C] yields "*,b,c"
  • notice <-, which means just create the query, don't execute it, for now. make-qry constructs a single job, to do 1 aggregation   generate-query-tree is also responsible for choosing the best previous aggregation to use for the current rollup.    
  • All of the functions we've seen are testable with unit test or integration tests. 2 tests Sample input data Expected Results
  • Becomes main function input-dir and output-dir are command line arguments to the program.  They are paths in HDFS, for example. Could be local file system or S3 paths instead.  Or, since this runs on top of Caascading,you coudl output to a database tap or Hbase, for example.
  • Lastly, I had never used Clojure or any functional language before starting with Cascalog.

Cascalog internal dsl_preso Cascalog internal dsl_preso Presentation Transcript