Cascalog
                      Nathan Marz, BackType



Po wer fu l a n d ea sy-t o- us e data a n a lysi s to ol fo r H adoo p
About Me


Tech Lead at BackType

Have been working on many-terabyte scale
systems for two years

 ETL workflows

 Data warehouses
What is Hadoop?

Distributed Filesystem

MapReduce Framework



Scales to thousands of machines and petabytes of
data
What is Cascalog?


Clojure-based query language for Hadoop with
Datalog-inspired syntax

Queries compile to one or more MapReduce jobs

The tool I wish I had two years ago
Features

Inner and outer joins

Aggregators

Functions

Subqueries

Sorting

High performance
What sets Cascalog apart?

Super simple

Full power of Clojure always available

Easy to extend with custom operations

Dynamic queries

Arbitrary inputs and outputs
What sets Cascalog apart?

Super simple

Full power of Clojure always available

Easy to extend with custom operations

Dynamic queries

Arbitrary inputs and outputs
Experiment with Cascalog

Ships with test
dataset that can be
queried locally (the
“playground”)

5 minutes to setup
Hadoop, Clojure, and
Cascalog locally - see
README
News feed generator

Ranks events in
social network
for each person
based on
“importance”
and recency


38 lines of code
Demo time!
News Feed
“Follows” and “Action” data sources

 Text files on HDFS
       Follows               Action
News Feed
News Feed
   Custom Aggregator to produce a
     news feed in JSON-like form
News Feed

             Custom Function
            to score each item
                in the feed
News Feed



            Data sources
News Feed

            Subquery to compute
             follower count for
                 each person
News Feed




  Tie everything
together in a single
  Cascalog query
Questions?


Project page:
http://www.github.com/nathanmarz/cascalog

Tutorial:
http://nathanmarz.com/blog/introducing-cascalog

Follow me on Twitter: @nathanmarz

Cascalog