Your SlideShare is downloading. ×
  • Like
Pig with Cassandra: Adventures in Analytics
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Pig with Cassandra: Adventures in Analytics


This presentation was given at Cassandra SF 2011. I go over how to use Pig with Apache Cassandra, including things we've learned putting a system into production at The Dachis Group.

This presentation was given at Cassandra SF 2011. I go over how to use Pig with Apache Cassandra, including things we've learned putting a system into production at The Dachis Group.

Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Make this section interactiveHow many are using Cassandra – find out why, what types of dataHow many are using Hadoop – what types of dataWhat they would like to get from that data
  • Mention Jacob’s involvement


  • 1. Pig with Cassandra
    Adventures in Analytics
  • 2. Motivation
    What’s our need?
    How do we get at data in Cassandra with ad-hoc queries
    Don’t reinvent the wheel
  • 3. Enter Pig
    Pig was created at Yahoo! as an abstraction for MapReduce
    Designed to eat anything
    loadstorefunc created for Cassandra
  • 4. How it works
    Perform queries over all rows in a column family or set of column families
    Intermediate results stored in HDFS or CFS
    Can mixand match inputs and outputs
  • 5. Uses
    Data exploration
    How many items did I get from New Jersey?
    Data validation
    How many items were missing a field and when were they created?
    Data correction
    Company name correction over all data
    Expand Cassandra data model
    Make a new column family for querying by US State and back-populate with Pig
    Bootstrap local dev environment
  • 6. Pygmalion
    Figure in Greek mythology, sounds like Pig
    UDFs, examples scripts for using Pig with Cassandra
    Used in production at The Dachis Group
  • 7. Digging in the Dirt
    Pygmalion basic examples
  • 8. Tips
    Develop incrementally
    Output intermediate data frequently to verify
    Validate data on input if possible
    Use Cassandra data type validation for inputs and outputs
    Pygmalion for tabular data
    Penny in Pig 0.9!
  • 9. Cluster Configuration
    Split cluster – virtual datacenters
    Brisk (built-in pig support in 1.0 beta 2+)
    Task trackers on all analytic nodes
    With HDFS:
    Separate namenode/jobtracker
    Data nodes on all analytic nodes
    A few settings to bridge the two
    Start the server processes
    Distributed cache and intermediate data
    With Brisk:
    Startup includes CFS, job tracker, and task trackers
  • 10. Topology configuration
    # from conf/
    # Cassandra Node IP=Data Center:Rack

    # default for unknown nodes
  • 11. Configuration Priorities
    Data locality
    Data locality – no really, biggest performance factor
    Memory needs
    Cassandra requires lots of memory
    Hadoop requires lots of memory
    Plan with your data model and analytics in mind
    CPU needs
    Cassandra doesn’t need a lot of CPU horsepower
    Hadoop loves CPU cores
    Analytic nodes need to be close to one another
  • 12. Cassandra/Hadoop properties
    Splits and batches
  • 13. Future Work
    Better data type handling (Cassandra-2777)
    MapReduce over subsets of rows (Cassandra-1600)
    MapReduce over secondary indexes (Cassandra-1600)
    Pig pushdown projection
    Pig pushdown filter
    HCatalog support for Cassandra
    Better Cassandra wide-row support (Cassandra-2688)
    Support for immutable/snapshot inputs (Cassandra-2527)
  • 14. Questions
    Contact info
    Jeremy Hanna
    @jeromatron on twitter
    jeremy.hanna1234 <at> gmail
    jeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)