Pig with Cassandra: Adventures in Analytics


Published on

This presentation was given at Cassandra SF 2011. I go over how to use Pig with Apache Cassandra, including things we've learned putting a system into production at The Dachis Group.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Make this section interactiveHow many are using Cassandra – find out why, what types of dataHow many are using Hadoop – what types of dataWhat they would like to get from that data
  • Mention Jacob’s involvement
  • Pig with Cassandra: Adventures in Analytics

    1. 1. Pig with Cassandra<br />Adventures in Analytics<br />
    2. 2. Motivation<br />What’s our need?<br />How do we get at data in Cassandra with ad-hoc queries<br />Don’t reinvent the wheel<br />
    3. 3. Enter Pig<br />Pig was created at Yahoo! as an abstraction for MapReduce<br />Designed to eat anything<br />loadstorefunc created for Cassandra<br />
    4. 4. How it works<br />Perform queries over all rows in a column family or set of column families<br />Intermediate results stored in HDFS or CFS<br />Can mixand match inputs and outputs<br />
    5. 5. Uses<br />Analytics<br />Data exploration<br />How many items did I get from New Jersey?<br />Data validation<br />How many items were missing a field and when were they created?<br />Data correction<br />Company name correction over all data<br />Expand Cassandra data model<br />Make a new column family for querying by US State and back-populate with Pig<br />Bootstrap local dev environment<br />
    6. 6. Pygmalion<br />Figure in Greek mythology, sounds like Pig<br />UDFs, examples scripts for using Pig with Cassandra<br />Used in production at The Dachis Group<br />https://github.com/jeromatron/pygmalion/<br />
    7. 7. Digging in the Dirt<br />Pygmalion basic examples<br />
    8. 8. Tips<br />Develop incrementally<br />Output intermediate data frequently to verify<br />Validate data on input if possible<br />Use Cassandra data type validation for inputs and outputs<br />Pygmalion for tabular data<br />Penny in Pig 0.9!<br />
    9. 9. Cluster Configuration<br />Split cluster – virtual datacenters<br />Brisk (built-in pig support in 1.0 beta 2+)<br />Task trackers on all analytic nodes<br />With HDFS:<br />Separate namenode/jobtracker<br />Data nodes on all analytic nodes<br />A few settings to bridge the two<br />Start the server processes<br />Distributed cache and intermediate data<br />With Brisk:<br />Startup includes CFS, job tracker, and task trackers<br />
    10. 10. Topology configuration<br /># from conf/cassandra-topology.properties<br />###<br /># Cassandra Node IP=Data Center:Rack<br /> <br /><br /><br /><br /> <br /><br /><br /><br /> <br /><br /><br /><br /> <br /># default for unknown nodes<br />default=DC-Realtime-West:Rack-1c<br />
    11. 11. Configuration Priorities<br />Data locality<br />Data locality – no really, biggest performance factor<br />Memory needs<br />Cassandra requires lots of memory<br />Hadoop requires lots of memory<br />Plan with your data model and analytics in mind<br />CPU needs<br />Cassandra doesn’t need a lot of CPU horsepower<br />Hadoop loves CPU cores<br />Interconnected<br />Analytic nodes need to be close to one another<br />
    12. 12. Cassandra/Hadoop properties<br />Reference: org.apache.cassandra.hadoop.ConfigHelper.java<br />Basics<br />cassandra.thrift.address<br />cassandra.thrift.port<br />cassandra.partitioner.class<br />Consistency<br />cassandra.consistencylevel.read<br />cassandra.consistencylevel.write<br />Splits and batches<br />cassandra.input.split.size<br />cassandra.range.batch.size<br />
    13. 13. Future Work<br />Better data type handling (Cassandra-2777)<br />MapReduce over subsets of rows (Cassandra-1600)<br />MapReduce over secondary indexes (Cassandra-1600)<br />Pig pushdown projection<br />Pig pushdown filter<br />HCatalog support for Cassandra<br />Better Cassandra wide-row support (Cassandra-2688)<br />Support for immutable/snapshot inputs (Cassandra-2527)<br />
    14. 14. Questions<br />Contact info<br />Jeremy Hanna<br />@jeromatron on twitter<br />jeremy.hanna1234 <at> gmail<br />jeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)<br />