Pig with Cassandra<br />Adventures in Analytics<br />
Motivation<br />What’s our need?<br />How do we get at data in Cassandra with ad-hoc queries<br />Don’t reinvent the wheel...
Enter Pig<br />Pig was created at Yahoo! as an abstraction for MapReduce<br />Designed to eat anything<br />loadstorefunc ...
How it works<br />Perform queries over all rows in a column family or set of column families<br />Intermediate results sto...
Uses<br />Analytics<br />Data exploration<br />How many items did I get from New Jersey?<br />Data validation<br />How man...
Pygmalion<br />Figure in Greek mythology, sounds like Pig<br />UDFs, examples scripts for using Pig with Cassandra<br />Us...
Digging in the Dirt<br />Pygmalion basic examples<br />
Tips<br />Develop incrementally<br />Output intermediate data frequently to verify<br />Validate data on input if possible...
Cluster Configuration<br />Split cluster – virtual datacenters<br />Brisk (built-in pig support in 1.0 beta 2+)<br />Task ...
Topology configuration<br /># from conf/cassandra-topology.properties<br />###<br /># Cassandra Node IP=Data Center:Rack<b...
Configuration Priorities<br />Data locality<br />Data locality – no really, biggest performance factor<br />Memory needs<b...
Cassandra/Hadoop properties<br />Reference: org.apache.cassandra.hadoop.ConfigHelper.java<br />Basics<br />cassandra.thrif...
Future Work<br />Better data type handling (Cassandra-2777)<br />MapReduce over subsets of rows (Cassandra-1600)<br />MapR...
Questions<br />Contact info<br />Jeremy Hanna<br />@jeromatron on twitter<br />jeremy.hanna1234 <at> gmail<br />jeromatron...
Upcoming SlideShare
Loading in …5
×

Pig with Cassandra: Adventures in Analytics

12,879 views
12,608 views

Published on

This presentation was given at Cassandra SF 2011. I go over how to use Pig with Apache Cassandra, including things we've learned putting a system into production at The Dachis Group.

Published in: Technology, Business
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
12,879
On SlideShare
0
From Embeds
0
Number of Embeds
85
Actions
Shares
0
Downloads
223
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide
  • Make this section interactiveHow many are using Cassandra – find out why, what types of dataHow many are using Hadoop – what types of dataWhat they would like to get from that data
  • Mention Jacob’s involvement
  • Pig with Cassandra: Adventures in Analytics

    1. 1. Pig with Cassandra<br />Adventures in Analytics<br />
    2. 2. Motivation<br />What’s our need?<br />How do we get at data in Cassandra with ad-hoc queries<br />Don’t reinvent the wheel<br />
    3. 3. Enter Pig<br />Pig was created at Yahoo! as an abstraction for MapReduce<br />Designed to eat anything<br />loadstorefunc created for Cassandra<br />
    4. 4. How it works<br />Perform queries over all rows in a column family or set of column families<br />Intermediate results stored in HDFS or CFS<br />Can mixand match inputs and outputs<br />
    5. 5. Uses<br />Analytics<br />Data exploration<br />How many items did I get from New Jersey?<br />Data validation<br />How many items were missing a field and when were they created?<br />Data correction<br />Company name correction over all data<br />Expand Cassandra data model<br />Make a new column family for querying by US State and back-populate with Pig<br />Bootstrap local dev environment<br />
    6. 6. Pygmalion<br />Figure in Greek mythology, sounds like Pig<br />UDFs, examples scripts for using Pig with Cassandra<br />Used in production at The Dachis Group<br />https://github.com/jeromatron/pygmalion/<br />
    7. 7. Digging in the Dirt<br />Pygmalion basic examples<br />
    8. 8. Tips<br />Develop incrementally<br />Output intermediate data frequently to verify<br />Validate data on input if possible<br />Use Cassandra data type validation for inputs and outputs<br />Pygmalion for tabular data<br />Penny in Pig 0.9!<br />
    9. 9. Cluster Configuration<br />Split cluster – virtual datacenters<br />Brisk (built-in pig support in 1.0 beta 2+)<br />Task trackers on all analytic nodes<br />With HDFS:<br />Separate namenode/jobtracker<br />Data nodes on all analytic nodes<br />A few settings to bridge the two<br />Start the server processes<br />Distributed cache and intermediate data<br />With Brisk:<br />Startup includes CFS, job tracker, and task trackers<br />
    10. 10. Topology configuration<br /># from conf/cassandra-topology.properties<br />###<br /># Cassandra Node IP=Data Center:Rack<br /> <br />10.20.114.10=DC-Analytics:Rack-1b<br />10.20.114.11=DC-Analytics:Rack-1b<br />10.20.114.12=DC-Analytics:Rack-2b<br /> <br />10.0.0.10=DC-Realtime-East:Rack-1a<br />10.0.0.11=DC-Realtime-East:Rack-1a<br />10.0.0.12=DC-Realtime-East:Rack-2a<br /> <br />10.21.119.13=DC-Realtime-West:Rack-1c<br />10.21.119.14=DC-Realtime-West:Rack-1c<br />10.21.119.15=DC-Realtime-West:Rack-2c<br /> <br /># default for unknown nodes<br />default=DC-Realtime-West:Rack-1c<br />
    11. 11. Configuration Priorities<br />Data locality<br />Data locality – no really, biggest performance factor<br />Memory needs<br />Cassandra requires lots of memory<br />Hadoop requires lots of memory<br />Plan with your data model and analytics in mind<br />CPU needs<br />Cassandra doesn’t need a lot of CPU horsepower<br />Hadoop loves CPU cores<br />Interconnected<br />Analytic nodes need to be close to one another<br />
    12. 12. Cassandra/Hadoop properties<br />Reference: org.apache.cassandra.hadoop.ConfigHelper.java<br />Basics<br />cassandra.thrift.address<br />cassandra.thrift.port<br />cassandra.partitioner.class<br />Consistency<br />cassandra.consistencylevel.read<br />cassandra.consistencylevel.write<br />Splits and batches<br />cassandra.input.split.size<br />cassandra.range.batch.size<br />
    13. 13. Future Work<br />Better data type handling (Cassandra-2777)<br />MapReduce over subsets of rows (Cassandra-1600)<br />MapReduce over secondary indexes (Cassandra-1600)<br />Pig pushdown projection<br />Pig pushdown filter<br />HCatalog support for Cassandra<br />Better Cassandra wide-row support (Cassandra-2688)<br />Support for immutable/snapshot inputs (Cassandra-2527)<br />
    14. 14. Questions<br />Contact info<br />Jeremy Hanna<br />@jeromatron on twitter<br />jeremy.hanna1234 <at> gmail<br />jeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)<br />

    ×