Your SlideShare is downloading. ×
0
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Pig with Cassandra: Adventures in Analytics

12,043

Published on

This presentation was given at Cassandra SF 2011. I go over how to use Pig with Apache Cassandra, including things we've learned putting a system into production at The Dachis Group.

This presentation was given at Cassandra SF 2011. I go over how to use Pig with Apache Cassandra, including things we've learned putting a system into production at The Dachis Group.

Published in: Technology, Business
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,043
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
220
Comments
0
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Make this section interactiveHow many are using Cassandra – find out why, what types of dataHow many are using Hadoop – what types of dataWhat they would like to get from that data
  • Mention Jacob’s involvement
  • Transcript

    • 1. Pig with Cassandra<br />Adventures in Analytics<br />
    • 2. Motivation<br />What’s our need?<br />How do we get at data in Cassandra with ad-hoc queries<br />Don’t reinvent the wheel<br />
    • 3. Enter Pig<br />Pig was created at Yahoo! as an abstraction for MapReduce<br />Designed to eat anything<br />loadstorefunc created for Cassandra<br />
    • 4. How it works<br />Perform queries over all rows in a column family or set of column families<br />Intermediate results stored in HDFS or CFS<br />Can mixand match inputs and outputs<br />
    • 5. Uses<br />Analytics<br />Data exploration<br />How many items did I get from New Jersey?<br />Data validation<br />How many items were missing a field and when were they created?<br />Data correction<br />Company name correction over all data<br />Expand Cassandra data model<br />Make a new column family for querying by US State and back-populate with Pig<br />Bootstrap local dev environment<br />
    • 6. Pygmalion<br />Figure in Greek mythology, sounds like Pig<br />UDFs, examples scripts for using Pig with Cassandra<br />Used in production at The Dachis Group<br />https://github.com/jeromatron/pygmalion/<br />
    • 7. Digging in the Dirt<br />Pygmalion basic examples<br />
    • 8. Tips<br />Develop incrementally<br />Output intermediate data frequently to verify<br />Validate data on input if possible<br />Use Cassandra data type validation for inputs and outputs<br />Pygmalion for tabular data<br />Penny in Pig 0.9!<br />
    • 9. Cluster Configuration<br />Split cluster – virtual datacenters<br />Brisk (built-in pig support in 1.0 beta 2+)<br />Task trackers on all analytic nodes<br />With HDFS:<br />Separate namenode/jobtracker<br />Data nodes on all analytic nodes<br />A few settings to bridge the two<br />Start the server processes<br />Distributed cache and intermediate data<br />With Brisk:<br />Startup includes CFS, job tracker, and task trackers<br />
    • 10. Topology configuration<br /># from conf/cassandra-topology.properties<br />###<br /># Cassandra Node IP=Data Center:Rack<br /> <br />10.20.114.10=DC-Analytics:Rack-1b<br />10.20.114.11=DC-Analytics:Rack-1b<br />10.20.114.12=DC-Analytics:Rack-2b<br /> <br />10.0.0.10=DC-Realtime-East:Rack-1a<br />10.0.0.11=DC-Realtime-East:Rack-1a<br />10.0.0.12=DC-Realtime-East:Rack-2a<br /> <br />10.21.119.13=DC-Realtime-West:Rack-1c<br />10.21.119.14=DC-Realtime-West:Rack-1c<br />10.21.119.15=DC-Realtime-West:Rack-2c<br /> <br /># default for unknown nodes<br />default=DC-Realtime-West:Rack-1c<br />
    • 11. Configuration Priorities<br />Data locality<br />Data locality – no really, biggest performance factor<br />Memory needs<br />Cassandra requires lots of memory<br />Hadoop requires lots of memory<br />Plan with your data model and analytics in mind<br />CPU needs<br />Cassandra doesn’t need a lot of CPU horsepower<br />Hadoop loves CPU cores<br />Interconnected<br />Analytic nodes need to be close to one another<br />
    • 12. Cassandra/Hadoop properties<br />Reference: org.apache.cassandra.hadoop.ConfigHelper.java<br />Basics<br />cassandra.thrift.address<br />cassandra.thrift.port<br />cassandra.partitioner.class<br />Consistency<br />cassandra.consistencylevel.read<br />cassandra.consistencylevel.write<br />Splits and batches<br />cassandra.input.split.size<br />cassandra.range.batch.size<br />
    • 13. Future Work<br />Better data type handling (Cassandra-2777)<br />MapReduce over subsets of rows (Cassandra-1600)<br />MapReduce over secondary indexes (Cassandra-1600)<br />Pig pushdown projection<br />Pig pushdown filter<br />HCatalog support for Cassandra<br />Better Cassandra wide-row support (Cassandra-2688)<br />Support for immutable/snapshot inputs (Cassandra-2527)<br />
    • 14. Questions<br />Contact info<br />Jeremy Hanna<br />@jeromatron on twitter<br />jeremy.hanna1234 <at> gmail<br />jeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)<br />

    ×