Pig with Cassandra: Adventures in Analytics

Pig with Cassandra Adventures in Analytics

Motivation What’s our need? How do we get at data in Cassandra with ad-hoc queries Don’t reinvent the wheel

Enter Pig Pig was created at Yahoo! as an abstraction for MapReduce Designed to eat anything loadstorefunc created for Cassandra

How it works Perform queries over all rows in a column family or set of column families Intermediate results stored in HDFS or CFS Can mixand match inputs and outputs

Uses Analytics Data exploration How many items did I get from New Jersey? Data validation How many items were missing a field and when were they created? Data correction Company name correction over all data Expand Cassandra data model Make a new column family for querying by US State and back-populate with Pig Bootstrap local dev environment

Pygmalion Figure in Greek mythology, sounds like Pig UDFs, examples scripts for using Pig with Cassandra Used in production at The Dachis Group https://github.com/jeromatron/pygmalion/

Digging in the Dirt Pygmalion basic examples

Tips Develop incrementally Output intermediate data frequently to verify Validate data on input if possible Use Cassandra data type validation for inputs and outputs Pygmalion for tabular data Penny in Pig 0.9!

Cluster Configuration Split cluster – virtual datacenters Brisk (built-in pig support in 1.0 beta 2+) Task trackers on all analytic nodes With HDFS: Separate namenode/jobtracker Data nodes on all analytic nodes A few settings to bridge the two Start the server processes Distributed cache and intermediate data With Brisk: Startup includes CFS, job tracker, and task trackers

Topology configuration # from conf/cassandra-topology.properties ### # Cassandra Node IP=Data Center:Rack 10.20.114.10=DC-Analytics:Rack-1b 10.20.114.11=DC-Analytics:Rack-1b 10.20.114.12=DC-Analytics:Rack-2b 10.0.0.10=DC-Realtime-East:Rack-1a 10.0.0.11=DC-Realtime-East:Rack-1a 10.0.0.12=DC-Realtime-East:Rack-2a 10.21.119.13=DC-Realtime-West:Rack-1c 10.21.119.14=DC-Realtime-West:Rack-1c 10.21.119.15=DC-Realtime-West:Rack-2c # default for unknown nodes default=DC-Realtime-West:Rack-1c

Configuration Priorities Data locality Data locality – no really, biggest performance factor Memory needs Cassandra requires lots of memory Hadoop requires lots of memory Plan with your data model and analytics in mind CPU needs Cassandra doesn’t need a lot of CPU horsepower Hadoop loves CPU cores Interconnected Analytic nodes need to be close to one another

Cassandra/Hadoop properties Reference: org.apache.cassandra.hadoop.ConfigHelper.java Basics cassandra.thrift.address cassandra.thrift.port cassandra.partitioner.class Consistency cassandra.consistencylevel.read cassandra.consistencylevel.write Splits and batches cassandra.input.split.size cassandra.range.batch.size

Future Work Better data type handling (Cassandra-2777) MapReduce over subsets of rows (Cassandra-1600) MapReduce over secondary indexes (Cassandra-1600) Pig pushdown projection Pig pushdown filter HCatalog support for Cassandra Better Cassandra wide-row support (Cassandra-2688) Support for immutable/snapshot inputs (Cassandra-2527)

Questions Contact info Jeremy Hanna @jeromatron on twitter jeremy.hanna1234 <at> gmail jeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)

Pig with Cassandra: Adventures in Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pig with Cassandra: Adventures in Analytics

Similar to Pig with Cassandra: Adventures in Analytics (20)

More from Jeremy Hanna

More from Jeremy Hanna (8)

Recently uploaded

Recently uploaded (20)

Pig with Cassandra: Adventures in Analytics

Editor's Notes