Pig with CassandraAdventures in Analytics
MotivationWhat’s our need?How do we get at data in Cassandra with ad-hoc queriesDon’t reinvent the wheel
Enter PigPig was created at Yahoo! as an abstraction for MapReduceDesigned to eat anythingloadstorefunc created for Cassandra
How it worksPerform queries over all rows in a column family or set of column familiesIntermediate results stored in HDFS or CFSCan mixand match inputs and outputs
UsesAnalyticsData explorationHow many items did I get from New Jersey?Data validationHow many items were missing a field and when were they created?Data correctionCompany name correction over all dataExpand Cassandra data modelMake a new column family for querying by US State and back-populate with PigBootstrap local dev environment
PygmalionFigure in Greek mythology, sounds like PigUDFs, examples scripts for using Pig with CassandraUsed in production at The Dachis Grouphttps://github.com/jeromatron/pygmalion/
Digging in the DirtPygmalion basic examples
TipsDevelop incrementallyOutput intermediate data frequently to verifyValidate data on input if possibleUse Cassandra data type validation for inputs and outputsPygmalion for tabular dataPenny in Pig 0.9!
Cluster ConfigurationSplit cluster – virtual datacentersBrisk (built-in pig support in 1.0 beta 2+)Task trackers on all analytic nodesWith HDFS:Separate namenode/jobtrackerData nodes on all analytic nodesA few settings to bridge the twoStart the server processesDistributed cache and intermediate dataWith Brisk:Startup includes CFS, job tracker, and task trackers
Topology configuration# from conf/cassandra-topology.properties#### Cassandra Node IP=Data Center:Rack 10.20.114.10=DC-Analytics:Rack-1b10.20.114.11=DC-Analytics:Rack-1b10.20.114.12=DC-Analytics:Rack-2b 10.0.0.10=DC-Realtime-East:Rack-1a10.0.0.11=DC-Realtime-East:Rack-1a10.0.0.12=DC-Realtime-East:Rack-2a 10.21.119.13=DC-Realtime-West:Rack-1c10.21.119.14=DC-Realtime-West:Rack-1c10.21.119.15=DC-Realtime-West:Rack-2c # default for unknown nodesdefault=DC-Realtime-West:Rack-1c
Configuration PrioritiesData localityData locality – no really, biggest performance factorMemory needsCassandra requires lots of memoryHadoop requires lots of memoryPlan with your data model and analytics in mindCPU needsCassandra doesn’t need a lot of CPU horsepowerHadoop loves CPU coresInterconnectedAnalytic nodes need to be close to one another
Cassandra/Hadoop propertiesReference: org.apache.cassandra.hadoop.ConfigHelper.javaBasicscassandra.thrift.addresscassandra.thrift.portcassandra.partitioner.classConsistencycassandra.consistencylevel.readcassandra.consistencylevel.writeSplits and batchescassandra.input.split.sizecassandra.range.batch.size
Future WorkBetter data type handling (Cassandra-2777)MapReduce over subsets of rows (Cassandra-1600)MapReduce over secondary indexes (Cassandra-1600)Pig pushdown projectionPig pushdown filterHCatalog support for CassandraBetter Cassandra wide-row support (Cassandra-2688)Support for immutable/snapshot inputs (Cassandra-2527)
QuestionsContact infoJeremy Hanna@jeromatron on twitterjeremy.hanna1234 <at> gmailjeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)

Pig with Cassandra: Adventures in Analytics

  • 1.
  • 2.
    MotivationWhat’s our need?Howdo we get at data in Cassandra with ad-hoc queriesDon’t reinvent the wheel
  • 3.
    Enter PigPig wascreated at Yahoo! as an abstraction for MapReduceDesigned to eat anythingloadstorefunc created for Cassandra
  • 4.
    How it worksPerformqueries over all rows in a column family or set of column familiesIntermediate results stored in HDFS or CFSCan mixand match inputs and outputs
  • 5.
    UsesAnalyticsData explorationHow manyitems did I get from New Jersey?Data validationHow many items were missing a field and when were they created?Data correctionCompany name correction over all dataExpand Cassandra data modelMake a new column family for querying by US State and back-populate with PigBootstrap local dev environment
  • 6.
    PygmalionFigure in Greekmythology, sounds like PigUDFs, examples scripts for using Pig with CassandraUsed in production at The Dachis Grouphttps://github.com/jeromatron/pygmalion/
  • 7.
    Digging in theDirtPygmalion basic examples
  • 8.
    TipsDevelop incrementallyOutput intermediatedata frequently to verifyValidate data on input if possibleUse Cassandra data type validation for inputs and outputsPygmalion for tabular dataPenny in Pig 0.9!
  • 9.
    Cluster ConfigurationSplit cluster– virtual datacentersBrisk (built-in pig support in 1.0 beta 2+)Task trackers on all analytic nodesWith HDFS:Separate namenode/jobtrackerData nodes on all analytic nodesA few settings to bridge the twoStart the server processesDistributed cache and intermediate dataWith Brisk:Startup includes CFS, job tracker, and task trackers
  • 10.
    Topology configuration# fromconf/cassandra-topology.properties#### Cassandra Node IP=Data Center:Rack 10.20.114.10=DC-Analytics:Rack-1b10.20.114.11=DC-Analytics:Rack-1b10.20.114.12=DC-Analytics:Rack-2b 10.0.0.10=DC-Realtime-East:Rack-1a10.0.0.11=DC-Realtime-East:Rack-1a10.0.0.12=DC-Realtime-East:Rack-2a 10.21.119.13=DC-Realtime-West:Rack-1c10.21.119.14=DC-Realtime-West:Rack-1c10.21.119.15=DC-Realtime-West:Rack-2c # default for unknown nodesdefault=DC-Realtime-West:Rack-1c
  • 11.
    Configuration PrioritiesData localityDatalocality – no really, biggest performance factorMemory needsCassandra requires lots of memoryHadoop requires lots of memoryPlan with your data model and analytics in mindCPU needsCassandra doesn’t need a lot of CPU horsepowerHadoop loves CPU coresInterconnectedAnalytic nodes need to be close to one another
  • 12.
  • 13.
    Future WorkBetter datatype handling (Cassandra-2777)MapReduce over subsets of rows (Cassandra-1600)MapReduce over secondary indexes (Cassandra-1600)Pig pushdown projectionPig pushdown filterHCatalog support for CassandraBetter Cassandra wide-row support (Cassandra-2688)Support for immutable/snapshot inputs (Cassandra-2527)
  • 14.
    QuestionsContact infoJeremy Hanna@jeromatronon twitterjeremy.hanna1234 <at> gmailjeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)

Editor's Notes

  • #3 Make this section interactiveHow many are using Cassandra – find out why, what types of dataHow many are using Hadoop – what types of dataWhat they would like to get from that data
  • #7 Mention Jacob’s involvement