Pig with Cassandra: Adventures in Analytics
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Pig with Cassandra: Adventures in Analytics

on

  • 13,179 views

This presentation was given at Cassandra SF 2011. I go over how to use Pig with Apache Cassandra, including things we've learned putting a system into production at The Dachis Group.

This presentation was given at Cassandra SF 2011. I go over how to use Pig with Apache Cassandra, including things we've learned putting a system into production at The Dachis Group.

Statistics

Views

Total Views
13,179
Views on SlideShare
13,100
Embed Views
79

Actions

Likes
8
Downloads
211
Comments
0

7 Embeds 79

http://nosql.io 62
https://twitter.com 6
http://paper.li 6
http://twitter.com 2
http://tweetedtimes.com 1
http://www.linkedin.com 1
https://www.linkedin.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Make this section interactiveHow many are using Cassandra – find out why, what types of dataHow many are using Hadoop – what types of dataWhat they would like to get from that data
  • Mention Jacob’s involvement

Pig with Cassandra: Adventures in Analytics Presentation Transcript

  • 1. Pig with Cassandra
    Adventures in Analytics
  • 2. Motivation
    What’s our need?
    How do we get at data in Cassandra with ad-hoc queries
    Don’t reinvent the wheel
  • 3. Enter Pig
    Pig was created at Yahoo! as an abstraction for MapReduce
    Designed to eat anything
    loadstorefunc created for Cassandra
  • 4. How it works
    Perform queries over all rows in a column family or set of column families
    Intermediate results stored in HDFS or CFS
    Can mixand match inputs and outputs
  • 5. Uses
    Analytics
    Data exploration
    How many items did I get from New Jersey?
    Data validation
    How many items were missing a field and when were they created?
    Data correction
    Company name correction over all data
    Expand Cassandra data model
    Make a new column family for querying by US State and back-populate with Pig
    Bootstrap local dev environment
  • 6. Pygmalion
    Figure in Greek mythology, sounds like Pig
    UDFs, examples scripts for using Pig with Cassandra
    Used in production at The Dachis Group
    https://github.com/jeromatron/pygmalion/
  • 7. Digging in the Dirt
    Pygmalion basic examples
  • 8. Tips
    Develop incrementally
    Output intermediate data frequently to verify
    Validate data on input if possible
    Use Cassandra data type validation for inputs and outputs
    Pygmalion for tabular data
    Penny in Pig 0.9!
  • 9. Cluster Configuration
    Split cluster – virtual datacenters
    Brisk (built-in pig support in 1.0 beta 2+)
    Task trackers on all analytic nodes
    With HDFS:
    Separate namenode/jobtracker
    Data nodes on all analytic nodes
    A few settings to bridge the two
    Start the server processes
    Distributed cache and intermediate data
    With Brisk:
    Startup includes CFS, job tracker, and task trackers
  • 10. Topology configuration
    # from conf/cassandra-topology.properties
    ###
    # Cassandra Node IP=Data Center:Rack
     
    10.20.114.10=DC-Analytics:Rack-1b
    10.20.114.11=DC-Analytics:Rack-1b
    10.20.114.12=DC-Analytics:Rack-2b
     
    10.0.0.10=DC-Realtime-East:Rack-1a
    10.0.0.11=DC-Realtime-East:Rack-1a
    10.0.0.12=DC-Realtime-East:Rack-2a
     
    10.21.119.13=DC-Realtime-West:Rack-1c
    10.21.119.14=DC-Realtime-West:Rack-1c
    10.21.119.15=DC-Realtime-West:Rack-2c
     
    # default for unknown nodes
    default=DC-Realtime-West:Rack-1c
  • 11. Configuration Priorities
    Data locality
    Data locality – no really, biggest performance factor
    Memory needs
    Cassandra requires lots of memory
    Hadoop requires lots of memory
    Plan with your data model and analytics in mind
    CPU needs
    Cassandra doesn’t need a lot of CPU horsepower
    Hadoop loves CPU cores
    Interconnected
    Analytic nodes need to be close to one another
  • 12. Cassandra/Hadoop properties
    Reference: org.apache.cassandra.hadoop.ConfigHelper.java
    Basics
    cassandra.thrift.address
    cassandra.thrift.port
    cassandra.partitioner.class
    Consistency
    cassandra.consistencylevel.read
    cassandra.consistencylevel.write
    Splits and batches
    cassandra.input.split.size
    cassandra.range.batch.size
  • 13. Future Work
    Better data type handling (Cassandra-2777)
    MapReduce over subsets of rows (Cassandra-1600)
    MapReduce over secondary indexes (Cassandra-1600)
    Pig pushdown projection
    Pig pushdown filter
    HCatalog support for Cassandra
    Better Cassandra wide-row support (Cassandra-2688)
    Support for immutable/snapshot inputs (Cassandra-2527)
  • 14. Questions
    Contact info
    Jeremy Hanna
    @jeromatron on twitter
    jeremy.hanna1234 <at> gmail
    jeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)