Your SlideShare is downloading. ×
Pig with Cassandra: Adventures in Analytics
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Pig with Cassandra: Adventures in Analytics

11,879
views

Published on

This presentation was given at Cassandra SF 2011. I go over how to use Pig with Apache Cassandra, including things we've learned putting a system into production at The Dachis Group.

This presentation was given at Cassandra SF 2011. I go over how to use Pig with Apache Cassandra, including things we've learned putting a system into production at The Dachis Group.

Published in: Technology, Business

0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
11,879
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
219
Comments
0
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Make this section interactiveHow many are using Cassandra – find out why, what types of dataHow many are using Hadoop – what types of dataWhat they would like to get from that data
  • Mention Jacob’s involvement
  • Transcript

    • 1. Pig with Cassandra
      Adventures in Analytics
    • 2. Motivation
      What’s our need?
      How do we get at data in Cassandra with ad-hoc queries
      Don’t reinvent the wheel
    • 3. Enter Pig
      Pig was created at Yahoo! as an abstraction for MapReduce
      Designed to eat anything
      loadstorefunc created for Cassandra
    • 4. How it works
      Perform queries over all rows in a column family or set of column families
      Intermediate results stored in HDFS or CFS
      Can mixand match inputs and outputs
    • 5. Uses
      Analytics
      Data exploration
      How many items did I get from New Jersey?
      Data validation
      How many items were missing a field and when were they created?
      Data correction
      Company name correction over all data
      Expand Cassandra data model
      Make a new column family for querying by US State and back-populate with Pig
      Bootstrap local dev environment
    • 6. Pygmalion
      Figure in Greek mythology, sounds like Pig
      UDFs, examples scripts for using Pig with Cassandra
      Used in production at The Dachis Group
      https://github.com/jeromatron/pygmalion/
    • 7. Digging in the Dirt
      Pygmalion basic examples
    • 8. Tips
      Develop incrementally
      Output intermediate data frequently to verify
      Validate data on input if possible
      Use Cassandra data type validation for inputs and outputs
      Pygmalion for tabular data
      Penny in Pig 0.9!
    • 9. Cluster Configuration
      Split cluster – virtual datacenters
      Brisk (built-in pig support in 1.0 beta 2+)
      Task trackers on all analytic nodes
      With HDFS:
      Separate namenode/jobtracker
      Data nodes on all analytic nodes
      A few settings to bridge the two
      Start the server processes
      Distributed cache and intermediate data
      With Brisk:
      Startup includes CFS, job tracker, and task trackers
    • 10. Topology configuration
      # from conf/cassandra-topology.properties
      ###
      # Cassandra Node IP=Data Center:Rack
       
      10.20.114.10=DC-Analytics:Rack-1b
      10.20.114.11=DC-Analytics:Rack-1b
      10.20.114.12=DC-Analytics:Rack-2b
       
      10.0.0.10=DC-Realtime-East:Rack-1a
      10.0.0.11=DC-Realtime-East:Rack-1a
      10.0.0.12=DC-Realtime-East:Rack-2a
       
      10.21.119.13=DC-Realtime-West:Rack-1c
      10.21.119.14=DC-Realtime-West:Rack-1c
      10.21.119.15=DC-Realtime-West:Rack-2c
       
      # default for unknown nodes
      default=DC-Realtime-West:Rack-1c
    • 11. Configuration Priorities
      Data locality
      Data locality – no really, biggest performance factor
      Memory needs
      Cassandra requires lots of memory
      Hadoop requires lots of memory
      Plan with your data model and analytics in mind
      CPU needs
      Cassandra doesn’t need a lot of CPU horsepower
      Hadoop loves CPU cores
      Interconnected
      Analytic nodes need to be close to one another
    • 12. Cassandra/Hadoop properties
      Reference: org.apache.cassandra.hadoop.ConfigHelper.java
      Basics
      cassandra.thrift.address
      cassandra.thrift.port
      cassandra.partitioner.class
      Consistency
      cassandra.consistencylevel.read
      cassandra.consistencylevel.write
      Splits and batches
      cassandra.input.split.size
      cassandra.range.batch.size
    • 13. Future Work
      Better data type handling (Cassandra-2777)
      MapReduce over subsets of rows (Cassandra-1600)
      MapReduce over secondary indexes (Cassandra-1600)
      Pig pushdown projection
      Pig pushdown filter
      HCatalog support for Cassandra
      Better Cassandra wide-row support (Cassandra-2688)
      Support for immutable/snapshot inputs (Cassandra-2527)
    • 14. Questions
      Contact info
      Jeremy Hanna
      @jeromatron on twitter
      jeremy.hanna1234 <at> gmail
      jeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)