• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Partners in Crime: Cassandra Analytics and ETL with Hadoop
 

Partners in Crime: Cassandra Analytics and ETL with Hadoop

on

  • 8,304 views

Light slides supporting a Hadoop and Cassandra integration talk at the 2010 Cassandra Summit.

Light slides supporting a Hadoop and Cassandra integration talk at the 2010 Cassandra Summit.

The code is more interesting: http://github.com/stuhood/cassandra-summit-demo

Statistics

Views

Total Views
8,304
Views on SlideShare
8,072
Embed Views
232

Actions

Likes
11
Downloads
162
Comments
0

7 Embeds 232

http://www.scoop.it 215
http://www.redditmedia.com 4
http://confluence.corp.apple.com 4
https://coral.corp.apple.com 3
http://coral.corp.apple.com 2
http://webcache.googleusercontent.com 2
http://searchutil01 2
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Partners in Crime: Cassandra Analytics and ETL with Hadoop Partners in Crime: Cassandra Analytics and ETL with Hadoop Presentation Transcript

    • Partners in Crime Cassandra Analytics and ETL with Hadoop Cassandra Summit 2010 Date: August 10th, 2010
    • What is Hadoop? • Distributed processing framework (MapReduce) – Moves processing to the data • Distributed filesystem – Allows data to move when processing can't
    • Why use Hadoop with Cassandra? Perfect partners for big data laundering • Cassandra optimized for access • Hadoop optimized for processing – Many analytics frameworks – Existing integrations • RDBMS → Hadoop → Cassandra
    • Cluster Layouts • Existing Hadoop cluster? – Start Hadoop tasktrackers on Cassandra cluster – Processing performed on local nodes
    • Cluster Layouts • No Hadoop cluster? – Start all Hadoop daemons on 2-3 nodes • MapReduce depends lightly on HDFS – Start Hadoop tasktrackers on Cassandra cluster
    • Hadoop Integration Points • JVM MapReduce – Keys/values iterated in process • Hadoop Streaming – Performs IPC on stdin/stdout to arbitrary processes • Apache Pig – High level relational language (SQL alternative) • Apache Hive – Forthcoming support for Cassandra storage
    • Demo • Code – github.com/stuhood/cassandra-summit-demo • Flow – Load with Hadoop Streaming – Analyze with Apache Pig – Load/Process with JVM MapReduce
    • Hadoop Streaming Summary • Mapper/Reducer scripts – Any language • Script is moved to the data cat $input | mapper | sort | reducer > $output
    • ETL with Streaming • ETL to Cassandra in ~50 lines Load!
    • ETL with Streaming 1)Files in HDFS 2)Hadoop Streaming 3)bin/load-mapper.py (the code you write) 4)Cassandra's Streaming Shim 5)Cassandra
    • Apache Pig Summary • Declarative relational language
    • Analytics with Pig • Analytics from Cassandra in ~20 lines Analyze!
    • Analytics with Pig 1)Data stored in Cassandra 2)Cassandra's Pig LoadFunc 3)bin/analyze.pig (the code you write) 4)Files in HDFS
    • JVM MapReduce Summary • Extend Mapper/Reducer base classes • Hadoop: – Transports the Jar to nodes near the data – Efficiently streams data through
    • Load/Process with MapReduce • Efficient bulk loading in ~80 lines Summarize!
    • Load/Process with MapReduce 1)Files in HDFS 2)MapReduce 3)Mapper/Reducer (the code you write) 4)Cassandra's ColumnFamilyOutputFormat 5)Cassandra
    • Future Work • Pig Output • Hive • Hadoop Streaming Input • Optimizations
    • Questions?
    • References • Code available at – github.com/stuhood/cassandra-summit-demo • Open issues – CASSANDRA-1315 – CASSANDRA-1322 – CASSANDRA-1368 • “Hadoop + Cassandra” - Jeremy Hanna – slideshare.net/jeromatron/cassandrahadoop-4399672