Partners in Crime: Cassandra Analytics and ETL with Hadoop

  • 7,152 views
Uploaded on

Light slides supporting a Hadoop and Cassandra integration talk at the 2010 Cassandra Summit. …

Light slides supporting a Hadoop and Cassandra integration talk at the 2010 Cassandra Summit.

The code is more interesting: http://github.com/stuhood/cassandra-summit-demo

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
7,152
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
165
Comments
0
Likes
11

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Partners in Crime Cassandra Analytics and ETL with Hadoop Cassandra Summit 2010 Date: August 10th, 2010
  • 2. What is Hadoop? • Distributed processing framework (MapReduce) – Moves processing to the data • Distributed filesystem – Allows data to move when processing can't
  • 3. Why use Hadoop with Cassandra? Perfect partners for big data laundering • Cassandra optimized for access • Hadoop optimized for processing – Many analytics frameworks – Existing integrations • RDBMS → Hadoop → Cassandra
  • 4. Cluster Layouts • Existing Hadoop cluster? – Start Hadoop tasktrackers on Cassandra cluster – Processing performed on local nodes
  • 5. Cluster Layouts • No Hadoop cluster? – Start all Hadoop daemons on 2-3 nodes • MapReduce depends lightly on HDFS – Start Hadoop tasktrackers on Cassandra cluster
  • 6. Hadoop Integration Points • JVM MapReduce – Keys/values iterated in process • Hadoop Streaming – Performs IPC on stdin/stdout to arbitrary processes • Apache Pig – High level relational language (SQL alternative) • Apache Hive – Forthcoming support for Cassandra storage
  • 7. Demo • Code – github.com/stuhood/cassandra-summit-demo • Flow – Load with Hadoop Streaming – Analyze with Apache Pig – Load/Process with JVM MapReduce
  • 8. Hadoop Streaming Summary • Mapper/Reducer scripts – Any language • Script is moved to the data cat $input | mapper | sort | reducer > $output
  • 9. ETL with Streaming • ETL to Cassandra in ~50 lines Load!
  • 10. ETL with Streaming 1)Files in HDFS 2)Hadoop Streaming 3)bin/load-mapper.py (the code you write) 4)Cassandra's Streaming Shim 5)Cassandra
  • 11. Apache Pig Summary • Declarative relational language
  • 12. Analytics with Pig • Analytics from Cassandra in ~20 lines Analyze!
  • 13. Analytics with Pig 1)Data stored in Cassandra 2)Cassandra's Pig LoadFunc 3)bin/analyze.pig (the code you write) 4)Files in HDFS
  • 14. JVM MapReduce Summary • Extend Mapper/Reducer base classes • Hadoop: – Transports the Jar to nodes near the data – Efficiently streams data through
  • 15. Load/Process with MapReduce • Efficient bulk loading in ~80 lines Summarize!
  • 16. Load/Process with MapReduce 1)Files in HDFS 2)MapReduce 3)Mapper/Reducer (the code you write) 4)Cassandra's ColumnFamilyOutputFormat 5)Cassandra
  • 17. Future Work • Pig Output • Hive • Hadoop Streaming Input • Optimizations
  • 18. Questions?
  • 19. References • Code available at – github.com/stuhood/cassandra-summit-demo • Open issues – CASSANDRA-1315 – CASSANDRA-1322 – CASSANDRA-1368 • “Hadoop + Cassandra” - Jeremy Hanna – slideshare.net/jeromatron/cassandrahadoop-4399672