Partners in Crime
Cassandra Analytics and ETL with Hadoop




Cassandra Summit 2010

Date: August 10th, 2010
What is Hadoop?

• Distributed processing framework (MapReduce)
  – Moves processing to the data
• Distributed filesystem
...
Why use Hadoop with Cassandra?

 Perfect partners for big data laundering

• Cassandra optimized for access
• Hadoop optim...
Cluster Layouts

• Existing Hadoop cluster?
  – Start Hadoop tasktrackers on Cassandra cluster
  – Processing performed on...
Cluster Layouts

• No Hadoop cluster?
  – Start all Hadoop daemons on 2-3 nodes
      • MapReduce depends lightly on HDFS
...
Hadoop Integration Points

• JVM MapReduce
  – Keys/values iterated in process
• Hadoop Streaming
  – Performs IPC on stdi...
Demo

• Code
  – github.com/stuhood/cassandra-summit-demo
• Flow
  – Load with Hadoop Streaming
  – Analyze with Apache Pi...
Hadoop Streaming Summary

• Mapper/Reducer scripts
  – Any language
• Script is moved to the data


 cat $input | mapper |...
ETL with Streaming

• ETL to Cassandra in ~50 lines
 Load!
ETL with Streaming

1)Files in HDFS
2)Hadoop Streaming
3)bin/load-mapper.py (the code you write)
4)Cassandra's Streaming S...
Apache Pig Summary

• Declarative relational language
Analytics with Pig

• Analytics from Cassandra in ~20 lines
 Analyze!
Analytics with Pig

1)Data stored in Cassandra
2)Cassandra's Pig LoadFunc
3)bin/analyze.pig (the code you write)
4)Files i...
JVM MapReduce Summary

• Extend Mapper/Reducer base classes
• Hadoop:
  – Transports the Jar to nodes near the data
  – Ef...
Load/Process with MapReduce

• Efficient bulk loading in ~80 lines
 Summarize!
Load/Process with MapReduce

1)Files in HDFS
2)MapReduce
3)Mapper/Reducer (the code you write)
4)Cassandra's ColumnFamilyO...
Future Work

• Pig Output
• Hive
• Hadoop Streaming Input
• Optimizations
Questions?
References

• Code available at
  – github.com/stuhood/cassandra-summit-demo
• Open issues
  – CASSANDRA-1315
  – CASSANDR...
Upcoming SlideShare
Loading in...5
×

Partners in Crime: Cassandra Analytics and ETL with Hadoop

7,318

Published on

Light slides supporting a Hadoop and Cassandra integration talk at the 2010 Cassandra Summit.

The code is more interesting: http://github.com/stuhood/cassandra-summit-demo

Published in: Technology

Transcript of "Partners in Crime: Cassandra Analytics and ETL with Hadoop"

  1. 1. Partners in Crime Cassandra Analytics and ETL with Hadoop Cassandra Summit 2010 Date: August 10th, 2010
  2. 2. What is Hadoop? • Distributed processing framework (MapReduce) – Moves processing to the data • Distributed filesystem – Allows data to move when processing can't
  3. 3. Why use Hadoop with Cassandra? Perfect partners for big data laundering • Cassandra optimized for access • Hadoop optimized for processing – Many analytics frameworks – Existing integrations • RDBMS → Hadoop → Cassandra
  4. 4. Cluster Layouts • Existing Hadoop cluster? – Start Hadoop tasktrackers on Cassandra cluster – Processing performed on local nodes
  5. 5. Cluster Layouts • No Hadoop cluster? – Start all Hadoop daemons on 2-3 nodes • MapReduce depends lightly on HDFS – Start Hadoop tasktrackers on Cassandra cluster
  6. 6. Hadoop Integration Points • JVM MapReduce – Keys/values iterated in process • Hadoop Streaming – Performs IPC on stdin/stdout to arbitrary processes • Apache Pig – High level relational language (SQL alternative) • Apache Hive – Forthcoming support for Cassandra storage
  7. 7. Demo • Code – github.com/stuhood/cassandra-summit-demo • Flow – Load with Hadoop Streaming – Analyze with Apache Pig – Load/Process with JVM MapReduce
  8. 8. Hadoop Streaming Summary • Mapper/Reducer scripts – Any language • Script is moved to the data cat $input | mapper | sort | reducer > $output
  9. 9. ETL with Streaming • ETL to Cassandra in ~50 lines Load!
  10. 10. ETL with Streaming 1)Files in HDFS 2)Hadoop Streaming 3)bin/load-mapper.py (the code you write) 4)Cassandra's Streaming Shim 5)Cassandra
  11. 11. Apache Pig Summary • Declarative relational language
  12. 12. Analytics with Pig • Analytics from Cassandra in ~20 lines Analyze!
  13. 13. Analytics with Pig 1)Data stored in Cassandra 2)Cassandra's Pig LoadFunc 3)bin/analyze.pig (the code you write) 4)Files in HDFS
  14. 14. JVM MapReduce Summary • Extend Mapper/Reducer base classes • Hadoop: – Transports the Jar to nodes near the data – Efficiently streams data through
  15. 15. Load/Process with MapReduce • Efficient bulk loading in ~80 lines Summarize!
  16. 16. Load/Process with MapReduce 1)Files in HDFS 2)MapReduce 3)Mapper/Reducer (the code you write) 4)Cassandra's ColumnFamilyOutputFormat 5)Cassandra
  17. 17. Future Work • Pig Output • Hive • Hadoop Streaming Input • Optimizations
  18. 18. Questions?
  19. 19. References • Code available at – github.com/stuhood/cassandra-summit-demo • Open issues – CASSANDRA-1315 – CASSANDRA-1322 – CASSANDRA-1368 • “Hadoop + Cassandra” - Jeremy Hanna – slideshare.net/jeromatron/cassandrahadoop-4399672
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×