Partners in Crime: Cassandra Analytics and ETL with Hadoop

7,520
-1

Published on

Light slides supporting a Hadoop and Cassandra integration talk at the 2010 Cassandra Summit.

The code is more interesting: http://github.com/stuhood/cassandra-summit-demo

Published in: Technology

Partners in Crime: Cassandra Analytics and ETL with Hadoop

  1. 1. Partners in Crime Cassandra Analytics and ETL with Hadoop Cassandra Summit 2010 Date: August 10th, 2010
  2. 2. What is Hadoop? • Distributed processing framework (MapReduce) – Moves processing to the data • Distributed filesystem – Allows data to move when processing can't
  3. 3. Why use Hadoop with Cassandra? Perfect partners for big data laundering • Cassandra optimized for access • Hadoop optimized for processing – Many analytics frameworks – Existing integrations • RDBMS → Hadoop → Cassandra
  4. 4. Cluster Layouts • Existing Hadoop cluster? – Start Hadoop tasktrackers on Cassandra cluster – Processing performed on local nodes
  5. 5. Cluster Layouts • No Hadoop cluster? – Start all Hadoop daemons on 2-3 nodes • MapReduce depends lightly on HDFS – Start Hadoop tasktrackers on Cassandra cluster
  6. 6. Hadoop Integration Points • JVM MapReduce – Keys/values iterated in process • Hadoop Streaming – Performs IPC on stdin/stdout to arbitrary processes • Apache Pig – High level relational language (SQL alternative) • Apache Hive – Forthcoming support for Cassandra storage
  7. 7. Demo • Code – github.com/stuhood/cassandra-summit-demo • Flow – Load with Hadoop Streaming – Analyze with Apache Pig – Load/Process with JVM MapReduce
  8. 8. Hadoop Streaming Summary • Mapper/Reducer scripts – Any language • Script is moved to the data cat $input | mapper | sort | reducer > $output
  9. 9. ETL with Streaming • ETL to Cassandra in ~50 lines Load!
  10. 10. ETL with Streaming 1)Files in HDFS 2)Hadoop Streaming 3)bin/load-mapper.py (the code you write) 4)Cassandra's Streaming Shim 5)Cassandra
  11. 11. Apache Pig Summary • Declarative relational language
  12. 12. Analytics with Pig • Analytics from Cassandra in ~20 lines Analyze!
  13. 13. Analytics with Pig 1)Data stored in Cassandra 2)Cassandra's Pig LoadFunc 3)bin/analyze.pig (the code you write) 4)Files in HDFS
  14. 14. JVM MapReduce Summary • Extend Mapper/Reducer base classes • Hadoop: – Transports the Jar to nodes near the data – Efficiently streams data through
  15. 15. Load/Process with MapReduce • Efficient bulk loading in ~80 lines Summarize!
  16. 16. Load/Process with MapReduce 1)Files in HDFS 2)MapReduce 3)Mapper/Reducer (the code you write) 4)Cassandra's ColumnFamilyOutputFormat 5)Cassandra
  17. 17. Future Work • Pig Output • Hive • Hadoop Streaming Input • Optimizations
  18. 18. Questions?
  19. 19. References • Code available at – github.com/stuhood/cassandra-summit-demo • Open issues – CASSANDRA-1315 – CASSANDRA-1322 – CASSANDRA-1368 • “Hadoop + Cassandra” - Jeremy Hanna – slideshare.net/jeromatron/cassandrahadoop-4399672

×