Light slides supporting a Hadoop and Cassandra integration talk at the 2010 Cassandra Summit.
The code is more interesting: http://github.com/stuhood/cassandra-summit-demo
2. What is Hadoop?
• Distributed processing framework (MapReduce)
– Moves processing to the data
• Distributed filesystem
– Allows data to move when processing can't
3. Why use Hadoop with Cassandra?
Perfect partners for big data laundering
• Cassandra optimized for access
• Hadoop optimized for processing
– Many analytics frameworks
– Existing integrations
• RDBMS → Hadoop → Cassandra
4. Cluster Layouts
• Existing Hadoop cluster?
– Start Hadoop tasktrackers on Cassandra cluster
– Processing performed on local nodes
5. Cluster Layouts
• No Hadoop cluster?
– Start all Hadoop daemons on 2-3 nodes
• MapReduce depends lightly on HDFS
– Start Hadoop tasktrackers on Cassandra cluster
6. Hadoop Integration Points
• JVM MapReduce
– Keys/values iterated in process
• Hadoop Streaming
– Performs IPC on stdin/stdout to arbitrary processes
• Apache Pig
– High level relational language (SQL alternative)
• Apache Hive
– Forthcoming support for Cassandra storage
7. Demo
• Code
– github.com/stuhood/cassandra-summit-demo
• Flow
– Load with Hadoop Streaming
– Analyze with Apache Pig
– Load/Process with JVM MapReduce
8. Hadoop Streaming Summary
• Mapper/Reducer scripts
– Any language
• Script is moved to the data
cat $input | mapper | sort | reducer > $output
13. Analytics with Pig
1)Data stored in Cassandra
2)Cassandra's Pig LoadFunc
3)bin/analyze.pig (the code you write)
4)Files in HDFS
14. JVM MapReduce Summary
• Extend Mapper/Reducer base classes
• Hadoop:
– Transports the Jar to nodes near the data
– Efficiently streams data through