Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

End-to-end Analytics with Apache Cassandra


Published on

Published in: Technology
  • Be the first to comment

End-to-end Analytics with Apache Cassandra

  1. End-to-end Analytics with Apache CassandraC* #cassandra12
  2. Basics • CFIF/CFOF/CFRR/CFRW • BulkOutputFormat • Input locality (identical to HDFS) • Wide row, 2I, composite support • Pig, Hive, Mahout, Sqoop, Oozie, DSE AnalyticsC* #cassandra12
  3. Why Cassandra? • Excellent Hadoop capabilities built-in • Multi-datacenter support, load isolation • Operationally, order of magnitude simpler • DSE Analytics is all-in-one, simpler still • Review your requirements, do homeworkC* #cassandra12
  4. Use Cases • Trends, recommendations, reporting, etc. • Detect and fix problems in data • New realtime (or analytic) query pattern? • Backpopulate new CF with historical dataC* #cassandra12
  5. Data Model • Consider growth patterns and analytic query patterns for your data • One-off inquiry or regular processing? • Fast growing, want only small slices? • Consider active/archive CFs • Secondary indexes for small inputsC* #cassandra12
  6. Miscellaneous Tips • Don’t forget about tombstones • BOP to enable range slices (Rows with keys ‘A*’ to ‘F*’)C* #cassandra12
  7. Cassandra + Oozie • Workflows: cohesive, nestable, scheduled • Web UI, CLI, web service • Cassandra properties in oozie job properties, and workflow.xml • Writing out to Cassandra: mapreduce.fileoutputcommitter.marksuccessfuljobs to false • DSE Analytics works with Oozie 3.2.1+C* #cassandra12
  8. Cassandra + Pig • Data with validators • Pig tuples (, address.value) • Data with default validator • Bag of key, value pairs (tuples) • Unmarshal with Pygmalion • Select by regex (eg ‘1369*’, ‘link*’)C* #cassandra12
  9. Cassandra + Pig • Output to Cassandra • Can output directly with (key, (name, value), (name, value)...) format • For tabular data, format output with Pygmalion’s ToCassandraBag • Use BulkOutputFormat (C* 1.1)C* #cassandra12
  10. Cassandra + Pig • Composite column support (C* 1.0.9+) • Counter support (C* 1.0.9+) • Secondary Index support for relatively small slices (C* 1.1+) • Wide row support (C* 1.1+) • Composite key support (C* 1.1.3+)C* #cassandra12
  11. Cassandra + Hadoop X • Example: Cassandra + CDH3 • Start with Cassandra ring • Add NN, JT, Oozie server • TaskTracker, DataNode on each node • Jobs/launching point have Cassandra info • Segregate out analytics with virtual DCC* #cassandra12
  12. Current/Future Work • Cassandra core Hive support (outside of Brisk) (CASSANDRA-4131)C* #cassandra12
  13. For More Information • Follow @CassandraHadoop on Twitter • HadoopSupportC* #cassandra12
  14. Questions?C* #cassandra12