Introduction to Real-Time Analytics with Cassandra and Hadoop
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Introduction to Real-Time Analytics with Cassandra and Hadoop

on

  • 4,763 views

This presentation examines the benefits of using Cassandra to store data, and how the Hadoop ecosystem can fit in to add aggregation functionality to your cluster. ...

This presentation examines the benefits of using Cassandra to store data, and how the Hadoop ecosystem can fit in to add aggregation functionality to your cluster.

Accompanying code can be found online at bit.ly/1aB8Jy8.

Talk delivered at StrataConf + Hadoop World 2013.

Statistics

Views

Total Views
4,763
Views on SlideShare
4,677
Embed Views
86

Actions

Likes
9
Downloads
106
Comments
0

3 Embeds 86

https://twitter.com 84
http://eventifier.co 1
http://www.pinterest.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Introduction to Real-Time Analytics with Cassandra and Hadoop Presentation Transcript

  • 1. Real-Time Analytics with Cassandra and Hadoop Patricia Gorla Download code: bit.ly/1aB8Jy8 (12KB) #strataconf + #hw2013
  • 2. About Me • Solr • Cassandra • Datastax MVP Download code: bit.ly/1aB8Jy8 (12KB)
  • 3. Outline • Introduction to Cassandra + 2 labs 15m Break ~ 14:30 • Analytics + 1 labs 15m Break ~ 16:30 • Extra Credit Download code: bit.ly/1aB8Jy8 (12KB)
  • 4. Introduction Download code: bit.ly/1aB8Jy8 (12KB)
  • 5. Getting Started Architecture Data Modeling Download code: bit.ly/1aB8Jy8 (12KB)
  • 6. History • Powered inbox search at Facebook • Open-sourced in 2008
  • 7. Why Cassandra? • Linear scalability • Availability • Set it and forget it
  • 8. ... Many companies use Cassandra.
  • 9. What is Cassandra? • Dynamo distributed cluster (no vector clocks) • Bigtable data model • No SPOF • Tuneably consistent
  • 10. Cluster Keyspace Architecture
  • 11. Keyspace Column Family 1 Column Family 2
  • 12. Keyspace Column Family 1 row1: Column Family 2 {col1:val1,time,TTL; … }
  • 13. Lab introduction/1-getting-started.md Download code: bit.ly/1aB8Jy8 (12KB)
  • 14. Getting Started Architecture Data Modeling
  • 15. Writes Commit Log -> Memtable -> SSTables Source: datastax.com
  • 16. Incoming write to cluster.
  • 17. Data replicated to replicants.
  • 18. Data partitioning by token ranges.
  • 19. Data partitioning by virtual nodes.
  • 20. Reads
  • 21. High-level overview of reads. Source: fusionio.com
  • 22. Source: datastax.com
  • 23. ? Reading from cluster.
  • 24. ? ? ? Reading from cluster.
  • 25. Reading from cluster.
  • 26. Reading from cluster.
  • 27. Fault tolerance
  • 28. ? Reading from cluster.
  • 29. ? ? ? Reading from cluster.
  • 30. Reading from cluster.
  • 31. Reading from cluster.
  • 32. Deletes • Distributed deletes are tricky • Tombstones may not be propagated • Don’t rely on a delete-heavy system
  • 33. Getting Started Architecture Data Modeling
  • 34. Protocols Thrift Binary • Thrift, CQL • CQL • Synchronous • Asynchronous
  • 35. Why CQL? • Familiar syntax • Flexible data model over Cassandra
  • 36. CQL: Creating a Keyspace create KEYSPACE “Patisserie” with replication = {‘class’: ‘SimpleStrategy’, ‘replication_factor’: 1 } ; use “Patisserie”;
  • 37. CQL: Creating a Column Family create TABLE “customers” (customer text, age int, PRIMARY KEY (customer) ) ; customer CQL Schema age Yves Laurent 77 Coco Chanel 130 Pierre Cardin
  • 38. CQL: Creating a Column Family create TABLE “customers” (customer text, age int, PRIMARY KEY (customer) ) ; Physical Representation ”Yves Laurent”: {“age”:77} “Coco Chanel”: {“age”:130} “Pierre Cardin”: {}
  • 39. CQL: Composite Columns create TABLE “customer_purchases” (customer text, day text, item text, PRIMARY KEY (customer,day) ) ; customer day item ylaurent M rivoli ylaurent T mille feuille cchanel M pain au chocolat pcardin W mille feuille pcardin F croissant
  • 40. CQL: Composite Columns create TABLE “customer_purchases” (customer text, day text, item text, PRIMARY KEY (customer,day) ) ; ”ylaurent”: { “M:item”: “rivoli”, “T:item”: “mille feuille” } “cchanel”: { “M:item”: “pain au chocolat” } “pcardin”: { “W:item”: “mille feuille”, “F:item”: croissant }
  • 41. CQL: Composite Primary Keys create TABLE “daily_sales_by_item” (day text, customer text, hour timestamp, item text, PRIMARY KEY ((day,customer), hour) ) ; day customer hour item M cchanel 13 rivoli M cchanel 15 mille feuille M ylaurent 4 rivoli T cchanel 17 mille feuille W pcardin 20 croissant
  • 42. CQL: Composite Primary Keys create TABLE “daily_sales_by_item” (day text, customer text, hour timestamp, item text, PRIMARY KEY ((day,customer), hour) ) ; ”M:cchanel”: { “13:item”: “rivoli”, “15:item”: “mille feuille” } “M:ylaurent”: { “4:item”: “rivoli” } “T:cchanel”: { “17:item”: “mille feuille" } “W:pcardin”: { “20”item”: “croissant” }
  • 43. CQL: Collections create TABLE “customer_purchases” (customer text, day text, item list<text>, PRIMARY KEY (customer,day) ) ; customer day item ylaurent M [‘rivoli’, ‘rivoli’, ‘javanais’] cchanel M [‘pain au chocolat’] pcardin W [‘mille feuille’, ‘croissant’] pcardin F [‘croissant’]
  • 44. Data Modeling Lab introduction/2-data-modeling.md
  • 45. Analytics
  • 46. Cassandra and Analytics Adapting the Data Model MapReduce Paradigms
  • 47. An Unlikely Union • Batch processing analytics and real-time data store • MapReduce, Hive, Pig, Sqoop, Mahout
  • 48. Why Cassandra and Hadoop? • Unified workload • Availability • Simpler deployment
  • 49. Data Locality Data Locality Data Locality Datastax Enterprise
  • 50. Job Tracker Task Trackers Datastax Enterprise
  • 51. MapReduce CFS Writing in / out is passed through the CassandraFS layer
  • 52. Starting Analytics Node $ bin/dse cassandra -t -j # Starts task tracker and job tracker on # node
  • 53. Hello, Wordcount $ bin/dse hadoop fs -put wikipedia / $ bin/dse hadoop jar wordcount.jar /wikipedia wc-output
  • 54. Cassandra and Hadoop Adapting the Data Model MapReduce Paradigms
  • 55. Hive • SQL-like MapReduce abstraction • Data types • Efficient JOINs, GROUP BY
  • 56. Cassandra and Hive • Hive still has to have separate tables. • DSE stores them in a separate keyspace. • 1:1 mapping to Cassandra CFs • Schemas must match or columns will be inaccessible.
  • 57. MapReduce CFS Hive Hive Metastore is persisted in Cassandra layer
  • 58. Hive: Creating a DB hive> CREATE EXTERNAL TABLE customers ( id string, name string, age int ) STORED BY ‘o.a.h.h.cassandra.CassandraStorageHandler’ TBLPROPERTIES ( “cassandra.ks.name” = “Oberweis”, “cassandra.ks.repfactor” = “2”, “cassandra.ks.strategy” = “o.a.c.l.SimpleStrategy” );
  • 59. Hive: Multiple Data Centers hive> CREATE EXTERNAL TABLE customers ( id string, name string, age int ) STORED BY ‘o.a.h.h.cassandra.CassandraStorageHandler’ TBLPROPERTIES ( “cassandra.ks.name” = “Oberweis”, “cassandra.ks.stratOptions” = “DC1:3, DC2:1”, “cassandra.ks.strategy” = “o.a.c.l.NTStrategy” );
  • 60. Hive • What about composite columns? • Must be retrieved as binary data, and then use UDF to deserialize it.
  • 61. Hive: Lab • For each person, calculate how many pastries (and of what kind) they purchased.
  • 62. Hive: Multiple Data Centers hive> SELECT b.name, a.item, sum(a.amount) FROM Oberweis.daily_purchases a JOIN Oberweis.person b ON (a.person = b.id) GROUP BY b.name, a.item;
  • 63. Extra Credit
  • 64. Real Time Considerations • What about real time? • Neither Hadoop nor Hive are built for real-time • Cassandra provides you with data locality
  • 65. Cassandra 2.0 • Transactions • Triggers • Prepared Statements
  • 66. Q&A @patriciagorla pgorla@o19s.com pgorla on IRC (#cassandra, #python) #strataconf + #hw2013