0
Real-Time Analytics with
Cassandra and Hadoop
Patricia Gorla

Download code: bit.ly/1aB8Jy8 (12KB)
#strataconf + #hw2013
About Me
• Solr
• Cassandra
• Datastax MVP

Download code: bit.ly/1aB8Jy8 (12KB)
Outline
•

Introduction to Cassandra + 2 labs
15m Break ~ 14:30

•

Analytics + 1 labs
15m Break ~ 16:30

•

Extra Credit
...
Introduction

Download code: bit.ly/1aB8Jy8 (12KB)
Getting Started
Architecture
Data Modeling

Download code: bit.ly/1aB8Jy8 (12KB)
History
• Powered inbox search at Facebook
• Open-sourced in 2008
Why Cassandra?
• Linear scalability
• Availability
• Set it and forget it
...

Many companies use Cassandra.
What is Cassandra?
• Dynamo distributed cluster (no vector
clocks)
• Bigtable data model
• No SPOF
• Tuneably consistent
Cluster

Keyspace

Architecture
Keyspace

Column Family 1

Column Family 2
Keyspace

Column Family 1

row1:

Column Family 2

{col1:val1,time,TTL; … }
Lab
introduction/1-getting-started.md

Download code: bit.ly/1aB8Jy8 (12KB)
Getting Started
Architecture
Data Modeling
Writes
Commit Log -> Memtable -> SSTables

Source: datastax.com
Incoming write to cluster.
Data replicated to replicants.
Data partitioning
by token ranges.
Data partitioning
by virtual nodes.
Reads
High-level overview of reads.
Source: fusionio.com
Source: datastax.com
?

Reading from cluster.
?

?

?

Reading from cluster.
Reading from cluster.
Reading from cluster.
Fault tolerance
?

Reading from cluster.
?

?

?

Reading from cluster.
Reading from cluster.
Reading from cluster.
Deletes
• Distributed deletes are tricky
• Tombstones may not be propagated
• Don’t rely on a delete-heavy system
Getting Started
Architecture
Data Modeling
Protocols
Thrift

Binary

• Thrift, CQL

• CQL

• Synchronous

• Asynchronous
Why CQL?
• Familiar syntax
• Flexible data model over Cassandra
CQL: Creating a Keyspace
create KEYSPACE “Patisserie”
with replication =
{‘class’: ‘SimpleStrategy’,
‘replication_factor’:...
CQL: Creating a Column Family
create TABLE “customers”
(customer text,
age int,
PRIMARY KEY (customer) ) ;
customer

CQL S...
CQL: Creating a Column Family
create TABLE “customers”
(customer text,
age int,
PRIMARY KEY (customer) ) ;

Physical
Repre...
CQL: Composite Columns
create TABLE “customer_purchases”
(customer text,
day text,
item text,
PRIMARY KEY (customer,day) )...
CQL: Composite Columns
create TABLE “customer_purchases”
(customer text,
day text,
item text,
PRIMARY KEY (customer,day) )...
CQL: Composite Primary Keys
create TABLE “daily_sales_by_item”
(day text,
customer text,
hour timestamp,
item text,
PRIMAR...
CQL: Composite Primary Keys
create TABLE “daily_sales_by_item”
(day text,
customer text,
hour timestamp,
item text,
PRIMAR...
CQL: Collections
create TABLE “customer_purchases”
(customer text,
day text,
item list<text>,
PRIMARY KEY (customer,day) )...
Data Modeling Lab
introduction/2-data-modeling.md
Analytics
Cassandra and Analytics
Adapting the Data Model
MapReduce Paradigms
An Unlikely Union
• Batch processing analytics and real-time
data store
• MapReduce, Hive, Pig, Sqoop, Mahout
Why Cassandra and Hadoop?
• Unified workload
• Availability
• Simpler deployment
Data Locality
Data Locality
Data Locality

Datastax Enterprise
Job Tracker

Task Trackers

Datastax Enterprise
MapReduce
CFS
Writing in / out
is passed
through the
CassandraFS
layer
Starting Analytics Node
$ bin/dse cassandra -t -j
# Starts task tracker and job tracker on
# node
Hello, Wordcount
$ bin/dse hadoop fs -put wikipedia /
$ bin/dse hadoop jar wordcount.jar /wikipedia
wc-output
Cassandra and Hadoop
Adapting the Data Model
MapReduce Paradigms
Hive
• SQL-like MapReduce abstraction
• Data types
• Efficient JOINs, GROUP BY
Cassandra and Hive
• Hive still has to have separate tables.
• DSE stores them in a separate keyspace.
• 1:1 mapping to Ca...
MapReduce
CFS
Hive

Hive Metastore
is persisted in
Cassandra layer
Hive: Creating a DB
hive> CREATE EXTERNAL TABLE customers (
id string, name string, age int
)
STORED BY
‘o.a.h.h.cassandra...
Hive: Multiple Data Centers
hive> CREATE EXTERNAL TABLE customers (
id string, name string, age int
)
STORED BY
‘o.a.h.h.c...
Hive
• What about composite columns?
• Must be retrieved as binary data, and then use UDF to
deserialize it.
Hive: Lab
• For each person, calculate how many pastries (and of
what kind) they purchased.
Hive: Multiple Data Centers
hive> SELECT
b.name, a.item, sum(a.amount)
FROM Oberweis.daily_purchases a
JOIN Oberweis.perso...
Extra Credit
Real Time Considerations
• What about real time?
• Neither Hadoop nor Hive are built for real-time
• Cassandra provides yo...
Cassandra 2.0
• Transactions
• Triggers
• Prepared Statements
Q&A
@patriciagorla
pgorla@o19s.com
pgorla on IRC
(#cassandra, #python)

#strataconf + #hw2013
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
Upcoming SlideShare
Loading in...5
×

Introduction to Real-Time Analytics with Cassandra and Hadoop

7,707

Published on

This presentation examines the benefits of using Cassandra to store data, and how the Hadoop ecosystem can fit in to add aggregation functionality to your cluster.

Accompanying code can be found online at bit.ly/1aB8Jy8.

Talk delivered at StrataConf + Hadoop World 2013.

Published in: Technology, Business
0 Comments
18 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,707
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
211
Comments
0
Likes
18
Embeds 0
No embeds

No notes for slide

Transcript of "Introduction to Real-Time Analytics with Cassandra and Hadoop"

  1. 1. Real-Time Analytics with Cassandra and Hadoop Patricia Gorla Download code: bit.ly/1aB8Jy8 (12KB) #strataconf + #hw2013
  2. 2. About Me • Solr • Cassandra • Datastax MVP Download code: bit.ly/1aB8Jy8 (12KB)
  3. 3. Outline • Introduction to Cassandra + 2 labs 15m Break ~ 14:30 • Analytics + 1 labs 15m Break ~ 16:30 • Extra Credit Download code: bit.ly/1aB8Jy8 (12KB)
  4. 4. Introduction Download code: bit.ly/1aB8Jy8 (12KB)
  5. 5. Getting Started Architecture Data Modeling Download code: bit.ly/1aB8Jy8 (12KB)
  6. 6. History • Powered inbox search at Facebook • Open-sourced in 2008
  7. 7. Why Cassandra? • Linear scalability • Availability • Set it and forget it
  8. 8. ... Many companies use Cassandra.
  9. 9. What is Cassandra? • Dynamo distributed cluster (no vector clocks) • Bigtable data model • No SPOF • Tuneably consistent
  10. 10. Cluster Keyspace Architecture
  11. 11. Keyspace Column Family 1 Column Family 2
  12. 12. Keyspace Column Family 1 row1: Column Family 2 {col1:val1,time,TTL; … }
  13. 13. Lab introduction/1-getting-started.md Download code: bit.ly/1aB8Jy8 (12KB)
  14. 14. Getting Started Architecture Data Modeling
  15. 15. Writes Commit Log -> Memtable -> SSTables Source: datastax.com
  16. 16. Incoming write to cluster.
  17. 17. Data replicated to replicants.
  18. 18. Data partitioning by token ranges.
  19. 19. Data partitioning by virtual nodes.
  20. 20. Reads
  21. 21. High-level overview of reads. Source: fusionio.com
  22. 22. Source: datastax.com
  23. 23. ? Reading from cluster.
  24. 24. ? ? ? Reading from cluster.
  25. 25. Reading from cluster.
  26. 26. Reading from cluster.
  27. 27. Fault tolerance
  28. 28. ? Reading from cluster.
  29. 29. ? ? ? Reading from cluster.
  30. 30. Reading from cluster.
  31. 31. Reading from cluster.
  32. 32. Deletes • Distributed deletes are tricky • Tombstones may not be propagated • Don’t rely on a delete-heavy system
  33. 33. Getting Started Architecture Data Modeling
  34. 34. Protocols Thrift Binary • Thrift, CQL • CQL • Synchronous • Asynchronous
  35. 35. Why CQL? • Familiar syntax • Flexible data model over Cassandra
  36. 36. CQL: Creating a Keyspace create KEYSPACE “Patisserie” with replication = {‘class’: ‘SimpleStrategy’, ‘replication_factor’: 1 } ; use “Patisserie”;
  37. 37. CQL: Creating a Column Family create TABLE “customers” (customer text, age int, PRIMARY KEY (customer) ) ; customer CQL Schema age Yves Laurent 77 Coco Chanel 130 Pierre Cardin
  38. 38. CQL: Creating a Column Family create TABLE “customers” (customer text, age int, PRIMARY KEY (customer) ) ; Physical Representation ”Yves Laurent”: {“age”:77} “Coco Chanel”: {“age”:130} “Pierre Cardin”: {}
  39. 39. CQL: Composite Columns create TABLE “customer_purchases” (customer text, day text, item text, PRIMARY KEY (customer,day) ) ; customer day item ylaurent M rivoli ylaurent T mille feuille cchanel M pain au chocolat pcardin W mille feuille pcardin F croissant
  40. 40. CQL: Composite Columns create TABLE “customer_purchases” (customer text, day text, item text, PRIMARY KEY (customer,day) ) ; ”ylaurent”: { “M:item”: “rivoli”, “T:item”: “mille feuille” } “cchanel”: { “M:item”: “pain au chocolat” } “pcardin”: { “W:item”: “mille feuille”, “F:item”: croissant }
  41. 41. CQL: Composite Primary Keys create TABLE “daily_sales_by_item” (day text, customer text, hour timestamp, item text, PRIMARY KEY ((day,customer), hour) ) ; day customer hour item M cchanel 13 rivoli M cchanel 15 mille feuille M ylaurent 4 rivoli T cchanel 17 mille feuille W pcardin 20 croissant
  42. 42. CQL: Composite Primary Keys create TABLE “daily_sales_by_item” (day text, customer text, hour timestamp, item text, PRIMARY KEY ((day,customer), hour) ) ; ”M:cchanel”: { “13:item”: “rivoli”, “15:item”: “mille feuille” } “M:ylaurent”: { “4:item”: “rivoli” } “T:cchanel”: { “17:item”: “mille feuille" } “W:pcardin”: { “20”item”: “croissant” }
  43. 43. CQL: Collections create TABLE “customer_purchases” (customer text, day text, item list<text>, PRIMARY KEY (customer,day) ) ; customer day item ylaurent M [‘rivoli’, ‘rivoli’, ‘javanais’] cchanel M [‘pain au chocolat’] pcardin W [‘mille feuille’, ‘croissant’] pcardin F [‘croissant’]
  44. 44. Data Modeling Lab introduction/2-data-modeling.md
  45. 45. Analytics
  46. 46. Cassandra and Analytics Adapting the Data Model MapReduce Paradigms
  47. 47. An Unlikely Union • Batch processing analytics and real-time data store • MapReduce, Hive, Pig, Sqoop, Mahout
  48. 48. Why Cassandra and Hadoop? • Unified workload • Availability • Simpler deployment
  49. 49. Data Locality Data Locality Data Locality Datastax Enterprise
  50. 50. Job Tracker Task Trackers Datastax Enterprise
  51. 51. MapReduce CFS Writing in / out is passed through the CassandraFS layer
  52. 52. Starting Analytics Node $ bin/dse cassandra -t -j # Starts task tracker and job tracker on # node
  53. 53. Hello, Wordcount $ bin/dse hadoop fs -put wikipedia / $ bin/dse hadoop jar wordcount.jar /wikipedia wc-output
  54. 54. Cassandra and Hadoop Adapting the Data Model MapReduce Paradigms
  55. 55. Hive • SQL-like MapReduce abstraction • Data types • Efficient JOINs, GROUP BY
  56. 56. Cassandra and Hive • Hive still has to have separate tables. • DSE stores them in a separate keyspace. • 1:1 mapping to Cassandra CFs • Schemas must match or columns will be inaccessible.
  57. 57. MapReduce CFS Hive Hive Metastore is persisted in Cassandra layer
  58. 58. Hive: Creating a DB hive> CREATE EXTERNAL TABLE customers ( id string, name string, age int ) STORED BY ‘o.a.h.h.cassandra.CassandraStorageHandler’ TBLPROPERTIES ( “cassandra.ks.name” = “Oberweis”, “cassandra.ks.repfactor” = “2”, “cassandra.ks.strategy” = “o.a.c.l.SimpleStrategy” );
  59. 59. Hive: Multiple Data Centers hive> CREATE EXTERNAL TABLE customers ( id string, name string, age int ) STORED BY ‘o.a.h.h.cassandra.CassandraStorageHandler’ TBLPROPERTIES ( “cassandra.ks.name” = “Oberweis”, “cassandra.ks.stratOptions” = “DC1:3, DC2:1”, “cassandra.ks.strategy” = “o.a.c.l.NTStrategy” );
  60. 60. Hive • What about composite columns? • Must be retrieved as binary data, and then use UDF to deserialize it.
  61. 61. Hive: Lab • For each person, calculate how many pastries (and of what kind) they purchased.
  62. 62. Hive: Multiple Data Centers hive> SELECT b.name, a.item, sum(a.amount) FROM Oberweis.daily_purchases a JOIN Oberweis.person b ON (a.person = b.id) GROUP BY b.name, a.item;
  63. 63. Extra Credit
  64. 64. Real Time Considerations • What about real time? • Neither Hadoop nor Hive are built for real-time • Cassandra provides you with data locality
  65. 65. Cassandra 2.0 • Transactions • Triggers • Prepared Statements
  66. 66. Q&A @patriciagorla pgorla@o19s.com pgorla on IRC (#cassandra, #python) #strataconf + #hw2013
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×