C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Cassandra &
Next Generation Analysis
Cassandra for a high-velocity data
ingestion and real-time analysis system.
Ameet Chaubal & Fausto Inestroza

Presentation Route
• Describe
conven,onal
technology

solu,on

• Highlight
deﬁciencies

• Showcase
new
solu,on

implemented
using
Cassandra

• Layout
architecture
with

improvements

Business Case
•  Capture messages from high-volume e-
Commerce site.
•  Store them into a database
•  Perform near real-time queries for
troubleshooting
•  Perform deeper analysis a la BI.

Olden Days …
JMS Queue
Transient
Storage
RDBMS
Data
warehouse
Analysis
eCommerce Website

Business Case, Details…
Messages: 5000 msg/sec
~ 250 million / day
Message size : 1 Kb
JMS Queue
Transient
Storage
RDBMS
Data
warehouse
eCommerce Website
Decouple UI from storage
Multiple sinks
Dedicated storage Triage
Data Analysis
Business Intelligence

What’s the problem?
JMS Queue
Data
warehouse
SITE I
SITE II
JMS Queue
•  Queue
Replication
problems
•  Message Loss
•  Other applications
affected in case of
failover
•  Triage data isolated
•  No universal view
•  Data Consolidation
adds delay
•  Inability to keep up
with increasing
messages
•  Analysis always
lagging the action
•  No low-latency
queries
Batch Load
Transient
storage

Problems Recap
• Over
5000
msg/sec
High
Write
Speed

• Extrac9on
&
Load
very
slow
ETL
from
Transient
storage
to
Data

warehouse
takes
over
4
hours

• Analysis
always
lags
events
by
hours
ETL
performed
in
batches
4
hours
apart

• No
high
availability
No
Geo-‐Redundancy
for
Transient

Storage

• Data
stored
in
disparate
buckets
No
Universal
view
of
data
for
“Triage”

applica9ons/troubleshoo9ng

• No
dashboard

No
low-‐latency
queries

•  No
immediate
alert,
paRern
detec9on
No
real-‐9me
analysis

Thrift
Connection
Pool
Online e-Commerce
Application
Event JMS
A3
Load
Balancer
VIP
A6A5
Replication
Consumers
Hector /
Java Client -1
Hector /
Java Client -2
Hector /
Java Client -n
JMS
Publisher
A1
A2
Cassandra
A7
A4
Write
event to
queue
Fetch
from
queue
Cassandra + Hadoop
A8
Map/Reduce
Hive Queries/
BI
Real-Time
Dashboard
A9
A10
A12
Solution Blueprint

Role of Data Model
Before we get there,
what features are missing from Cassandra in
comparison to traditional RDBMS

Shortcomings… Opportunities
•  No Joins across Column Families
•  No analytical functions such as sum, count…
•  Difficulty in constructing “WHERE” clause
predicates across composite columns
•  Inability to order range of Keys in Random
Partitioner

Importance of Data model - Cassandra
•  In lieu of JOINS, “smart” de-normalization techniques
are crucial.
•  Need to use “FEATURES” of Cassandra to effectively
model the business rules and business data
•  “Client” or “Application” code becomes extremely
important.
•  “APPLICATION” + “DATABASE” => Full Package

Features of Cassandra Modeling
•  “WIDE” Column Family
–  Organize data in “horizontal” as opposed to “vertical” fashion as in RDBMS
•  Automatic Sorting of Columns
–  Important to “MODEL” the data in “COLUMNS” as opposed to rows.
•  Faster Access to ALL COLUMNS of a Row Key
–  All columns of a row key stored on ONE server =>fast iteration/aggregations
•  Useful info in “COLUMN NAME”
–  Ground breaking from RDBMS perspective
–  Enables “MORE” “INFORMATION” to be PACKED
–  “COLUMN” as entity becomes “MORE POWERFUL”.
•  COMPOSITE Column NAMES:
–  Column names can be COMPOSITES !!! Made up of multiple columns
–  Auto sorting still works

Data Model
Wide
rows
with
sharding

Row
Key
=
“<min>|<part#>”

Role
of
par99on
#:

•  Each
row
is
stored
by
a
single
server
and
with
5,000x60=300,000
events
per
minute,
that

would
put
large
load
for
a
minute
on
a
single
server.

•  A
“par99on”
contrap9on
aims
to
“break”
this
huge
row,
remove
hotspots
and
spread
the

load
to
possibly
all
servers

•  The
#
of
par99ons,
some
mul9ple
of
the
#
of
servers

•  Finite
#
of
par99ons
–
s9ll
maintains
the
row
key
as
meaningful,
i.e.
we
can
construct
the

keys
for
a
certain
minute
and
fetch
records
for
them.

Composite Columns
•  Composite Columns:
–  Actual message stored as part of composite column
•  Variable granularity grouping
–  Minute: Row key based on minute
Min_par((on
(TEXT)
DC:TimeUUID:UserID:Message(Composite)
…

2012-‐07-‐18-‐08-‐13-‐p-‐1
Status

…
…

2012-‐07-‐19-‐11-‐21-‐p-‐3
Status

Data Center 3 (RO)
Data Center 2
(RW)
Data Center 1
(RW)
Geo-Redundancy
16
Data Center 4 (RO)

Data Consolidation and Extraction
•  Single view of data across multiple locations
•  Data extraction can be performed in parallel
•  Data extraction process performed in
dedicated cluster of machines.

Low-Latency & Batch Applications
•  Triaging
–  Troubleshooting customer issues within 10 minutes of
occurrence
–  Feeding a dashboard of live feed data through
aggregations performed in Counter CFs
•  Analysis
–  Analytical and ad Hoc queries to replace the need
for remote data warehouse eventually
–  Map/Reduce via Hive without ETL

Opportunities Remaining
•  Near real-time pattern detection and
response
•  Message loss in JMS queue
•  JMS queue replication.
•  reducing the impact of Queue failover on
other applications

Further Improvements…
HOW ???

Accenture
Cloud
PlaAorm

Recommender
as
a

Service

…

Network
Analy9cs

Services

Big Data Platform

Drivers
consumer devices
video usage
Issues
Operational Costs
Understanding service quality degradation
Inefficient capacity planning

INGEST
PROCESS

VISUALIZE

ANALYZE

STORE

Scalability
Reliability
Data types, size, velocity
Mission critical data
Processing, computation, etc.
Time series / pattern
analysis
Fault-tolerance
What do we need?
Multiple use cases

How do we get this from Storm?
Processing guarantees
Low-level
Primitives
Parallelization
Robust fail-over strategies
Scalability
Reliability
Fault-tolerance
Processing, computation,
etc.

Stream

Spout

Bolt

Topology

Subop(mal
network

speed,
geospa(al
analysis

Request
info
(IP,
user-‐agent,

etc)

Pull
messages
from

distributed
queue

Sessioniza(on,
speed

calcula(on

Tuple
Tuple
Tuple

Integration with Cassandra
Cassandra
Optimal for time series data
Near-linear scalable
Low read/write latency
Scales in conjunction with Storm
Custom Bolt
Uses Hector API to access Cassandra
Creates dynamic columns per request
Stores relevant network data

SUBOPTIMAL NETWORK SPEED TOPOLOGY
AN EXAMPLE

KaUa

Spout

Pre-‐process
Sessionize

Calculate
N/
W
Speed
per

Session

Update

Speed
per
IP

Iden(fy
Sub-‐
Op(mal

Speed

Store
in

Cassandra

Cassandra

Tuple
(ip
1)
Tuple
(ip
1)
Tuple
(ip
1)
Tuple
(ip
1)
Tuple
(ip
1)
Tuple
(ip
1)
Tuple
(ip
1)

Cassandra

KaUa

Spout

Pre-‐process
Sessionize

Calculate
N/
W
Speed
per

Session

Update

Speed
per
IP

Iden(fy
Sub-‐
Op(mal

Speed

Store
in

Cassandra

Tuple
(ip
2)
Tuple
(ip
2)
Tuple
(ip
2)

Tuple
(ip
1)

Tuple
(ip
2)

Tuple
(ip
1)
Tuple
(ip
1)

Tuple
(ip
2)
Tuple
(ip
2)
Tuple
(ip
2)

Tuple
(ip
1)

Tuple
(ip
2)

Tuple
(ip
1)

Tuple
(ip
2)

Tuple
(ip
1)
Tuple
(ip
1)
Tuple
(ip
1)

Tuple
(ip
1)

Parallelism

Cassandra

KaUa

Spout

Pre-‐process
Sessionize

Calculate
N/
W
Speed
per

Session

Update

Speed
per
IP

Join

Compare

Speed

Store
in

Cassandra

Speed
by

Loca(on

Stream
1

Stream
2

KaUa

Spout

Tuple
(ip
1)
Tuple
(ip
1/NY)

Tuple
(NY)

Tuple
(ip
1/NY)

Branching
and
Joins

Lessons Learned
•  Rebalance Topology

•  Tweak parallelism in bolt

•  Isolation of Topologies

•  Use TimeUUIDUtils

•  Log4j level set to INFO by default

C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Similar to C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza (20)

More from DataStax Academy

More from DataStax Academy (20)

Recently uploaded

Recently uploaded (20)

C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza