Using a Fast Operational Database to Build Real-time Streaming Aggregations

page
USING A FAST OPERATIONAL
DATABASE TO BUILD REAL-TIME
STREAMING AGGREGATIONS

page© 2016 VoltDB PROPRIETARY
•  It’s a data-intensive world
•  Your business is only as fast, as
competitive as your database
The Trillion Device World
2
UC Berkeley Professor Vincentelli,
Computerworld, September 2015
THE DATA-FICATION OF LIFE

page
Big Data
“Perishable insights can have exponentially more value than
after-the-fact traditional historical analytics.”
Mike Gual.eri, Principal Analyst, Forrester Research
Fast Data
DATA IS TRANSFORMING BUSINESS

VOLTDB: WE DON’T MAKE APPS, WE MAKE APPS…
4
• Real-time intelligence and context for richer interactions
• Make different decisions on each individual event or person
• Analyze and act on streaming data
• 100X faster than traditional databases
• World record performance in the cloud (YCSB)
• Millisecond response time
• High-speed data ingestion
• Simpler apps, easier to test and maintain
• Easier to program with SQL + Java
• Seamless ecosystem integration
• Data is always consistent and correct, never lost
Smarter
Faster
Simpler
10
Trillion Device World

100X
Traditional DB

100%
Consistent, Correct

Batch/Iterative
Analytics
-  Statistical correlations
-  Multi-dimensional analysis
-  Predictive analytics
+
Big DataFast Data
Rapid Data Ingestion
and
Transformation
Streaming
Analytics
-  Filtering
-  Windowing
-  Aggregation
-  Enrichment
-  Correlations
Operational
Interaction/
Transactions
-  Context-aware
-  Personal
-  Real-time
FAST DATA APPLICATION REQUIREMENTS

Streaming
Analytics
-  Filtering
-  Windowing
-  Aggregation
-  Enrichment
-  Correlations
Batch/Iterative
Analytics
-  Predictive analyticsOperational
Interaction/
Transactions
-  Context-aware
-  Personal
-  Real-time
+
and
Transformation
Fast Data
1
2
3
1
2 3
Ingest Analyze Decide
Fast Data = + + 4
Export
+
4
Big Data

BUILDING FAST DATA APPLICATIONS
1.  Ingest:
Unbound
Streams
of
Data

•  Stream
data
into
an
opera8onal
store

•  VoltDB
has
in-‐process
(in
database)
importers

2.  Analyze:
Opera8onal
Store
processes
data

•  Compute
Real-‐8me
analy8cs

3.  Decide:
Make
Per-‐event
Decisions

•  Transac8ons

4.  Export:

To
historical
data
store

•  VoltDB
has
in-‐process
Export
connecters

•  Push
data
downstream
“data
lake”

•  For
Historical
Analysis/Machine
Learning

Streaming
Analytics
-  Filtering
-  Windowing
-  Aggregation
-  Enrichment
-  Correlations
Batch/Iterative
Analytics
-  Predictive analyticsOperational
Interaction/
Transactions
-  Context-aware
-  Personal
-  Real-time
+
and
Transformation
Big Data

Biography
-  Technical:
-  Started programing in 1985
-  Developed kernel apps like printer drivers and high
performance networking tools in C
-  MS in Electrical Engineering from Technical
University in Graz/Austria in 1995
-  Filed for two patents for improving RDBMS
Performance in 2005 (Symantec Corp) and 2008
(FOX news)
-  Hobbies:
-  Running (Marathons)
-  Photography
-  RC Airplanes
-  Electronics

Agenda -  Vision
-  Technical requirements
-  System Architecture
-  Why using VoltDB over HBASE or Cassandra
-  VoltDB, Things to consider when designing
solutions with VoltDB
-  Conclusion
-  Resources

Vision
-  Building a real-time analytic engine for:
-  real-time diagnoses of our Edge Servers
-  MaxCDN-Predict
-  Elastic Provisioning
-  Improving Serving performance
-  Using this data to bill customers

Technical Requirements
-  The system should have the following features:
-  Horizontally scalable
-  Real-time (15 seconds SLA) from the time content is served till it shows up
into the aggregates.
-  Zero production support:
-  Zero touch crash recovery
-  No data clean-up/recovery required
-  Guaranteed no data lost
-  SQL interface for mining and drill-down
-  Ad-Hoc queries of the not aggregated raw-data

MaxCDN’s Lambda Architecture

System Architecture
-  When Nginx serves the content, it logs this transaction
-  These logs are streamed into the aggregation farm from around the world. We
get ~ 32 TB of logs per day. This data gets pushed into 4 rabbit-mq queues.
-  A farm of 4 machines, clean up and pre-aggregate this data. They create a
batch of 70K raw-data along with corresponding aggregates and push it into a
rabbit-mq queue.
-  VoltDB cluster runs with:
-  7 machines in k-factor=0
-  Sync logging mode for “no data lost”
-  48 SitesPerHost. So, a total of 7*48 = 336 partitions.

System Architecture
-  VoltDB clients read these batches from rabbit-mq and push this data into a VoltDB
cluster composed of 7 machines. They use VoltDB’s “hashinator” to push an array
of data into only “one procedure call per Table per Partition”. These clients
guarantee batch level atomic processing across 1680 (=5*336) VoltDB stored
procedure calls
-  Tables are maintained in a ring-buﬀer fashion.
-  We can only keep ~ 30 min of most recent raw-data
-  The system behaves completely like a distributed transactional RDBMS in terms of
“no data lost guarantee”.

System Architecture
-  Zero touch crash recovery:
-  When VoltDB crashes:
-  Clients go into pause mode
-  Supervisord starts up VoltDB cluster in recovery mode
-  When VoltDB clients or other components crash:
-  VoltDB clients and all the other critical components run under Supvisord. So, they
get restarted automatically
-  Completely transactional processing through utilizing :
-  VoltDB’s atomic processing at the stored procedure level
-  Rabbit-MQ re-play guarantee
-  Idempotency

Why using VoltDB over HBASE or Cassandra
-  Simply because of the “multi-row WRITE atomicity”.
-  Multi-row WRITE atomicity results in much less CPU / I/O load as well as easier
implementation.
-  To make this clear let us consider our use-case of pushing our 70K batches of raw-
logs into a storage system:
-  VoltDB:
-  With VoltDB, we have got stored-proc level atomicity. Current implementation pushes 70000
rows into 336 partitions. So, each stored-proc call writes 70,000/336 = ~ 208 rows into the
rawlogs table. For these 208 rows, we add one row into the TX table with batch-id of this
batch.

Advantage of Multi-Write Atomicity

Why using VoltDB over HBASE or Cassandra
-  HBASE:
-  HBASE only oﬀers single row atomicity. So, let us say, we have got also 336 partitions, but,
with HBASE, we have to include batch-id into each row. So, writing the batch-id 208 times
instead of one time. When we apply the batch,we have to go through “208 IF statements” for
each row and apply the batch if needed. So, this would mean a lot more CPU, I/O, and space
requirements.
-  If the batch size grows to 140K from 70K, these 208 WRITEs and “IF statments” will also grow
to 416.

VoltDB, Things to Consider when Designing Solutions
-  Good things:
-  SQL interface unlike Trident or Spark-Streaming
-  Merges the good things of the old-world like SQL and transactions with the
good things of the new world like ‘no-locks’, ‘k-factor’ HA, etc….
-  Very simple and intuitive API and usage
-  k-factor + logs + snapshots eliminates the need to backup the system.
-  Fast query performance
-  Horizontal scalability

VoltDB, Things to Consider when Designing Solutions
-  Each partition has got only one thread of execution for INSERT/UPDATE.
-  Workarounds:
-  Get faster CPUs
-  Pre-process the data outside VoltDB
-  Maximum data coming out of a partition is limited to 50 MB.
-  Workarounds:
-  Make sure there is no relevant query with a qualified set of bigger than 50 MB for any
partitions
-  The more partitions, the better

Conclusion
-  VoltDB merges the good things of the old-world and new world.
-  Provides an easy and scalable solution for real-time streaming aggregation
-  Like any other tool, has some limitations that need to be taken into account when
used towards a solution.

-  VoltDBDB Docs: https://docs.VoltDBdb.com/
-  Lambda Architecture:
https://VoltDBdb.com/blog/simplifying-complex-lambda-architecture
-  Lambda Architecture: http://lambda-architecture.net/
-  Storm/Trident: http://storm.apache.org/documentation/Trident-tutorial.html
-  Spark Streaming: http://spark.apache.org/streaming/
I am available by email: bpirvali@gmail.com
Resources

Using a Fast Operational Database to Build Real-time Streaming Aggregations

More Related Content

What's hot

Viewers also liked

Similar to Using a Fast Operational Database to Build Real-time Streaming Aggregations

Recently uploaded

Using a Fast Operational Database to Build Real-time Streaming Aggregations