page
USING A FAST OPERATIONAL
DATABASE TO BUILD REAL-TIME
STREAMING AGGREGATIONS
page© 2016 VoltDB PROPRIETARY
•  It’s a data-intensive world
•  Your business is only as fast, as
competitive as your database
The Trillion Device World
2
UC Berkeley Professor Vincentelli,
Computerworld, September 2015
THE DATA-FICATION OF LIFE
page
Big Data
“Perishable insights can have exponentially more value than
after-the-fact traditional historical analytics.”
Mike  Gual.eri,  Principal  Analyst,  Forrester  Research  
Fast Data
DATA IS TRANSFORMING BUSINESS
page© 2016 VoltDB PROPRIETARY
VOLTDB: WE DON’T MAKE APPS, WE MAKE APPS…
4
• Real-time intelligence and context for richer interactions
• Make different decisions on each individual event or person
• Analyze and act on streaming data
• 100X faster than traditional databases
• World record performance in the cloud (YCSB)
• Millisecond response time
• High-speed data ingestion
• Simpler apps, easier to test and maintain
• Easier to program with SQL + Java
• Seamless ecosystem integration
• Data is always consistent and correct, never lost
Smarter
Faster
Simpler
10
Trillion Device World	
  
100X
Traditional DB	
  
100%
Consistent, Correct	
  
page© 2016 VoltDB PROPRIETARY
Batch/Iterative
Analytics
-  Statistical correlations
-  Multi-dimensional analysis
-  Predictive analytics
+
Big DataFast Data
Rapid Data Ingestion
and
Transformation
Streaming
Analytics
-  Filtering
-  Windowing
-  Aggregation
-  Enrichment
-  Correlations
Operational
Interaction/
Transactions
-  Context-aware
-  Personal
-  Real-time
FAST DATA APPLICATION REQUIREMENTS
page© 2016 VoltDB PROPRIETARY
Streaming
Analytics
-  Filtering
-  Windowing
-  Aggregation
-  Enrichment
-  Correlations
Batch/Iterative
Analytics
-  Statistical correlations
-  Multi-dimensional analysis
-  Predictive analyticsOperational
Interaction/
Transactions
-  Context-aware
-  Personal
-  Real-time
+
Rapid Data Ingestion
and
Transformation
Fast Data
1
2
3
1
2 3
Ingest Analyze Decide
Fast Data = + + 4
Export
+
4
Big Data
FAST DATA APPLICATION REQUIREMENTS
page© 2016 VoltDB PROPRIETARY
BUILDING FAST DATA APPLICATIONS
1.  Ingest:	
  Unbound	
  Streams	
  of	
  Data	
  
•  Stream	
  data	
  into	
  an	
  opera8onal	
  store	
  
•  VoltDB	
  has	
  in-­‐process	
  (in	
  database)	
  importers	
  
2.  Analyze:	
  Opera8onal	
  Store	
  processes	
  data	
  
•  Compute	
  Real-­‐8me	
  analy8cs	
  
3.  Decide:	
  Make	
  Per-­‐event	
  Decisions	
  
•  Transac8ons	
  
4.  Export:	
  	
  To	
  historical	
  data	
  store	
  
•  VoltDB	
  has	
  in-­‐process	
  Export	
  connecters	
  
•  Push	
  data	
  downstream	
  “data	
  lake”	
  
•  For	
  Historical	
  Analysis/Machine	
  Learning	
  
	
  
page© 2016 VoltDB PROPRIETARY
Streaming
Analytics
-  Filtering
-  Windowing
-  Aggregation
-  Enrichment
-  Correlations
Batch/Iterative
Analytics
-  Statistical correlations
-  Multi-dimensional analysis
-  Predictive analyticsOperational
Interaction/
Transactions
-  Context-aware
-  Personal
-  Real-time
+
Rapid Data Ingestion
and
Transformation
Big Data
FAST DATA APPLICATION REQUIREMENTS
Biography
-  Technical:
-  Started programing in 1985
-  Developed kernel apps like printer drivers and high
performance networking tools in C
-  MS in Electrical Engineering from Technical
University in Graz/Austria in 1995
-  Filed for two patents for improving RDBMS
Performance in 2005 (Symantec Corp) and 2008
(FOX news)
-  Hobbies:
-  Running (Marathons)
-  Photography
-  RC Airplanes
-  Electronics
Agenda -  Vision
-  Technical requirements
-  System Architecture
-  Why using VoltDB over HBASE or Cassandra
-  VoltDB, Things to consider when designing
solutions with VoltDB
-  Conclusion
-  Resources
Vision
-  Building a real-time analytic engine for:
-  real-time diagnoses of our Edge Servers
-  MaxCDN-Predict
-  Elastic Provisioning
-  Improving Serving performance
-  Using this data to bill customers
Technical Requirements
-  The system should have the following features:
-  Horizontally scalable
-  Real-time (15 seconds SLA) from the time content is served till it shows up
into the aggregates.
-  Zero production support:
-  Zero touch crash recovery
-  No data clean-up/recovery required
-  Guaranteed no data lost
-  SQL interface for mining and drill-down
-  Ad-Hoc queries of the not aggregated raw-data
MaxCDN’s Lambda Architecture
System Architecture
-  When Nginx serves the content, it logs this transaction
-  These logs are streamed into the aggregation farm from around the world. We
get ~ 32 TB of logs per day. This data gets pushed into 4 rabbit-mq queues.
-  A farm of 4 machines, clean up and pre-aggregate this data. They create a
batch of 70K raw-data along with corresponding aggregates and push it into a
rabbit-mq queue.
-  VoltDB cluster runs with:
-  7 machines in k-factor=0
-  Sync logging mode for “no data lost”
-  48 SitesPerHost. So, a total of 7*48 = 336 partitions.
System Architecture
-  VoltDB clients read these batches from rabbit-mq and push this data into a VoltDB
cluster composed of 7 machines. They use VoltDB’s “hashinator” to push an array
of data into only “one procedure call per Table per Partition”. These clients
guarantee batch level atomic processing across 1680 (=5*336) VoltDB stored
procedure calls
-  Tables are maintained in a ring-buffer fashion.
-  We can only keep ~ 30 min of most recent raw-data
-  The system behaves completely like a distributed transactional RDBMS in terms of
“no data lost guarantee”.
System Architecture
-  Zero touch crash recovery:
-  When VoltDB crashes:
-  Clients go into pause mode
-  Supervisord starts up VoltDB cluster in recovery mode
-  When VoltDB clients or other components crash:
-  VoltDB clients and all the other critical components run under Supvisord. So, they
get restarted automatically
-  Completely transactional processing through utilizing :
-  VoltDB’s atomic processing at the stored procedure level
-  Rabbit-MQ re-play guarantee
-  Idempotency
Why using VoltDB over HBASE or Cassandra
-  Simply because of the “multi-row WRITE atomicity”.
-  Multi-row WRITE atomicity results in much less CPU / I/O load as well as easier
implementation.
-  To make this clear let us consider our use-case of pushing our 70K batches of raw-
logs into a storage system:
-  VoltDB:
-  With VoltDB, we have got stored-proc level atomicity. Current implementation pushes 70000
rows into 336 partitions. So, each stored-proc call writes 70,000/336 = ~ 208 rows into the
rawlogs table. For these 208 rows, we add one row into the TX table with batch-id of this
batch.
Advantage of Multi-Write Atomicity
Why using VoltDB over HBASE or Cassandra
-  HBASE:
-  HBASE only offers single row atomicity. So, let us say, we have got also 336 partitions, but,
with HBASE, we have to include batch-id into each row. So, writing the batch-id 208 times
instead of one time. When we apply the batch,we have to go through “208 IF statements” for
each row and apply the batch if needed. So, this would mean a lot more CPU, I/O, and space
requirements.
-  If the batch size grows to 140K from 70K, these 208 WRITEs and “IF statments” will also grow
to 416.
VoltDB, Things to Consider when Designing Solutions
-  Good things:
-  SQL interface unlike Trident or Spark-Streaming
-  Merges the good things of the old-world like SQL and transactions with the
good things of the new world like ‘no-locks’, ‘k-factor’ HA, etc….
-  Very simple and intuitive API and usage
-  k-factor + logs + snapshots eliminates the need to backup the system.
-  Fast query performance
-  Horizontal scalability
VoltDB, Things to Consider when Designing Solutions
-  Each partition has got only one thread of execution for INSERT/UPDATE.
-  Workarounds:
-  Get faster CPUs
-  Pre-process the data outside VoltDB
-  Maximum data coming out of a partition is limited to 50 MB.
-  Workarounds:
-  Make sure there is no relevant query with a qualified set of bigger than 50 MB for any
partitions
-  The more partitions, the better
Conclusion
-  VoltDB merges the good things of the old-world and new world.
-  Provides an easy and scalable solution for real-time streaming aggregation
-  Like any other tool, has some limitations that need to be taken into account when
used towards a solution.
-  VoltDBDB Docs: https://docs.VoltDBdb.com/
-  Lambda Architecture:
https://VoltDBdb.com/blog/simplifying-complex-lambda-architecture
-  Lambda Architecture: http://lambda-architecture.net/
-  Storm/Trident: http://storm.apache.org/documentation/Trident-tutorial.html
-  Spark Streaming: http://spark.apache.org/streaming/
I am available by email: bpirvali@gmail.com
Resources

Using a Fast Operational Database to Build Real-time Streaming Aggregations

  • 1.
    page USING A FASTOPERATIONAL DATABASE TO BUILD REAL-TIME STREAMING AGGREGATIONS
  • 2.
    page© 2016 VoltDBPROPRIETARY •  It’s a data-intensive world •  Your business is only as fast, as competitive as your database The Trillion Device World 2 UC Berkeley Professor Vincentelli, Computerworld, September 2015 THE DATA-FICATION OF LIFE
  • 3.
    page Big Data “Perishable insightscan have exponentially more value than after-the-fact traditional historical analytics.” Mike  Gual.eri,  Principal  Analyst,  Forrester  Research   Fast Data DATA IS TRANSFORMING BUSINESS
  • 4.
    page© 2016 VoltDBPROPRIETARY VOLTDB: WE DON’T MAKE APPS, WE MAKE APPS… 4 • Real-time intelligence and context for richer interactions • Make different decisions on each individual event or person • Analyze and act on streaming data • 100X faster than traditional databases • World record performance in the cloud (YCSB) • Millisecond response time • High-speed data ingestion • Simpler apps, easier to test and maintain • Easier to program with SQL + Java • Seamless ecosystem integration • Data is always consistent and correct, never lost Smarter Faster Simpler 10 Trillion Device World   100X Traditional DB   100% Consistent, Correct  
  • 5.
    page© 2016 VoltDBPROPRIETARY Batch/Iterative Analytics -  Statistical correlations -  Multi-dimensional analysis -  Predictive analytics + Big DataFast Data Rapid Data Ingestion and Transformation Streaming Analytics -  Filtering -  Windowing -  Aggregation -  Enrichment -  Correlations Operational Interaction/ Transactions -  Context-aware -  Personal -  Real-time FAST DATA APPLICATION REQUIREMENTS
  • 6.
    page© 2016 VoltDBPROPRIETARY Streaming Analytics -  Filtering -  Windowing -  Aggregation -  Enrichment -  Correlations Batch/Iterative Analytics -  Statistical correlations -  Multi-dimensional analysis -  Predictive analyticsOperational Interaction/ Transactions -  Context-aware -  Personal -  Real-time + Rapid Data Ingestion and Transformation Fast Data 1 2 3 1 2 3 Ingest Analyze Decide Fast Data = + + 4 Export + 4 Big Data FAST DATA APPLICATION REQUIREMENTS
  • 7.
    page© 2016 VoltDBPROPRIETARY BUILDING FAST DATA APPLICATIONS 1.  Ingest:  Unbound  Streams  of  Data   •  Stream  data  into  an  opera8onal  store   •  VoltDB  has  in-­‐process  (in  database)  importers   2.  Analyze:  Opera8onal  Store  processes  data   •  Compute  Real-­‐8me  analy8cs   3.  Decide:  Make  Per-­‐event  Decisions   •  Transac8ons   4.  Export:    To  historical  data  store   •  VoltDB  has  in-­‐process  Export  connecters   •  Push  data  downstream  “data  lake”   •  For  Historical  Analysis/Machine  Learning    
  • 8.
    page© 2016 VoltDBPROPRIETARY Streaming Analytics -  Filtering -  Windowing -  Aggregation -  Enrichment -  Correlations Batch/Iterative Analytics -  Statistical correlations -  Multi-dimensional analysis -  Predictive analyticsOperational Interaction/ Transactions -  Context-aware -  Personal -  Real-time + Rapid Data Ingestion and Transformation Big Data FAST DATA APPLICATION REQUIREMENTS
  • 9.
    Biography -  Technical: -  Startedprograming in 1985 -  Developed kernel apps like printer drivers and high performance networking tools in C -  MS in Electrical Engineering from Technical University in Graz/Austria in 1995 -  Filed for two patents for improving RDBMS Performance in 2005 (Symantec Corp) and 2008 (FOX news) -  Hobbies: -  Running (Marathons) -  Photography -  RC Airplanes -  Electronics
  • 10.
    Agenda -  Vision - Technical requirements -  System Architecture -  Why using VoltDB over HBASE or Cassandra -  VoltDB, Things to consider when designing solutions with VoltDB -  Conclusion -  Resources
  • 11.
    Vision -  Building areal-time analytic engine for: -  real-time diagnoses of our Edge Servers -  MaxCDN-Predict -  Elastic Provisioning -  Improving Serving performance -  Using this data to bill customers
  • 12.
    Technical Requirements -  Thesystem should have the following features: -  Horizontally scalable -  Real-time (15 seconds SLA) from the time content is served till it shows up into the aggregates. -  Zero production support: -  Zero touch crash recovery -  No data clean-up/recovery required -  Guaranteed no data lost -  SQL interface for mining and drill-down -  Ad-Hoc queries of the not aggregated raw-data
  • 13.
  • 14.
    System Architecture -  WhenNginx serves the content, it logs this transaction -  These logs are streamed into the aggregation farm from around the world. We get ~ 32 TB of logs per day. This data gets pushed into 4 rabbit-mq queues. -  A farm of 4 machines, clean up and pre-aggregate this data. They create a batch of 70K raw-data along with corresponding aggregates and push it into a rabbit-mq queue. -  VoltDB cluster runs with: -  7 machines in k-factor=0 -  Sync logging mode for “no data lost” -  48 SitesPerHost. So, a total of 7*48 = 336 partitions.
  • 15.
    System Architecture -  VoltDBclients read these batches from rabbit-mq and push this data into a VoltDB cluster composed of 7 machines. They use VoltDB’s “hashinator” to push an array of data into only “one procedure call per Table per Partition”. These clients guarantee batch level atomic processing across 1680 (=5*336) VoltDB stored procedure calls -  Tables are maintained in a ring-buffer fashion. -  We can only keep ~ 30 min of most recent raw-data -  The system behaves completely like a distributed transactional RDBMS in terms of “no data lost guarantee”.
  • 16.
    System Architecture -  Zerotouch crash recovery: -  When VoltDB crashes: -  Clients go into pause mode -  Supervisord starts up VoltDB cluster in recovery mode -  When VoltDB clients or other components crash: -  VoltDB clients and all the other critical components run under Supvisord. So, they get restarted automatically -  Completely transactional processing through utilizing : -  VoltDB’s atomic processing at the stored procedure level -  Rabbit-MQ re-play guarantee -  Idempotency
  • 17.
    Why using VoltDBover HBASE or Cassandra -  Simply because of the “multi-row WRITE atomicity”. -  Multi-row WRITE atomicity results in much less CPU / I/O load as well as easier implementation. -  To make this clear let us consider our use-case of pushing our 70K batches of raw- logs into a storage system: -  VoltDB: -  With VoltDB, we have got stored-proc level atomicity. Current implementation pushes 70000 rows into 336 partitions. So, each stored-proc call writes 70,000/336 = ~ 208 rows into the rawlogs table. For these 208 rows, we add one row into the TX table with batch-id of this batch.
  • 18.
  • 19.
    Why using VoltDBover HBASE or Cassandra -  HBASE: -  HBASE only offers single row atomicity. So, let us say, we have got also 336 partitions, but, with HBASE, we have to include batch-id into each row. So, writing the batch-id 208 times instead of one time. When we apply the batch,we have to go through “208 IF statements” for each row and apply the batch if needed. So, this would mean a lot more CPU, I/O, and space requirements. -  If the batch size grows to 140K from 70K, these 208 WRITEs and “IF statments” will also grow to 416.
  • 20.
    VoltDB, Things toConsider when Designing Solutions -  Good things: -  SQL interface unlike Trident or Spark-Streaming -  Merges the good things of the old-world like SQL and transactions with the good things of the new world like ‘no-locks’, ‘k-factor’ HA, etc…. -  Very simple and intuitive API and usage -  k-factor + logs + snapshots eliminates the need to backup the system. -  Fast query performance -  Horizontal scalability
  • 21.
    VoltDB, Things toConsider when Designing Solutions -  Each partition has got only one thread of execution for INSERT/UPDATE. -  Workarounds: -  Get faster CPUs -  Pre-process the data outside VoltDB -  Maximum data coming out of a partition is limited to 50 MB. -  Workarounds: -  Make sure there is no relevant query with a qualified set of bigger than 50 MB for any partitions -  The more partitions, the better
  • 22.
    Conclusion -  VoltDB mergesthe good things of the old-world and new world. -  Provides an easy and scalable solution for real-time streaming aggregation -  Like any other tool, has some limitations that need to be taken into account when used towards a solution.
  • 23.
    -  VoltDBDB Docs:https://docs.VoltDBdb.com/ -  Lambda Architecture: https://VoltDBdb.com/blog/simplifying-complex-lambda-architecture -  Lambda Architecture: http://lambda-architecture.net/ -  Storm/Trident: http://storm.apache.org/documentation/Trident-tutorial.html -  Spark Streaming: http://spark.apache.org/streaming/ I am available by email: bpirvali@gmail.com Resources