• Multicore is the future.
• CPUs are getting faster faster than cache is getting faster faster
than RAM is getting faster faster than disk is getting faster.
• Specialized systems can be 10x better.
• Having lots of RAM isn’t weird.
• This “cloud” thing is a thing, but with lousy hardware.
B U F F E R P O O L M A N A G E M E N T
C O N C U R R E N C Y
U S E M A I N M E M O RY
S I N G L E T H R E A D E D
Waiting on users leaves CPU idle
Single threaded doesn’t jive with multicore world
WA I T I N G O N U S E R S
• External transactions control and performance are not
friends with each other.
• Use server side transactional logic.
• Move the logic to the data, not the other way around.
U S I N G * A L L * T H E C O R E S
• Partitioning data is a requirement for scale-out.
• Single-threaded is desired for efficiency.
• Why not partition to the core instead of the node?
• Concurrency via scheduling, not shared memory.
Partition Data Serial Execution Never Block
across a cluster.
Each running a
pipeline of work.
Do one thing after
another with no
Run code next to
data so never block
Data in memory so
never block on disk.
Keep CPUs full of real work and you win.
• Two kinds of tables, partitioned and replicated.
• Partitioned tables have a column partitioning key.
• Two kinds of transactions, partitioned and global.
• Partitioned transactions are routed to the data
partition they need.
• Global transactions can read and update all
• Partition to CPU cores. More machines = more cores.
• "Buddy up" cores across machines for HA.
• Fully synchronous replication within a cluster.
• Asynchronous replication across clusters (WAN).
• Partitioned workloads parallelize linearly.
Partition 1 Partition 2 Partition 3
Partition 4 Partition 5 Partition 6
Partition 7 Partition 8 Partition 9
Commodity Cluster Scale Out / Active-Active HA
Millions of ACID serializable operations per second
Synchronous Disk Persistence
Avg latency under 1ms, 99.999 under 50ms.
Multi-TB customers in production.
Explaining the tech to people is not a good way to
• VoltDB is not just a traditional RDBMS
with some tweaks sitting on RAM rather
• VoltDB is weird and new and exciting
and not compatible with Hibernate.
• VoltDB sounds like MemSQL, NuoDB,
Clustrix, HANA or whomever on ﬁrst
blush, but has a really really different
• MySQL can’t do it.
• Analytic RDBMSs can’t do it.
• Hadoop can’t do it.
(BUT VoltDB isn’t a drop in for MySQL)
• No sprawling apps built with Hibernate.
• No websites where reads are 95%.
More on OLTP vs OLAP
• Nobody wants black-box state.
Real-time understanding has value.
• OLTP apps smell like stream processing apps.
• Processing and state management go well together.
• By following customers, we ended up with a fantastic
streaming analytics / stream processing tool.
• Strong consistency & transactions make streaming better.
What is Fast Data
• Digital Ad Tech
• Smart Devices / IoT / Sensors
• Financial Exchange Streams
• Online Gaming
• High Write Throughput
• Partitionable Actions
• Global Live Understanding
• Long Term Storage in
HDFS or Analytic RDBMS
• Global Live Understanding
Materialized Views for Live
Special index support for
Function based indexes
• Long Term Storage in
HDFS or Analytic RDBMS
“Export” to HDFS, CSV,
JDBC, Speciﬁc systems
HTTP/JSON Queries for
Snapshot to CSV / HDFS
Full SQL support for
operating on JSON docs as
• Looks like write only tables in your schema.
• Each is really a persistent message queue.
• Can be connected to consumers:
• HDFS, CSV, JDBC, HTTP, Vertica Bulk
What About Java?
Time For Implementation Choices
Time For Implementation Choices
No external transaction control.
Multi-statement transactions use
Stored Procedure Needs
• Easy for us (VoltDB devs) to implement
Rules out bespoke options, like our own PL-SQL or DSL
• Not slow
Rules out Ruby, Python, etc…
• Can’t crash the system easily
Rules out C, C++, Fortran
• Familiar or easy to learn if not
Rules out Erlang and some weird stuff
• Has to exist in 2008
Rules out Rust, Swift, Go
• Runs on Linux (in 2008)
Rules out .Net languages
We picked Java
• Once we picked Java as the user stored procedure language, we
decided to implement much of the system in Java.
• 2008 Java was much more appealing than 2008 C++ to write SQL
optimizers, procedure runtimes, transaction lifecycle management,
networking interfaces, and all the other trappings.
C++17 is much improved.
Rust is interesting.
Swift might be good in 2020.
2007 Conference on Analysis of Algorithms, AofA 07 DMTCS proc. AH, 2007, 127–146
HyperLogLog: the analysis of a near-optimal
cardinality estimation algorithm
and Éric Fusy1
and Olivier Gandouet2
Algorithms Project, INRIA–Rocquencourt, F78153 Le Chesnay (France)
LIRMM, 161 rue Ada, 34392 Montpellier (France)
This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to
estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory
of m units (typically, “short bytes”), HYPERLOGLOG performs a single pass over the data and produces an estimate
of the cardinality such that the relative accuracy (the standard error) is typically about 1.04/
m. This improves on
the best previously known cardinality estimator, LOGLOG, whose accuracy can be matched by consuming only 64%
of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond 109
with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and
adapts to the sliding window model.
The purpose of this note is to present and analyse an efﬁcient algorithm for estimating the number of
distinct elements, known as the cardinality, of large data ensembles, which are referred to here as multisets
and are usually massive streams (read-once sequences). This problem has received a great deal of attention
over the past two decades, ﬁnding an ever growing number of applications in networking and trafﬁc
monitoring, such as the detection of worm propagation, of network attacks (e.g., by Denial of Service),
and of link-based spam on the web . For instance, a data stream over a network consists of a sequence
of packets, each packet having a header, which contains a pair (source–destination) of addresses, followed
A method of estimating cardinality with O(1) space/time.
blob = update(hashedVal, blob)
integer = estimate(blob)
A few kilobytes to get 99% accuracy.
Unique Device ID
appid = 87
deviceid = 12
Mobile phone is
Request sent to
VoltDB to decide if it
should be let through.
Single transaction looks at state
and decides if this call:
is permitted under plan?
has prepaid balance to cover?
Recent Activity for both Numbers
99.999% of txns
respond in 50ms
Example: Micro Personalization
User clicks link on
a website. This
request to VoltDB.
VoltDB transaction scans
a table of rules and
checks which apply to
transaction decides what
to show the user next.
That decision is
exported to HDFS
Spark ML is used to look at
historical data in HDFS and
generate new rules.
These rules are loaded into
VoltDB every few hours.
Cool Java Beneﬁt: Hot Swap Code
• Java classloaders are pretty cool.
• Where code needs to be dynamically changed, we setup one
custom classloader per thread.
• Transitioning to a new Jarﬁle can be done asynchronously.
• Happy to talk more about this.
Cool Java Beneﬁt: Debuggers
• Can debug in Eclipse or other IDEs, stepping through user code.
First Java Problem: Heap
• We’re building an in-memory database.
• Users storing 128GB of data in memory isn’t crazy.
• 128GB Java heap is no fun. Very hard to avoid long GC pauses.
• Multi-lifecycle data is the worst possible case.
• Direct ByteBuffers
• Pooled Direct ByteBuffers
• A full persistence layer with good caching
(possibly even with ORM)
• Use a FOSS/COTS in-memory, in-process thingy
• Build your own storage layer in native code.
(a last resort)
VoltDB C++ Storage Engine
• Class called VoltDBEngine manages tables, indexes, hot-snapshots,
• Accepts pseudo-compiled SQL statements and modiﬁes or queries
• Clear deﬁned interface over JNI
• Java heap is 1-4GB, C++ stores up to 1TB.
How to Debug
• Abstract JNI interface and implement it over sockets
One mixed-lang process becomes two.
• Can use GDB/Valgrind/XCode/EclipseCDT/etc...
• If the problem only reproduces in JNI or in a distributed system, we
resort too often to printf / log4j.
• Goal is to keep C++ code as simple and task-focused as possible
so horrible native bugs are the exception, not the rule.
How to Proﬁle
• This is the big downside to non-trivial JNI.
• Much performance tuning is generic. (auto-measure)
• oproﬁle/perf have gotten recently better at C++ in JNI.
• Sampling in Java gives best results, less clear with many threads.
• Proﬁling one thread doesn’t always inform multi-thread performance.
• Proﬁling release build confusing. Debug build is off.
• Isolate and micro-benchmark/micro-proﬁle if possible.
Related Problem: Serialization
• Subproblem: How do you represent a row-based, relational table in
• Subproblem: Best way to serialize POJOs?
How do you represent a row-based, relational table
• Array of arrays of objects is often the wrong answer.
• We serialize rows by native type to a ByteBuffer with a binary header
format. Lazy de-serialization.
• Since we support variable-sized rows, we’ve made this buffer
• No great way to use a library like protobufs for this. Avro close?
What about POJOs?
• java.io.Serializable is slow. Needs classloading.
• java.io.Externalizable is the right idea.
• VoltDB breaks fast serializing into two steps:
• How big are you?
• Flatten to this buffer (Externalizable-style)
• Preﬁx with type/length indicators when needed.
• Protobufs, Avro, Thrift, MessagePack, Parquet
OLTP Data Fits In Memory
• Memory is getting cheaper faster than OLTP data is growing.
• Need to split up your app though. Driven by scale pain.
• There is value is ridiculously consistent performance.
Cache + K/V
• Some apps have hot-cold patterns with lots of cold data.
• NVRAM is coming.
Recovery From Peers, Not Disk
• No disk persistence is a non-starter.
Disk. Check. Now?
• Recovery from peers is actually pretty cloud-friendly.
• All nodes the same with identical failure and
replacement semantics has been a big win.
Cluster Native and Commodity Boxes
VM and Cloud Friendly
• Clustering is hard. So so hard. Especially if you care about consistency
• Monitoring clusters is still something many users aren’t good at.
• Debugging clusters is hard, especially beyond key-value stores.
• Partitioning is getting easier to explain/sell thanks to NoSQL.
• I’m super skeptical about automated partitioners for operational work.
• Alternative is 1TB mega-machines? PCIe networks/fabrics?
• “What, you don’t have 1000 node clusters?”
External Transaction Control
is an Anti-Pattern
• Downside: We self-disqualify from all of the ORM apps out there.
• Upside: We self-disqualify from all of the ORM apps out there.
• Server-side logic is a really good ﬁt for event processing and the
Active-Active Intra-Cluster Through
Deterministic Logical Replication
• V1 used clock-sync to globally order transactions.
• Basing replication on clocks was a no-go unless you’re Google.
• Sync latency was too slow.
• Now more like Raft.
• Traded a global pre-order for global post-order.
• Happy with where we ended up.
OLTP Doesn’t Need Long
• Big engineering wins to a single-threaded SQL execution engine.
• Lots of people want long transactions, though many apps do without.
• Drives us to integrate.
• Fat ﬁngers are problematic.
• Added ability to set timeouts globally or per-call.
• Biggest differentiator. Real transactions. Real throughput. Low Latency.
An OLTP-Focused Database
Needs Much Less SQL Support
• Always supported more powerful state
manipulation and queries than NoSQL.
• Always got compared to mature RDBMS.
• In 2014, our SQL got rich enough to for us
to switch to offense.
Only took 6 years or so.
How Hard is Integration?
• I hate you Google.
• Integrating two things for one use case is easier
than integrating two things as a vendor.
• Using a vendor-supplied integration is often much
smarter than building your own.
Others are using it. The Vendor tests it. Etc…
Let’s Ingest from Kafka
Kafka Kafkaloader VoltDB
• Manage acks to ensure at least once delivery,
even when any of the three pieces fails in any
• 1 Kafka “topic” routed to one table or one ingest
That middle guy was lame…
• VoltDB nodes are elected
leaders for Kafka topics.
• If any failure on either side
coordinates to resume work.
• Guarantee at least once
when used correct.
• Leverage ACID to get
idempotency to get
effective exactly once
Users want more!
• But what if data for
many tables shares a
• What if message
content dictates how it
should be processed?
Integrations So Far
• OLAP, like Vertica, Netezza,
• Generic, like JDBC, MySQL
• Kafka, RabbitMQ, Kinesis
• HDFS/Spark and Hadoop
• CSV and raw sockets
• HTTP APIs
• Various AWS things
• Wrap core VoltDB jar with python scripts.
Looks like Hadoop tools or Cassandra
• Wrap native libraries for Linux + macOS
in the jar
Stole this idea from libsnappy
• You can use one Jar in eclipse to test
Same jar as client lib
Same jar as JDBC driver
• Some apps have one client connection.
Some apps have 5000.
• Some clients are long lived.
Some are transient.
• VoltDB is often bottlenecked on # packets, not just throughput.
• Use NIO to handle worst case client load.
Small penalty when handling best case.
• One network for some highest priority internal trafﬁc, one for everything
• Used pooled direct Byte Buffers for network IO.
• Dedicated network threads (proportional to cores).
• Use ListenableFutureTasks for serialization in dedicated threads.
• Split NIO Selectors on many-core.
• Example SLA:
99.999% of txns return in 50ms
• Chief problems:
• Garbage collection
• Operational events
• Non-java compaction and cleanup
Ariel Weisberg at Strangeloop 2014
Latency: Mullet Revisited
Data Storage in C++
Per transaction stuff
Conﬁg + other stuff
that lasts a while
• Initiating a snapshot used to take 200ms. Better now.
• Failing a node gracefully used to take about 1s. Better now.
• Failing a node by cutting it’s cord can take even longer.
• Some operational events require restart.
• VoltDB scales well to 16 cores, then starts to scale less well to 32, and it’s not ideal at 64.
We have lots of thoughts about this and I could talk more about it.
Some of it is Java. Some not.
Customers don’t care much yet?
• Fragmentation in native memory allocation has been a big issue for us. It’s not much of a
Java issue, but is interesting.
• When to use an off the shelf tool vs when to roll own.
• We’ve run into people who are resistant to using JVM software or writing stored procs in
• Kafka has 4 different popular versions. Had to use OSGI module loading. Ugh.
all images from wikimedia w/ cc license unless otherwise noted
• Please ask me questions now or later.
• Feedback on what was interesting,
helpful, confusing, boring is ALWAYS
• Happy to talk about:
Systems software dev
Stuff I Don't Know
Stuff I Know
T H I S TA L K