At the StampedeCon 2013 Big Data conference in St. Louis, Adam Kocoloski, CoFounder & CTO of Cloudant, CouchDB Expert, discussed CouchDB at its Core: Global Data Storage and Rich Incremental Indexing at Cloudant - StampedeCon 2013. Cloudant operates database clusters comprising 100+ nodes based on BigCouch, the company’s fork of CouchDB. Key elements of CouchDB’s design have proven instrumental to success at this scale, including version histories, append-only storage, and multi-master replication. In this talk, Cloudant CoFounder and Apache CouchDB Committer Adam Kocoloski will discuss lessons learned from running production CouchDB clusters bigger than many wellpublicized Hadoop deployments, and how Cloudant’s experience at scale is informing development work on the next release of Apache CouchDB.
CouchDB at its Core: Global Data Storage and Rich Incremental Indexing at Cloudant - StampedeCon 2013
1. CouchDB at Its Core
Global Data Storage and Rich Incremental Indexing at Cloudant
Adam Kocoloski
StampedeCon 2013
2. What is Cloudant?
• Founded by “big data” scientists
• Particle physicists @ MIT analyzing
petabytes of collider data
• Frustrated by inadequate tools,
founders became experts in
scaling CouchDB (“BigCouch”)
2
• Started Cloudant in 2008 as a managed data layer
• Premise: Apps should grow into their data layer, not out of it
• Built: Scalable, global, fault-tolerant data layer managed service
• Funded by Avalon, Devonshire (Fidelity), IQT, Rackspace, Samsung Ventures, Toba
Capital, Y Combinator
3. Cloudant Overview
• Operational JSON document store
• Web service
• Advanced APIs
• Replication & Sync
• Full-text Search
• Geospatial
• Incremental MapReduce
• Scalable, Highly Available Performance
• Cross-data center data distribution & fail over
• Geo load balancing
• Multi-tenant and single-tenant clusters
• Monitoring, admin & dev dashboards
• Managed 24x7 by experts
4
5. Anatomy of the Cloudant Data Network
US-EAST “Node”
Single-
tenant
cluster
Multi-tenant
cluster
HTTP POST, GET,…{JSON doc}
Edge Database Cluster
Mobile Devices
AP-JP
Filtered
Replication
& Sync
Secondary Data Centers
(for DR & distributed access)
EU-NL
6
6. Horizontal Clustering Framework
How CouchDB Fits In
Visualization
Lucene
Search
Chainable
MapReduce
Management
Monitoring
IOQ
Fabric Mem3 Rexi
Apache CouchDB
Docs: JSON,
Attachments
Developer APIs
Prioritizing IO types; prevents
“noisy neighbors” in multi-tenancy
Clustering API, Sharding,
Intra-cluster messaging
GET/PUT docs, Views,
Replication…
Horizontal Clustering Framework
Geospatial
Indexing
Geo-Load Balancing Connects users to closest copy of
data
Dashboards-Monitoring, Admin,
Development
7
7. Why CouchDB?
8
• Durable append-only storage engine
• Sequence tree enabling incremental processing of updates
• Data structures supporting eventual consistency
• Sophisticated replication & synchronization
The right primitives for a global data network
9. Append-only Storage
10
• Rewrite path to root in each index on
document update
• Large sequential writes, smaller random reads
• Wasted space must be periodically vacuumed
• Disk is cheap
• SSD-friendly access pattern
• We build what we run ➜ we make things
that are easy to run
• (We automated the heck out of the compactor)
This used to be controversial, now everyone does it
12. Sequence Index
13
1
foo
2
bar
3
baz
4
bif
GET /db/_changes
{“seq”:1, “id”: “foo”, “rev”:”1-...”}
{“seq”:3, “id”: “baz”, “rev”:”1-...”}
{“seq”:4, “id”: “bif”, “rev”:”1-...”}
{“seq”:5, “id”: “bar”, “rev”:”2-...”}
5
bar
OR
GET /db/_changes?since=4
{“seq”:5, “id”: “bar”, “rev”:”2-...”}
13. Sequence Index
14
• Index each document in order of most recent update
• Allows incremental, resumable processing in the background
• Originally, MapReduce views
• First class API endpoint ➜ DIY integrations (c.f. ElasticSearch)
• Lucene-based text search
• Geospatial indexes and querying
• First class internal service ➜ add additional consumers as need arises
15. Eventual Consistency
16
• CAP theorem (Brewer)
• O"en over-simplified
• I’ll offer my own oversimplification: “You must choose P”
• When faced with a network partition, you optimize for consistency
or availability
• Cloudant is an ODS
• Availability is paramount
• Strong consistency across geographies introduces unacceptable latency*
✱ Unless you’re Google and you install atomic clocks in your data centers
16. Eventual Consistency: Hash Histories
17
• Multiple concurrent versions of data will happen
• Default strategy cannot be to discard user data
• Hash histories track versions of a document
• Baked into every document
• Think git
• Document versions derived from contents + edit history
• Same series of edits, applied in same order, yield same
version ID
• History comparison detects divergences and how the
versions fit into the “family tree”
1-5a4...
2-ab6...
3-085...3-f57...
4-7ba... 4-8bf...
5-d4e...
21. Replication & Sync
22
• Not your RDBMS’ notion of replication
• Transfers updates from any source DB to any target DB
• Builds on earlier primitives
• Leverages sequence index to determine what’s changed
• Leverages hash histories to determine what’s missing on the target
• Critical “anti-entropy” element in clusters
• DBs are divided into partitions, copies of each partition are stored on
multiple distinct nodes
• Partition copies replicate with each other to ensure that documents are
durably stored and that consistency is achieved ... eventually
22. Why CouchDB Recap
23
• Durable append-only storage engine
• Sequence tree enabling incremental processing of updates
• Data structures supporting eventual consistency
• Sophisticated replication & synchronization
23. What’s Next?
24
• BigCouch ➜ CouchDB
• Cloudant will continue development under ASF umbrella
• Fewer code forks ➜ better velocity
• New CouchDB web UI “Fauxton”
• Better developer tooling for server-side code
• Plugins for Cloudant-specific functionality
• Cloudant is betting on data “at the edge”