Introducing Venice

Introducing Venice
A Derived Data Store for Batch, Streaming & Lambda
Architectures
Jeff Weiner
Chief Executive Officer
Felix GV
Engineer
Yan Yan
Engineer

Today’s
Agenda
Introducing Venice
2:55 Intro
3:05 Venice
3:10 Architecture
3:20 Hybrid Stores
3:25 Conclusion
3:30 Q&A

Intro
Primary & Derived Data, Data Lifecycle, Voldemort, Venice

Kinds of Data
• Source of Truth
• Example use case:
• Profile
• Example systems:
• SQL
• Document Stores
• K-V Stores
Primary Data Derived Data
• Derived from computing primary data
• Example use case:
• People You May Know
• Example systems:
• Search Indices
• Graph Databases
• K-V Stores

Data Lifecycle
Apps
Events
Buffer
Offline
Storage
Batch
Jobs
Online
Storage

Data Lifecycle
Apps
Kafka
HDFS
Pig,
Hive,
Spark…
…

Voldemort Read-Only
• Generates binary files on Hadoop
• Bulk loads data from Hadoop
• (in the background)
• Swaps new data when ready
• Keeps last dataset as a backup
• Allows quick rollbacks

Voldemort Read-Only
• At LinkedIn:
• ~1000 nodes
• > 500 stores
• > 240 TB refreshed / day
• > 600K QPS

Data Lifecycle
Today
Apps
Events
Buffer
Offline
Storage
Batch
Jobs
Online
Storage
Stream
Processing

How can we serve both
batch and stream data?

Lambda Architecture
Stream
Processing
Speed
Layer
Batch
Processing
Bulk
Store
App
Kafka
Hadoop

Downsides
Lambda Architecture
• Read path limited by:
• Slowest of two systems
• Least available of two systems
• Extra application complexity

Lambda Architecture, v2
Stream
Processing
Batch
Processing
App
Kafka
Hadoop

Design Goals, API, Features, Scale, Tradeoffs

Design Goals
Venice
• To replace Voldemort Read-Only
• Drop-in replacement
• More efficient
• More resilient
• More operable
• To enable new use cases “as a
service”
• Nearline derived data
• Lambda Architecture

Read/Write API
Venice
• Derived data K-V store
• Single Get
• Batch Get
• High throughput ingestion from:
• Hadoop
• Samza
• Or both (hybrid)

Features
Venice
• Dataset versioning
• Same semantics as Voldemort RO
• Bulk loads in the background
• Swapped in when ready
• Quick rollback

Features
Venice
• Avro schema evolution
• Service discovery via D2
• Helix cluster management
• Fully automatic replica placement
• Cluster expansion
• Self-healing
• Rack-awareness

Scale
Venice
• Large scale
• Multi-Datacenter
• Multi-Cluster
• Run “as a service”
• Self-service onboarding
• Each cluster is multi-tenant
• Resource isolation

Tradeoffs
Venice
• All writes go through Kafka
• Scalable
• Burst tolerant
• Asynchronous
• No “read your writes” semantics

Architecture
Components, Global Replication, Kafka Usage

Components
Architecture
• Server Processes
• Storage Node
• Router
• Controller
• Libraries
• Client
• Hadoop to Venice Push Job
• Samza System Producer

Venice Components
Router
Samza
Client
Push JobHadoop
Controller
Storage
Nodes

Global Replication
Architecture
• Voldemort painpoint
• Duplicated copy send through WAN

Global Replication
Push Job
Controller
Hadoop
Mirror
Maker
Parent
Controller
Datacenter Boundary
Storage
Nodes
Mirror
Maker
…
Mirror
Maker
…

Metadata Replication
Architecture
• Admin operations performed on
parent
• Store creation/deletion
• Schema evolution
• Quota changes, etc.
• Metadata replicated via “admin topic”
• Resilient to transient DC failures

Kafka Usage
Architecture
• One topic per store-version
• Kafka is fully managed by the
controller
• Dynamic topic creation/deletion
• Infinite retention

Step 1/3: Steady State, In-between Bulkloads
RouterStore
v7
Data Source Kafka Topics Venice Processes
Hadoop
Store
v6
Not consuming,
unless restoring
a failed replica.

Step 2/3: Offline Bulkload Into New Store-Version
RouterStore
v7
Hadoop Store
v8
Store
v6
Push Job

Step 3/3: Bulkload Finished, Router Swaps to New
Version
RouterStore
v7
Hadoop Store
v8
Store
v6
Push Job

Hybrid Stores
Overview, Data Merging

Overview
Hybrid Store
• Hybrid Stores aim to
• Merge batch and streaming data
• Not compromise read path performance
• Minimize application complexity

Data Merge
Hybrid Store
• Write-time merge
• All writes go through Kafka
• Hadoop writes into store-version topics
• Samza writes into a Real-Time Buffer topic
(RTB)
• The RTB gets replayed into store-version topics

Step 1/4: Steady State, In-between Bulkloads
RouterSamza Store
v7
Data Sources Kafka Topics Venice Processes
Hadoop

Step 2/4: Offline Bulkload Into New Store-Version
RouterSamza Store
v7
Hadoop Store
v8
Push Job

Step 3/4: Bulkload Finished, Start Buffer Replay
RouterSamza Store
v7
Hadoop Store
v8
Push Job

Step 4/4: Replay Caught Up, Router Swaps to New
Version
RouterSamza Store
v7
Store
v8
Hadoop

Conclusion
Production Status, Killing Voldemort

Production Status
Conclusion
• Venice is running in production
• Batch stores since late 2016
• Hybrid stores since September 2017

Killing Voldemort
Conclusion
• Migration of Voldemort RO to Venice
• Tooling complete
• Seamless
• Starting now

Build & Push
Hadoop Cluster
Scheduled Job
Voldemort Cluster

Introducing Venice

More Related Content

Similar to Introducing Venice

Introducing Venice

Editor's Notes