Introducing Venice - Strata NYC 2017

Introducing Venice
A Derived Data Store for Batch, Streaming & Lambda
Architectures
Jeff Weiner
Chief Executive Officer
Felix GV
Engineer
Yan Yan
Engineer

Today’s
Agenda
Introducing Venice
2:55 Intro
3:05 Venice
3:10 Architecture
3:20 Hybrid Stores
3:25 Conclusion
3:30 Q&A

Intro
Primary & Derived Data, Data Lifecycle, Voldemort, Venice

Kinds of Data
• Source of Truth
• Example use case:
• Profile
• Example systems:
• SQL
• Document Stores
• K-V Stores
Primary Data Derived Data
• Derived from computing primary data
• Example use case:
• People You May Know
• Example systems:
• Search Indices
• Graph Databases
• K-V Stores

Data Lifecycle
Apps
Events
Buffer
Offline
Storage
Batch
Jobs
Online
Storage

Data Lifecycle
Apps
Kafka
HDFS
Pig,
Hive,
Spark…
…

Voldemort
The data store who must not be named

Overview
Voldemort Read-Only
• Generates binary files on Hadoop
• Bulk loads data from Hadoop
• (in the background)
• Swaps new data when ready
• Keeps last dataset as a backup
• Allows quick rollbacks

Scale
Voldemort Read-Only
• At LinkedIn:
• ~1000 nodes
• > 500 stores
• ~ 1 PB of SSD storage
• > 240 TB refreshed / day
• ~ 500K queries / second

Data Lifecycle
Today
Apps
Events
Buffer
Offline
Storage
Batch
Jobs
Online
Storage
Stream
Processing

How can we serve both
batch and stream data?

Lambda Architecture
Stream
Processing
Speed
Layer
Batch
Processing
Bulk
Store
App
Kafka
Hadoop

Downsides
Lambda Architecture
• Read path limited by:
• Slowest of two systems
• Least available of two systems
• Extra application complexity

Lambda Architecture, v2
Stream
Processing
Batch
Processing
App
Kafka
Hadoop

Design Goals, API, Features, Scale, Tradeoffs

Design Goals
Venice
• To replace Voldemort Read-Only
• Drop-in replacement
• More efficient
• More resilient
• More operable
• To enable new use cases “as a
service”
• Nearline derived data
• Lambda Architecture

Read/Write API
Venice
• Derived data K-V store
• Single Get
• Batch Get
• High throughput ingestion from:
• Hadoop
• Samza
• Or both (hybrid)

Features
Venice
• Dataset versioning
• Same semantics as Voldemort RO
• Bulk loads data in the background
• Swapped in when ready
• Quick rollback

Features
Venice
• Avro schema evolution
• Service discovery via D2
• Helix cluster management
• Fully automatic replica placement
• Cluster expansion
• Self-healing
• Rack-awareness

Scale
Venice
• Large scale
• Multi-Datacenter
• Multi-Cluster
• Run “as a service”
• Self-service onboarding
• Each cluster is multi-tenant
• Resource isolation

Tradeoffs
Venice
• All writes go through Kafka
• Scalable
• Burst tolerant
• Asynchronous
• No “read your writes” semantics

Architecture
Components, Global Replication, Kafka Usage

Components
Architecture
• Server Processes
• Storage Node
• Router
• Controller
• Libraries
• Client
• Hadoop to Venice Push Job
• Samza System Producer

Storage
Node
Venice Components
Router
Samza
Client
Push JobHadoop
Controller

Global Replication
Architecture
• In Voldemort RO:
• Each server pulls replicas redundantly
• Waste of WAN bandwidth

Global Replication
Single replica pushed across the WAN
Push JobHadoop
Mirror
Maker Storage
Nodes
DC 3DC 2
DC 1Source DC
Many replicas consumed locally

Global Replication
Architecture
• In Venice:
• Only one replica pushed across the
WAN
• ~ 40% faster push time

Metadata Replication
Architecture
• Admin operations performed on
parent
• Store creation/deletion
• Schema evolution
• Quota changes, etc.
• Metadata replicated via “admin topic”
• Resilient to transient DC failures

Kafka Usage
Architecture
• One topic per store-version
• Kafka is fully managed by the
controller
• Dynamic topic creation/deletion
• Infinite retention

Step 1/3: Steady State, In-between Bulkloads
RouterStore
v7
Data Source Kafka Topics Venice Processes
Hadoop
Store
v6
Not consuming,
unless restoring
a failed replica.

Step 2/3: Offline Bulkload Into New Store-Version
RouterStore
v7
Hadoop Store
v8
Store
v6
Push Job

Step 3/3: Bulkload Finished, Router Swaps to New
Version
RouterStore
v7
Hadoop Store
v8
Store
v6
Push Job

Hybrid Stores
Overview, Data Merging, Usage Patterns

Overview
Hybrid Store
• Hybrid Stores aim to
• Merge batch and streaming data
• Better read path performance than Lambda
Arch.
• Minimize application complexity

Data Merge
Hybrid Store
• Write-time merge
• All writes go through Kafka
• Hadoop writes into store-version topics
• Samza writes into a Real-Time Buffer topic
(RTB)
• The RTB gets replayed into store-version topics

Step 1/4: Steady State, In-between Bulkloads
RouterSamza Store
v7
Data Sources Kafka Topics Venice Processes
Hadoop

Step 2/4: Offline Bulkload Into New Store-Version
RouterSamza Store
v7
Hadoop Store
v8
Push Job

Step 3/4: Bulkload Finished, Start Buffer Replay
RouterSamza Store
v7
Hadoop Store
v8
Push Job

Step 4/4: Replay Caught Up, Router Swaps to New
Version
RouterSamza Store
v7
Store
v8
Hadoop

Usage Patterns
Hybrid Store
• Offline Source
• Traditional Hadoop job
• Samza “reprocessing” job
• Nearline Source
• Overwrite same keys
• Write into different keys

Conclusion
Production Status, Killing Voldemort

Production Status
Conclusion
• Venice is running in production
• Batch stores since late 2016
• Hybrid stores since September 2017

Killing Voldemort
Conclusion
• Migration of Voldemort RO to Venice
• Tooling complete
• Seamless
• Starting now

Introducing Venice - Strata NYC 2017

Introducing Venice - Strata NYC 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introducing Venice - Strata NYC 2017

Similar to Introducing Venice - Strata NYC 2017 (20)

Recently uploaded

Recently uploaded (20)

Introducing Venice - Strata NYC 2017

Editor's Notes