Introducing Venice
A Derived Data Store for Batch, Streaming & Lambda
Architectures
Jeff Weiner
Chief Executive Officer
Felix GV
Engineer
Yan Yan
Engineer
Today’s
Agenda
Introducing Venice
2:55 Intro
3:05 Venice
3:10 Architecture
3:20 Hybrid Stores
3:25 Conclusion
3:30 Q&A
Intro
Primary & Derived Data, Data Lifecycle, Voldemort, Venice
Kinds of Data
• Source of Truth
• Example use case:
• Profile
• Example systems:
• SQL
• Document Stores
• K-V Stores
Primary Data Derived Data
• Derived from computing primary data
• Example use case:
• People You May Know
• Example systems:
• Search Indices
• Graph Databases
• K-V Stores
Data Lifecycle
Apps
Events
Buffer
Offline
Storage
Batch
Jobs
Online
Storage
Data Lifecycle
Apps
Kafka
HDFS
Pig,
Hive,
Spark…
…
Voldemort Read-Only
• Generates binary files on Hadoop
• Bulk loads data from Hadoop
• (in the background)
• Swaps new data when ready
• Keeps last dataset as a backup
• Allows quick rollbacks
Voldemort Read-Only
• At LinkedIn:
• ~1000 nodes
• > 500 stores
• > 240 TB refreshed / day
• > 600K QPS
Data Lifecycle
Apps
Events
Buffer
Offline
Storage
Batch
Jobs
Online
Storage
Data Lifecycle
Today
Apps
Events
Buffer
Offline
Storage
Batch
Jobs
Online
Storage
Stream
Processing
How can we serve both
batch and stream data?
Lambda Architecture
Stream
Processing
Speed
Layer
Batch
Processing
Bulk
Store
App
Kafka
Hadoop
Downsides
Lambda Architecture
• Read path limited by:
• Slowest of two systems
• Least available of two systems
• Extra application complexity
Lambda Architecture, v2
Stream
Processing
Batch
Processing
App
Kafka
Hadoop
Design Goals, API, Features, Scale, Tradeoffs
Design Goals
Venice
• To replace Voldemort Read-Only
• Drop-in replacement
• More efficient
• More resilient
• More operable
• To enable new use cases “as a
service”
• Nearline derived data
• Lambda Architecture
Read/Write API
Venice
• Derived data K-V store
• Single Get
• Batch Get
• High throughput ingestion from:
• Hadoop
• Samza
• Or both (hybrid)
Features
Venice
• Dataset versioning
• Same semantics as Voldemort RO
• Bulk loads in the background
• Swapped in when ready
• Quick rollback
Features
Venice
• Avro schema evolution
• Service discovery via D2
• Helix cluster management
• Fully automatic replica placement
• Cluster expansion
• Self-healing
• Rack-awareness
Scale
Venice
• Large scale
• Multi-Datacenter
• Multi-Cluster
• Run “as a service”
• Self-service onboarding
• Each cluster is multi-tenant
• Resource isolation
Tradeoffs
Venice
• All writes go through Kafka
• Scalable
• Burst tolerant
• Asynchronous
• No “read your writes” semantics
Architecture
Components, Global Replication, Kafka Usage
Components
Architecture
• Server Processes
• Storage Node
• Router
• Controller
• Libraries
• Client
• Hadoop to Venice Push Job
• Samza System Producer
Venice Components
Router
Samza
Client
Push JobHadoop
Controller
Storage
Nodes
Global Replication
Architecture
• Voldemort painpoint
• Duplicated copy send through WAN
Global Replication
Push Job
Controller
Hadoop
Mirror
Maker
Parent
Controller
Datacenter Boundary
Storage
Nodes
Mirror
Maker
…
Mirror
Maker
…
Metadata Replication
Architecture
• Admin operations performed on
parent
• Store creation/deletion
• Schema evolution
• Quota changes, etc.
• Metadata replicated via “admin topic”
• Resilient to transient DC failures
Kafka Usage
Architecture
• One topic per store-version
• Kafka is fully managed by the
controller
• Dynamic topic creation/deletion
• Infinite retention
Step 1/3: Steady State, In-between Bulkloads
RouterStore
v7
Data Source Kafka Topics Venice Processes
Hadoop
Store
v6
Not consuming,
unless restoring
a failed replica.
Step 2/3: Offline Bulkload Into New Store-Version
RouterStore
v7
Data Source Kafka Topics Venice Processes
Hadoop Store
v8
Store
v6
Push Job
Step 3/3: Bulkload Finished, Router Swaps to New
Version
RouterStore
v7
Data Source Kafka Topics Venice Processes
Hadoop Store
v8
Store
v6
Push Job
Hybrid Stores
Overview, Data Merging
Overview
Hybrid Store
• Hybrid Stores aim to
• Merge batch and streaming data
• Not compromise read path performance
• Minimize application complexity
Data Merge
Hybrid Store
• Write-time merge
• All writes go through Kafka
• Hadoop writes into store-version topics
• Samza writes into a Real-Time Buffer topic
(RTB)
• The RTB gets replayed into store-version topics
Step 1/4: Steady State, In-between Bulkloads
RouterSamza Store
v7
Data Sources Kafka Topics Venice Processes
Hadoop
Step 2/4: Offline Bulkload Into New Store-Version
RouterSamza Store
v7
Data Sources Kafka Topics Venice Processes
Hadoop Store
v8
Push Job
Step 3/4: Bulkload Finished, Start Buffer Replay
RouterSamza Store
v7
Data Sources Kafka Topics Venice Processes
Hadoop Store
v8
Push Job
Step 4/4: Replay Caught Up, Router Swaps to New
Version
RouterSamza Store
v7
Data Sources Kafka Topics Venice Processes
Store
v8
Hadoop
Conclusion
Production Status, Killing Voldemort
Production Status
Conclusion
• Venice is running in production
• Batch stores since late 2016
• Hybrid stores since September 2017
Killing Voldemort
Conclusion
• Migration of Voldemort RO to Venice
• Tooling complete
• Seamless
• Starting now
Thank you
Backup Slides
Build & Push
Hadoop Cluster
Scheduled Job
Voldemort Cluster
LagMeasurement

Introducing Venice

Editor's Notes

  • #17 Thanks Felix, I am Yan from Venice team, I am going to give you guys a brief introduction of Venice our new generation derived data platform, with highlighting our design goals, API and main features we have, the scalability that Venice have to support and some tradeoffs we made when we designed this system.
  • #18 So let’s start from design goals, as Venice is the successor of Voldemort read-only. So obviously it should be able to take over everything thing that Voldemort read-only has been doing. But more efficient, more resilient and more operable. And one more important thing here is we wanna make Venice as a drop-in replacement of Voldemort. Caz we have hundreds of users are living on Voldemort and all of their data, $$$ will be moved to Venice eventually. Caz you could expect migration that much of data is not easy so This is why we we have to make the migration as smooth as possible. Another main goal of Venice is in addition to offline derived data, we could also serve for near line derived data and should have the ability to merge offline and near-line data together to give our user a unified view of them in one system.
  • #19 All right, after we are clear on the goals, let’s see what APIs and features Venice have to achieve our goals. On the read path, actually you could think of Venice as a distributed key-value store, so obviousely the single get and batch get API are required. On the write path, a high throughput ingestion from both Hadoop and Samza will keep our system efficient, caz we have hundreds of Terabytes data get to be written into system every day. In Venice we provide an asynchronous way to write data from data source to Venice, I will explain the details when we talk about trade offs.
  • #20 The first feature I want to introduce is data versioning. Inside in Venice, we keep multiple versions for your data, once a user start to ingest a new offline data set, that data is being written in the background without impacting the current data version which is serving read requests. When this version is ready to serve, Venice will do an atomic swap to let all read requests hit this new version instead. The whole process is almost transparent to user, the only thing user need to do is just start the ingestion. Then Venice will help you to manage everything. Of course, in case that you find your data have any issue, you could also quickly rollback, just tell Venice which version you wanna to use.
  • #21 In this slides I want to talk about 3 new features in Venice which are all aimed for solving the pain points in Voldemort we met. At first, the Avro schema evolution allow user to update the schema of their data instead of creating a new store, which was what did in Voldemort. The second thing is Dynamic service discovery, we build our service discovery feature on top of D2 which is a dynamic discovery framework opensourced by Linkedin. So on client side, user do not need to specify which endpoint they want to talk to, instead, Venice will find the proper server for you based on the store you are using. And in case of any server failure, Venice will give you another server as the replacement. So your application could focus on the business logic you want to solve. We also introduced Helix which is the open sourced cluster management framework widely used in Linkedin. It provides the ability to do the fully automatic and rack-ware data replica placement. Other than that we implement the further features on top of this framework, which is zero-downtime cluster expansion and upgrading, this is a big availability improvement on Voldemort, it means user could continue their data ingestion during our maintenance period which could normally take 2-3 hours each time.
  • #22 So in terms of scalability, there are two kinds of meaning of scalability in Venice. The first one is we have to support large scale of machine and data. So Venice is a system running in multi-datacenter across continents. And could also run multiple clusters in one physical datacenter in order to get better resource utilization for different use cases. Example? Secondly, we want to run Venice as a service which means Venice could support large scale of users. This is why we provide the self-service onboarding with our internal cloud management platform to let user manage their store without our SRE/DEV. As each cluster is multi-tenant Env, which means users are going to share some resources like cpu and storage, so we introduced several ways to do the resource isolation like qps quota, storage quota and also multi-cluster to prevent user being impacted by each others.
  • #23 So tradeoffs, as I said, we need a way to do the high throughput ingestion to keep our system efficient. The main tradeoff we made is how to ingest large amount of data into Venice. Basically we have two options: Fetch data from data source directly or write data into an intermedia then fetch the data from that intermedia asynchronousely. Finally we decided to make all writes go through Kafka, it means we write all data into Kafka at first thus we treat Kafka as the source of truth for Venice. As you know Kafka is a scalable message queue could provide good support for high throughput writes. In that case we could accept user’s data as much as possible and as also fast as possible . Remember that we do the bulk load from Hadoop, so we always face the message burst issue, with Kafka we have the ability to be burst tolerant, Caz Kafka persist those messages at first, then Venice consume them gradually to prevent running out of capacity, particually in case of accepting large number of message in a short period. What we paid for that asynchronous push mechanism is in near line cases Venice does not provide “read your writes” semantics. Imaging that your data has been written into Kafka and Venice return that your write succeeds. Now you could not see the data you just write, because normally Venice take several seconds to consume that data then persist locally before make that data visible to client. We think it’s acceptable for most our use cases, and we are also working on some workaround to support read your writes semantics with this push mechanism.
  • #24 Now I wanna describe the architecture of Venice to help you understand how we implement the features we mentioned above. I’m goanna introduce the main components in Venice and how they interact with each others. Then jump into the global replication which let us be able to sync up the data and metadata across multiple datacenters. At last explain how we use kafka, caz, the way we use kafka is a slightly different from the common case.
  • #25 So, there are two kinds of components in Venice, we have some processes running on server side like, storage node which is the real host of the dataset, the router is the gateway of our cluster so all of request will hit router first then forward to proper nodes. And we also have the controller who mange the whole cluster. Another kind of component is the library embedded in user’s application. Of course we have the client library to let user read data from Venice. And a H2V push job plugin in azkaban to let user be able to push offline derived data into venice. Regarding to nearline dervicd data, we provide a kinf of samza system produer, so user could push their data from samza job into Venice as well.
  • #26 This diagram shows how components interact with each others. The blue shapes are Venice components we built and the gray shape are the dependencies we rely on. So you could see at first your data exist on the data sources like Hadoop, and in order to ingest them into Venice, you need to start a new push job and wait it to be completed. Underneath the job, Venice controller will create all essentials for you like a new data version and a new Kafka topic. Beside that Controller will also pick up proper storage nodes to be the real hosts for your data set. Those storage node will consume the data written by your push job parallely from kakfa, and report status to controller regularly. So once controller get enough information and think you data is ready to serve, it will notify routers to use the new data version, after version swap, Now your data is visible to read client. So this is whole lifecycle of a offline push job. For nearline derived data, Felix will give more details later.
  • #27 As I said Venice is a plant level scale system, so each piece of data must be replicated to multiple datacenters located in different continents. In voldemort, the global replication is each replica in each datacenter will read a copy of data from source datacenter, so it sends duplicated copies through WAN, which eat large oversea bandwidth and slowdown the whole push.
  • #28 So in Venice we build a new global replication mechanism that the push job write data into kafka located in the source datacenters, then we rely on Kafka mirror maker to replicate each message to all target datacenters. Please note here we only send one copy of data to a remoted datacenter, in order to keep enough replicas, multiple storage nodes consume the same copy of data from local kafka and persist them in local storage enginee. So with this new global replication mechansism, we savd 40% time spent on the the whole push job. And also reduced the across datacenters network usage, depends on the replication factor it could be 2 of thrid or half bandwith cost will be saved.
  • #29 Beside the data replication, we replicated our metadata across multi-datacenters as well. There is a dedicated kafka topic we used to transfer our medata data operations called “Admin topic”. So once you create store in source datacenter, all target data centers will receive this admin message and execute the related operation to create store to keep metadata consistent. In case one entire dc failed, we still have that data in kafka, either in source kafka cluster or target kafka cluster. Once this datacenter is recovered, the admin message will be eventually consumed by Venice running in the recovered datacenter, thus be handled properly by controller in that datacenter. So no manual operation is need to handle datacenter failure and metadata inconsistency.
  • #30 Unlike most of kafka use cases, we create one topic once a new data version is created and delete that topic once the associate version is retired. And all those topic creation and deletion are dynamic and fully managed by controller. So topic here is no longer a pre-created one with long term life cycle.
  • #31 Image we already have 2 data versions in your data store, V6 and V7. V7 is the current version serving read requests.
  • #32 Now you gonna start a new push job, so Venice create V8 for this push
  • #33 Once the push job succeed, V8 is ready to serve, so what Venice will do is swap the current version from V7 to V8, meanwhile retire the oldest version V6 with deleting the associate kafka topic and data persisted in local storage enginee. So now we still keep 2 versions for you data set but complete one round of data swap.
  • #34 All right that’s all about Venice architecture and offline push job. I want to give it back to Felix to introduce more about our hybrid design. Thank you.