Ambry is an open source object store that is responsible for storing all media content at Linkedin. This talk goes over development of Ambry at Linkedin and its architecture to some details.
2. Agenda
Media’s significance
Nature of Media Infrastructure at LinkedIn
before Ambry
Motivation and Design goals
API
Architecture and selected internals
Evaluation
How to get started
4. Content ecosystem @ Linkedin
before Ambry
4
Client 1 Client 2 Client 3 Client N
Filers
(Media Server)
Google App
Engine
Voldemort Espresso
Media
Processing
Lib 2
….....
Cache
Media
Processing
Lib1
Fragmented
Eco system
5. Content ecosystem @ Linkedin
before Ambry
5
Client 1 Client 1 Client 1 Client 1
Filers
Media
Server
Google App
Engine
Voldemort Espresso
Resize lib 2
….....
Cache
Resize lib1 Fragmented
Eco system
Calls for a Unified
Solution
6. Why not File Systems,
Key Value Stores?
File Systems (HDFS, Ceph, etc)
Have extra capabilities that are not required
for an object store
Overhead due to metadata lookups
Key Value stores ( Cassandra, DynamoDB, etc)
Not built to support large objects. Might copy
data multiple times
Lacks streaming and chunking
7. Ambry : A Scalable Geo-
distributed object store
Design principles
Low latency and high throughput for a
variety of immutable objects
Unstructured content
Geo-distribution
Highly Available
8. Design principles
Horizontally Scalable
Low operational overhead
Active active set up
Simple design and ease of use
Cost Effective
9. Store a variety of objects (documents, slides, etc)
Store any kind of media files (pictures, sounds,
videos)
Store backups
For any other use-case that needs to store any
content of larger size
Range query support for storing videos
Use cases
11. API
11
POST
Upload content to ambry
Returns a handle(AmbryID) to the blob/content
uploaded
GET
Blob : Fetches the content associated with the
“AmbryID”
BlobInfo : Fetches the properties and user
metadata associated with the blob pertaining to
the “AmbryId”
DELETE
Deletes the content associated with the “AmbryID”
14. Frontend
HTTP
Stateless
Pluggable Validation and Authentication
service
Pluggable change capture
Router to route requests
15. Supports streaming for large blobs
Service or embedded library
Proxy requests
For stronger consistency
Zero cost failure detection
Avoid down resources
Router Library
16. id 30 700 - ∞
id 40 850 - 1/1/16
id 70 770 - ∞
id 30 700 - ∞
id 40 850 - 1/1/16
id 70 770 - ∞
16
Data Layer
Partitions, Replicas
Log structured
Asynchronous replication for remote replicas
…
start offset in current
index segment
70
0
770 850 900
log end offset
0 100 GB
blob id
50
640
Log blob id
30
blob id
70
blob id
40
blob id offset flags
id 30 700 - ∞
id 40 850 - 1/1/17
id 70 770 - ∞
TTL
Sorted on
blob id
Index
Segment
17. 17
Data Layer
Key Optimizations
O(1) I/O for writes
Avoid unnecessary movement of actual data
Bloom filter for index segments
Rely on OS page cache
Zero copy for gets
19. Evaluation - setup
Small cluster
– A beefy single Datanode
• 24 core CPU, 64 GB of RAM, 14 1TB HDD disks, and a
full-duplex 1 Gb/s Ethernet network.
– Workload: Read only, 50-50 read-write, and write
only
• Fix size objects in each test
• Randomly reading objects (worse than real-world with
skewed distribution)
19
20. Throughput
20
Large objects:
• All cases saturate network.
• Read-write saturates both
inbound and outbound link
(reaching 2x network)
For small objects:
• Write saturates network
• Throughput in Read and
Read-write drop linearly
because of frequent disk
seeks.
• In 50 KB objects, > 94%
latency for disk seeks
21. Latency
21
Large objects:
• Latency scales
proportionally to the object
size
For small objects:
• Write latency scales
proportionally
• Read latency dominated by
disk seek (almost constant
for all sizes)
23. How to get started !
https://github.com/linkedin/ambry
Quick start
mail to ambrydev@googlegroups.com
LinkedIn Blog
SIGMOD 2016 paper
Apache 2.0 Licensed
Contributions are welcome and encouraged !
26. Challenges
26
Huge diversity
10s of KBs to few GBs
Fast, durable and highly available
Geo-replication with low latency
Ever growing data and requests
> 800 M req/day (~120 TB)
Rate doubled in 12 months
Uploaded once, never modified, rarely deleted
Most recent uploads are accessed often
27. 27
API
• BlobProperties
• Size, TTL, Creation time, Content type
• UserMetadata
• List of <attribute, value>
• AmbryBlobOutput
• InputStream, Size, Last Modified Time
28. 28
API
• POST /
Request Header Type Description
x-ambry-blob-size Long The size of the blob
x-ambry-service-id String The ID of the
service that is
uploading the blob
x-ambry-content-
type
String The type of
content in the blob
29. 29
API
• POST /
Request Header Type Description
x-ambry-ttl Long The time in
seconds for which
the blob is valid.
Defaults to -1
x-ambry-owner-id String The owner of the
blob
x-ambry-um- String User metadata
headers prefix
Returns the handle(blob Id) to the object uploaded
30. 30
API
• GET /<ambry-id>/<sub-resource>
• Sub-resources: BlobInfo, UserMetadata
Request Header Type Description
ambry-id String The ID of the blob
whose content is
requested
sub-resource String One of the listed
sub-resources
Returns the Content of the blob or the requested sub-resource
31. 31
API
DELETE /<ambry-id>
HEAD /<ambry-id>
Request Header Type Description
ambry-id String The ID of the blob
to be deleted
Returns a successful response on deletion
Request Header Type Description
ambry-id String The ID of the blob
whose properties
are requested
Returns The blob properties of the blob as response headers.
35. Cluster Manager
Node Disks Size State
DC1:
Node_
1
Disk_0
Disk_1
…
Disk_n
4 TB
4 TB
…
4 TB
Up
Up
…
Up
DC1:
Node_
2
Disk_0
Disk_1
…
Disk_p
4 TB
4 TB
…
4 TB
Up
Down
…
Up
… … … …
DC2:
Node_i
Disk_0
Disk_1
…
Disk_
m
4 TB
4 TB
…
4 TB
Up
Up
…
Down
Partition_id State Replica
Partition_0 Read-
Write
DC1:Node1:Disk_0
DC1:Node4:Disk_2
…
DC3:Node2:Disk_1
Partition_1 Read-
Only
DC2:Node0:Disk_0
DC1:Node4:Disk_3
…
DC2:Node2:Disk_2
… …
Partition_k Read-
Write
DC1:Node1:Disk_1
DC2:Node1:Disk_2
…
DC3:Node0:Disk_0
Hardware Layout Partition Layout
36. Zero-cost Failure Recovery
36
available
wait period
temp down
temp
available
wait period
temp down
temp
available available
• No extra messages leveraging request messages
• Effective, simple, and consumes very little
bandwidth.
37. Replication Protocol
640 b1
700 b5
770 b6
850 b4
920 b3
600 b5
690 b1
750 b2
810 b4
880 b7
1. Get blob Ids since 690
Last known
offset of R2
= 690
2. Blob info {b1, b2, b4, b7}
Last known
offset of R2
= 880
4. Get blobs {b2, b7}
5. Blob content {b2, b7}
Replica R1 of Partition P1
(Journal)
Replica R2 of Partition P1
(Journal)
3. Filter missing blob Ids
1050 b2
1110 b7
6. Append blobs to local
replica
38. 38
Replication
Multi-master replication
Asynchronous
Pull based
Optimizations
Batching requests
Inter and intra-colo thread pools
Prioritization for lagging replicas
- Disk repairs
In-Memory Journaling
- Map<offset, blob_id>
Editor's Notes
Before proceeding, a few words on importance of media. 15 seconds
Media includes images, videos, documents and possibly in the future Virtual Reality. Media content are the biggest influencers to user engagement and virality. This trend will continue to grow not just on LinkedIn but across the web. To build great media products, we need to have a world class infrastructure to support it. That is the vision and focus of this group.
FS have and extra capabilities that are not needed for blob store Blob store does not need the FS arbitrarily nested directory structure (hierarchical structure) They don’t need the rich metadata per object like ACL and access times
Not the best to store large number of small objects due to the metadata that has to be kept
Name Node issues
KV stores are more general purpose, they can be used for storing objects but not optimized for this purpose, and extra burden comes with them.
Handle key collisions
Data is mutable. Additional overhead to maintain consistency model
So, when we started designing Ambry, we had few design principles in mind that needs to be taken care of.
First and foremost, it should be a low latency and high throughput system. Since medias are tend to vary from smaller sized objects to very large objects, we can’t give up on throughput nor the latency. And definitely smaller blobs should be served with very low latency.
Next is Geo – distribution. Geo-distribution is very much necessary when dealing with medias. Whenever you upload a picture, and share it via your social network, very likely someone on the other end of the globe will try to see it. So, our object store should have a global presence.
Highly available: Downstream Applications nor users can’t withstand unavailability. Consider having video ads and the monetization via same. So, availability is pretty much a must have feature in any data system.
Next one is scalability. If we don’t build a system scalable right from beginning, either we have to pay a hefty price at a later stage when we revisit our system design to make it scalable or it may not be feasible to scale at all. Scaling Ambry is just a matter of adding new nodes with new partitions.
You can’t brag that you have designed a good system, just that its tough to operationalize it. No one is going to use your system if for operational issues. So, Ambry has been built from the ground up having this in mind. We have lot of tooling built around Ambry for ease of operationalizing.
Active-active set up:
Having master-slave set up might add additional overhead in making sure slaves are synced up with master all the time, master is up and becomes sometimes might be bottle neck too. So, ours is a active active set up, where in any replica can take in puts and gets. This avoids bottleneck and helps in spreading out the requests to different replicas for the same partition.
Simple design and ease of use:
We will talk about Ambry’s design in the forth coming slides. But again, we wanted to ensure that our design is simple enough to understand and easy for developers to use it.
Cheap:
Medias are rarely deleted. The older data become cold over time and has very low read QPS. Also, objects are usually large and take up a lot of space. The design should be such that it enables JBOD, supports hard disks and keeps the space amplification to a minimum.
Ambry can be used a source of truth for all your immutable needs with highly availability and scalability.
Frontend understands http
Having a frontend also helps us to coexist with the legacy system (requests can be routed to Ambry or the media server)
Frontend is also where we can plug in other features like virus scanning, or pushing to a change capture system like kafka for tracking puts and deletes.
Frontend has a router library which has all the core logic to work with the Ambry backend/datanode.
Router handles how an operation should be performed and has configurable policies for them.
Storage is divided into partitions
A partition has a set of replicas
Any given blob goes into one partition
A server node consists of replicas of several partitions
A replica has a set of disk/memory data structures
Storage is divided into partitions
A partition has a set of replicas
Any given blob goes into one partition
A server node consists of replicas of several partitions
A replica has a set of disk/memory data structures
We deployed Ambry with a single Datanode. The Datan- ode was running on a 24 core CPU with 64 GB of RAM, 14 1TB HDD disks, and a full-duplex 1 Gb/s Ethernet network. 4 GB of the RAM was set aside for the Datanode’s internal use and the rest was left to be used as Linux Cache
the maximum through- put (in MB/s) stays constant and close to maximum net- work bandwidth across all blob sizes. Similarly, throughput in terms of requests/s scales proportionally.
However, for Read and Read-Write, the read throughput in terms of MB/s drops linearly for smaller sizes. This drop is because our micro-benchmark reads blobs randomly, incurring frequent disk seeks. The effect of disk seeks is amplified for smaller blobs. By further profiling the disk using Bonnie++ [1] (an IO benchmark for measuring disk performance), we confirmed that disk seeks are the dominant source of latency for small blobs. For example, when reading a 50 KB blob, more than 94% of latency is due to disk seek (6.49 ms for disk seek, and 0.4 ms for reading the data).
Handling blobs poses a number of unique challenges. First, due to diversity in media types, blob sizes vary significantly from tens of KBs (e.g., profile pictures) to a few GBs (e.g., videos). The system needs to store both massive blobs and a large number of small blobs efficiently.
Second, users expect the uploading process to be fast, durable, and highly available. When a user uploads a blob, all his/her friends from all around the globe should be able to see the blob with very low latency, even if parts of the internal infrastructure fail. To provide these properties, data has to be reliably replicated across the globe in multiple datacenters, while maintaining low latency for each request.
Finally, there is an ever-growing number of blobs that need to be stored and served. Currently, LinkedIn serves more than 800 million put and get operations per day (over 120 TB in size). In the past 12 months, the request rate has almost doubled, from 5k requests/s to 9.5k requests/s. This rapid growth in requests magnifies the necessity for a linearly scalable system (with low overhead).
Third, the variability in workload and cluster expansions can create unbalanced load, degrading the latency and throughput of the system
Talk about REST APIs & Router APIs
Also talk about ease of usability (either as a service or embed as lib and reduce one hop)
Talk about REST APIs & Router APIs
Also talk about ease of usability (either as a service or embed as lib and reduce one hop)
Talk about REST APIs & Router APIs
Also talk about ease of usability (either as a service or embed as lib and reduce one hop)
Talk about REST APIs & Router APIs
Also talk about ease of usability (either as a service or embed as lib and reduce one hop)
Talk about REST APIs & Router APIs
Also talk about ease of usability (either as a service or embed as lib and reduce one hop)
Talk about REST APIs & Router APIs
Also talk about ease of usability (either as a service or embed as lib and reduce one hop)
Talk about REST APIs & Router APIs
Also talk about ease of usability (either as a service or embed as lib and reduce one hop)
Frontend understands http
Frontend is also where we can plug in other features like virus scanning, or pushing to a change capture system like kafka for tracking puts and deletes.
Frontend has a router library which has all the core logic to work with the Ambry backend/datanode.
Router handles how an operation should be performed and has configurable policies for them.
Ambry employs a zero-cost failure detection mechanism involving no extra messages, such as heartbeats and pings, by leveraging request messages. In practice, we found our failure detection mechanism is effective, simple, and con- sumes very little bandwidth.
As we saw earlier, we have quorum based writes in case of PUT requests. So, there could be some replicas where certain blobs are left out. We rely on replication to replicate such blobs to all replicas. Also, our operation policy for PUT is designed in such a way that we write to local replicas and rely on replication to replicate the same to remote replicas. Thus replication places an important role in Ambry. Multiple masters can take writes for high availability. Completely asynchronous and a pull based replication protocol.