Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
www.twosigma.com
Smooth Storage
September 13, 2018Proprietary and Confidential – Not for Redistribution
A storage system f...
Disclaimer
This document is being distributed for informational and educational purposes only and is not an offer to sell ...
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• Syst...
Motivation
September 13, 2018
• Why have specialized storage for time series data ?
 Extremely common at Two Sigma
 Time...
Smooth’s design emphasis
September 13, 2018
• Optimized for range queries and range updates executed in parallel per table...
Target Application characteristics
September 13, 2018
• Parallel time partitioned jobs that move a lot of data
• Tend to b...
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• Syst...
Data Model
September 13, 2018Proprietary and Confidential – Not for Redistribution
• Tables with schema; mandatory time co...
Write API
September 13, 2018
Updates a given time range atomically; the existing rows belonging to the range
are replaced ...
Write API
September 13, 2018Proprietary and Confidential – Not for Redistribution
• Set of write operations to a table for...
Read API
September 13, 2018Proprietary and Confidential – Not for Redistribution
• Rows returned are based on the latest c...
Other Operations
September 13, 2018
• Some operations that are not officially supported but a natural fit for smooth
• Dis...
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• Syst...
Table Implementation
September 13, 2018Proprietary and Confidential – Not for Redistribution
Time column
Shard 2
Shard 1
o...
Read Algorithm
September 13, 2018Proprietary and Confidential – Not for Redistribution
Time column
Committime
Shard 1
Shar...
Data File format
September 13, 2018
The underlying data file is indexed using a simple two level static B+Tree
Proprietary...
Data File format
September 13, 2018
A data file has one index block and individually compressed data blocks laid out
conti...
Compaction
September 13, 2018
Problem: overwrites of random time ranges and small writes
• Excessive fragmentation of the ...
Compaction Process
September 13, 2018Proprietary and Confidential – Not for Redistribution
Time column
Committime
Shard 1
...
Comparing with LSM
September 13, 2018
Similar to Log Structured Merge (LSM) tree
• Smooth impl is log structured
• immutab...
Write Amplification
September 13, 2018
• Write amplification = actual bytes written to storage / bytes written by user
• H...
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• Syst...
System Architecture
September 13, 2018Proprietary and Confidential – Not for Redistribution
System Architecture
September 13, 2018
• All smooth metadata is stored on Microsoft Sql Server which gets replicated
to ba...
System Architecture
September 13, 2018
• Data files are stored in object stores
• Multiple different types of OSs can be p...
Virtues of Immutability
September 13, 2018
• A design principle we have been using is immutability - both physical (write-...
Some Statistics
September 13, 2018
• Multiple PBs of unique compressed data
• Read peaks in excess of 100 GB/s (before dec...
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• Syst...
Looking Forward
September 13, 2018
• Multi-datacenter and public cloud read scaling
• CDN like distributed caching layer t...
Looking Forward
September 13, 2018
• Performance
• Performance consistency is a major concern - tail latencies are a major...
Looking Forward
September 13, 2018
Complex queries
• Common for time series datasets to have multiple sub-series merged to...
Looking Forward
September 13, 2018
• Moving away from a “thick” smooth client
• Enables quick iteration and bug fixes
• Mu...
September 13, 2018
Thank You!
Proprietary and Confidential – Not for Redistribution
Upcoming SlideShare
Loading in …5
×

of

Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 1 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 2 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 3 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 4 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 5 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 6 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 7 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 8 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 9 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 10 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 11 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 12 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 13 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 14 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 15 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 16 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 17 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 18 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 19 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 20 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 21 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 22 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 23 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 24 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 25 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 26 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 27 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 28 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 29 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 30 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 31 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 32 Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma Slide 33
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma

Download to read offline

Smooth is a distributed storage system for managing structured time series data at Two Sigma. Smooth’s design emphasizes scale, both in terms of size and aggregate request bandwidth, reliability and storage efficiency. It is optimized for large parallel streaming read/write accesses over provided time ranges. Smooth has a clear separation between the metadata and data layers, and supports multiple pluggable object stores for storing data files. Data can be replicated or moved between different stores and data centers to support availability, performance and storage tiering objectives. Smooth is widely used at Two Sigma by various applications including modeling research workflows, data pipelines and various data analysis jobs. Smooth has been in development for about 5 years, currently stores multiple PBs of compressed data, and serves peak aggregate throughput in excess of 100 GB/s. In this talk I will discuss the design and implementation of Smooth, our experience running it over the past two years, ongoing challenges and future directions.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma

  1. 1. www.twosigma.com Smooth Storage September 13, 2018Proprietary and Confidential – Not for Redistribution A storage system for managing structured time series data at Two Sigma Saurabh Goel saurabh.goel@twosigma.com
  2. 2. Disclaimer This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity. The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
  3. 3. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  4. 4. Motivation September 13, 2018 • Why have specialized storage for time series data ?  Extremely common at Two Sigma  Time is one of the primary dimensions along which applications want to partition and filter data  Scale – in terms of both size and access  Optimizing for the target application workload and requirements Proprietary and Confidential – Not for Redistribution
  5. 5. Smooth’s design emphasis September 13, 2018 • Optimized for range queries and range updates executed in parallel per table • File system like operations but with database like properties like atomicity and an isolation model for concurrent access • Centrally managed service at TS • Higher expectations around reliability, availability, and multi-tenancy (security, access control, fair sharing of resources, etc) • Storage efficiency is also a major concern given the overall size of data stored Proprietary and Confidential – Not for Redistribution File system ------------------------------ Smooth --------------- Database
  6. 6. Target Application characteristics September 13, 2018 • Parallel time partitioned jobs that move a lot of data • Tend to be batch oriented; care more about throughput than latency • New use cases are demanding better latency, smaller IO, more query power • Not good for workloads that require very low latencies or issue large numbers of small reads and writes Proprietary and Confidential – Not for Redistribution
  7. 7. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  8. 8. Data Model September 13, 2018Proprietary and Confidential – Not for Redistribution • Tables with schema; mandatory time column • Rows ordered and indexed by time • Not relational – duplicate timestamps/rows allowed; no notion of primary key but users can enforce PK constraints in their applications • Easy to update schema • Can store wide sparse schemas efficiently
  9. 9. Write API September 13, 2018 Updates a given time range atomically; the existing rows belonging to the range are replaced by the given set of new rows Proprietary and Confidential – Not for Redistribution WriteSession s = write(table, [10, 42)); s.addRow(<10, ..>); s.addRow(<15, ..>); // repeated timestamp is ok s.addRow(<15, ..>); // rows must be added in non-decreasing order s.addRow(<10, ..>); // rows must lie within the given time range s.addRow(<50, ..>); s.commit();
  10. 10. Write API September 13, 2018Proprietary and Confidential – Not for Redistribution • Set of write operations to a table forms a total order; internally each write gets a unique, strictly monotonically increasing logical commit timestamp • Distributed atomic writes are possible • Delete is just a special case of update where no new rows are written
  11. 11. Read API September 13, 2018Proprietary and Confidential – Not for Redistribution • Rows returned are based on the latest committed view of the table at the start of the read operation. Remains isolated from concurrent writes. Read API • Snapshot reads over a given time range Iterator<Row> i = read(table, time range); while(i.hasNext()) { doSomething(i.next()); }
  12. 12. Other Operations September 13, 2018 • Some operations that are not officially supported but a natural fit for smooth • Distributed snapshot reads • Reads in the past, permanent snapshots • Atomic read-modify-write operations using optimistic concurrency control (OCC) on the commit time Proprietary and Confidential – Not for Redistribution
  13. 13. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  14. 14. Table Implementation September 13, 2018Proprietary and Confidential – Not for Redistribution Time column Shard 2 Shard 1 overwritten time range Committime c1 c2 Data file Replica Data file contains the new set of ordered rows; immutable and indexed; potentially replicated Shard is the internal representation of an update operation; semantically immutable Data layer Metadata layer
  15. 15. Read Algorithm September 13, 2018Proprietary and Confidential – Not for Redistribution Time column Committime Shard 1 Shard 2 Shard 3 Shard 4 Read this range start of read Reads are implemented by concatenating together visible subranges of overlapping shards - we call this the “read plan” The underlying data file per shard is ordered and indexed and can efficiently select rows belonging to visible sub- ranges
  16. 16. Data File format September 13, 2018 The underlying data file is indexed using a simple two level static B+Tree Proprietary and Confidential – Not for Redistribution
  17. 17. Data File format September 13, 2018 A data file has one index block and individually compressed data blocks laid out contiguously • Data block is the unit of read; variable sized and compressed; typically small number of MBs; allow random access and parallelization • Currently use lz4 for most of the files; very low overhead but still gives us about 2x compression on average; have used gzip for some of the cold data files Proprietary and Confidential – Not for Redistribution
  18. 18. Compaction September 13, 2018 Problem: overwrites of random time ranges and small writes • Excessive fragmentation of the read plan; leads to slow reads, and excessive seeks on the backend data stores reducing overall serving capacity • Metadata bloat; small shards/files means larger metadata on smooth and object stores • Garbage; data under hidden ranges can be garbage collected Proprietary and Confidential – Not for Redistribution
  19. 19. Compaction Process September 13, 2018Proprietary and Confidential – Not for Redistribution Time column Committime Shard 1 Shard 2 Shard 3 Shard 4 New compacted shard committed here New compacted shard Deleted after the new shard is committed Underlying data files are not immediately deleted to support ongoing reads Only contiguous fragments can be combined together!
  20. 20. Comparing with LSM September 13, 2018 Similar to Log Structured Merge (LSM) tree • Smooth impl is log structured • immutable shards with embedded B-trees are similar to “sstables” • both have compaction processes aimed at similar objectives • Differ in details – each shard carries with itself a “bulk delete” tombstone whose handling is deferred till compaction time • read algorithm is different – no row level comparison for “next” operation • Key-value stores can use similar ideas to optimize bulk deletes Proprietary and Confidential – Not for Redistribution
  21. 21. Write Amplification September 13, 2018 • Write amplification = actual bytes written to storage / bytes written by user • Has not been an issue in practice – less than 10 on average • If the write workload gets more challenging (i.e. higher rate of small random writes) • Use leveled compaction similar to traditional key-value based LSM storage engines • by allowing non-contiguous shards to be combined – shards essentially get moved into data files • would make our read algorithm more complex - need to merge read plans from all levels Proprietary and Confidential – Not for Redistribution
  22. 22. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  23. 23. System Architecture September 13, 2018Proprietary and Confidential – Not for Redistribution
  24. 24. System Architecture September 13, 2018 • All smooth metadata is stored on Microsoft Sql Server which gets replicated to backup servers in a remote data center • Stateless metadata servers front the database providing functions like authorization, quota enforcement, and qos (fair sharing of resources) • Applications link with a smooth client library in order to access smooth Proprietary and Confidential – Not for Redistribution
  25. 25. System Architecture September 13, 2018 • Data files are stored in object stores • Multiple different types of OSs can be plugged into smooth and federated together for scaling, or replicated across for geo-redundancy/availability, or used for storage tiering. • Currently we use HDFS for warm data and CELFS for cold data; CELFS is an internal archival file system at TS Proprietary and Confidential – Not for Redistribution
  26. 26. Virtues of Immutability September 13, 2018 • A design principle we have been using is immutability - both physical (write- once data files) and semantic (shards) • The combination of linear metadata (i.e. strictly increasing commit timestamps) and immutable elements means that user reads and updates, the shard compaction process, and physical data movement process can operate in parallel with no interference and with minimal coordination • Data files can be cached without worrying about consistency This simple model has been central to keeping the system simple, robust and scalable. Proprietary and Confidential – Not for Redistribution
  27. 27. Some Statistics September 13, 2018 • Multiple PBs of unique compressed data • Read peaks in excess of 100 GB/s (before decompressing) • 100s of millions of files/shards • 10s of millions of tables • 10s of thousands of concurrent requests Proprietary and Confidential – Not for Redistribution
  28. 28. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  29. 29. Looking Forward September 13, 2018 • Multi-datacenter and public cloud read scaling • CDN like distributed caching layer that spans even to sites that don’t store data • Encryption at rest may be important for cloud use cases • More cost-efficient multi-dc replication and cold data storage • Data stores that use erasure coding • More efficient data encoding and compression • Data stores that can replicate data across data centers and support desirable failover semantics Proprietary and Confidential – Not for Redistribution
  30. 30. Looking Forward September 13, 2018 • Performance • Performance consistency is a major concern - tail latencies are a major issue with HDFS • Issues with slow serialization and parsing of rows • More challenging workloads • Interactive workloads are becoming common – latency sensitive • Column filtering • Complex read queries Proprietary and Confidential – Not for Redistribution
  31. 31. Looking Forward September 13, 2018 Complex queries • Common for time series datasets to have multiple sub-series merged together by time, like prices per stock ticker. The sub-series is typically identified by another column. The cardinality of this column is generally in 10k to 20k range • Example query: given an arbitrary subset of tickers and a time range, return all matching rows ordered by time • In reality each ticker has its own time range, and there are several variations of this query • Looking at new kinds of indexing Proprietary and Confidential – Not for Redistribution
  32. 32. Looking Forward September 13, 2018 • Moving away from a “thick” smooth client • Enables quick iteration and bug fixes • Multi-language support • Opens up many architectural possibilities like caching, easier access control, Qos, etc • Various other reliability, multi-tenancy, metadata scaling, security and operability improvements Proprietary and Confidential – Not for Redistribution
  33. 33. September 13, 2018 Thank You! Proprietary and Confidential – Not for Redistribution

Smooth is a distributed storage system for managing structured time series data at Two Sigma. Smooth’s design emphasizes scale, both in terms of size and aggregate request bandwidth, reliability and storage efficiency. It is optimized for large parallel streaming read/write accesses over provided time ranges. Smooth has a clear separation between the metadata and data layers, and supports multiple pluggable object stores for storing data files. Data can be replicated or moved between different stores and data centers to support availability, performance and storage tiering objectives. Smooth is widely used at Two Sigma by various applications including modeling research workflows, data pipelines and various data analysis jobs. Smooth has been in development for about 5 years, currently stores multiple PBs of compressed data, and serves peak aggregate throughput in excess of 100 GB/s. In this talk I will discuss the design and implementation of Smooth, our experience running it over the past two years, ongoing challenges and future directions.

Views

Total views

452

On Slideshare

0

From embeds

0

Number of embeds

88

Actions

Downloads

13

Shares

0

Comments

0

Likes

0

×