SlideShare a Scribd company logo
www.twosigma.com
Smooth Storage
September 13, 2018Proprietary and Confidential – Not for Redistribution
A storage system for managing structured time
series data at Two Sigma
Saurabh Goel
saurabh.goel@twosigma.com
Disclaimer
This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer
to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon
for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates
(collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without
notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of
such information and the use of such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two
Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for
identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark
does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
Motivation
September 13, 2018
• Why have specialized storage for time series data ?
 Extremely common at Two Sigma
 Time is one of the primary dimensions along which applications want to partition and
filter data
 Scale – in terms of both size and access
 Optimizing for the target application workload and requirements
Proprietary and Confidential – Not for Redistribution
Smooth’s design emphasis
September 13, 2018
• Optimized for range queries and range updates executed in parallel per table
• File system like operations but with database like properties like atomicity
and an isolation model for concurrent access
• Centrally managed service at TS
• Higher expectations around reliability, availability, and multi-tenancy
(security, access control, fair sharing of resources, etc)
• Storage efficiency is also a major concern given the overall size of data stored
Proprietary and Confidential – Not for Redistribution
File system ------------------------------ Smooth --------------- Database
Target Application characteristics
September 13, 2018
• Parallel time partitioned jobs that move a lot of data
• Tend to be batch oriented; care more about throughput than latency
• New use cases are demanding better latency, smaller IO, more query power
• Not good for workloads that require very low latencies or issue large numbers
of small reads and writes
Proprietary and Confidential – Not for Redistribution
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
Data Model
September 13, 2018Proprietary and Confidential – Not for Redistribution
• Tables with schema; mandatory time column
• Rows ordered and indexed by time
• Not relational – duplicate timestamps/rows allowed; no notion of primary key
but users can enforce PK constraints in their applications
• Easy to update schema
• Can store wide sparse schemas efficiently
Write API
September 13, 2018
Updates a given time range atomically; the existing rows belonging to the range
are replaced by the given set of new rows
Proprietary and Confidential – Not for Redistribution
WriteSession s = write(table, [10, 42));
s.addRow(<10, ..>);
s.addRow(<15, ..>);
// repeated timestamp is ok
s.addRow(<15, ..>);
// rows must be added in non-decreasing order
s.addRow(<10, ..>);
// rows must lie within the given time range
s.addRow(<50, ..>);
s.commit();
Write API
September 13, 2018Proprietary and Confidential – Not for Redistribution
• Set of write operations to a table forms a total order; internally each write
gets a unique, strictly monotonically increasing logical commit timestamp
• Distributed atomic writes are possible
• Delete is just a special case of update where no new rows are written
Read API
September 13, 2018Proprietary and Confidential – Not for Redistribution
• Rows returned are based on the latest committed view of the table at the
start of the read operation. Remains isolated from concurrent writes.
Read API
• Snapshot reads over a given time range
Iterator<Row> i = read(table, time range);
while(i.hasNext()) {
doSomething(i.next());
}
Other Operations
September 13, 2018
• Some operations that are not officially supported but a natural fit for smooth
• Distributed snapshot reads
• Reads in the past, permanent snapshots
• Atomic read-modify-write operations using optimistic concurrency control
(OCC) on the commit time
Proprietary and Confidential – Not for Redistribution
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
Table Implementation
September 13, 2018Proprietary and Confidential – Not for Redistribution
Time column
Shard 2
Shard 1
overwritten time range
Committime
c1
c2
Data
file
Replica
Data file contains the new
set of ordered rows;
immutable and indexed;
potentially replicated
Shard is the internal representation
of an update operation;
semantically immutable
Data layer
Metadata layer
Read Algorithm
September 13, 2018Proprietary and Confidential – Not for Redistribution
Time column
Committime
Shard 1
Shard 2
Shard 3
Shard 4
Read this range
start of
read
Reads are implemented by
concatenating together visible
subranges of overlapping shards - we
call this the “read plan”
The underlying data file per shard is
ordered and indexed and can efficiently
select rows belonging to visible sub-
ranges
Data File format
September 13, 2018
The underlying data file is indexed using a simple two level static B+Tree
Proprietary and Confidential – Not for Redistribution
Data File format
September 13, 2018
A data file has one index block and individually compressed data blocks laid out
contiguously
• Data block is the unit of read; variable sized and compressed; typically small
number of MBs; allow random access and parallelization
• Currently use lz4 for most of the files; very low overhead but still gives us
about 2x compression on average; have used gzip for some of the cold data
files
Proprietary and Confidential – Not for Redistribution
Compaction
September 13, 2018
Problem: overwrites of random time ranges and small writes
• Excessive fragmentation of the read plan; leads to slow reads, and excessive
seeks on the backend data stores reducing overall serving capacity
• Metadata bloat; small shards/files means larger metadata on smooth and
object stores
• Garbage; data under hidden ranges can be garbage collected
Proprietary and Confidential – Not for Redistribution
Compaction Process
September 13, 2018Proprietary and Confidential – Not for Redistribution
Time column
Committime
Shard 1
Shard 2
Shard 3
Shard 4
New compacted shard
committed here
New compacted
shard
Deleted after the new
shard is committed
Underlying data files
are not immediately
deleted to support
ongoing reads
Only contiguous fragments can be combined
together!
Comparing with LSM
September 13, 2018
Similar to Log Structured Merge (LSM) tree
• Smooth impl is log structured
• immutable shards with embedded B-trees are similar to “sstables”
• both have compaction processes aimed at similar objectives
• Differ in details – each shard carries with itself a “bulk delete” tombstone
whose handling is deferred till compaction time
• read algorithm is different – no row level comparison for “next” operation
• Key-value stores can use similar ideas to optimize bulk deletes
Proprietary and Confidential – Not for Redistribution
Write Amplification
September 13, 2018
• Write amplification = actual bytes written to storage / bytes written by user
• Has not been an issue in practice – less than 10 on average
• If the write workload gets more challenging (i.e. higher rate of small random
writes)
• Use leveled compaction similar to traditional key-value based LSM storage
engines
• by allowing non-contiguous shards to be combined – shards essentially get moved
into data files
• would make our read algorithm more complex - need to merge read plans from all
levels
Proprietary and Confidential – Not for Redistribution
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
System Architecture
September 13, 2018Proprietary and Confidential – Not for Redistribution
System Architecture
September 13, 2018
• All smooth metadata is stored on Microsoft Sql Server which gets replicated
to backup servers in a remote data center
• Stateless metadata servers front the database providing functions like
authorization, quota enforcement, and qos (fair sharing of resources)
• Applications link with a smooth client library in order to access smooth
Proprietary and Confidential – Not for Redistribution
System Architecture
September 13, 2018
• Data files are stored in object stores
• Multiple different types of OSs can be plugged into smooth and federated
together for scaling, or replicated across for geo-redundancy/availability, or
used for storage tiering.
• Currently we use HDFS for warm data and CELFS for cold data; CELFS is an
internal archival file system at TS
Proprietary and Confidential – Not for Redistribution
Virtues of Immutability
September 13, 2018
• A design principle we have been using is immutability - both physical (write-
once data files) and semantic (shards)
• The combination of linear metadata (i.e. strictly increasing commit
timestamps) and immutable elements means that user reads and updates, the
shard compaction process, and physical data movement process can operate
in parallel with no interference and with minimal coordination
• Data files can be cached without worrying about consistency
This simple model has been central to keeping the system simple, robust and
scalable.
Proprietary and Confidential – Not for Redistribution
Some Statistics
September 13, 2018
• Multiple PBs of unique compressed data
• Read peaks in excess of 100 GB/s (before decompressing)
• 100s of millions of files/shards
• 10s of millions of tables
• 10s of thousands of concurrent requests
Proprietary and Confidential – Not for Redistribution
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
Looking Forward
September 13, 2018
• Multi-datacenter and public cloud read scaling
• CDN like distributed caching layer that spans even to sites that don’t store
data
• Encryption at rest may be important for cloud use cases
• More cost-efficient multi-dc replication and cold data storage
• Data stores that use erasure coding
• More efficient data encoding and compression
• Data stores that can replicate data across data centers and support
desirable failover semantics
Proprietary and Confidential – Not for Redistribution
Looking Forward
September 13, 2018
• Performance
• Performance consistency is a major concern - tail latencies are a major issue
with HDFS
• Issues with slow serialization and parsing of rows
• More challenging workloads
• Interactive workloads are becoming common – latency sensitive
• Column filtering
• Complex read queries
Proprietary and Confidential – Not for Redistribution
Looking Forward
September 13, 2018
Complex queries
• Common for time series datasets to have multiple sub-series merged together
by time, like prices per stock ticker. The sub-series is typically identified by
another column. The cardinality of this column is generally in 10k to 20k
range
• Example query: given an arbitrary subset of tickers and a time range, return all
matching rows ordered by time
• In reality each ticker has its own time range, and there are several variations
of this query
• Looking at new kinds of indexing
Proprietary and Confidential – Not for Redistribution
Looking Forward
September 13, 2018
• Moving away from a “thick” smooth client
• Enables quick iteration and bug fixes
• Multi-language support
• Opens up many architectural possibilities like caching, easier access control,
Qos, etc
• Various other reliability, multi-tenancy, metadata scaling, security and
operability improvements
Proprietary and Confidential – Not for Redistribution
September 13, 2018
Thank You!
Proprietary and Confidential – Not for Redistribution

More Related Content

What's hot

Data Democratization for Faster Decision-making and Business Agility (ASEAN)
Data Democratization for Faster Decision-making and Business Agility (ASEAN)Data Democratization for Faster Decision-making and Business Agility (ASEAN)
Data Democratization for Faster Decision-making and Business Agility (ASEAN)Denodo
 
Project To Product: How we transitioned to product-aligned value streams
Project To Product: How we transitioned to product-aligned value streamsProject To Product: How we transitioned to product-aligned value streams
Project To Product: How we transitioned to product-aligned value streamsTasktop
 
Got data?… now what? An introduction to modern data platforms
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platformsJamesAnderson599331
 
Application Migration: How to Start, Scale and Succeed
Application Migration: How to Start, Scale and SucceedApplication Migration: How to Start, Scale and Succeed
Application Migration: How to Start, Scale and SucceedVMware Tanzu
 
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works Enterprise Management Associates
 
Business analytics from basics to value
Business analytics from basics to valueBusiness analytics from basics to value
Business analytics from basics to valuesucesuminas
 
Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...
Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...
Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...confluent
 
Modernizing Infrastructure Monitoring and Management with AIOps
Modernizing Infrastructure Monitoring and Management with AIOpsModernizing Infrastructure Monitoring and Management with AIOps
Modernizing Infrastructure Monitoring and Management with AIOpsOpsRamp
 
Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides SlideTeam
 
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)Kai Wähner
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaScyllaDB
 
Data Monetization
Data MonetizationData Monetization
Data MonetizationDATAVERSITY
 
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...HostedbyConfluent
 
Driving Key Account Growth: Planning and Execution to Access the White Space
Driving Key Account Growth: Planning and Execution to Access the White SpaceDriving Key Account Growth: Planning and Execution to Access the White Space
Driving Key Account Growth: Planning and Execution to Access the White SpaceRichardson
 
My interview with ChatGPT
My interview with ChatGPTMy interview with ChatGPT
My interview with ChatGPTIsac Costa
 
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Precisely
 
Data product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics historyData product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics historyRogier Werschkull
 
Real Time Data Strategy and Architecture
Real Time Data Strategy and ArchitectureReal Time Data Strategy and Architecture
Real Time Data Strategy and ArchitectureAlan McSweeney
 
Apache pulsar - storage architecture
Apache pulsar - storage architectureApache pulsar - storage architecture
Apache pulsar - storage architectureMatteo Merli
 

What's hot (20)

Data Democratization for Faster Decision-making and Business Agility (ASEAN)
Data Democratization for Faster Decision-making and Business Agility (ASEAN)Data Democratization for Faster Decision-making and Business Agility (ASEAN)
Data Democratization for Faster Decision-making and Business Agility (ASEAN)
 
Project To Product: How we transitioned to product-aligned value streams
Project To Product: How we transitioned to product-aligned value streamsProject To Product: How we transitioned to product-aligned value streams
Project To Product: How we transitioned to product-aligned value streams
 
Got data?… now what? An introduction to modern data platforms
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platforms
 
Application Migration: How to Start, Scale and Succeed
Application Migration: How to Start, Scale and SucceedApplication Migration: How to Start, Scale and Succeed
Application Migration: How to Start, Scale and Succeed
 
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
 
Business analytics from basics to value
Business analytics from basics to valueBusiness analytics from basics to value
Business analytics from basics to value
 
Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...
Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...
Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...
 
Modernizing Infrastructure Monitoring and Management with AIOps
Modernizing Infrastructure Monitoring and Management with AIOpsModernizing Infrastructure Monitoring and Management with AIOps
Modernizing Infrastructure Monitoring and Management with AIOps
 
Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides
 
Data Products and teams
Data Products and teamsData Products and teams
Data Products and teams
 
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Data Monetization
Data MonetizationData Monetization
Data Monetization
 
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
 
Driving Key Account Growth: Planning and Execution to Access the White Space
Driving Key Account Growth: Planning and Execution to Access the White SpaceDriving Key Account Growth: Planning and Execution to Access the White Space
Driving Key Account Growth: Planning and Execution to Access the White Space
 
My interview with ChatGPT
My interview with ChatGPTMy interview with ChatGPT
My interview with ChatGPT
 
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
 
Data product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics historyData product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics history
 
Real Time Data Strategy and Architecture
Real Time Data Strategy and ArchitectureReal Time Data Strategy and Architecture
Real Time Data Strategy and Architecture
 
Apache pulsar - storage architecture
Apache pulsar - storage architectureApache pulsar - storage architecture
Apache pulsar - storage architecture
 

Similar to Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma

Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18Cloudera, Inc.
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
Serverless Datalake Day with AWS
Serverless Datalake Day with AWSServerless Datalake Day with AWS
Serverless Datalake Day with AWSAmazon Web Services
 
Choosing the Right Database for My Workload: Purpose-Built Databases
Choosing the Right Database for My Workload: Purpose-Built Databases Choosing the Right Database for My Workload: Purpose-Built Databases
Choosing the Right Database for My Workload: Purpose-Built Databases AWS Germany
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Saurabh K. Gupta
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauSam Palani
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeDATAVERSITY
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architectureRahul Chaturvedi
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InSnapLogic
 
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...DataStax Academy
 
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Amazon Web Services
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSAWS User Group Kochi
 

Similar to Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma (20)

Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Serverless Datalake Day with AWS
Serverless Datalake Day with AWSServerless Datalake Day with AWS
Serverless Datalake Day with AWS
 
Choosing the Right Database for My Workload: Purpose-Built Databases
Choosing the Right Database for My Workload: Purpose-Built Databases Choosing the Right Database for My Workload: Purpose-Built Databases
Choosing the Right Database for My Workload: Purpose-Built Databases
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-Purpose
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Amazon Aurora: Database Week SF
Amazon Aurora: Database Week SFAmazon Aurora: Database Week SF
Amazon Aurora: Database Week SF
 
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
 
BI & Analytics
BI & AnalyticsBI & Analytics
BI & Analytics
 
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
 

More from Two Sigma

The State of Open Data on School Bullying
The State of Open Data on School BullyingThe State of Open Data on School Bullying
The State of Open Data on School BullyingTwo Sigma
 
Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018Two Sigma
 
Future of Pandas - Jeff Reback
Future of Pandas - Jeff RebackFuture of Pandas - Jeff Reback
Future of Pandas - Jeff RebackTwo Sigma
 
BeakerX - Tiezheng Li
BeakerX - Tiezheng LiBeakerX - Tiezheng Li
BeakerX - Tiezheng LiTwo Sigma
 
Engineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee JooEngineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee JooTwo Sigma
 
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonBringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonTwo Sigma
 
Waiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-ScalerWaiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-ScalerTwo Sigma
 
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia YeResponsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia YeTwo Sigma
 
The Language of Compression - Leif Walsh
The Language of Compression - Leif WalshThe Language of Compression - Leif Walsh
The Language of Compression - Leif WalshTwo Sigma
 
Identifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane AdamsIdentifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane AdamsTwo Sigma
 
Algorithmic Data Science = Theory + Practice
Algorithmic Data Science = Theory + PracticeAlgorithmic Data Science = Theory + Practice
Algorithmic Data Science = Theory + PracticeTwo Sigma
 
HUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For SparkHUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For SparkTwo Sigma
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowTwo Sigma
 
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...Two Sigma
 
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...Two Sigma
 
Graph Summarization with Quality Guarantees
Graph Summarization with Quality GuaranteesGraph Summarization with Quality Guarantees
Graph Summarization with Quality GuaranteesTwo Sigma
 
Rademacher Averages: Theory and Practice
Rademacher Averages: Theory and PracticeRademacher Averages: Theory and Practice
Rademacher Averages: Theory and PracticeTwo Sigma
 
Credit-Implied Volatility
Credit-Implied VolatilityCredit-Implied Volatility
Credit-Implied VolatilityTwo Sigma
 
Principles of REST API Design
Principles of REST API DesignPrinciples of REST API Design
Principles of REST API DesignTwo Sigma
 

More from Two Sigma (19)

The State of Open Data on School Bullying
The State of Open Data on School BullyingThe State of Open Data on School Bullying
The State of Open Data on School Bullying
 
Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018
 
Future of Pandas - Jeff Reback
Future of Pandas - Jeff RebackFuture of Pandas - Jeff Reback
Future of Pandas - Jeff Reback
 
BeakerX - Tiezheng Li
BeakerX - Tiezheng LiBeakerX - Tiezheng Li
BeakerX - Tiezheng Li
 
Engineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee JooEngineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee Joo
 
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonBringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
 
Waiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-ScalerWaiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-Scaler
 
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia YeResponsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
 
The Language of Compression - Leif Walsh
The Language of Compression - Leif WalshThe Language of Compression - Leif Walsh
The Language of Compression - Leif Walsh
 
Identifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane AdamsIdentifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane Adams
 
Algorithmic Data Science = Theory + Practice
Algorithmic Data Science = Theory + PracticeAlgorithmic Data Science = Theory + Practice
Algorithmic Data Science = Theory + Practice
 
HUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For SparkHUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For Spark
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
 
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
 
Graph Summarization with Quality Guarantees
Graph Summarization with Quality GuaranteesGraph Summarization with Quality Guarantees
Graph Summarization with Quality Guarantees
 
Rademacher Averages: Theory and Practice
Rademacher Averages: Theory and PracticeRademacher Averages: Theory and Practice
Rademacher Averages: Theory and Practice
 
Credit-Implied Volatility
Credit-Implied VolatilityCredit-Implied Volatility
Credit-Implied Volatility
 
Principles of REST API Design
Principles of REST API DesignPrinciples of REST API Design
Principles of REST API Design
 

Recently uploaded

NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...Amil baba
 
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...Aimil Ltd
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfKamal Acharya
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Krakówbim.edu.pl
 
Paint shop management system project report.pdf
Paint shop management system project report.pdfPaint shop management system project report.pdf
Paint shop management system project report.pdfKamal Acharya
 
retail automation billing system ppt.pptx
retail automation billing system ppt.pptxretail automation billing system ppt.pptx
retail automation billing system ppt.pptxfaamieahmd
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamDr. Radhey Shyam
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdfKamal Acharya
 
Maestro Scripting Language CNC programacion
Maestro Scripting Language CNC programacionMaestro Scripting Language CNC programacion
Maestro Scripting Language CNC programacionliberfusta1
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdfKamal Acharya
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf884710SadaqatAli
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
 
A case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfA case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfKamal Acharya
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfAyahmorsy
 
An improvement in the safety of big data using blockchain technology
An improvement in the safety of big data using blockchain technologyAn improvement in the safety of big data using blockchain technology
An improvement in the safety of big data using blockchain technologyBOHRInternationalJou1
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfPipe Restoration Solutions
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopEmre Günaydın
 
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDrGurudutt
 

Recently uploaded (20)

NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
 
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
Paint shop management system project report.pdf
Paint shop management system project report.pdfPaint shop management system project report.pdf
Paint shop management system project report.pdf
 
retail automation billing system ppt.pptx
retail automation billing system ppt.pptxretail automation billing system ppt.pptx
retail automation billing system ppt.pptx
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdf
 
Maestro Scripting Language CNC programacion
Maestro Scripting Language CNC programacionMaestro Scripting Language CNC programacion
Maestro Scripting Language CNC programacion
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
A case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfA case study of cinema management system project report..pdf
A case study of cinema management system project report..pdf
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdf
 
An improvement in the safety of big data using blockchain technology
An improvement in the safety of big data using blockchain technologyAn improvement in the safety of big data using blockchain technology
An improvement in the safety of big data using blockchain technology
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering Workshop
 
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
 

Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma

  • 1. www.twosigma.com Smooth Storage September 13, 2018Proprietary and Confidential – Not for Redistribution A storage system for managing structured time series data at Two Sigma Saurabh Goel saurabh.goel@twosigma.com
  • 2. Disclaimer This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity. The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
  • 3. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  • 4. Motivation September 13, 2018 • Why have specialized storage for time series data ?  Extremely common at Two Sigma  Time is one of the primary dimensions along which applications want to partition and filter data  Scale – in terms of both size and access  Optimizing for the target application workload and requirements Proprietary and Confidential – Not for Redistribution
  • 5. Smooth’s design emphasis September 13, 2018 • Optimized for range queries and range updates executed in parallel per table • File system like operations but with database like properties like atomicity and an isolation model for concurrent access • Centrally managed service at TS • Higher expectations around reliability, availability, and multi-tenancy (security, access control, fair sharing of resources, etc) • Storage efficiency is also a major concern given the overall size of data stored Proprietary and Confidential – Not for Redistribution File system ------------------------------ Smooth --------------- Database
  • 6. Target Application characteristics September 13, 2018 • Parallel time partitioned jobs that move a lot of data • Tend to be batch oriented; care more about throughput than latency • New use cases are demanding better latency, smaller IO, more query power • Not good for workloads that require very low latencies or issue large numbers of small reads and writes Proprietary and Confidential – Not for Redistribution
  • 7. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  • 8. Data Model September 13, 2018Proprietary and Confidential – Not for Redistribution • Tables with schema; mandatory time column • Rows ordered and indexed by time • Not relational – duplicate timestamps/rows allowed; no notion of primary key but users can enforce PK constraints in their applications • Easy to update schema • Can store wide sparse schemas efficiently
  • 9. Write API September 13, 2018 Updates a given time range atomically; the existing rows belonging to the range are replaced by the given set of new rows Proprietary and Confidential – Not for Redistribution WriteSession s = write(table, [10, 42)); s.addRow(<10, ..>); s.addRow(<15, ..>); // repeated timestamp is ok s.addRow(<15, ..>); // rows must be added in non-decreasing order s.addRow(<10, ..>); // rows must lie within the given time range s.addRow(<50, ..>); s.commit();
  • 10. Write API September 13, 2018Proprietary and Confidential – Not for Redistribution • Set of write operations to a table forms a total order; internally each write gets a unique, strictly monotonically increasing logical commit timestamp • Distributed atomic writes are possible • Delete is just a special case of update where no new rows are written
  • 11. Read API September 13, 2018Proprietary and Confidential – Not for Redistribution • Rows returned are based on the latest committed view of the table at the start of the read operation. Remains isolated from concurrent writes. Read API • Snapshot reads over a given time range Iterator<Row> i = read(table, time range); while(i.hasNext()) { doSomething(i.next()); }
  • 12. Other Operations September 13, 2018 • Some operations that are not officially supported but a natural fit for smooth • Distributed snapshot reads • Reads in the past, permanent snapshots • Atomic read-modify-write operations using optimistic concurrency control (OCC) on the commit time Proprietary and Confidential – Not for Redistribution
  • 13. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  • 14. Table Implementation September 13, 2018Proprietary and Confidential – Not for Redistribution Time column Shard 2 Shard 1 overwritten time range Committime c1 c2 Data file Replica Data file contains the new set of ordered rows; immutable and indexed; potentially replicated Shard is the internal representation of an update operation; semantically immutable Data layer Metadata layer
  • 15. Read Algorithm September 13, 2018Proprietary and Confidential – Not for Redistribution Time column Committime Shard 1 Shard 2 Shard 3 Shard 4 Read this range start of read Reads are implemented by concatenating together visible subranges of overlapping shards - we call this the “read plan” The underlying data file per shard is ordered and indexed and can efficiently select rows belonging to visible sub- ranges
  • 16. Data File format September 13, 2018 The underlying data file is indexed using a simple two level static B+Tree Proprietary and Confidential – Not for Redistribution
  • 17. Data File format September 13, 2018 A data file has one index block and individually compressed data blocks laid out contiguously • Data block is the unit of read; variable sized and compressed; typically small number of MBs; allow random access and parallelization • Currently use lz4 for most of the files; very low overhead but still gives us about 2x compression on average; have used gzip for some of the cold data files Proprietary and Confidential – Not for Redistribution
  • 18. Compaction September 13, 2018 Problem: overwrites of random time ranges and small writes • Excessive fragmentation of the read plan; leads to slow reads, and excessive seeks on the backend data stores reducing overall serving capacity • Metadata bloat; small shards/files means larger metadata on smooth and object stores • Garbage; data under hidden ranges can be garbage collected Proprietary and Confidential – Not for Redistribution
  • 19. Compaction Process September 13, 2018Proprietary and Confidential – Not for Redistribution Time column Committime Shard 1 Shard 2 Shard 3 Shard 4 New compacted shard committed here New compacted shard Deleted after the new shard is committed Underlying data files are not immediately deleted to support ongoing reads Only contiguous fragments can be combined together!
  • 20. Comparing with LSM September 13, 2018 Similar to Log Structured Merge (LSM) tree • Smooth impl is log structured • immutable shards with embedded B-trees are similar to “sstables” • both have compaction processes aimed at similar objectives • Differ in details – each shard carries with itself a “bulk delete” tombstone whose handling is deferred till compaction time • read algorithm is different – no row level comparison for “next” operation • Key-value stores can use similar ideas to optimize bulk deletes Proprietary and Confidential – Not for Redistribution
  • 21. Write Amplification September 13, 2018 • Write amplification = actual bytes written to storage / bytes written by user • Has not been an issue in practice – less than 10 on average • If the write workload gets more challenging (i.e. higher rate of small random writes) • Use leveled compaction similar to traditional key-value based LSM storage engines • by allowing non-contiguous shards to be combined – shards essentially get moved into data files • would make our read algorithm more complex - need to merge read plans from all levels Proprietary and Confidential – Not for Redistribution
  • 22. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  • 23. System Architecture September 13, 2018Proprietary and Confidential – Not for Redistribution
  • 24. System Architecture September 13, 2018 • All smooth metadata is stored on Microsoft Sql Server which gets replicated to backup servers in a remote data center • Stateless metadata servers front the database providing functions like authorization, quota enforcement, and qos (fair sharing of resources) • Applications link with a smooth client library in order to access smooth Proprietary and Confidential – Not for Redistribution
  • 25. System Architecture September 13, 2018 • Data files are stored in object stores • Multiple different types of OSs can be plugged into smooth and federated together for scaling, or replicated across for geo-redundancy/availability, or used for storage tiering. • Currently we use HDFS for warm data and CELFS for cold data; CELFS is an internal archival file system at TS Proprietary and Confidential – Not for Redistribution
  • 26. Virtues of Immutability September 13, 2018 • A design principle we have been using is immutability - both physical (write- once data files) and semantic (shards) • The combination of linear metadata (i.e. strictly increasing commit timestamps) and immutable elements means that user reads and updates, the shard compaction process, and physical data movement process can operate in parallel with no interference and with minimal coordination • Data files can be cached without worrying about consistency This simple model has been central to keeping the system simple, robust and scalable. Proprietary and Confidential – Not for Redistribution
  • 27. Some Statistics September 13, 2018 • Multiple PBs of unique compressed data • Read peaks in excess of 100 GB/s (before decompressing) • 100s of millions of files/shards • 10s of millions of tables • 10s of thousands of concurrent requests Proprietary and Confidential – Not for Redistribution
  • 28. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  • 29. Looking Forward September 13, 2018 • Multi-datacenter and public cloud read scaling • CDN like distributed caching layer that spans even to sites that don’t store data • Encryption at rest may be important for cloud use cases • More cost-efficient multi-dc replication and cold data storage • Data stores that use erasure coding • More efficient data encoding and compression • Data stores that can replicate data across data centers and support desirable failover semantics Proprietary and Confidential – Not for Redistribution
  • 30. Looking Forward September 13, 2018 • Performance • Performance consistency is a major concern - tail latencies are a major issue with HDFS • Issues with slow serialization and parsing of rows • More challenging workloads • Interactive workloads are becoming common – latency sensitive • Column filtering • Complex read queries Proprietary and Confidential – Not for Redistribution
  • 31. Looking Forward September 13, 2018 Complex queries • Common for time series datasets to have multiple sub-series merged together by time, like prices per stock ticker. The sub-series is typically identified by another column. The cardinality of this column is generally in 10k to 20k range • Example query: given an arbitrary subset of tickers and a time range, return all matching rows ordered by time • In reality each ticker has its own time range, and there are several variations of this query • Looking at new kinds of indexing Proprietary and Confidential – Not for Redistribution
  • 32. Looking Forward September 13, 2018 • Moving away from a “thick” smooth client • Enables quick iteration and bug fixes • Multi-language support • Opens up many architectural possibilities like caching, easier access control, Qos, etc • Various other reliability, multi-tenancy, metadata scaling, security and operability improvements Proprietary and Confidential – Not for Redistribution
  • 33. September 13, 2018 Thank You! Proprietary and Confidential – Not for Redistribution

Editor's Notes

  1. A shard is semantically immutable, i.e. it always returns the same set of rows The physical representation of the underlying data can change in format or storage location or be replicated
  2. Gets the read plan for the entire time range and finds areas with excessive fragmentation (many small fragments) Selects a contiguous segment of the read plan containing fragments to be fixed, and rewrites them as a single new shard. The commit time of the new shard is the max of participating input shards – this makes sure the compaction process does not interfere with ongoing writes The underlying data files for the deleted shards are not immediately removed so that references from read plans of ongoing reads remain valid