Terark (Y Combinator W17) has built a new storage engine based on nested succinct trie which provides a 10x-500x performance improvement, a 10:1 compression ratio and a crazy low latency compared to Google's LevelDB, Facebook's RocksDB. It is usable as a standalone key-value store, or as a storage engine for MySQL and MongoDB.
2. Terark built a fastest storage engine with best compression.
Compatible with MySQL, MongoDB and RocksDB, making
random read 200X faster, storage size 10X smaller. It is built for
general purpose, optimized for read heavy scenarios, resulting in
larger scalability with lower cost for big data applications.
Brief Introduction
Terark Confidential
3. Y Combinator is the world leading startup incubator (total valuation of portfolio
companies $100+ billion). The best known are Airbnb and Dropbox.
We Are a YC Company
Terark Confidential
4. Paying Customers
Terark technology supports Cloud, Big Data and Internet companies
to have better performance with less costs.
Terark Confidential
E-Commerce Giant around the Globe
Terark technology supports its business growth through
Alibaba Cloud.
5. Proven Results
Terark Compression
$ 5,000$ 30,000
Others (6 servers)
Terark (1 server)
550G
47G
TerarkTPC-H Dataset
TCO (on the same data size)
Hardware & Ops Cost
Terark Confidential
6. Use Terark’s CO-Index and PA-Zip to implement RocksDB’s SSTable.
• Much better compression
• Much better random read performance
• Terark trades off compression speed for high compression ratio and performance
• Use universal compaction to minimize write amplification
TerarkDB: Compatible with RocksDB
Terark Confidential
7. Strong Compression ( > 10:1 compression ratio)
- Lift Data Capacity
- Increase Memory Utilization, Lower Down Disk I/O
- Save Data Infrastructure Cost
Extreme Performance (QPS 15~500X of Competitors)
- Lower Latency, Higher Throughput and Concurrency
Simple DevOps & HA
- Leverage MySQL&MongoDB ecosystem
- Support proven devops tools
- HA based on MySQL and MongoDB
MySQL on TerarkDB, Mongo on TerarkDB
Terark Confidential
8. Core Technology
● CO-Index (Compressed Ordered Index)
Direct search on highly compressed index
● PA-Zip (Point Accessible Zip)
Direct point access one datum on globally compressed dataset
Our breakthrough technology is protected by 5 patents in the US, China and worldwide.
Terark Confidential
10. Appendix 1: TCO & ROI Details
Hardware Cost (1 server ~ $5000 a year referred to AWS) Operational Cost (~20% of the hardware cost)
Terark $5,000 $1,000
Other Product $30,000 $6,000
Terark Confidential
11. Appendix 1: TCO & ROI Details
Year(s) Cost Savings Estimated Rev Lift due to Performance/Scalibility Improvement(~20% of Cost Savings)
1 $30,000 $6,000
3 $90,000 $18,000
5 $150,000 $30,000
Terark Confidential
13. Hash B+Tree Terark Nested Succinct Trie
Compression None OK ✔✔✔ Excellent
Searching ✔✔ Very Fast OK ✔ Fast
Exact Searching ✔ Supported ✔ Supported ✔ Supported
Range Searching Not Supported ✔ Supported ✔ Supported
Prefix Searching Not Supported ✔ Supported ✔ Supported
Regex Searching Not Supported Not Supported ✔ Supported
Reverse Searching(id to key) Not Supported(can be work-around) Not Supported ✔ Supported
Index Comparation
Terark Confidential
14. Block-based: leveldb,
rocksdb, wiredtiger…
Short data: Terark
Nested Succinct Trie
Long data: Terark Global
Compression
Compression ratio OK ✔✔✔ Excellent ✔✔✔ Excellent
Random Read Slow ✔ Fast ✔ Fast
Sequential Read ✔ Fast OK ✔ Fast
Double Cache Problem YES NO NO
Compression Speed ✔ Fast Slow Slow
Data (Value) Compression
Terark Confidential
15. 2-bits for a node, Pre-Order
DFUDS
101110000100
Level-Order
LOUDS
101110010000
Parent(c) = rank0(select1(c))
Child(p, i) = select0(p) – p + i
Needs findopen, findclose, enclose, which are much
slower than rank/select, rarely used
Simple and fast, small:
Succinct Data Structure represents data within a space which is close to theoretical limit. It uses bitmap to represent data, and uses
rank-select to look for data.
It can tremendously reduce memory usage, but it is very complex to implement. Terark has our own implementations and achieved
much better performance than open-source implementations.
CO-Index: Succinct Tree
Terark Confidential
16. Patricia Trie: A Compressed Trie
Path compression: Compress all one-child nodes in a
path into a single node
Nested: Convert the compressed path into a new Trie
Requirements: Trie need to support “reverse searching”,
meaning to extract string from the node
CO-Index: Patricia Trie + Nesting
Terark Confidential
17. • Global Compression
• Global + Local Dictionary
• Short data friendly (~50 bytes)
• Larger dataset, better compression
• Point accessible (via record id)
• Inspired by lz77
PA-Zip (Point Accessible Zip)
Terark Confidential