Some statistics: - Feeds: 1.6 B, 700 GB hard drive in 4 DB instances, 8 caching servers, 136 GB memory cache in used. - User Profiles: 44.5 M registered accounts, 2 database instances, 30 GB memory cache. - Comments: 350 M, 50 GB hard drive in 2 DB instances, 20 GB memory cache
Access time L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1K bytes with Zippy 10,000 ns Send 2K bytes over 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from network 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns by Jeff Dean (http://labs.google.com/people/jeff)
Standard & Real Requirement - Time to load a page < 200 ms - Read data rate ~12K ops/sec - Write data rate ~8K ops/sec - Caching service/Database recovery time < 5 mins
Existent thing - RDBMS (MySQL, MSSQL): Write: too slow; Read: so so with a small DB, too bad with a huge DB - Cassandra (by Facebook): difficult to do operation/maintain, and performance is not so good - HBase/Hadoop: We use this for log system - MongoDB, Membase, Tokyo Tyrant, .. : OK! we use these in several cases, but not suitable for all
ZNonblockingServer - Based on TNonblockingServer (Apache Thrift) - 185K reqs/sec (original TNonblockingServer is just 45K reqs/sec) - Serialize/Deserialize data - Prevent overload server - Data is not secured while transferring - Protect service from invalid requests
ICache - Least Recently Used/Time based expiration strategy - zlru_table<key_type, value_type>: hash table data structure - Re-write malloc/free functions instead of using standard malloc/free in glibc to reduce memory fragment - Support dirty-items marking => for lazy DB flush
ZiDB - Separate into DataFile & IndexFile - 1 seek for a read, 1-2 seeks for a write - IndexFile (hash structure) is loaded onto memory as a mapping file (shared memory) to reduce system call - Write-ahead log to avoid data loss - Data magic-padding - Checksum & checkpoint for repair data - Partitioning DB for easier maintenance
2 Models: - Centralized: 1 addressing server & multiple storage servers => bottleneck & single-point-of-failure - Peer-peer: Each server includes addressing module & storage 2 Types of routing: - Client routing: Each client itself does the addressing and query data - Server routing: The addressing is done at server
Operation Flows * Addressing module is moved into each storage node in Peer-peer model Business Logic Server Addressing Server (DHT) Storage Layer Storage Node 1 ICache ZiDB Storage Module Storage Node N ICache ZiDB Storage Module … (1) Request key locations (2) Key locations (3) Get & Set operations (4) Operation returns
Addressing: - Provide key locations of resources - Basically a Distributed Hash Table, using consistent hashing - Hashing: Jenkins, Murmur, or any algorithm that satisfies two conditions: - Uniform distribution of generated keys in the key space - Consistency (MD5, SHA are bad choice since performance)
Addressing - Node location: Each node is assigned a continuous range of IDs (hashed key)
Addressing - Node location: Golden ratio principle (a/b = 2b/a) - Init ratio = 1.618 - Max ratio ~ 2.6 - Easy to implement - Easy for routing from client 2 3 4 5 1
Server 1: 1,2,3 Server 2: 4,5,6,7 Server 3: 8,9 1 4 7 3 6 2 5 8 9 Addressing - Node location: Virtual nodes - Each real server has multiple virtual nodes on ring - More virtual nodes, more balance of load - Hard to maintain table of nodes
A A A B B C Addressing – Multi-layer rings - Store the change history of system - Provide availability/reconfigurability - Able to put a node on ring manually * Write: data is located on the highest ring * Read: data is located on the highest ring, then lower rings if not found
Replication & Backup - Each node has one primary range of IDs, and Some secondary range of IDs - Each real node need a backup instance to replace in case it’s down * Data is queried from primary node, then secondary nodes
Configuration: to find the best parameters to configure DB or to choose the suitable DB type. - How many read/write per second? - Length Deviation of data: data length is same same or much different each others, - Has updation/deletion data? - How important of data: acceptable loss or not - The old data can be recycled?
Q & A Contact: Nguyễn Quang Nam [email_address] http://me.zing.vn/nam.nq
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.