DaStor/Cassandra report for CDR solution

2,007 views
1,817 views

Published on

DaStor is a data store based on cassandra.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,007
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

DaStor/Cassandra report for CDR solution

  1. 1. DaStor Evaluation Report for CDR Storage & Querymaking data alive! (DaStor bases on Cassandra, a fault project) Schubert Zhang Big Data Engineering Team Oct.28, 2010
  2. 2. Testbed• Hardware • The existing testbed and – Cluster with 9 nodes configuration are not • 5 nodes ideal for performance. – DELL PowerEdge R710 – CPU: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz, cache size=8192 KB – Core: 2x 4core CPU, HyperThread, => 16 cores • Preferred – RAM: 16GB – Commit Log: dedicated – Hard Disk: 2x 1TB SATA 7.2k rpm, RAID0 hard disk. • 4 nodes – DELL PowerEdge 2970 – File system: XFS/EXT4. – CPU: Quad-Core AMD Opteron (tm) Processor 2378, – More memory to cache cache size=512 KB – Core: 2x 4core CPU, => 8 cores more indexes and – RAM: 16GB memadata. – Hard Disk: 2x 1TB STAT 7.2k rpm, RAID0 – Totally • 9 nodes, 112 cores, 144GB RAM, 18(18TB) Hard Disks – Network: within a single 1Gbps switch.• Linux: RedHat EL 5.3, Kernel=2.6.18-128.el5• File System: Ext3• JDK: Sun Java 1.6.0_20-b02 2
  3. 3. DaStor Configuration• Release version: 1.6.6-001 Max Heap Size 10GB Memtable Size 1GB *• Memory Heap Quota: 10GB Index Interval 128• CommitLog and Data storage use the Key Cache Capacity 100000 same 2TB volume(RAID0), as well as Replication Factor 2 the Linux OS. CommitLog Segment Size 128MB• The important performance related CommitLog Sync Period 10s parameters as the right side table. Concurrent Writers (Threads) 32 Concurrent Readers (Threads) 16 Cell Block Size 64KB Consistency Check false Concurrent Compaction * false 3
  4. 4. Data Schema for CDR Key Date(Day) as Bucket …… Date(Day) as Bucket (20101020) (20101024) User ID CDR CDR CDR … CDR …… CDR CDR CDR CDR … CDR sorted by timestamp cells• Schema – Key: The User ID (Phone Number), string – Bucket: The date(day) name, string – Cell: CDR, Thrift (or ProtocolBuffer) • Data Patterns compacted encoding – A short set of temporal data• Semantics that tends to be volatile. – Each user’s everyday CDRs are sorted by timestamp, and stored together. – An ever-growing set of data that rarely gets accessed.• Stored Files – The SSTable files are separated by Buckets.• Flexible and applicable to various CDR structures. 4
  5. 5. Storage Architecture Key (bkt1 , bkt2 , bkt3) Memtable (bkt1) Triggered By: • Data size Commit Log Memtable (bkt2) • Lifetime Binary serialized Memtable (bkt3) Flush Key (bkt1 , bkt2 , bkt3) Index file on disks Data file on disks <Size> <Index> <Serialized Cells> K128 Offset --- --- K256 Offset <Key> Offset Dedicated <Key> Offset Disk K384 Offset --- --- --- --- Bloom Filter of Key --- (Sparse Indexes in memory)The storage architecture refers to relevant techniques of Google and other databases.It’s similar to Bigtable, but it’s index scheme is different. 5
  6. 6. Indexing Scheme Index Level-1 Consistent Hash Index Level-3 1 0 h(key1) Sorted Map, BloomFilter E 64KB (Changeable) A mirror of data of Cells on Row N=3 C K0 K0 h(key2) F Cells Cells Cells Key Cells Index Block 0 Block 1 ... Block N B D 1/2 Index Level-4 Block Index Range of B-Tree K128 K128 Hash to Node (Binary Search) Cells Block 0 -> Position BloomFilter Cells Block 1-> Position of Keys on SSTable ... K256 K256 Cells Block N -> Position KeyCache Inde Level-2 Block Index B-Tree (Binary Search) K0 K384 K384  Totally 4 levels of indexing. K128  Indexes are relatively small. K256 K384  Very fit to store data of a Key Position Maps Data Rows individuals, such as users, etc. Sparse Block Index (Key interval = 128, in Index file [on disk, cachable] in Data File  Good for CDR data serving. [on disk] changeable) [in memory] 6
  7. 7. Benchmark for Writes• Each node runs 6 clients (threads), totally 54 clients.• Each client generates random CDRs for 50 million users/phone-numbers, and puts them into DaStor one by one. – Key Space: 50 million – Size of a CDR: Thrift-compacted encoding, ~200 bytes of one node of the cluster (9 nodes) Throughput: average ~80K ops/s; per-node: average ~9K ops/s Latency: average ~0.5ms Bottleneck: network (and memory) 7
  8. 8. Benchmark for Writes (cluster overview) The wave is because of: (1) GC (2) Compaction 8
  9. 9. Benchmark for Reads• Each node runs 8 clients (threads) , totally 72 clients.• Each client randomly uses a user-id/phone-number out of the 50-million space, to get it’s recent 20 CDRs (one page) from DaStor.• All clients read CDRs of a same day/bucket. ------------------------------------------------------------------------------------• The 1st run: – Before compaction. – Average 8 SSTables on each node for everyday.• The 2nd run: – After compaction. – Only one SSTable on each node for everyday. 9
  10. 10. Benchmark for Reads (before compaction) of one node of the cluster (9 nodes) percentage of read ops 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 100ms  Throughput: average ~140 ops/s; per-node: average ~16 ops/s  Latency: average ~500ms, 97% < 2s (SLA)  Bottleneck: disk IO (random seek) (CPU load is very low) 10
  11. 11. Benchmark for Reads (after compaction) of one node of the cluster (9 nodes) percentage of read ops 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 100ms Compaction of ~8 SSTables, ~200GB. Time:16core node: 1:40; 8core node: 2:25 Throughput: average ~1.1K ops/s; per-node: average ~120 ops/s Latency: average ~60ms, 95% < 500ms (SLA) Bottleneck: disk IO (random seek) (CPU load is very low) 11
  12. 12. Benchmark for (Writes + Reads) 12
  13. 13. Experiences• Large Memtable reduces the frequency of regular compaction. – We found that 1GB is fine.• Large Key Space requires more memory, because of more Key-Indexes. – Especially for key-cache. – Memory Mapped index files.• Compaction is sensitive to the number of CPU core, and L2/L3 cache. – On 16core node: A large (such as 200GB) compaction may take 100 minutes. – On 8core node : A large (such as 200GB) compaction may take 150 minutes. – Long compaction may result in many small SSTables, that reduce the read performance. – Now, we support concurrent compaction.• Number of CPU Cores, L2/L3 cache, Disks, RAM size – CPU Cores, L2/L3 cache: Writes, Compaction – Disks: Random seeks and reads – RAM: Memtables for writes, Indexes cache for random reads. 13
  14. 14. Maintenance Tools• Daily Flush Tool • Daily Compaction Tool – To flush memtables of old buckets. – To compact SSTables of old buckets. – Use dastor-admin tool. – Use dastor-admin tool. DaStor Admin Tool (bin/dastor-admin, bin/dastor-admin-shell) 14
  15. 15. Admin Web 15
  16. 16. CDR Query Web Page for Demo 16
  17. 17. Developed Features• Admin Tools • Concurrent Compaction – Configuration improvement based – From single thread to bucket- on config-files independent multi-threads. – Script framework and scripts • Scalability – Admin tools – Easy to scale-out – CLI shell – More controllable – WebAdmin – Ganglia, Jmxetric • Benchmarks – Writes and Reads – Throughput and Latency• Compression – New serialization format. • Bug fix – Support Gzip and LZO.• Bucket mapping and reclaim – Mapping plug-in – Reclaim command and mechanism.• Java Client API 17
  18. 18. Controllable Carrier-level Scale-out Existing cluster New machines added into the cluster, not online. (1) Available Partitioning-A (1) Available Partitioning-A (2) Existing buckets with data (2) Existing buckets with data (3) New Partitioning-B for future buckets, but not available for now Time is gone, data in old buckets are reclaimed. The added machines online (1) gone (1) Available Partitioning-A (2) gone (2) Existing buckets with data (3) Only Partitioning-B, available for service (3) New Partitioning-B available for service, coexist with Partitioning-A. No data movement. 18
  19. 19. Data Processing and Analysis for BI & DM • Integration with MapReduce and Hive, etc. BI & DM Apps • Provide SQL-Like and rich API for BI QL and DM. Hive API Table Meta • Built-in plug-ins for MapRecuce framework. MapReduce Framework • Flexible data structure description InputFormat OutputFormat and tabular management. DaStor • The simple and flexible data model (Data Storage) of DaStor is proper for analysis, since the past buckets are stable. 19
  20. 20. Further Works• More flexible and manageable Cache • Deployment tools – Capability size/memory control. – Consider: Capistrano, PyStatus, – Methods to load and free cache. Puppet, Chef …• Scalability feature for operational scale • Data Analysis – Version 1.6.6 + Controllable Scalability – Hadoop• Compression Improvement • Documents – To reduce number of disk seeks. – API Manual• Admin Tools – Admin Manual – Configuration, Monitor, Control … – More professional and easy to use • Test• Client API Enhancement – New features – Hide the individual node to be – Performance connected. – More API methods. – mmap …• Flexible consistency check 20
  21. 21. DaStor/Cassandra vs. Bigtable• Scalability: Bigtable has better scalability. – The scale of DaStor should be controlled carefully, and may affect services. It is a big trouble. – Bigtable’s scalability is easy.• Data Distribution: Bigtable’s high-level partitioning/indexing scheme is more fine-grained, and so more effective. – DaStor ’s consistent hash partitioning scheme is too coarse-grained, and so we must cut up the bucket level partitions. But sometimes, it is not easy to trade-off on bigdata.• Indexing: Bigtable may need less memory to hold indexes. – Bigtable ’s indexes are more general and can be shared equally (均摊) by different users/rows, especially when data-skew. – There’s only one copy of indexes in Bigtable, even for multiple storage replications, since Bigtable use GFS layer for replication. (multiple copies of data, one copy of indexes)• Local Storage Engine: Bigtable provides better read performance, less disk seeks. – Bigtable vs. Cassandra  InnoDB vs. MyISAM• Bigtable’s write/mutation performance is lower. – Commit Log: If the GFS/HDFS support s fine-configuration to let individual directory on a exclusive disk, then …• So, Bigtable ’s architecture and data model make more sense. – The Cassandra project is a fault. It is a big fault to mix Dynamo and Bigtable. – But in my opinion, Cassandra is just a partial Dynamo and target to a wrong field – Data Storage. It is anamorphotic. 21

×