Cassandra implementation for collecting data and presenting data


Published on

Cassandra implementation for collecting data and presenting data

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Cassandra implementation for collecting data and presenting data

  1. 1. Cassandra implementation for collecting data and presenting data Robert Chen
  2. 2. Agenda • SQL vs NOSQL • Why Cassandra • Cassandra introduction • Our architecture and design • Configuration best practice • How we write data • How we read data • Demo
  3. 3. A highly scalable, eventually consistent, distributed, structured key-value store. Cassandra™ is the highly scalable and high performance distributed data infrastructure. Offering distribution of data across multiple data centers and incremental scalability with no single points of failure, Cassandra is the logical choice when you need reliability without compromising performance. Cassandra is relied upon by leading companies like Netflix, Twitter, Cisco, Rackspace, Ooyala, Openwave, and many more.
  4. 4. SQL vs NOSQL • NOSQL • Not just SQL, schema free • Big data • NOSQL can service heavy read/write workloads • Probably not consistent in real time read • SQL • Can support complex join relationship • Oracle RAC solution for big data? Too expensive • Typical RDBMS implementations are tuned for small but frequent read/write transactions or for large batch transactions with rare write access • RDBMSs (they say) have shown poor performance on data-intensive applications, including: • Indexing a large number of documents • Serving pages on high-traffic websites • Handling the volumes of social networking data • Delivering streaming media • Consistent in all read
  5. 5. Why Cassandra • To solve our central netapp filer storage bottleneck issue • Choose cassandra instead of Hbase • No Single point of failure • Fast development • Big data and dynamically changing environment • Good fit for horizontally production environment • Low total cost of ownership • No special hardware needed, just some x86 boxes
  6. 6. Cassandra Design •High availability (A wily hare has three burrows ) •Eventual consistency • trade-off strong consistency in favor of high availability • allows you to choose strong consistency or allow varying degress of more relaxed consistency •Incremental scalability(linearly scalable), Horizontal! • Nodes added to a Cassandra cluster (all done online) increase the throughput of your database in a predictable, linear fashion for both read and write operations •Optimistic Replication •
  7. 7. Cassandra Design II • All nodes are identical: decentralized/symmetric • No master or SPOF • Adding is simple • Distributed, read/write anywhere design • Massively scalable peer-to-peer architecture • Based on the best of Amazon Dynamo and Google BigTable • Minimal administration • Multi-datacenter replication • No caching layer required
  8. 8. Cassandra Design III • very fast writes • fault tolerant, Guaranteed data safety • automatic provisioning of new nodes • big data • Transparent fault detection and recovery • Cassandra utilizes gossip protocols to detect machine failure and recover when a machine is brought back into the cluster – all without your application noticing.
  9. 9. write op
  10. 10. Write op (continue) • Writes go to log and memory table • Periodically memory table merged with disk table Cassandra node Disk RAM Log SSTable file Memtable Update (later)
  11. 11. Read Query Closest replica Cassandra Cluster Replica A Result Replica B Replica C Digest Query Digest Response Digest Response Result Client Read repair if digests differ
  12. 12. Configuration best practice • Put the data files on good performance RAID volumes • Start with Sun JDK 1.6+ • Configure with Java Native libs • The clocks on each node must be synchronized to maintain precision across the cluster on inserts.
  13. 13. Data collection Architecture Web UI (High Chart/ JQuery) Active MQ (Message Bus) 1. collect data sent to Active MQ 2. Consume data, save to Cassandra 3. Filer the data, showing on the plots
  14. 14. Data structure keyspace settings (eg, partitioner) column family settings (eg, comparator, type [Std]) column name value clock
  15. 15. Our Data Model Company Logo CoreMetrics (keyspace) LoadAvg1 (Column family) host1_131696(row) Column:6449, value: 0.04 Column:5546, value: 0.02 host2_131811(row) Column:8227, value: 0.46 Column:9792, value: 1.30
  16. 16. Our Data Model Company Logo CoreMetrics (keyspace) Primary (Column family) host1:loadAvg1 (row) Column:1316966449, value: 0.04 Column:1316965546, value: 0.02 host2:loadAvg1 (row) Column:1318118227, value: 0.46 Column:1318119792, value: 1.30
  17. 17. Our Meta Data Model Company Logo CoreMetrics (keyspace) PrimaryMeta (Column family) (row) Column:loadAvg15:Total value: 1 Column:loadAvg15:Total value: 1 host2 (row) Column:loadAvg15:Total value: 1 Column:loadAvg15:Total value: 1
  18. 18. Our Hbase Data Model Company Logo Primary (Column family) host1:loadAvg1:1 (row: host:metric:instance) Column:c:1316966449, value: 0.04 Column:c:1316965546, value: 0.02 host2:loadAvg1:1 (row: host:metric:instance) Column:1318118227, value: 0.46 Column:1318119792, value: 1.30
  19. 19. Our Data Model (II) Company Logo • Keyspace: CoreMetrics (database name), one per application • Column families: (metrics, each metric is a column family) • loadAvg1 • loadAvg5 • etc (About 80 server metrics) • Rows and columns: inspired by the design of Hbase and opentsdb, we use the similar way to design our rows and columns: separate timestamp into row and column keys, which improve tremendously the reading performance
  20. 20. How we write to cassandra Multiple data loaders connect to cassandra nodes 9160 port and insert data like this: $CLIENT = new Cassandra::CassandraClient($PROTOCOL); $CLIENT->set_keyspace($keyspace); $CLIENT->insert($rowkey, $column_parent, $column, $consistency_level);
  21. 21. How we read data from cassandra Using pycassa to multiget of the rows and do some aggregation if too many data points returns. get_coremetrics(metric_name, host, stime, etime, samples = 1000):
  22. 22. Demo: data model view Company Logo
  23. 23. Demo: graphing the data Company Logo
  24. 24. Cassandra monitoring 1.Nagios plugin for cassandra 2.JMX
  25. 25. Thoughts and future 1.Migrate more applications to Cassandra 2.Livestat data (Bids/Listings…) 3.Help other team to do data collection and graphing?
  26. 26. Reference URLs • Thrift (12 language bindings!) • • • Pycassa •