C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

1,167 views
998 views

Published on

Speaker: Dmitry Mezhensky

Published in: Technology, Economy & Finance
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,167
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

  1. 1. Analytics on top of Cassandra and Hadoop Dmitry Mezhensky | Mirantis Inc #CASSANDRAEU
  2. 2. What we will discuss today ● Analytics on Cassandra using Hadoop ● Various types of statistics & implementation ● Scalability of approach #CASSANDRAEU
  3. 3. Problems ● Too many statistics (more that 100) ● Various types ○ Top N ○ Time series ○ Min/max/average/median ○ Extremum values on time interval ○ Fraud analysis ● Huge amount of data ● Scalability of approach #CASSANDRAEU
  4. 4. Statistics implementation on Hadoop #CASSANDRAEU
  5. 5. Top N ● Map phase generates <Key, Value> pairs, top N is building by Value ● Reduce phase accumulates values, persist to Cassandra is done via custom output format ● For top N entities in Cassandra suitable comparator was used #CASSANDRAEU
  6. 6. Top N ● One write stage to Cassandra sorting is done by value ● On reading stage first N records will be Top N values #CASSANDRAEU
  7. 7. Time series ● Map phase generates pairs <Time, Value> ● Reduce phase accumulates (various behaviour for different statistics) ● Persist to Cassandra using custom output format & using one row key per statistics, one column per date #CASSANDRAEU
  8. 8. Maximum, minimum, extremum on interval ● Max/min values are simple to calculate ● Extremum on interval is calculating the similar to time series #CASSANDRAEU
  9. 9. Fraud analysis ● Fraud analysis is running after all statistics are calculated ● Processed data is filtered by fraud filters #CASSANDRAEU
  10. 10. Scalability approach ● ● ● ● Data is reading/writing to Cassandra only Hadoop is elastically scalable Cassandra is elastically scalable No bottleneck #CASSANDRAEU
  11. 11. Questions? #CASSANDRAEU
  12. 12. Thank you! #CASSANDRAEU

×