C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

10,418 views
10,378 views

Published on

Building analytics systems is an increasingly common requirement for BI teams inside companies both big and small, and a feat made even more challenging when analytic results have to be produced in real-time. In this presentation the team from MarkedUp Analytics will show you techniques for leveraging Cassandra, Hadoop, and Hive to build a manageable and scalable analytics system capable of handling a wide range of business cases and needs.

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
10,418
On SlideShare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
Downloads
36
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

  1. 1. Real Time Analytics with Cassandra, Hive, and Solr
  2. 2. Real Time Analytics with Cassandra, Hive, and SolrAaron Stannard, Founder & CEO of MarkedUp
  3. 3. Powerful analytics tools for native appsUnderstand youraudience.Gain valuable data onyour users.Monitor yourapp’s health.Log errors and crashesremotely.Drivemore sales.Better data = morerevenue.
  4. 4. Do we really need real-timeanalytics?
  5. 5. Real time analytics isn’t inherentlysuperior or necessary.
  6. 6. Building your own real-timeanalytics service with Cassandraand DataStax Enterprise
  7. 7. Cassandra Setup on EC2
  8. 8. Write Strategy
  9. 9. Read Strategy
  10. 10. Analytics Schema Strategy•  All  row  keys  should  be  predictable  (not  always  possible)  •  U8lize  physical  sortability  of  columns  •  Use  predictably  sortable  data  types  for  column  names  (integers,  dates)    •  Learn  to  love  composite  keys  •  Batch  muta8ons  are  your  friend  •  Use  distributed  counters  for  real-­‐8me  metrics  •  Use  TTL  for  automa8on  data  expira8on  (if  necessary)    
  11. 11. Time Series Schema 0: All Knowns
  12. 12. Time Series Schema 1: Bounded Number of Unknowns
  13. 13. Time Series Schema 2: Unbounded Number of Unknowns
  14. 14. Schema Tips
  15. 15. Adding Hive and Hadoop to the MixMo’ data, mo’ problems
  16. 16. When is Hadoop necessary?•  Large volumes of data (100GB+)•  Queries require retrospective / historical analysis•  Need consistent results•  Need to perform multi-stage analysis•  Speed isn’t a concern (Hadoop is sloooooooooow)
  17. 17. Hadoop on easy mode: Hive•  SQL abstraction on top of Hadoop (more familiar)•  Easier to deploy and test•  Simplifies data warehousing•  Easy to automatically import data from Cassandra•  DSE eliminates need for HDFS
  18. 18. C* to Hive
  19. 19. Hive SyntaxQuery: count the number items where “key” is greater than100RDBMS> select key, count(1) from kv1where key > 100 group by key;Hive> select key, count(1) from kv1where key > 100 group by key;
  20. 20. Hive Tips and Tricks•  Don’t write data from Hive back to a hot Cassandra column family•  If writing data from Hive to Cassandra, use dedicated columnfamilies•  You can write to multiple places on a single Hive read (table, CSVfile, etc…)•  Use sampling to test Hive queries on scaled-down data sets
  21. 21. How do you count millions ofdistinct items in real-time?
  22. 22. •  Solr:  Lucene-­‐based  indexing  engine  •  Part  of  Apache  Founda8on  •  Full-­‐text  search  •  Faceted  search  •  Distributed  •  Integrates  well  with  Cassandra  
  23. 23. Solr Index Setup
  24. 24. Solr Search
  25. 25. Questions or Comments?aaron@markedup.com    hMps://markedup.com/    

×