Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Real Time Analytics with Cassandra, Hive, and Solr
Real Time Analytics with Cassandra, Hive, and SolrAaron Stannard, Founder & CEO of MarkedUp
Powerful analytics tools for native appsUnderstand youraudience.Gain valuable data onyour users.Monitor yourapp’s health.L...
Do we really need real-timeanalytics?
Real time analytics isn’t inherentlysuperior or necessary.
Building your own real-timeanalytics service with Cassandraand DataStax Enterprise
Cassandra Setup on EC2
Write Strategy
Read Strategy
Analytics Schema Strategy•  All	  row	  keys	  should	  be	  predictable	  (not	  always	  possible)	  •  U8lize	  physica...
Time Series Schema 0: All Knowns
Time Series Schema 1: Bounded Number of Unknowns
Time Series Schema 2: Unbounded Number of Unknowns
Schema Tips
Adding Hive and Hadoop to the MixMo’ data, mo’ problems
When is Hadoop necessary?•  Large volumes of data (100GB+)•  Queries require retrospective / historical analysis•  Need co...
Hadoop on easy mode: Hive•  SQL abstraction on top of Hadoop (more familiar)•  Easier to deploy and test•  Simplifies data...
C* to Hive
Hive SyntaxQuery: count the number items where “key” is greater than100RDBMS> select key, count(1) from kv1where key > 100...
Hive Tips and Tricks•  Don’t write data from Hive back to a hot Cassandra column family•  If writing data from Hive to Cas...
How do you count millions ofdistinct items in real-time?
•  Solr:	  Lucene-­‐based	  indexing	  engine	  •  Part	  of	  Apache	  Founda8on	  •  Full-­‐text	  search	  •  Faceted	 ...
Solr Index Setup
Solr Search
Questions or Comments?aaron@markedup.com	  	  hMps://markedup.com/	  	  
C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard
C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard
C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard
Upcoming SlideShare
Loading in …5
×

C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

10,906 views

Published on

Building analytics systems is an increasingly common requirement for BI teams inside companies both big and small, and a feat made even more challenging when analytic results have to be produced in real-time. In this presentation the team from MarkedUp Analytics will show you techniques for leveraging Cassandra, Hadoop, and Hive to build a manageable and scalable analytics system capable of handling a wide range of business cases and needs.

Published in: Technology, Business
  • Be the first to comment

C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

  1. 1. Real Time Analytics with Cassandra, Hive, and Solr
  2. 2. Real Time Analytics with Cassandra, Hive, and SolrAaron Stannard, Founder & CEO of MarkedUp
  3. 3. Powerful analytics tools for native appsUnderstand youraudience.Gain valuable data onyour users.Monitor yourapp’s health.Log errors and crashesremotely.Drivemore sales.Better data = morerevenue.
  4. 4. Do we really need real-timeanalytics?
  5. 5. Real time analytics isn’t inherentlysuperior or necessary.
  6. 6. Building your own real-timeanalytics service with Cassandraand DataStax Enterprise
  7. 7. Cassandra Setup on EC2
  8. 8. Write Strategy
  9. 9. Read Strategy
  10. 10. Analytics Schema Strategy•  All  row  keys  should  be  predictable  (not  always  possible)  •  U8lize  physical  sortability  of  columns  •  Use  predictably  sortable  data  types  for  column  names  (integers,  dates)    •  Learn  to  love  composite  keys  •  Batch  muta8ons  are  your  friend  •  Use  distributed  counters  for  real-­‐8me  metrics  •  Use  TTL  for  automa8on  data  expira8on  (if  necessary)    
  11. 11. Time Series Schema 0: All Knowns
  12. 12. Time Series Schema 1: Bounded Number of Unknowns
  13. 13. Time Series Schema 2: Unbounded Number of Unknowns
  14. 14. Schema Tips
  15. 15. Adding Hive and Hadoop to the MixMo’ data, mo’ problems
  16. 16. When is Hadoop necessary?•  Large volumes of data (100GB+)•  Queries require retrospective / historical analysis•  Need consistent results•  Need to perform multi-stage analysis•  Speed isn’t a concern (Hadoop is sloooooooooow)
  17. 17. Hadoop on easy mode: Hive•  SQL abstraction on top of Hadoop (more familiar)•  Easier to deploy and test•  Simplifies data warehousing•  Easy to automatically import data from Cassandra•  DSE eliminates need for HDFS
  18. 18. C* to Hive
  19. 19. Hive SyntaxQuery: count the number items where “key” is greater than100RDBMS> select key, count(1) from kv1where key > 100 group by key;Hive> select key, count(1) from kv1where key > 100 group by key;
  20. 20. Hive Tips and Tricks•  Don’t write data from Hive back to a hot Cassandra column family•  If writing data from Hive to Cassandra, use dedicated columnfamilies•  You can write to multiple places on a single Hive read (table, CSVfile, etc…)•  Use sampling to test Hive queries on scaled-down data sets
  21. 21. How do you count millions ofdistinct items in real-time?
  22. 22. •  Solr:  Lucene-­‐based  indexing  engine  •  Part  of  Apache  Founda8on  •  Full-­‐text  search  •  Faceted  search  •  Distributed  •  Integrates  well  with  Cassandra  
  23. 23. Solr Index Setup
  24. 24. Solr Search
  25. 25. Questions or Comments?aaron@markedup.com    hMps://markedup.com/    

×