Real time analytics with cassandra hive and solr

11,474 views
11,066 views

Published on

Aaron Stannard and Caglar Oner, founding engineers at MarkedUp Analytics (www.markedup.com), on how they use Cassandra, Hive, and Solr to create real-time reports for hundreds of customers under burst loads.

Learn how to take advantage of Cassandra schema design, counter columns, Hive analysis, and Solr indexing to create robust, scalable analytics solutions.

This talk was presented at the Los Angeles Cassandra Users meetup (http://www.meetup.com/Los-Angeles-Cassandra-Users/events/104649892/) on March 12, 2013

Published in: Technology

Real time analytics with cassandra hive and solr

  1. 1. Real Time AnalyticsLA Cassandra User’s Groupwith Cassandra, Hive,and SolrBy Aaron Stannardand Caglar Oner March 2013
  2. 2. What is MarkedUp? 2 MarkedUp: powerful analytics tools for native apps. Understand your Monitor your Drive audience. app’s health. more sales. Gain valuable data on Log errors and crashes Better data = more your users. remotely. revenue.
  3. 3. What Our Customers See 3
  4. 4. The Analytics CAP Theorem - SCV 4
  5. 5. SCV Examples 5
  6. 6. The Truth about Analytics 6Real time analytics isn’tinherently superior ornecessary.
  7. 7. Analytic Speed is a Business Decision 7
  8. 8. MarkedUp is a Blend of Both 8
  9. 9. Why Cassandra for Real Time? 9
  10. 10. Our Technology Stack 10
  11. 11. Cassandra Setup on EC2 11
  12. 12. Schema Strategy 12
  13. 13. Write Strategy 13
  14. 14. Read Strategy 14
  15. 15. Time Series Schema 0: All Knowns 15
  16. 16. Time Series Schema 1: Bounded Number of Unknowns 16
  17. 17. Time Series Schema 2: Unbounded Number of Unknowns 17
  18. 18. Schema Design Considerations 18
  19. 19. Data, data and more data 19 More Data = More Problems• Every day is our peak day.• Every day we sign up more apps and our apps sign up more users.
  20. 20. How Do We Manage this Data? 20 TM TM TM TM TM
  21. 21. Managing a Cluster 21
  22. 22. Hadoop Overview 22 TM• Two components: • HDFS • Map / Reduce• Mimics Google File System and Map / Reduce TM• Started by Yahoo! TM• Part of Apache Foundation• Mature and has rich ecosystem• Scalable• Highly available• SLOOOOOWW
  23. 23. Hive Overview 23 TM• A SQL Map / Reduce abstraction• Originally developed as a BI tool for data warehousing @ Facebook TM• Rich data types (structs, lists and maps)• Efficient implementation of joins, group-bys• Scalable – easy to shard across nodes• Tunable performance• Extensible
  24. 24. Hive in Production 24• Simple Aggregations SELECT applicationId, COUNT(1) AS dataPoints FROM ApplicationTable GROUP BY applicationId WHERE date = “2013-03-12”• Data Mining User Retention User Engagement• Ad hoc Analysis How many sales by X?
  25. 25. Hive Syntax 25 Query: count the number items where “key” is greater than 100RDBMS> select key, count(1) from kv1where key > 100 group by key;Hive> select key, count(1) from kv1where key > 100 group by key;
  26. 26. Hive Tables vs. Cassandra Column Families 26• When working with Cassandra, Hive still requires its own tables• These are stored in HDFS typically • In DataStax Enterprise Edition it’s implemented into a dedicated TM Cassandra key space• Hive tables are mapped directly onto Cassandra column families
  27. 27. Mapping Hive to Cassandra 27CREATE EXTERNAL TABLE AverageTime (rowkey binary, dates bigint, total bigint)STORED BY org.apache.hadoop.hive.cassandra.CassandraStorageHandlerWITH SERDEPROPERTIES ("cassandra.ks.name" = “myKeySpace","cassandra.cf.name" =“myColumnFamily", "cassandra.columns.mapping" = ":key, :column, :value" )TBLPROPERTIES ( "cassandra.range.size" = "100", "cassandra.slice.predicate.size" = "100" );
  28. 28. Writing from Hive to Cassandra 28INSERT OVERWRITE TABLE AverageTimeSELECT secondQuery.Project, secondQuery.SessionDay, AVG(secondQuery.SessionSum)FROM( SELECT firstQuery.Project, firstQuery.SessionDay, SUM(firstQuery.SessionLength) as SessionSum FROM ( SELECT HSessions.Project as Project, HSessions.Day as SessionDay, HSessions.Length as SessionLength, HSessions.User as UserId FROM HSessions ) firstQuery GROUP BY firstQuery.Project, firstQuery.SessionDay, firstQuery.UserId) secondQueryGROUP BY secondQuery.Project, secondQuery.SessionDay;
  29. 29. Hive User-Defined Functions 29# exportHIVE_AUX_JARS_PATH=/home/user/scripts# custom functionhive> create temporary function myFunc ascom.markedup.hive.udf.Compositor;SELECT myFunc(Apps.AppId, “DailySessions”) FROMApps WHERE ….
  30. 30. Hive Tips and Tricks 30• Don’t write data from Hive back to a hot Cassandra column family• Created dedicated Cassandra column families for Hive• You can write to multiple places on a single Hive read• Use sampling to test Hive queries on scaled-down data sets
  31. 31. Solr 31 TM• Solr: Lucene-based indexing engine• Part of Apache Foundation• Full-text search• Faceted search• Distributed• Integrates well with Cassandra
  32. 32. Real World Example 32How do you count a billiondistinct items in real-time?
  33. 33. Solr Index Setup 33
  34. 34. Solr Search 34
  35. 35. We Are Hiring! 35We’re hiring.markedup.com/jobs
  36. 36. THANK YOU 36Questions? markedup.com team@markedup.com

×