Real Time AnalyticsLA Cassandra User’s Groupwith Cassandra, Hive,and SolrBy Aaron Stannardand Caglar Oner   March 2013
What is MarkedUp?                                                                               2                 MarkedUp...
What Our Customers See                         3
The Analytics CAP Theorem - SCV                                  4
SCV Examples               5
The Truth about Analytics                              6Real time analytics isn’tinherently superior ornecessary.
Analytic Speed is a Business Decision                                        7
MarkedUp is a Blend of Both                              8
Why Cassandra for Real Time?                               9
Our Technology Stack                       10
Cassandra Setup on EC2                         11
Schema Strategy                  12
Write Strategy                 13
Read Strategy                14
Time Series Schema 0: All Knowns                                   15
Time Series Schema 1: Bounded Number of Unknowns                                                   16
Time Series Schema 2: Unbounded Number of Unknowns                                                     17
Schema Design Considerations                               18
Data, data and more data                                                                    19                     More Da...
How Do We Manage this Data?                                                       20                                   TM ...
Managing a Cluster                     21
Hadoop Overview                                               22                          TM• Two components:  • HDFS  • M...
Hive Overview                                                                  23         TM• A SQL Map / Reduce abstracti...
Hive in Production                                                          24• Simple Aggregations      SELECT applicatio...
Hive Syntax                                                                 25 Query: count the number items where “key” i...
Hive Tables vs. Cassandra Column Families                                                                        26• When ...
Mapping Hive to Cassandra                                                                                              27C...
Writing from Hive to Cassandra                                                                                          28...
Hive User-Defined Functions                                                  29# exportHIVE_AUX_JARS_PATH=/home/user/scrip...
Hive Tips and Tricks                                                                     30• Don’t write data from Hive ba...
Solr                                       31           TM• Solr: Lucene-based indexing engine• Part of Apache Foundation•...
Real World Example                               32How do you count a billiondistinct items in real-time?
Solr Index Setup                   33
Solr Search              34
We Are Hiring!                    35We’re hiring.markedup.com/jobs
THANK YOU                                  36Questions?    markedup.com              team@markedup.com
Upcoming SlideShare
Loading in...5
×

Real time analytics with cassandra hive and solr

9,295

Published on

Aaron Stannard and Caglar Oner, founding engineers at MarkedUp Analytics (www.markedup.com), on how they use Cassandra, Hive, and Solr to create real-time reports for hundreds of customers under burst loads.

Learn how to take advantage of Cassandra schema design, counter columns, Hive analysis, and Solr indexing to create robust, scalable analytics solutions.

This talk was presented at the Los Angeles Cassandra Users meetup (http://www.meetup.com/Los-Angeles-Cassandra-Users/events/104649892/) on March 12, 2013

Published in: Technology

Transcript of "Real time analytics with cassandra hive and solr"

  1. 1. Real Time AnalyticsLA Cassandra User’s Groupwith Cassandra, Hive,and SolrBy Aaron Stannardand Caglar Oner March 2013
  2. 2. What is MarkedUp? 2 MarkedUp: powerful analytics tools for native apps. Understand your Monitor your Drive audience. app’s health. more sales. Gain valuable data on Log errors and crashes Better data = more your users. remotely. revenue.
  3. 3. What Our Customers See 3
  4. 4. The Analytics CAP Theorem - SCV 4
  5. 5. SCV Examples 5
  6. 6. The Truth about Analytics 6Real time analytics isn’tinherently superior ornecessary.
  7. 7. Analytic Speed is a Business Decision 7
  8. 8. MarkedUp is a Blend of Both 8
  9. 9. Why Cassandra for Real Time? 9
  10. 10. Our Technology Stack 10
  11. 11. Cassandra Setup on EC2 11
  12. 12. Schema Strategy 12
  13. 13. Write Strategy 13
  14. 14. Read Strategy 14
  15. 15. Time Series Schema 0: All Knowns 15
  16. 16. Time Series Schema 1: Bounded Number of Unknowns 16
  17. 17. Time Series Schema 2: Unbounded Number of Unknowns 17
  18. 18. Schema Design Considerations 18
  19. 19. Data, data and more data 19 More Data = More Problems• Every day is our peak day.• Every day we sign up more apps and our apps sign up more users.
  20. 20. How Do We Manage this Data? 20 TM TM TM TM TM
  21. 21. Managing a Cluster 21
  22. 22. Hadoop Overview 22 TM• Two components: • HDFS • Map / Reduce• Mimics Google File System and Map / Reduce TM• Started by Yahoo! TM• Part of Apache Foundation• Mature and has rich ecosystem• Scalable• Highly available• SLOOOOOWW
  23. 23. Hive Overview 23 TM• A SQL Map / Reduce abstraction• Originally developed as a BI tool for data warehousing @ Facebook TM• Rich data types (structs, lists and maps)• Efficient implementation of joins, group-bys• Scalable – easy to shard across nodes• Tunable performance• Extensible
  24. 24. Hive in Production 24• Simple Aggregations SELECT applicationId, COUNT(1) AS dataPoints FROM ApplicationTable GROUP BY applicationId WHERE date = “2013-03-12”• Data Mining User Retention User Engagement• Ad hoc Analysis How many sales by X?
  25. 25. Hive Syntax 25 Query: count the number items where “key” is greater than 100RDBMS> select key, count(1) from kv1where key > 100 group by key;Hive> select key, count(1) from kv1where key > 100 group by key;
  26. 26. Hive Tables vs. Cassandra Column Families 26• When working with Cassandra, Hive still requires its own tables• These are stored in HDFS typically • In DataStax Enterprise Edition it’s implemented into a dedicated TM Cassandra key space• Hive tables are mapped directly onto Cassandra column families
  27. 27. Mapping Hive to Cassandra 27CREATE EXTERNAL TABLE AverageTime (rowkey binary, dates bigint, total bigint)STORED BY org.apache.hadoop.hive.cassandra.CassandraStorageHandlerWITH SERDEPROPERTIES ("cassandra.ks.name" = “myKeySpace","cassandra.cf.name" =“myColumnFamily", "cassandra.columns.mapping" = ":key, :column, :value" )TBLPROPERTIES ( "cassandra.range.size" = "100", "cassandra.slice.predicate.size" = "100" );
  28. 28. Writing from Hive to Cassandra 28INSERT OVERWRITE TABLE AverageTimeSELECT secondQuery.Project, secondQuery.SessionDay, AVG(secondQuery.SessionSum)FROM( SELECT firstQuery.Project, firstQuery.SessionDay, SUM(firstQuery.SessionLength) as SessionSum FROM ( SELECT HSessions.Project as Project, HSessions.Day as SessionDay, HSessions.Length as SessionLength, HSessions.User as UserId FROM HSessions ) firstQuery GROUP BY firstQuery.Project, firstQuery.SessionDay, firstQuery.UserId) secondQueryGROUP BY secondQuery.Project, secondQuery.SessionDay;
  29. 29. Hive User-Defined Functions 29# exportHIVE_AUX_JARS_PATH=/home/user/scripts# custom functionhive> create temporary function myFunc ascom.markedup.hive.udf.Compositor;SELECT myFunc(Apps.AppId, “DailySessions”) FROMApps WHERE ….
  30. 30. Hive Tips and Tricks 30• Don’t write data from Hive back to a hot Cassandra column family• Created dedicated Cassandra column families for Hive• You can write to multiple places on a single Hive read• Use sampling to test Hive queries on scaled-down data sets
  31. 31. Solr 31 TM• Solr: Lucene-based indexing engine• Part of Apache Foundation• Full-text search• Faceted search• Distributed• Integrates well with Cassandra
  32. 32. Real World Example 32How do you count a billiondistinct items in real-time?
  33. 33. Solr Index Setup 33
  34. 34. Solr Search 34
  35. 35. We Are Hiring! 35We’re hiring.markedup.com/jobs
  36. 36. THANK YOU 36Questions? markedup.com team@markedup.com

×