• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Real time analytics with cassandra hive and solr
 

Real time analytics with cassandra hive and solr

on

  • 6,744 views

Aaron Stannard and Caglar Oner, founding engineers at MarkedUp Analytics (www.markedup.com), on how they use Cassandra, Hive, and Solr to create real-time reports for hundreds of customers under burst ...

Aaron Stannard and Caglar Oner, founding engineers at MarkedUp Analytics (www.markedup.com), on how they use Cassandra, Hive, and Solr to create real-time reports for hundreds of customers under burst loads.

Learn how to take advantage of Cassandra schema design, counter columns, Hive analysis, and Solr indexing to create robust, scalable analytics solutions.

This talk was presented at the Los Angeles Cassandra Users meetup (http://www.meetup.com/Los-Angeles-Cassandra-Users/events/104649892/) on March 12, 2013

Statistics

Views

Total Views
6,744
Views on SlideShare
6,490
Embed Views
254

Actions

Likes
27
Downloads
0
Comments
0

3 Embeds 254

https://twitter.com 183
http://www.scoop.it 69
http://tweetedtimes.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Real time analytics with cassandra hive and solr Real time analytics with cassandra hive and solr Presentation Transcript

    • Real Time AnalyticsLA Cassandra User’s Groupwith Cassandra, Hive,and SolrBy Aaron Stannardand Caglar Oner March 2013
    • What is MarkedUp? 2 MarkedUp: powerful analytics tools for native apps. Understand your Monitor your Drive audience. app’s health. more sales. Gain valuable data on Log errors and crashes Better data = more your users. remotely. revenue.
    • What Our Customers See 3
    • The Analytics CAP Theorem - SCV 4
    • SCV Examples 5
    • The Truth about Analytics 6Real time analytics isn’tinherently superior ornecessary.
    • Analytic Speed is a Business Decision 7
    • MarkedUp is a Blend of Both 8
    • Why Cassandra for Real Time? 9
    • Our Technology Stack 10
    • Cassandra Setup on EC2 11
    • Schema Strategy 12
    • Write Strategy 13
    • Read Strategy 14
    • Time Series Schema 0: All Knowns 15
    • Time Series Schema 1: Bounded Number of Unknowns 16
    • Time Series Schema 2: Unbounded Number of Unknowns 17
    • Schema Design Considerations 18
    • Data, data and more data 19 More Data = More Problems• Every day is our peak day.• Every day we sign up more apps and our apps sign up more users.
    • How Do We Manage this Data? 20 TM TM TM TM TM
    • Managing a Cluster 21
    • Hadoop Overview 22 TM• Two components: • HDFS • Map / Reduce• Mimics Google File System and Map / Reduce TM• Started by Yahoo! TM• Part of Apache Foundation• Mature and has rich ecosystem• Scalable• Highly available• SLOOOOOWW
    • Hive Overview 23 TM• A SQL Map / Reduce abstraction• Originally developed as a BI tool for data warehousing @ Facebook TM• Rich data types (structs, lists and maps)• Efficient implementation of joins, group-bys• Scalable – easy to shard across nodes• Tunable performance• Extensible
    • Hive in Production 24• Simple Aggregations SELECT applicationId, COUNT(1) AS dataPoints FROM ApplicationTable GROUP BY applicationId WHERE date = “2013-03-12”• Data Mining User Retention User Engagement• Ad hoc Analysis How many sales by X?
    • Hive Syntax 25 Query: count the number items where “key” is greater than 100RDBMS> select key, count(1) from kv1where key > 100 group by key;Hive> select key, count(1) from kv1where key > 100 group by key;
    • Hive Tables vs. Cassandra Column Families 26• When working with Cassandra, Hive still requires its own tables• These are stored in HDFS typically • In DataStax Enterprise Edition it’s implemented into a dedicated TM Cassandra key space• Hive tables are mapped directly onto Cassandra column families
    • Mapping Hive to Cassandra 27CREATE EXTERNAL TABLE AverageTime (rowkey binary, dates bigint, total bigint)STORED BY org.apache.hadoop.hive.cassandra.CassandraStorageHandlerWITH SERDEPROPERTIES ("cassandra.ks.name" = “myKeySpace","cassandra.cf.name" =“myColumnFamily", "cassandra.columns.mapping" = ":key, :column, :value" )TBLPROPERTIES ( "cassandra.range.size" = "100", "cassandra.slice.predicate.size" = "100" );
    • Writing from Hive to Cassandra 28INSERT OVERWRITE TABLE AverageTimeSELECT secondQuery.Project, secondQuery.SessionDay, AVG(secondQuery.SessionSum)FROM( SELECT firstQuery.Project, firstQuery.SessionDay, SUM(firstQuery.SessionLength) as SessionSum FROM ( SELECT HSessions.Project as Project, HSessions.Day as SessionDay, HSessions.Length as SessionLength, HSessions.User as UserId FROM HSessions ) firstQuery GROUP BY firstQuery.Project, firstQuery.SessionDay, firstQuery.UserId) secondQueryGROUP BY secondQuery.Project, secondQuery.SessionDay;
    • Hive User-Defined Functions 29# exportHIVE_AUX_JARS_PATH=/home/user/scripts# custom functionhive> create temporary function myFunc ascom.markedup.hive.udf.Compositor;SELECT myFunc(Apps.AppId, “DailySessions”) FROMApps WHERE ….
    • Hive Tips and Tricks 30• Don’t write data from Hive back to a hot Cassandra column family• Created dedicated Cassandra column families for Hive• You can write to multiple places on a single Hive read• Use sampling to test Hive queries on scaled-down data sets
    • Solr 31 TM• Solr: Lucene-based indexing engine• Part of Apache Foundation• Full-text search• Faceted search• Distributed• Integrates well with Cassandra
    • Real World Example 32How do you count a billiondistinct items in real-time?
    • Solr Index Setup 33
    • Solr Search 34
    • We Are Hiring! 35We’re hiring.markedup.com/jobs
    • THANK YOU 36Questions? markedup.com team@markedup.com