Counters for real-time statistics

3,339 views

Published on

Using counters in Apache Cassandra for real time statistics.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,339
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • OrbServ is our primary application for both indentifying both Action Takers and collecting mapping data from our partners. When OrbServ sees a user it scores the browser into any applicable offers using the content we’ve observed them interacting with to expand the the marketers audience to include their closest network neighbors. In addition to tracking browser interactions, OrbServ is the mechanism by which we re-direct browsers to data and exchange partners to identify them for further interactions. For example, this is how we let RightMedia know that we are interested in purchasing impressions for a particular browser.
  • Counters for real-time statistics

    1. 1. Counters for real-time statistics Aug 2011
    2. 2. Quick Cassandra storage primer
    3. 3. Standard columns <ul><li>Idempotent writes – last client time stamp wins </li></ul><ul><li>Store byte [] - can have validators </li></ul><ul><li>No internal locking </li></ul><ul><li>Not read before write </li></ul><ul><li>Example: </li></ul><ul><li>set Users['ecapriolo']['fname']='ed'; </li></ul>
    4. 4. Counter columns <ul><li>Store Integral values only </li></ul><ul><li>Can be incremented or decremented with single RPC </li></ul><ul><li>Local read before write </li></ul><ul><li>Merged on read </li></ul><ul><li>Example: </li></ul><ul><li>incr followers['ecapriolo']['x'] by 30 </li></ul>
    5. 5. Counters combine powers with: <ul><li>composite keys: incr stats['user/date']['page'] by 1; </li></ul><ul><li>scale to distribute writes </li></ul><ul><li>A distributed system to record events </li></ul><ul><li>Pre-caclulated real time stats </li></ul>And you get:
    6. 6. Other ways to collect and report <ul><li>Store in files, process into reports </li></ul><ul><ul><li>Example: data-> hdfs -> hive queries -> reports </li></ul></ul><ul><ul><li>Light work on front end </li></ul></ul><ul><ul><li>Heavy on back end </li></ul></ul><ul><li>Store into relational database </li></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><li>data -> rdbms (ind) -> rt queries & reports -> reports </li></ul></ul><ul><ul><li>Divides work between front end and back end </li></ul></ul><ul><ul><li>Indexes can become choke points </li></ul></ul>
    7. 7. Example data set <ul><li>url | username | event_time | time_to_serve_millis </li></ul><ul><li>/page1.htm | edward | 2011-01-02 :04:01:04 | 45 </li></ul><ul><li>/page1.htm | stacey | 2011-01-02 :04:01:05 | 46 </li></ul><ul><li>/page1.htm | stacey | 2011-01-02 :04:02:07 | 40 </li></ul><ul><li>/page2.htm | edward | 2011-01-02 :04:02:45 | 22 </li></ul>
    8. 8. “ Query” one: hit count bucket by minute <ul><li>page | time | count </li></ul><ul><li>/page1.htm | 2011-01-02 :04:01 | 2 </li></ul><ul><li>/page1.htm | 2011-01-02 :04:02 | 1 </li></ul><ul><li>/page2.htm | 2011-01-02 :04:02 | 1 </li></ul>
    9. 9. “ Query” two: resources consumed by user per hour <ul><li>user | time | total_time_to_serve </li></ul><ul><li>edward | 2011-01-02 :04 | 67 </li></ul><ul><li>stacey | 2011-01-02 :04 | 86 </li></ul>
    10. 10. Turn a record line into a pojo <ul><li>class Record { </li></ul><ul><li>String url,username; </li></ul><ul><li>Date date; </li></ul><ul><li>int timeToServe; </li></ul><ul><li>} </li></ul><ul><li>Use your imagination here: </li></ul><ul><li>public static List<Record> readRecords(String file) throws Exception { </li></ul>
    11. 11. writeRecord() Method <ul><li>public static void writeRecord(Cassandra.Client c, Record r) throws Exception { </li></ul><ul><li>DateFormat bucketByMinute = new SimpleDateFormat(&quot;yyyy-MM-dd HH:mm&quot;); </li></ul><ul><li>DateFormat bucketByDay = new SimpleDateFormat(&quot;yyyy-MM-dd&quot;); </li></ul><ul><li>DateFormat bucketByHour = new SimpleDateFormat(&quot;yyyy-MM-dd HH&quot;); </li></ul>
    12. 12. “ Query” 1 page counts by minute <ul><li>CounterColumn counter = new CounterColumn(); </li></ul><ul><li>ColumnParent cp = new ColumnParent(&quot;page_counts_by_minute&quot;); </li></ul><ul><li>counter.setName(ByteBufferUtil.bytes (bucketByMinute.format(r.date))); </li></ul><ul><li>counter.setValue(1); </li></ul><ul><li>c.add( ByteBufferUtil.bytes( </li></ul><ul><li>bucketByDay.format(r.date)+&quot;-&quot;+r.url) </li></ul><ul><li>, cp, counter, ConsistencyLevel.ONE); </li></ul>
    13. 13. “ Query” 2 usage by users per hour <ul><li>CounterColumn counter2 = new CounterColumn(); </li></ul><ul><li>ColumnParent cp2 = new ColumnParent (&quot;user_usage_by_minute&quot;); </li></ul><ul><li>counter2.setName( ByteBufferUtil.bytes( </li></ul><ul><li>bucketByHour.format(r.date))); </li></ul><ul><li>counter2.setValue(r.timeToServe); </li></ul><ul><li>c.add(ByteBufferUtil.bytes( </li></ul><ul><li>bucketByDay.format(r.date)+&quot;-&quot;+r.username) </li></ul><ul><li>, cp2, counter2, ConsistencyLevel.ONE); </li></ul>
    14. 14. How this works
    15. 15. Results <ul><li>[default@counttest] list user_usage_by_minute; </li></ul><ul><li>—————— - </li></ul><ul><li>RowKey: 2011-01-02- stacey </li></ul><ul><li>=> (counter=2011-01-02 04, value=86) </li></ul><ul><li>—————— - </li></ul><ul><li>RowKey: 2011-01-02- edward </li></ul><ul><li>=> (counter=2011-01-02 04, value=67) </li></ul>
    16. 16. More Results <ul><li>[default@counttest] list page_counts_by_minute; </li></ul><ul><li>—————— - </li></ul><ul><li>RowKey: 2011-01-02-/page1.htm </li></ul><ul><li>=> (counter=2011-01-02 04:01, value=2) </li></ul><ul><li>=> (counter=2011-01-02 04:02, value=1) </li></ul><ul><li>—————— - </li></ul><ul><li>RowKey: 2011-01-02-/page2.htm </li></ul><ul><li>=> (counter=2011-01-02 04:02, value=1) </li></ul>
    17. 17. Recap <ul><li>Counters pushed work to the “front end” </li></ul><ul><ul><li>Data is bucketed, sorted, and indexed on insert </li></ul></ul><ul><ul><li>Data is already “ready” on read </li></ul></ul><ul><ul><li>Designed around how you want to read data </li></ul></ul><ul><li>Distributed writes across the cluster </li></ul><ul><ul><li>Bucketed data by time, user, page, etc. </li></ul></ul><ul><ul><li>Different then table/index contention point </li></ul></ul>
    18. 18. Questions? Full code at: http://www.jointhegrid.com/highperfcassandra/?cat=7

    ×