Real-World Cassandra at ShareThis
Use Cases, Data Modeling, and Hector

1
ShareThis + Our Customers: Keys to Unlocking Social

1. DEPLOY SOCIAL TOOLS ACROSS BRANDS (AND DEVICES)

2. TAKE YOUR SOCI...
Largest Ecosystem For Sharing and Engagement Across The Web

120 SOCIAL CHANNELS

SHARETHIS ECOSYSTEM
211 MILLION PEOPLE
(...
Data Modeling and Why it Matters (Keep it even, Keep it slice-able)
Use Cases

5
A New Product: SnapSets

3 - x1.large
Use Case: SnapSets, A New Product
Use Case: SnapSets, A New Product (Continued)
CF: Users (userId)
meta:first_name=Ronald
meta:last_name=Melencio
meta:usern...
Use Cases

9
High Velocity Reads and Writes: Count Service

9 – hi1.4xlarge
9 – x1.large
Use Case: Count Service for URL's
●

1 Billion Pageviews per day = 12k pageviews per second

●

60 Million Social Referral...
Use Cases

12
Insights that Matter – Your Social Analytics Dashboard
Timely Social Analytics
Benchmark your social
engagement with SQI

...
Use Case: Loading Processed Batch Data
●

Backend Hadoop stack for processing analytics

●

58 JSON schemas map tabular da...
Use Case: Loading Processed Batch Data (continued)
{

}

"schema":
[
{
"column_name":"publisher",
"column_type":"UTF8Type"...
Use Cases

16
Insights that Matter – Your Social Analytics Dashboard
Real Time Social Analytics
Benchmark your social
engagement with SQ...
Insights that Matter – Your Social Analytics Dashboard
Real Time Social Analytics
Benchmark your social
engagement with SQ...
Insights that Matter – And aren't accessible
Insights that Matter – And aren't accessible
Insights that Matter – And aren't accessible

●

Too many columns – unbounded url / channel sets

●

Cascading failure

●
...
Ask Good
Data Modeling
Questions

22
●
●
●
●
●
●
●

How many rows will there be?
How many columns per row will you need?
How will you slice your data?
What are...
Hector

https://github.com/hector-client/hector/wiki/User-Guide

24
Hector Imports
import me.prettyprint.cassandra.model.BasicColumnFamilyDefinition;
import me.prettyprint.cassandra.model.Co...
Hector: Add a keyspace
public static Cluster getCluster(String name, String hosts) {
return HFactory.getOrCreateCluster(na...
Hector: Define a CF

public static ColumnFamilyDefinition createGenericColumnFamilyDefinition(
String ksName, String cfNam...
Hector: Add a CF
Keyspace k = HFactory.createKeyspace(nameString, cluster);
public static void addColumnFamily(Cluster clu...
Hector: Insert Column
public static void insertColumn(
Cluster cluster, Keyspace keyspace,
String cfName, String rowKey,
S...
Hector: Read Column

public static String getColumn(
Cluster cluster, Keyspace keyspace,
String cfName, String rowKey,
Str...
Hector: Read Column

public static String getColumn(
Cluster cluster, Keyspace keyspace,
String cfName, String rowKey,
Str...
Hector: Read Column

public static long getCounter(
Cluster cluster, Keyspace keyspace,
String cfName, String rowKey,
Stri...
Hector: Read A Slice

public static Map<String, String> getSlice(
Cluster cluster, Keyspace keyspace,
String cfName, Strin...
Hector: Read All Columns

public static Map<String, String> getAllValues(
Cluster cluster, String keyspace,
String cf, Str...
Hector: DANGER

private static void dropAllKeyspaces(Cluster cluster) {
for (KeyspaceDefinition ksDef: cluster.describeKey...
Conclusions

●

Data Modeling is Important

●

Use Cassandra for write throughput

●

Keep your ring even and your data sl...
We're hiring: http://www.sharethis.com/about/careers

●

●

●

Work with REAL big data, billions of requests per day
Work ...
Thank You!

38
Upcoming SlideShare
Loading in …5
×

Real-World Cassandra at ShareThis

2,108 views

Published on



For this upcoming meetup Juan Valencia, Principal Engineer at ShareThis, will be presenting on their real-world use of Apache Cassandra for high throughput and mission critical applications.

This meetup will cover how to set up your projects successfully by having a good data model, running Cassandra, and using the Hector Java client. We will have a Q&A session at the end of Juan's presentation, to ensure everyone's questions are answered.

Hope you can make it!

What You Will Learn at this Meetup:

• Real-World Use Case on ShareThis + Apache Cassandra

• Data Modeling with Apache Cassandra

• Using the Java Hector Client Library with Cassandra

Abstract

Juan Valencia, Principal Engineer at ShareThis, will be presenting on the use of Cassandra for high throughput applications. ShareThis has been running on Cassandra since version 0.6 and currently runs 4 Cassandra clusters, powering batch analytics, real-time analytics, a counter service, and a data lookup service.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,108
On SlideShare
0
From Embeds
0
Number of Embeds
1,358
Actions
Shares
0
Downloads
35
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • We can change the look of the slide (and featured publishers), but I feel the ecosystem is a cool concept and graphic for getting a quick overview of who we are. The text below can be worked in somehow too, with the new look of this slide. Maybe the text can be cut down too.
    ShareThis empowers publishers with solutions to improve and drive value from the social engagement of their site. People share content that&apos;s most relevant to them, with people who they believe will also enjoy the content. More than 2.5 million publishers increase eyeballs, engagement, and advertising revenue through the ShareThis sharing platform.
    &lt;number&gt;
  • Real-World Cassandra at ShareThis

    1. 1. Real-World Cassandra at ShareThis Use Cases, Data Modeling, and Hector 1
    2. 2. ShareThis + Our Customers: Keys to Unlocking Social 1. DEPLOY SOCIAL TOOLS ACROSS BRANDS (AND DEVICES) 2. TAKE YOUR SOCIAL INVENTORY TO MARKET 3. LEVERAGE SHARETHIS: FOR DIRECT SALES, RESEARCH AND UN-RESERVED INVENTORY 2
    3. 3. Largest Ecosystem For Sharing and Engagement Across The Web 120 SOCIAL CHANNELS SHARETHIS ECOSYSTEM 211 MILLION PEOPLE (95.1% of the web) 2.4 MILLION PUBLISHERS Source: ComScore U.S. January 2013; internal numbers, January 2013 3
    4. 4. Data Modeling and Why it Matters (Keep it even, Keep it slice-able)
    5. 5. Use Cases 5
    6. 6. A New Product: SnapSets 3 - x1.large
    7. 7. Use Case: SnapSets, A New Product
    8. 8. Use Case: SnapSets, A New Product (Continued) CF: Users (userId) meta:first_name=Ronald meta:last_name=Melencio meta:username=ronsharethis scrapbook:timestamp:scrapbookId:name=Scrapbook 1 scrapbook:timestamp:scrapbookId:date_created=Jan 10 url1:sid:clipID={LOCATION DATA} url1:sid:456={LOCATION DATA} CF: Scrapbooks (scrapbookId) clip:timestamp:clipId:url=sharethis.com clip:timestamp:clipId:title=Clip 1 clip:timestamp:clipId:likes=10 CF: Clip (clipId) comment:timestamp:commentId={"name":"Ronald","timestamp":'"jan 10","comment":"hi"} CF: Stats (user:userId,application,publisher:pubId) meta:total_scrapbooks=1 meta:total_clips=100 meta:total_scrapbook_comments=100 scrapbook:timestamp:scrapbookId:total_comments=10 scrapbook:timestamp:scrapbookId:clip:timestamp:clipId:likes=10 scrapbook:timestamp:scrapbookId:clip:timestamp:clipId:dislikes=10
    9. 9. Use Cases 9
    10. 10. High Velocity Reads and Writes: Count Service 9 – hi1.4xlarge 9 – x1.large
    11. 11. Use Case: Count Service for URL's ● 1 Billion Pageviews per day = 12k pageviews per second ● 60 Million Social Referrals per day = 720 social referrals per second ● 1 Million Shares per day = 12 shares per second ● No expiration of Data* (3bn rows) ● Requires minimum latency possible ● Multiple read requests per page on blogs ● Normalize and Hash the URL for a row key ● Each social channel is a column ● Retrieve the whole row for counts ● Fix it by cheating ^_^ *
    12. 12. Use Cases 12
    13. 13. Insights that Matter – Your Social Analytics Dashboard Timely Social Analytics Benchmark your social engagement with SQI Identify popular articles Dive deeper into your most social content Measure social activity on an hourly, daily, weekly & monthly basis. Uncover which social channels are driving the most social traffic 12 - x1.large 13
    14. 14. Use Case: Loading Processed Batch Data ● Backend Hadoop stack for processing analytics ● 58 JSON schemas map tabular data to key/value storage for slicing ● MondoDB* did not scale for frequent row level writes on the same table ● Needed to maintain read throughput during spikes to writes when analytics were finished ● No TTL* - Works daily, doesn't work hourly ● Switching from Astyanax to Hector ● Using a Hector Client through Java API's
    15. 15. Use Case: Loading Processed Batch Data (continued) { } "schema": [ { "column_name":"publisher", "column_type":"UTF8Type", "column_level":"common", "column_master":"" }, {"column_name":"domain","column_type":"UTF8Type","column_level":"common","column_master":""}, {"column_name":"percenta","column_type":"FloatType","column_level":"composite_slave","column_master":"category"}, {"column_name":"percentb","column_type":"FloatType","column_level":"composite_slave","column_master":"category"}, {"column_name":"sqi","column_type":"FloatType","column_level":"composite_slave","column_master":"category"}, {"column_name":"month","column_type":"UTF8Type","column_level":"partition","column_master":""}, {"column_name":"category","column_type":"UTF8Type","column_level":"composite_master","column_master":""} ], "row_key_format": "publisher:domain:month", "column_family_name": "sqi_table" CF -> Data Type Row -> Publisher:domain:timestamp Columns -> master:slave = value (topics, categories, urls, timestamps, etc)
    16. 16. Use Cases 16
    17. 17. Insights that Matter – Your Social Analytics Dashboard Real Time Social Analytics Benchmark your social engagement with SQI Identify trending articles in real-time Dive deeper into your most social content Measure social activity on an hourly, daily, weekly & monthly basis. Uncover which social channels are driving the most social traffic 12 cc1.4xlarge 17
    18. 18. Insights that Matter – Your Social Analytics Dashboard Real Time Social Analytics Benchmark your social engagement with SQI Identify trending articles in real-time Dive deeper into your most social content Measure social activity on an hourly, daily, weekly & monthly basis. Uncover which social channels are driving the most social traffic 12 cc1.4xlarge 18
    19. 19. Insights that Matter – And aren't accessible
    20. 20. Insights that Matter – And aren't accessible
    21. 21. Insights that Matter – And aren't accessible ● Too many columns – unbounded url / channel sets ● Cascading failure ● Solutions: – Bigger Boxes – meh... – Split up the columns – split the rowkeys ● – Split up the columns – split the CF ● – Hash Urls and keep stats separate Move URLs to their own space Split up the columns – split the Keyspace ● Keyspace is a timestamp
    22. 22. Ask Good Data Modeling Questions 22
    23. 23. ● ● ● ● ● ● ● How many rows will there be? How many columns per row will you need? How will you slice your data? What are the maximum number of rows ? What are the maximum number of columns? Is your data relational? How long will your data live? 23
    24. 24. Hector https://github.com/hector-client/hector/wiki/User-Guide 24
    25. 25. Hector Imports import me.prettyprint.cassandra.model.BasicColumnFamilyDefinition; import me.prettyprint.cassandra.model.ConfigurableConsistencyLevel; import me.prettyprint.cassandra.serializers.LongSerializer; import me.prettyprint.cassandra.serializers.StringSerializer; import me.prettyprint.cassandra.service.ColumnSliceIterator; import me.prettyprint.cassandra.service.ThriftCfDef; import me.prettyprint.cassandra.service.ThriftKsDef; import me.prettyprint.cassandra.service.template.ColumnFamilyResult; import me.prettyprint.cassandra.service.template.ColumnFamilyTemplate; import me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate; import me.prettyprint.hector.api.beans.ColumnSlice; import me.prettyprint.hector.api.beans.HColumn; import me.prettyprint.hector.api.beans.HCounterColumn; import me.prettyprint.hector.api.ddl.ColumnFamilyDefinition; import me.prettyprint.hector.api.ddl.ComparatorType; import me.prettyprint.hector.api.ddl.KeyspaceDefinition; import me.prettyprint.hector.api.exceptions.HectorException; import me.prettyprint.hector.api.factory.HFactory; import me.prettyprint.hector.api.mutation.Mutator; import me.prettyprint.hector.api.query.ColumnQuery; import me.prettyprint.hector.api.query.CounterQuery; import me.prettyprint.hector.api.query.QueryResult; import me.prettyprint.hector.api.query.SliceCounterQuery; import me.prettyprint.hector.api.query.SliceQuery;
    26. 26. Hector: Add a keyspace public static Cluster getCluster(String name, String hosts) { return HFactory.getOrCreateCluster(name, hosts); } public static KeyspaceDefinition createKeyspaceDefinition(String keyspaceName, int replication) { return HFactory.createKeyspaceDefinition( keyspaceName, ThriftKsDef.DEF_STRATEGY_CLASS, // "org.apache.cassandra.locator.SimpleStrategy" replication, null // ArrayList of CF definitions ); } public static void addKeyspace(Cluster cluster, KeyspaceDefinition ksDef) { KeyspaceDefinition keyspaceDef = cluster.describeKeyspace(ksDef.getName()); if (keyspaceDef == null) { cluster.addKeyspace(ksDef, true); System.out.println("Created keyspace: " + ksDef.getName()); } else { System.err.println("Keyspace already exists"); } }
    27. 27. Hector: Define a CF public static ColumnFamilyDefinition createGenericColumnFamilyDefinition( String ksName, String cfName, ComparatorType ctName) { BasicColumnFamilyDefinition columnFamilyDefinition = new BasicColumnFamilyDefinition(); columnFamilyDefinition.setKeyspaceName(ksName); columnFamilyDefinition.setName(cfName); columnFamilyDefinition.setDefaultValidationClass(ctName.getClassName()); columnFamilyDefinition.setReplicateOnWrite(true); return new ThriftCfDef(columnFamilyDefinition); } public static ColumnFamilyDefinition createCounterColumnFamilyDefinition(String ksName, String cfName) { BasicColumnFamilyDefinition columnFamilyDefinition = new BasicColumnFamilyDefinition(); columnFamilyDefinition.setKeyspaceName(ksName); columnFamilyDefinition.setName(cfName); columnFamilyDefinition.setDefaultValidationClass(ComparatorType.COUNTERTYPE.getClassName()); columnFamilyDefinition.setReplicateOnWrite(true); return new ThriftCfDef(columnFamilyDefinition); }
    28. 28. Hector: Add a CF Keyspace k = HFactory.createKeyspace(nameString, cluster); public static void addColumnFamily(Cluster cluster, Keyspace keyspace, ColumnFamilyDefinition cfDef) { KeyspaceDefinition ksDef = cluster.describeKeyspace(keyspace.getKeyspaceName()); if (ksDef != null) { List<ColumnFamilyDefinition> list = ksDef.getCfDefs(); String cfName = cfDef.getName(); boolean exists = false; for (ColumnFamilyDefinition myCfDef : list) { if (myCfDef.getName().equals(cfName)) { exists = true; System.err.println("Found Column Family: " + cfName + ". Did not insert."); } } if (!exists) { cluster.addColumnFamily(cfDef, true); System.out.println("Created column family: " + cfDef.getName()); } } else { System.err.println("Keyspace definition is null"); } }
    29. 29. Hector: Insert Column public static void insertColumn( Cluster cluster, Keyspace keyspace, String cfName, String rowKey, String columnName, String columnValue) { Mutator<String> mutator = HFactory.createMutator(keyspace, StringSerializer.get()); //HFactory.createColumn(columnName, columnValue, StringSerializer.get(), StringSerializer.get()) HColumn<String, String> hCol = HFactory.createStringColumn(columnName, columnValue); mutator.insert(rowKey, cfName, hCol); mutator.execute(); } public static void incrementCounter( Cluster cluster, Keyspace keyspace, String cfName, String rowKey, String counterColumnName) { Mutator<String> mutator = HFactory.createMutator(keyspace, StringSerializer.get()); mutator.insertCounter( rowKey, cfName, HFactory.createCounterColumn(counterColumnName, 1, StringSerializer.get())); mutator.execute(); }
    30. 30. Hector: Read Column public static String getColumn( Cluster cluster, Keyspace keyspace, String cfName, String rowKey, String columnName) { ColumnQuery<String, String, String> query = Hfactory.createColumnQuery( keyspace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get()); query.setColumnFamily(cfName).setKey(rowKey).setName(columnName); HColumn<String, String> value = query.execute().get(); if (value != null) { return value.getValue(); } return ""; }
    31. 31. Hector: Read Column public static String getColumn( Cluster cluster, Keyspace keyspace, String cfName, String rowKey, String columnName) { ColumnQuery<String, String, String> query = Hfactory.createColumnQuery( keyspace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get()); query.setColumnFamily(cfName).setKey(rowKey).setName(columnName); HColumn<String, String> value = query.execute().get(); if (value != null) { return value.getValue(); } return ""; }
    32. 32. Hector: Read Column public static long getCounter( Cluster cluster, Keyspace keyspace, String cfName, String rowKey, String counterColumnName) { CounterQuery<String, String> query = HFactory.createCounterColumnQuery(keyspace, StringSerializer.get(),StringSerializer.get()); } query.setColumnFamily(cfName).setKey(rowKey).setName(counterColumnName); HCounterColumn<String> counter = query.execute().get(); if (counter != null) { return counter.getValue(); } return 0;
    33. 33. Hector: Read A Slice public static Map<String, String> getSlice( Cluster cluster, Keyspace keyspace, String cfName, String rowKey, String start, String end, boolean reversed, int count) { SliceQuery<String, String, String> query = HFactory.createSliceQuery(keyspace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get()); // for counter use HFactory.createSliceQuery query.setColumnFamily(cfName); query.setKey(rowKey); query.setRange(start, end, reversed, count); Iterator<HColumn<String, String>> iter = query.execute().get().getColumns().iterator(); Map<String, String> answer = new HashMap<String, String>(); while (iter.hasNext()) { HColumn<String, String> temp = iter.next(); answer.put(temp.getName(), temp.getValue()); } return answer; }
    34. 34. Hector: Read All Columns public static Map<String, String> getAllValues( Cluster cluster, String keyspace, String cf, String rowkey) { HashMap<String, String> values = new HashMap<String, String>(); Keyspace keyspaceObject = HFactory.createKeyspace(keyspace, cluster); SliceQuery<String,String,String> query = Hfactory.createSliceQuery( keyspaceObject, StringSerializer.get(), StringSerializer.get(), StringSerializer.get()); query.setColumnFamily(cf).setKey(rowkey).setRange("", "", true, 10000); QueryResult<ColumnSlice<String,String>> result = query.execute(); Iterator<HColumn<String, String>> iter = result.get().getColumns().iterator(); while (iter.hasNext()) { HColumn<String, String> current = iter.next(); values.put(current.getName(), current.getValue()); } return values; }
    35. 35. Hector: DANGER private static void dropAllKeyspaces(Cluster cluster) { for (KeyspaceDefinition ksDef: cluster.describeKeyspaces()) { if (!(ksDef.getName().equals("system") || ksDef.getName().equals("OpsCenter"))) { cluster.dropKeyspace(ksDef.getName(), true); System.out.println("Dropped keyspace: " + ksDef.getName()); } } } private static void dropKeyspace(Cluster cluster, String keyspace) { KeyspaceDefinition ksDef = createKeyspaceDefinition(keyspace, Hector.replication); cluster.dropKeyspace(ksDef.getName(), true); System.out.println("Dropped keyspace: " + ksDef.getName()); } private static void dropColumnFamily(Cluster cluster, String keyspace, String cf) { cluster.dropColumnFamily(keyspace, cf); System.out.println("Dropped Column Family: " + cf ); }
    36. 36. Conclusions ● Data Modeling is Important ● Use Cassandra for write throughput ● Keep your ring even and your data slice-able ● Wrap your libraries and switch when you need to
    37. 37. We're hiring: http://www.sharethis.com/about/careers ● ● ● Work with REAL big data, billions of requests per day Work on products that millions people see and interact with on a daily basis ● Work with a real-time pipeline, machine learning, complex user models ● #1 fastest growing company San Francisco ● free lunches ● ... and of course work with a bunch fun, smart people and PhDs
    38. 38. Thank You! 38

    ×