Cassandra eu
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Cassandra eu

on

  • 2,831 views

Presentation on how The Dachis Group uses Cassandra and Hadoop to do social analytics.

Presentation on how The Dachis Group uses Cassandra and Hadoop to do social analytics.

Statistics

Views

Total Views
2,831
Views on SlideShare
2,816
Embed Views
15

Actions

Likes
2
Downloads
53
Comments
1

5 Embeds 15

http://tweetedtimes.com 7
https://twitter.com 4
http://waffle-prod.herokuapp.com 2
http://us-w1.rockmelt.com 1
http://www.twylah.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • This is how they see their capability after, for example, the Superbowl.\n
  • When managers ask how effective the campaign was, the marketing department says it was awesome. When asked how they know that, they say that Zoltar told them so. In reality there are a lot of home grown methods, some good, some not so good. Some of what we did grew out of a spreadsheet that was manually updated, validated and refined over time with one of our major customers.\n\n
  • What brands does Berkshire Hathaway have under its gigantic umbrella?!?\nCan mention Red Bull, Disney, HP, Levis, Samsung, Honda, etc.\n
  • Operationally simple doesn’t mean that you don’t need to learn a lot about it, just that there aren’t a lot of moving parts.\nUnique use case in that it’s hybrid. Both lots of writes and analytics and reads.\n
  • \n
  • It’s just scads of text, but we do classify - conversations long/short difference between microblogs and blogs.\nWe may use hadoop to generate alternate CFs for specific queries as we need them.\n
  • Company information is unique because we had to buy, borrow, steal and yes crowd source that data.\nPig handles joins really well for example account snapshots and signal for enrichment.\n\n
  • Mention Brandon’s work to make things better with CassandraStorage and newer versions of Pig, including regression tests.\nSpeculative exectution.\n
  • Mention having looked at Azkaban as well.\nNo real way around the logs, just takes getting used to. User impersonation is a product of the authorization framework, patch added to DSE.\n
  • Mention consistency level choices.\nRotate racks - yeah, wasn’t documented except in the code.\nBackup/restore.\nRoot causes sometimes difficult to determine.\nScaling up - each order of magnitude jump has its own problems.\n
  • But the long sleepless Summer finally pays off...\n
  • Everlasting gobstoppers are a fun phase for the projects.\n
  • Reveals numbers\n
  • Explanation\nGreat working as a team\nMention Boxing Day\n
  • Also customer curated topics in the future\n
  • \n
  • Data consistency - periodic checks, staging cluster, unit tests, integration testing.\nReparable data. Sometimes incredibly painful, but possible.\nMention backup/restore.\nMention root causes.\n
  • Be active in communities of these new projects\nIf necessary start building communities around them\nDon’t just take, answer questions, follow mailing lists but have a filter, docs, bug submission, feature requests, votes, representation, tests, patches/pull requests.\n
  • \n

Cassandra eu Presentation Transcript

  • 1. Powering Social Business IntelligenceCassandra and Hadoop at the Dachis Group
  • 2. Social Business Wha?Big Data meets Big Budgets• Brand marketers spend • $450 (~£270) billion annually on tradition media • $50 (~£30) billion annually on SEO/SEM• Starting to transition to social media
  • 3. Effectiveness - TraditionalMeasure all the things! Measuring traditional marketing effectiveness
  • 4. Effectiveness - SocialMeasure all the things! Measuring social marketing effectiveness
  • 5. The Dachis GroupMeasure all the things!• Jeff Dachis amasses small army of social strategists• Funds team to create social analytics platform • Measure business outcomes of social media strategies • Track social media surrounding Forbes Global 2000 • Include all brands, all subsidiaries, all social media types
  • 6. Architecture• Raw data in S3• Cassandra • Realtime queries to return raw data • Hadoop analytic integration for foundational measures • Horizontal scalability • Operationally simple• RDBMS • Time rollups of measures • Aggregates and composite measures • Arbitrary dimensional queries • Mini data warehouse
  • 7. Pipeline Memcached AWS S3 Cassandra Postgres Raw Signal Signal Metrics Storage Repository Store Normalization Enrichment Analysis
  • 8. Normalization• Parallel copy from S3 to HDFS• MapReduce to Cassandra from Raw to Normalized CF• Normalized data model • Decent investment to get right • Mostly for conceptual reasons rather than concerns about queries • Secondary indexes vs app maintained indexes
  • 9. Enrichment• Enrich with • Unique company/brand information • Sentiment • Relationships • Conversations • Social graph information• Enter Pig• Enter Oozie
  • 10. The Bleeding EdgePig • newlogicalplan in 0.8.0 • Debugging/tracing? • Incremental development • Working with Cassandra • Pygmalion - facilitating to and from Cassandra • Experience, unit test framework, UDFs, community slowly became
  • 11. The Bleeding EdgeOozie• Learning curve and common errors • User impersonation • Logs, we haz them, lots of them • Web UI needs love• Specific to Cassandra • mapreduce.fileoutputcommitter.marksuccessfuljobs • See http://wiki.apache.org/cassandra/HadoopSupport#Oozie• Still very good DAG workflow crunching tool • Subworkflows, fork/join, regular scheduling, dataset detection • Extensible • Apache Incubator (@oozie on twitter, #oozie on freenode)
  • 12. The Bleeding EdgeCassandra • Rack aware snitch and replication • Always rotate racks in order in topology • In EC2 this likely means rotate AZs • Dealing with scanning over column families • Project early • General tuning and unique workload • Mahout and other higher memory hadoop tasks • EC2 instance types • Visualization tool helped (OpsCenter, Acunu has Control Center) • Community++
  • 13. Social Business IndexLaunches September 2011 • Global Ranking of Companies • Industry Rankings • Visualization of strategy
  • 14. This might actually work! • Fall 2011, built up the team • Expertise in Pig, Lucene/Solr, machine learning, statistics, event prediction and analysis • Making everlasting gobstoppers
  • 15. Social Performance MonitorThe measures behind the score
  • 16. Topics topics topics• Black Friday • Science project! • Mallet, Pig • Custom analysis• Superbowl• Oscars
  • 17. Productizing Topics• Ongoing automated topic detection• Lessons from one-off topic analysis• Represented by term distributions• Threads with detail like • Signal volume • Participants • Links • Sentiment gauge
  • 18. Advocates• Auto-discovery of potential advocates• Curated set of known advocates• Example signal (from Cassandra)• Reports and other useful bits
  • 19. Lessons learned• Emerging products are sometimes frustrating, but well worth the pain in their respective niche.• “Never underestimate the massive impact of small bugs in big data.” (@peteskomoroch at LinkedIn)• Community karma
  • 20. A Note on Community• Community involvement • IRC, mailing lists, twitter, conferences, meetups • Newer projects have little or outdated docs • Some features may be • Deprecated • Not ready for primetime • Not a fit for your use case• Community karma • Don’t just take • Be a bridge builder • Positive karma helps
  • 21. Questions?• We’re hiring• Ping me @jeromatron (Twitter and IRC)