Cassandra eu


Published on

Presentation on how The Dachis Group uses Cassandra and Hadoop to do social analytics.

Published in: Technology, Business
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • This is how they see their capability after, for example, the Superbowl.\n
  • When managers ask how effective the campaign was, the marketing department says it was awesome. When asked how they know that, they say that Zoltar told them so. In reality there are a lot of home grown methods, some good, some not so good. Some of what we did grew out of a spreadsheet that was manually updated, validated and refined over time with one of our major customers.\n\n
  • What brands does Berkshire Hathaway have under its gigantic umbrella?!?\nCan mention Red Bull, Disney, HP, Levis, Samsung, Honda, etc.\n
  • Operationally simple doesn’t mean that you don’t need to learn a lot about it, just that there aren’t a lot of moving parts.\nUnique use case in that it’s hybrid. Both lots of writes and analytics and reads.\n
  • \n
  • It’s just scads of text, but we do classify - conversations long/short difference between microblogs and blogs.\nWe may use hadoop to generate alternate CFs for specific queries as we need them.\n
  • Company information is unique because we had to buy, borrow, steal and yes crowd source that data.\nPig handles joins really well for example account snapshots and signal for enrichment.\n\n
  • Mention Brandon’s work to make things better with CassandraStorage and newer versions of Pig, including regression tests.\nSpeculative exectution.\n
  • Mention having looked at Azkaban as well.\nNo real way around the logs, just takes getting used to. User impersonation is a product of the authorization framework, patch added to DSE.\n
  • Mention consistency level choices.\nRotate racks - yeah, wasn’t documented except in the code.\nBackup/restore.\nRoot causes sometimes difficult to determine.\nScaling up - each order of magnitude jump has its own problems.\n
  • But the long sleepless Summer finally pays off...\n
  • Everlasting gobstoppers are a fun phase for the projects.\n
  • Reveals numbers\n
  • Explanation\nGreat working as a team\nMention Boxing Day\n
  • Also customer curated topics in the future\n
  • \n
  • Data consistency - periodic checks, staging cluster, unit tests, integration testing.\nReparable data. Sometimes incredibly painful, but possible.\nMention backup/restore.\nMention root causes.\n
  • Be active in communities of these new projects\nIf necessary start building communities around them\nDon’t just take, answer questions, follow mailing lists but have a filter, docs, bug submission, feature requests, votes, representation, tests, patches/pull requests.\n
  • \n
  • Cassandra eu

    1. 1. Powering Social Business IntelligenceCassandra and Hadoop at the Dachis Group
    2. 2. Social Business Wha?Big Data meets Big Budgets• Brand marketers spend • $450 (~£270) billion annually on tradition media • $50 (~£30) billion annually on SEO/SEM• Starting to transition to social media
    3. 3. Effectiveness - TraditionalMeasure all the things! Measuring traditional marketing effectiveness
    4. 4. Effectiveness - SocialMeasure all the things! Measuring social marketing effectiveness
    5. 5. The Dachis GroupMeasure all the things!• Jeff Dachis amasses small army of social strategists• Funds team to create social analytics platform • Measure business outcomes of social media strategies • Track social media surrounding Forbes Global 2000 • Include all brands, all subsidiaries, all social media types
    6. 6. Architecture• Raw data in S3• Cassandra • Realtime queries to return raw data • Hadoop analytic integration for foundational measures • Horizontal scalability • Operationally simple• RDBMS • Time rollups of measures • Aggregates and composite measures • Arbitrary dimensional queries • Mini data warehouse
    7. 7. Pipeline Memcached AWS S3 Cassandra Postgres Raw Signal Signal Metrics Storage Repository Store Normalization Enrichment Analysis
    8. 8. Normalization• Parallel copy from S3 to HDFS• MapReduce to Cassandra from Raw to Normalized CF• Normalized data model • Decent investment to get right • Mostly for conceptual reasons rather than concerns about queries • Secondary indexes vs app maintained indexes
    9. 9. Enrichment• Enrich with • Unique company/brand information • Sentiment • Relationships • Conversations • Social graph information• Enter Pig• Enter Oozie
    10. 10. The Bleeding EdgePig • newlogicalplan in 0.8.0 • Debugging/tracing? • Incremental development • Working with Cassandra • Pygmalion - facilitating to and from Cassandra • Experience, unit test framework, UDFs, community slowly became
    11. 11. The Bleeding EdgeOozie• Learning curve and common errors • User impersonation • Logs, we haz them, lots of them • Web UI needs love• Specific to Cassandra • mapreduce.fileoutputcommitter.marksuccessfuljobs • See• Still very good DAG workflow crunching tool • Subworkflows, fork/join, regular scheduling, dataset detection • Extensible • Apache Incubator (@oozie on twitter, #oozie on freenode)
    12. 12. The Bleeding EdgeCassandra • Rack aware snitch and replication • Always rotate racks in order in topology • In EC2 this likely means rotate AZs • Dealing with scanning over column families • Project early • General tuning and unique workload • Mahout and other higher memory hadoop tasks • EC2 instance types • Visualization tool helped (OpsCenter, Acunu has Control Center) • Community++
    13. 13. Social Business IndexLaunches September 2011 • Global Ranking of Companies • Industry Rankings • Visualization of strategy
    14. 14. This might actually work! • Fall 2011, built up the team • Expertise in Pig, Lucene/Solr, machine learning, statistics, event prediction and analysis • Making everlasting gobstoppers
    15. 15. Social Performance MonitorThe measures behind the score
    16. 16. Topics topics topics• Black Friday • Science project! • Mallet, Pig • Custom analysis• Superbowl• Oscars
    17. 17. Productizing Topics• Ongoing automated topic detection• Lessons from one-off topic analysis• Represented by term distributions• Threads with detail like • Signal volume • Participants • Links • Sentiment gauge
    18. 18. Advocates• Auto-discovery of potential advocates• Curated set of known advocates• Example signal (from Cassandra)• Reports and other useful bits
    19. 19. Lessons learned• Emerging products are sometimes frustrating, but well worth the pain in their respective niche.• “Never underestimate the massive impact of small bugs in big data.” (@peteskomoroch at LinkedIn)• Community karma
    20. 20. A Note on Community• Community involvement • IRC, mailing lists, twitter, conferences, meetups • Newer projects have little or outdated docs • Some features may be • Deprecated • Not ready for primetime • Not a fit for your use case• Community karma • Don’t just take • Be a bridge builder • Positive karma helps
    21. 21. Questions?• We’re hiring• Ping me @jeromatron (Twitter and IRC)