Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data


Published on

Often, analyzing more and more data doesn’t improve your results: you just make the same mistakes at a larger scale. Crimson Hexagon CTO Christopher Bingham discusses several techniques that leverage the quantity of data, increasing accuracy as you scale. Big data can thus lead to better analysis–not just bigger analysis.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

  1. 1. BETTER ALGORITHMSFROM BIGGER DATAChris Bingham, CTO, Crimson Hexagon April 26th, 2012
  2. 2. INTRODUCTIONCrimson Hexagon and me
  3. 3. ABOUT CRIMSON HEXAGON• Founded 4 years ago; now 40+ employees in Boston• Help companies make actionable business decisions• Based on unique analysis of social media and internal data• Customers include F100, agencies, UN• Tech stack: • Java, with R for algorithms • Massive Lucene infrastructure with custom shard management • Distributed computing framework for analysis • Hadoop increasingly used
  4. 4. BIG DATA, BETTER DATA, BETTER ALGORITHMS• World’s largest searchable social media archive• >200 billion posts in 2012• Adding 1 billion every 2-3 days• Twitter, Facebook, blogs, forums, comments, news, etc.
  5. 5. BIG DATA, BETTER DATA, BETTER ALGORITHMS• Who’s talking and listening? • Demographics • Interests • Relationships• Trends and comparisons • Compared to yourself, over time • Compared to industry, competitors, etc.• Human input • Define specific business question and possible answers • Provides focus and context
  6. 6. BIG DATA, BETTER DATA, BETTER ALGORITHMS• Based on work by co-founder Gary King at Harvard• Takes all those billions of posts, plus the human input• Leverages the human judgment to massive scale• Quantitative answers to specific business questions• Accurate in any language
  7. 7. ALGORITHMS AND BIG DATAThe problem of leverage
  8. 8. MACHINE LEARNING Let’s consider a typical data-analysis problem using machine learning. How does having more data help (or hurt) us?
  9. 9. DEFINE CATEGORIES A Some set of user- B defined categories (AKA topics, classes, etc.) C D
  10. 10. PROVIDE TRAINING A B Training examples to map features to C categories D
  11. 11. LEARN A MODEL A Algorithm classifies items into B categories based on training data C D
  12. 12. CLASSIFY ITEMS A B w x y z C Incoming unknown items to be classified D
  13. 13. OBTAIN RESULTS A y Result: Items are B w classified, hopefully correctly! C x z D
  14. 14. DID IT WORK? A y A y Compare algorithm to B w B w human(s) to measure accuracy—here “z” was incorrectly C x C x z classified D z D
  15. 15. ERROR RATE We were wrong 25% of the time. What happens when we add more data? 75% correct 25% wrong
  16. 16. SCALE TO BIG DATA We just make the same mistakes on a larger scale. 75% correct 75% correct 25% wrong 25% wrong
  17. 17. CAN MORE DATA HELP? A Can bigger data help us? In some ways. B • It can enable more types of analysis C • It can enable analysis of more categories • It can provide more raw material D for training and validation What about accuracy? E F
  18. 18. HUMAN SCALE A More training usually improves accuracy—but we need not just more data, but B more humans. Humans don’t scale. C D
  19. 19. FEEDBACK For some applications, users can A y implicitly provide feedback through their use. B w e.g. ad placement; spam detection C x z But this isn’t possible in all cases—and you can’t be too wrong to begin with D
  20. 20. BOOTSTRAPPING We can also feed the A y classified items back into the training set (no human intervention). B w Some incorrect classifications will become C x z part of the training! But that doesn’t necessarily hurt. D
  21. 21. BOOTSTRAPPING RESULT The more data you have, the more you can classify. r A y y s The more you classify, the more training data you obtain. B w w wt The more training data, the more accurate the results. C x z x u And we didn’t have to scale the human involvement. D x v x x
  22. 22. INDIVIDUAL VS. AGGREGATE So far we’ve considered classification of individual items. This is the conventional machine-learning approach. A y B w w x y z C x z D
  23. 23. INDIVIDUAL VS. AGGREGATE What if we want to know the size of each category, rather than which items are in which category? A 25% A e.g. epidemiology, polls, market research B 25% B w x y z C 50% C D 0% D
  24. 24. INDIVIDUAL VS. AGGREGATE When considered individually, there’s a limited amount of information we have about each item. As a result, there will be limited correlation with the training data, and therefore poor accuracy. A? C? w = B? D? x = 75% correct y = 25% wrong z =
  25. 25. INDIVIDUAL VS. AGGREGATE When considered in the aggregate, there’s much more data correlating with the training data for each category. As a result, we can make more accurate estimates of the category proportions. % % % %D A B C W+X+Y+ 85% correct Z = 15% wrong
  26. 26. INDIVIDUAL VS. AGGREGATE Now, increasing the amount of data can actually increase the accuracy—with the same amount of human training data. % % % %D A B C S+T+U+V+ 95% correct W+X+Y+Z = 5% wrong
  27. 27. CONCLUSION• Bigger data is important• Better data is important• Better algorithms are important• The sweet spot is when one leverages the other Bigger data can lead to better algorithms.
  28. 28. QUESTIONS?