Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
copyright 2015
Big Data:
debunking some of the myths
Chris Swan
@cpswan
copyright 2015
Agenda
• My background
• What do I mean by big data?
• Know your algorithm
• Know your data
• Performance
copyright 2015
My background
CTO
CTO Client Experience
Co-head CTO Security
Corporate Finance
fintech, early stage
IT R&D ...
copyright 2015
Recent adventure with Big Data
copyright 2015
Misquoting Roger Needham
Whoever thinks their analytics
problem is solved by big data,
doesn’t understand t...
copyright 2015
What do I mean by ‘big data’?
copyright 2015
Overview
7
Based on a blog post from April 2012 – http://is.gd/swbdla
Problem Types
Algorithm Complexity
Da...
copyright 2015
Simple problems
8
Low data volume, low algorithm complexity
Problem Types
Algorithm Complexity
DataVolume
S...
copyright 2015
Quant Problems
9
Any data volume, high algorithm complexity
Problem Types
Algorithm Complexity
DataVolume
S...
copyright 2015
Big Data Problems
10
High data volume, low algorithm complexity
Problem Types
Algorithm Complexity
DataVolu...
copyright 2015 11
Good
- Lots of new tools, mostly open source
Bad
- Term being abused by marketing departments
Ugly
- Can...
copyright 2015
It’s important to know your algorithms
copyright 2015
Turning an assumption into a line
copyright 2015
There are lots of algorithms to understand
copyright 2015
Statisticians
copyright 2015
Quants
copyright 2015
Data scientist
copyright 2015
It’s also important to know your data
copyright 2015
Whatever we call our ‘experts’
copyright 2015
Who’s heard of Anscombe’s quartet?
copyright 2015
Same statistical properties, but…
http://en.wikipedia.org/wiki/Anscombe's_quartet
copyright 2015
Performance
copyright 2015
Don’t agonise over distros
The performance of Hadoop distros
are all the same to within 1 server
within a c...
copyright 2015
Small is still beautiful
copyright 2015
Because latency
copyright 2015
In terms of distance
http://loci.cs.utk.edu/dsi/netstore99/docs/presentations/keynote/sld023.htm
copyright 2015
Interactive > Real time
copyright 2015
Questions?
Upcoming SlideShare
Loading in …5
×

Big data debunking some of the myths

536 views

Published on

A look at the practicalities of big data and why algorithms (and those who understand them) are important

Published in: Software
  • Be the first to comment

Big data debunking some of the myths

  1. 1. copyright 2015 Big Data: debunking some of the myths Chris Swan @cpswan
  2. 2. copyright 2015 Agenda • My background • What do I mean by big data? • Know your algorithm • Know your data • Performance
  3. 3. copyright 2015 My background CTO CTO Client Experience Co-head CTO Security Corporate Finance fintech, early stage IT R&D – Networks and security Grid, app server engineering Combat System Engineer
  4. 4. copyright 2015 Recent adventure with Big Data
  5. 5. copyright 2015 Misquoting Roger Needham Whoever thinks their analytics problem is solved by big data, doesn’t understand their analytics problem and doesn’t understand big data 5
  6. 6. copyright 2015 What do I mean by ‘big data’?
  7. 7. copyright 2015 Overview 7 Based on a blog post from April 2012 – http://is.gd/swbdla Problem Types Algorithm Complexity DataVolume Simple Big Data Quant
  8. 8. copyright 2015 Simple problems 8 Low data volume, low algorithm complexity Problem Types Algorithm Complexity DataVolume Simple Big Data Quant
  9. 9. copyright 2015 Quant Problems 9 Any data volume, high algorithm complexity Problem Types Algorithm Complexity DataVolume Simple Big Data Quant
  10. 10. copyright 2015 Big Data Problems 10 High data volume, low algorithm complexity Problem Types Algorithm Complexity DataVolume Simple Big Data Quant Types of Big Data Problem: 1. Inherent 2. More data gives better result than more complex algorithm
  11. 11. copyright 2015 11 Good - Lots of new tools, mostly open source Bad - Term being abused by marketing departments Ugly - Can easily lead to over reliance on systems that lack transparency and ignore specific data points 'Computer says no', but nobody can explain why The good, the bad and the ugly of Big Data
  12. 12. copyright 2015 It’s important to know your algorithms
  13. 13. copyright 2015 Turning an assumption into a line
  14. 14. copyright 2015 There are lots of algorithms to understand
  15. 15. copyright 2015 Statisticians
  16. 16. copyright 2015 Quants
  17. 17. copyright 2015 Data scientist
  18. 18. copyright 2015 It’s also important to know your data
  19. 19. copyright 2015 Whatever we call our ‘experts’
  20. 20. copyright 2015 Who’s heard of Anscombe’s quartet?
  21. 21. copyright 2015 Same statistical properties, but… http://en.wikipedia.org/wiki/Anscombe's_quartet
  22. 22. copyright 2015 Performance
  23. 23. copyright 2015 Don’t agonise over distros The performance of Hadoop distros are all the same to within 1 server within a cluster Stefan Groschupf One of the creators of Hadoop
  24. 24. copyright 2015 Small is still beautiful
  25. 25. copyright 2015 Because latency
  26. 26. copyright 2015 In terms of distance http://loci.cs.utk.edu/dsi/netstore99/docs/presentations/keynote/sld023.htm
  27. 27. copyright 2015 Interactive > Real time
  28. 28. copyright 2015 Questions?

×