copyright 2015
Big Data:
debunking some of the myths
Chris Swan
@cpswan
copyright 2015
Agenda
• My background
• What do I mean by big data?
• Know your algorithm
• Know your data
• Performance
copyright 2015
My background
CTO
CTO Client Experience
Co-head CTO Security
Corporate Finance
fintech, early stage
IT R&D – Networks and security
Grid, app server engineering
Combat System Engineer
copyright 2015
Recent adventure with Big Data
copyright 2015
Misquoting Roger Needham
Whoever thinks their analytics
problem is solved by big data,
doesn’t understand their analytics
problem and doesn’t understand
big data
5
copyright 2015
What do I mean by ‘big data’?
copyright 2015
Overview
7
Based on a blog post from April 2012 – http://is.gd/swbdla
Problem Types
Algorithm Complexity
DataVolume
Simple
Big Data
Quant
copyright 2015
Simple problems
8
Low data volume, low algorithm complexity
Problem Types
Algorithm Complexity
DataVolume
Simple
Big Data
Quant
copyright 2015
Quant Problems
9
Any data volume, high algorithm complexity
Problem Types
Algorithm Complexity
DataVolume
Simple
Big Data
Quant
copyright 2015
Big Data Problems
10
High data volume, low algorithm complexity
Problem Types
Algorithm Complexity
DataVolume
Simple
Big Data
Quant
Types of Big Data Problem:
1. Inherent
2. More data gives better
result than more complex
algorithm
copyright 2015 11
Good
- Lots of new tools, mostly open source
Bad
- Term being abused by marketing departments
Ugly
- Can easily lead to over reliance on systems that lack transparency and ignore specific data points
'Computer says no', but nobody can explain why
The good, the bad and the ugly of Big Data
copyright 2015
It’s important to know your algorithms
copyright 2015
Turning an assumption into a line
copyright 2015
There are lots of algorithms to understand
copyright 2015
Statisticians
copyright 2015
Quants
copyright 2015
Data scientist
copyright 2015
It’s also important to know your data
copyright 2015
Whatever we call our ‘experts’
copyright 2015
Who’s heard of Anscombe’s quartet?
copyright 2015
Same statistical properties, but…
http://en.wikipedia.org/wiki/Anscombe's_quartet
copyright 2015
Performance
copyright 2015
Don’t agonise over distros
The performance of Hadoop distros
are all the same to within 1 server
within a cluster
Stefan Groschupf
One of the creators of Hadoop
copyright 2015
Small is still beautiful
copyright 2015
Because latency
copyright 2015
In terms of distance
http://loci.cs.utk.edu/dsi/netstore99/docs/presentations/keynote/sld023.htm
copyright 2015
Interactive > Real time
copyright 2015
Questions?

Big data debunking some of the myths