Data Science @ Scale
@davidcoallier
Part of an amazing team at Barricade.io
Data Science is
Hard
Data Hacking is
“Easy”
Data Analysis is
“Easy”
Data Expertise is
“Easy”
Got all?
Having the three is real hard!
Is that it?
Well don’t forget your purpose.
You are not an economist.
ɪˈkɒnəmɪst/: Someone with all the answers, and none of the questions.
The Data Scientific
Method
Find a question.
Use the data you have
Features & Tests
Analyse Results
You will be sad.
Conversate
Talk about your findings.
Good Chats
Imply egoless and collaborative data scientists.
Recap.
1. Hacking
2. Maths & Stats
3. Expertise
And
1. Question
2. Be Pragmatic
3. Features
4. Analyse
5. Share.
A team!
Rarely a single-person effort.
An Example
Fraud Prevention — Business Prevention
I knew better.
Obviously… duh
We didn’t share.
Science has historically been shared.
Not with p-values
Empathise.
Use human language, not lingo.
For us at
Barricade
Doing this at
scale is hard.
We’re still small
About a billion data points a day.
Humble Beginnings
Typically… an Queue and an API.
This had issues.
Hard to scale, hard to decouple, etc.
Enter the
Lambda Architecture.
Speed Layer
Batch Layer
Speed Layer: U new behaviour from new data
Batch Layer: All classified behaviour since T
Serving Layer
Speed Layer: U new behaviour from new data
Batch Layer: All classified behaviour since T
Serve Layer: Batch layer U Speed Layer
Cache Layer
On Amazon AWS
Identifying an
Attack.
Ahh! What’s that?
Kafka Queue.
Distributed messaging system
Append-only log
Consumers have offsets
Partition for parallelism
Replicate for redundancy
Message order guaranteed, per-partition
Barricade
Customer
Questions?
@davidcoallier
@barricadeio

Data Science at Scale @ barricade.io