DevOps in a
Machine Learning World
@leonardaustin
As machine learning moves from niche to
mainstream tech stacks how do DevOps engineers
prepare for a very different set of problems. A brief
look at the new issues that arise from machine
learning, an overview of cutting-edge "old school"
solutions and how to drag data science (kicking and
screaming) into a world of automation.
Leonard Austin
Cofounder at Ravelin
CTO, Software Engineer, DevOps, Recruiter...
@leonardaustin
Ravelin
Fraud Detection. Ravelin examines your visitor and
payment data in real time, telling your systems
which customers are fraudsters. We use Machine
Learning, Rule Engines, Graph Networks and
Industry Expertise to respond with scores in
milliseconds. Perfect for an on-demand world.
Raised $2m last year. Fintech. Hiring
Fraud?
$14B
Lost to fraud
Growing rapidly as fraudsters move online
Detection is Hard
One fraudster leads to lots of cost
3D Secure
3D Secure
Kills Conversion
Stack
Go + Python
AWS
MicroServices
Storage: Cassandra, Postgres, ElasticSearch, Redis, Graph Database X, ZooKeeper
Queue: NSQ, Kinesis
Instrumentation: InfluxDB, Grafana
Docker - but only for local dev
Doing Things The
Right Way
TerraForm
100% Automation
Horizontally Scalable
Continuous Integration
No need for SSH access
100% Visibility - Metrics & Logs
Servers & MicroServices
Servers & MicroServices
“Livestock, not pets. It gets sick, terminate it” - DevOps guy on the internet
Machine Learning
Challenges
> Data Warehousing
Resource on Demand
Deploy
Hardware Requirement
Life Cycle
(Explore, Train, Deploy)
Data Warehousing
What?
Why we need it for Ravelin
How much data
$10m
IBM, Oracle, Microsoft
v1
$1m
Massively Parallel Processing - MPP
IBM, Oracle, Microsoft, Teradata, Vertica, GreenPlum
v1.5
$200k
Hadoop MapReduce, Spark, Hive, Impala
v2
$500
BigQuery
v3
$5.00
BigQuery per Terabyte
We ♡ BigQuery
Costs - $5 per terabyte, 5c per range query per terabyte
Managed - but no reserve compute resources needed!
Distributed columns easily append
Dataflow
Restriction:
Can’t Update
No Indexes
Probably need to mention AWS RedShift
Stack
Go + Python
AWS & Google Cloud Platform
MicroServices
DB: Cassandra, Postgres, ElasticSearch, Redis, Graph Databases, ZooKeeper
Queue: NSQ, Kinesis, Google Pub/Sub
Warehouse: BigQuery, DataFlow
Machine Learning
Challenges
Data Warehousing
> Resource on Demand
Deploy
Hardware Requirement
Life Cycle
(Explore, Train, Deploy)
Work on the Cloud!
“Stephen’s laptop was measurably heavier because of the amount
of data he had on it. We asked him nicely to move everything to
the cloud and now the internet is a little heavier” - Science 2016
Data
“Single point of success”- Jose CTO Hailo 2014
AWS
32 Cores 244GB RAM
Google Cloud Platform
32 Cores 208GB RAM
Azure
16 Cores 112GB RAM
Machine Learning
Challenges
Data Warehousing
Resource on Demand
> Deploy
Hardware Requirement
Life Cycle
(Explore, Train, Deploy)
Deploying Models
Train - sample
Pickle
S3
Deploy
Simple
Hardware - GPU’s
Specific for Deep Learning
AWS have a GPU machine but $$$
No virtualization
Buy and build your own server
Q. How Deep is your problem?
Speech, Video, Images
Summary
Data Warehousing
BigQuery
Dataflow
On Demand Resource
1 Machine (because clustering is expensive)
Big Machines on the Cloud
Persistent Volumes on Google Cloud Compute
Hiring Smart People
DevOps - Mid Level & Senior
Data Scientist - Junior & Mid Level
Software Engineer - Junior, Mid Level & Senior
Product Owner
Thanks
@leonardaustin
@ravelinhq
ravelin.com
leonard.austin@ravelin.com
Remember we are hiring

Leonard Austin (Ravelin) - DevOps in a Machine Learning World

Editor's Notes

  • #3 machine learning is becoming common place saas solutions coming out challenges have changed, I’ll try and cover some of the solutions my company has come up with
  • #4 But first who am I and what authority so I have to speak about DevOps or machine learning. Confession - I’m not really a full time devops but when we founded the company I had to do a little bit of everything. Had help but I learnt a lot.
  • #5 So what do we do. Fraud detection - lots and lots of data in real time using lots of tools. ML -> Rule Engine, Graph Networks etc Raised 2m, growing rapidly. THose of you who like buzz words we are “fintect” and we are hiring (i’ll come back to that later).
  • #6 So fraud you say- what do you mean? How big of a problem is it really? How much data are you talking about?
  • #7 How can this be? I’ve never lost money to fraud? - Merchant do, in fact they lose most of the above figure. Im sure 90% of people in this room has had a transaction declined or card blocked because you were overseas or for some completely unknown reason. That is because of fraud Chip and pin stopped fraud right? - Sure, kindof but that 14B just moved online USA have only moved to chip this year (but not the pin)
  • #8 Detecting fraud is hard. - There is a N in all of the M’s Most startup out there sign up customers and have drop outs - not the hockey stick of silicon valley, constantly messing around with their funnel to find those evangelical customers. Fraudsters look like your best customer ever, they sign up and start spending money - lots of it! The cost of one fraudster to your business could be as much as …...
  • #9 100 real customers (depending on your margins) So how do you stop fraudsters? Glad you asked. It’s a good job we have cutting edge companies like: Visa, Mastercard, Amex and all those trustworthy banks on it. With all their might their solution…..
  • #10 3D secure! I’m sure 100% of everyone in this room has seen this page before. Awesome no more fraud. I mean fraudsters dont know your secret password, so job done. For those who have been looking for the N it is….
  • #11 here…. So Ravelin might has well pack up and go home, the bank’s have solved the issue for all of us. I mean everyone remembers that random password you setup once 18 months ago whilst trying to by a stupid wedding list gift. A password you only use every 3 months. Problem is neither can you or any other customer! So conversion drops….
  • #12 Typically 20 - 25% on websites. Spoken to merchant who experience 50% dropout rate on mobile!!! So that the history, so what tools are we using that the banks are not…. (except Mondo of course).
  • #13 Go in a binary you compile and you put it somewhere. DevOps need to get on this, life is better. Got to make room for python - basically machine learning libraries are all in python AWS - obviously Microservices - obviously because we are a startup and we are cool Storage: lots of different databases for specific needs. The right db for the job Instrumentation guys - Do it! It is so useful - if something goes wrong it is the first thing we look at. Docker - not a fan (as most of the improvements in workflow you get from go anyway) but we do use it for local dev which is awesome at. Could rant about docker for 20 mins but need to move on.
  • #14 Terraform - so much better than cloudformation but infrastructure as code - big thumbs up! 100% automation - Kill a box is come straight back up again. Spin up everything at the click of a button right? CI Always be building - just moved to a mono-repo but that is a talk for another time SSH - if someone SSH into prod - alarms should be sounding. In this day and age you shouldn’t need access. Metrics - guys get on this. How many of you have metrics? but seriously a word about servers...
  • #15 This is not a server (or microservice) - it’s a puppy, aka a pet. Never name your servers…
  • #16 treat them like livestock with numbers. Infrastructure is a working farm not a house. Services/services are livestock. True story, we use ZooKeeper - it’s very important to us - it deals with service discovery and global locking. Last Wednesday - AWS decided to teminate one of our ZK nodes, for no reason. Those of you who have worked with ZK you’ll know as long as you have quorum (2 or out 3) you’ll be ok - and we were. But not only that, new zk box came online and rejoined the cluster without any manual help. I mean we were worried, and 3 of us in the office where looking at our metric page for 30 mins straight but it just worked.
  • #17 So those are our DevOps beliefs at Ravelin and what I want to cover is the specific issues with Machine Learning and how we solved them. Lets start with Data Warehousing
  • #18 What: We have databases for operational needs e.g. postgres, cassandra etc for real time requests from services. We dont want unleash our data scientists and their none optimised queries on it. Why: Exploratory work without impacting production databases on real data Query anything - unlimited resources for long running queries How much data: Terrabytes of data. If I had 100GB I would be tempted to move to BQ I want to walk through the history of data warehousing - 15 - 20 years ago...
  • #19 Per year. Then you know 10 - 15 years ago it came down in price a lot
  • #20 I’m calling this v1.5 And v2 is what I assume all of you guys are used to..
  • #21 Licence is free but Devs, servers and consultants cost a pretty penny. Anyone here a Hadoop contractor - bet you own a house in London. Anyone guess how much v3 costs?
  • #24 We had the ability and skill to build our own cluster but: Dont need to plan for capacity because we have on demand resource maintenance time DevOps time on pay when you use Dataflow
  • #26 One thing I would say is, we have a really good account manager at Google who was a huge help. If you guys are serious and have big data ping me and I will personally introduce you.
  • #29 We know ML can work on a distributed systems however it complex. E.g. some algorithms require super fast network cards. But majority of the algorithms are build for single machine and you can just throw loads of memory. 37 signals etc - just scale up
  • #32 So Ram is all good - but GPU’s is another kettle of fish. GPU is good for a specific
  • #33 When I say expensive, I mean in terms of money but also time