SlideShare a Scribd company logo
Michael Roytman
Data Scientist, Risk I/O

My name is michael roytman and I’m the data scientist at risk io. I came to risk io just when we were starting to build out our predictive analytics functionality, and I have been wrangling that process like a
wild animal ever since. 

!

When I started, we didn’t have hadoop clusters, we didn’t have hive or spark, didn’t have custom machines running in our own data center, and didn’t even think about machine learning. I’m proud to say
that my data science operation has kept everything that way… except we recently got rid of mongoDB.

!

Such wow, so few science you might say. You’d be wrong. In fact, very intentionally limiting our complexity, both statistical and technological, while generating actionable insights and new near-real-time
data products is our biggest win to date. 
Less Is More:
!

Behind The Data
at Risk I/O

I don’t like dealing in platitudes. Taken on face, this is a meaningless statement, and even in something as simple as security, nothing is quite so black and white. 

! today I want you to consider the situations and impacts of resisting the current trends “more data, more data science, more hadoop”. Yes, it’s 2014, yes, we need new methods for
But,
generating value from our data. But that doesn’t mean we need to buy tulips to do it.
Less
Less
Less
Less

Tools
Data Scientists
Data
Model Complexity

More Impact

There are four contentions to my rant.
First, there is a propensity to implement complex and costly to maintain tools without the explicit need for them. I am versed in hadoop lore, but there’s only a slight difference between foreseeing needs
and overspending.
Second, data is everywhere and so are data scientists. It’s surprising and wonderful how much data driven work can be done within an existing environment once the directive is given.
Third, collecting all the things is great, but data comes at a cost - the cost of storage, the cost of cleaning, the cost of complexity. It is important to know which questions you’re answering before you
begin.
And lastly, all of these “constraints” are actually blessings. Much like our beloved twitter, limiting the scope and the tools to 140 characters makes for a much more precise and useful product.

!
Say “Big Data”

One More Time
As resistant as people are to change, I hear about a lot of organizations jumping the gun on technological or organizational change.

!

I want to explain an ongoing trend I like to call “knee-jerk hadoop”. Tons of organizations think they’ll get a fast datastore or a mature analytics practice just because they hired 5 people to run a hadoop
cluster.

!

A horn and a horse does not a unicorn make. Real efficiency comes from understanding run-time complexity of your backend and making it as efficient as possible - which takes time and specific
knowledge of the system.
Cautionary Tales

A great example of this is from Value America of dot com bubble fame. There’s a similar Groupon tale, but that one’s still ongoing. It was one of the first just in time models, connecting customers directly
to manufacturers, much like dell. It was backed by microsoft and fedex founders. At the peak of their success, they were hiring 100 people a month for over a year. When others caught on to the model
and hard times hit, they fired 300 people and continued to fire 100 a month - and the morale and communication impacts from the layoffs were the cause of bankruptcy a year later. This is the cost of a
bad forecast that changes organizational and technical structure.
“It don’t matta if you win
by an inch or a mile winning’s winning.”
-Vin Diesel, The Fast and the Furious, 2001
Winner, Best Movie, MTV Movie Awards

At Risk I/O we launched our first predictive models back in march of 2013 - they were crude predictions of priority vulnerabilities. They weren’t even written by a proper developer. They ran on a ruby backend, they took half a day
to compute, even longer to index. They pulled from mongo, used that to look up aggregations in mysql, did calculations in ruby, and pushed back up to mongo. It was horrible, but it worked. One year later, the same method runs
every half an hour, indexing takes minutes.

!

We’ve expanded the scope of the model, and it includes 4 times as many inputs. Our code is smarter, our algorithms ignore duplication at every step, we’ve scrapped mongo, we’ve modified the query DSL for ruby to fit our needs.
We could have easily said “this is slow. ruby sucks. bring me hadoop with a side of python please”.

!

Here’s what we gained by NOT doing that:
1. We’ve saved on infrastructure costs.
2. We took the time to understand exactly what the algorithms are doing, where the deltas are, and what kind of behavior we can expect moving forward.
3. We didn’t spend time and energy hiring, and then tasking our engineers with knowledge transfer.
4. We gained a clean an easy to expand analytics module which requires no specialized skills to work with. If my plane crashes tomorrow into the side of Ed Bellis’s house, a 20 year old without prior CS knowledge could scale the
algorithms.

!

And most importantly, we can still make that move when we need to. By now, we have spark and julia. New tools have evolved that might serve our purpose better and put us ahead of the status quo.
Everyone is a Data Scientist
Don’t Save For
Tomorrow What You
Can Do In Excel
Today.

But also don’t use excel. I come from an academic background, and I am well versed in R and recently wizard, a tool that I encourage everyone to look into. However, data science work is largely domain knowledge, and I have the
least domain knowledge of anyone. If I ask Andrea, our marketing manager the right set of questions, she can do 80% of the work in google analytics or kiss metrics. Even Ed Bellis knows how to write a SQL query.

!
!

I am the only data scientist at Risk I/O, but our data science operation is closer to 3 people. 100% my time, 1/4th CEO, CTO, and marketing time, another 10-20% of a security architect and a developer.
All this to say, it’s important to look around before expanding data science, and to recognize the importance of specialized domain knowledge.
Take Only What You Need

Not all data is good data. Not all good data is useful data. The new york times and business week view of data science is that you collect a slew of data, unleash a hipster from brooklyn on it, and voila,
insight!

!
I disagree. It is much leaner (in the deming sense of cutting out useless movements) to first ask the right questions, then collect the right data, and then generate the right answer.
!
Here’s how that works at risk i/o. We attempt to solve the contextual problem of which vulnerabilities put an enterprise most at risk. We have access through partner channels and public data to every kind of
security data under the sun - yet, when deciding what data parternships to pursue or which data to use, we have a very strict set of criteria that filters out the noise for us BEFORE we get into the hard work.

!
Here’s an example of just ONE data source integration I did at the end of last year: [168x167 system of equations per CVE live stream row echelon reduction].
!

Making quality decisions before you start the process is fundamental in quality control methods, pioneered by Taguchi Toyota. The same applied to data cleaning. So we only take active attacks, active
breaches, or data that we can turn into the two. We don’t care about ip reputation data, malware analysis, data that overlaps with a public source. That’s because we can’t afford to row reduce your live
stream only to find out it’s useless.
Transparency, et al.

There are huge wins from model simplicity too.
A. Fix What Matters story - refer to www.risk.io/data-driven-security
Wins:
1. Transparency
2. Ease of implementation
3. Feedback loop on tools, data.
Probability A Vuln Having Property X Has Observed Breaches
RANDOM VULN

CVSS 10

CVSS 9

CVSS 8

CVSS 6

CVSS 7

CVSS 5

CVSS 4

Has Patch

0.000

0.010

0.020

0.030

0.040
Probability A Vuln Having Property X Has Observed Breaches
Random Vuln

CVSS 10

Exploit DB

Metasploit

MSP+EDB

0.0

0.1

0.2

0.2

0.3
Know What You’re After

We have recently been working on a model for risk assessment, the technical documentation for which some of you have seen and which we’ll be releasing shortly. This is a lot more involved than finding one
risk factor on a vulnerability - but we’ve structured the effort in a similar matter. We collected a subset of data we knew would be relevant ahead of time. We used the tools at our disposal and the expertise
at our disposal to explore the data. Most of my work in creating the model was done on paper, messing around with algebraic equations until they were simple enough where they could be understood
easily without losing the value.
Holler!
@mroytman
www.risk.io

db.risk.io

More Related Content

What's hot

Dr Bonnie Cheuk IDC Future of Work Keynote: Workforce Transformation Human Ma...
Dr Bonnie Cheuk IDC Future of Work Keynote: Workforce Transformation Human Ma...Dr Bonnie Cheuk IDC Future of Work Keynote: Workforce Transformation Human Ma...
Dr Bonnie Cheuk IDC Future of Work Keynote: Workforce Transformation Human Ma...
Bonnie Cheuk
 
Lean approach to IT development
Lean approach to IT developmentLean approach to IT development
Lean approach to IT development
Mark Krebs
 

What's hot (20)

Critical Thinking for Software Testers
Critical Thinking for Software TestersCritical Thinking for Software Testers
Critical Thinking for Software Testers
 
Data Science For Social Scientists Workshop
Data Science For Social Scientists WorkshopData Science For Social Scientists Workshop
Data Science For Social Scientists Workshop
 
2019 June 27 - Big data and data science
2019 June 27 - Big data and data science2019 June 27 - Big data and data science
2019 June 27 - Big data and data science
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
Big Data Analytics and Data Science
Big Data Analytics and Data Science�Big Data Analytics and Data Science�
Big Data Analytics and Data Science
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?
 
20170313 mr - gss presentation
20170313   mr - gss presentation20170313   mr - gss presentation
20170313 mr - gss presentation
 
Machine Learning Introduction for Digital Business Leaders
Machine Learning Introduction for Digital Business LeadersMachine Learning Introduction for Digital Business Leaders
Machine Learning Introduction for Digital Business Leaders
 
Briefing - April 2016
Briefing - April 2016Briefing - April 2016
Briefing - April 2016
 
Dr Bonnie Cheuk IDC Future of Work Keynote: Workforce Transformation Human Ma...
Dr Bonnie Cheuk IDC Future of Work Keynote: Workforce Transformation Human Ma...Dr Bonnie Cheuk IDC Future of Work Keynote: Workforce Transformation Human Ma...
Dr Bonnie Cheuk IDC Future of Work Keynote: Workforce Transformation Human Ma...
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
Panacea H4D Stanford 2019
Panacea H4D Stanford 2019Panacea H4D Stanford 2019
Panacea H4D Stanford 2019
 
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingBio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation
 
Using AI to Solve Data and IT Complexity -- And Better Enable AI
Using AI to Solve Data and IT Complexity -- And Better Enable AIUsing AI to Solve Data and IT Complexity -- And Better Enable AI
Using AI to Solve Data and IT Complexity -- And Better Enable AI
 
Wither OWL
Wither OWLWither OWL
Wither OWL
 
Lean approach to IT development
Lean approach to IT developmentLean approach to IT development
Lean approach to IT development
 

Similar to Less is More: Behind the Data at Risk I/O

Similar to Less is More: Behind the Data at Risk I/O (20)

Cybersecurity Standards: The Open Group Explores Security and Ways to Assure ...
Cybersecurity Standards: The Open Group Explores Security and Ways to Assure ...Cybersecurity Standards: The Open Group Explores Security and Ways to Assure ...
Cybersecurity Standards: The Open Group Explores Security and Ways to Assure ...
 
Is big data just a buzzword -Big data simply explained
Is big data just a buzzword -Big data simply explainedIs big data just a buzzword -Big data simply explained
Is big data just a buzzword -Big data simply explained
 
EDW 2015 cognitive computing panel session
EDW 2015 cognitive computing panel session EDW 2015 cognitive computing panel session
EDW 2015 cognitive computing panel session
 
Four essential truths of the IoT
Four essential truths of the IoTFour essential truths of the IoT
Four essential truths of the IoT
 
Industry of Things World - Berlin 19-09-16
Industry of Things World - Berlin 19-09-16Industry of Things World - Berlin 19-09-16
Industry of Things World - Berlin 19-09-16
 
Future of data science as a profession
Future of data science as a professionFuture of data science as a profession
Future of data science as a profession
 
Building an enterprise security knowledge graph to fuel better decisions, fas...
Building an enterprise security knowledge graph to fuel better decisions, fas...Building an enterprise security knowledge graph to fuel better decisions, fas...
Building an enterprise security knowledge graph to fuel better decisions, fas...
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
 
Another Day In Paradise
Another Day In ParadiseAnother Day In Paradise
Another Day In Paradise
 
A strategy for security data analytics - SIRACon 2016
A strategy for security data analytics - SIRACon 2016A strategy for security data analytics - SIRACon 2016
A strategy for security data analytics - SIRACon 2016
 
Better the devil you know
Better the devil you knowBetter the devil you know
Better the devil you know
 
Analytics Driven SIEM Workshop
Analytics Driven SIEM WorkshopAnalytics Driven SIEM Workshop
Analytics Driven SIEM Workshop
 
A Big Dashboard of Problems.pdf
A Big Dashboard of Problems.pdfA Big Dashboard of Problems.pdf
A Big Dashboard of Problems.pdf
 
Daniel Lance - What "You've Got Mail" Taught Me About Cyber Security
Daniel Lance - What "You've Got Mail" Taught Me About Cyber SecurityDaniel Lance - What "You've Got Mail" Taught Me About Cyber Security
Daniel Lance - What "You've Got Mail" Taught Me About Cyber Security
 
From 🤦 to 🐿️
From 🤦 to 🐿️From 🤦 to 🐿️
From 🤦 to 🐿️
 
Security
SecuritySecurity
Security
 
Data dynamite presentation
Data dynamite presentationData dynamite presentation
Data dynamite presentation
 
Analytics Trends 20145 - Deloitte - us-da-analytics-analytics-trends-2015
Analytics Trends 20145 -  Deloitte - us-da-analytics-analytics-trends-2015Analytics Trends 20145 -  Deloitte - us-da-analytics-analytics-trends-2015
Analytics Trends 20145 - Deloitte - us-da-analytics-analytics-trends-2015
 
The cyber security hype cycle is upon us
The cyber security hype cycle is upon usThe cyber security hype cycle is upon us
The cyber security hype cycle is upon us
 
Fru 2022 | Tech Trends, Themes, Thoughts, Perspectives and Predictions
Fru 2022 | Tech Trends, Themes, Thoughts, Perspectives and PredictionsFru 2022 | Tech Trends, Themes, Thoughts, Perspectives and Predictions
Fru 2022 | Tech Trends, Themes, Thoughts, Perspectives and Predictions
 

More from Michael Roytman

Measure What You FIx: Asset Risk Management Done Right
Measure What You FIx: Asset Risk Management Done RightMeasure What You FIx: Asset Risk Management Done Right
Measure What You FIx: Asset Risk Management Done Right
Michael Roytman
 

More from Michael Roytman (15)

CyberTechEurope.pptx
CyberTechEurope.pptxCyberTechEurope.pptx
CyberTechEurope.pptx
 
O'Reilly Security New York - Predicting Exploitability Final
O'Reilly Security New York - Predicting Exploitability FinalO'Reilly Security New York - Predicting Exploitability Final
O'Reilly Security New York - Predicting Exploitability Final
 
RSA 2017 - Predicting Exploitability - With Predictions
RSA 2017 - Predicting Exploitability - With PredictionsRSA 2017 - Predicting Exploitability - With Predictions
RSA 2017 - Predicting Exploitability - With Predictions
 
Predicting Exploitability
Predicting ExploitabilityPredicting Exploitability
Predicting Exploitability
 
Chicago Security Meetup 08/2016
Chicago Security Meetup 08/2016Chicago Security Meetup 08/2016
Chicago Security Meetup 08/2016
 
Data Metrics and Automation: A Strange Loop - SIRAcon 2015
Data Metrics and Automation: A Strange Loop - SIRAcon 2015Data Metrics and Automation: A Strange Loop - SIRAcon 2015
Data Metrics and Automation: A Strange Loop - SIRAcon 2015
 
Who Watches the Watchers Metrics for Security Strategy - BsidesLV 2015 - Roytman
Who Watches the Watchers Metrics for Security Strategy - BsidesLV 2015 - RoytmanWho Watches the Watchers Metrics for Security Strategy - BsidesLV 2015 - Roytman
Who Watches the Watchers Metrics for Security Strategy - BsidesLV 2015 - Roytman
 
Attacker Behavior Boston Security Conference 2015
Attacker Behavior Boston Security Conference 2015Attacker Behavior Boston Security Conference 2015
Attacker Behavior Boston Security Conference 2015
 
Data Science ATL Meetup - Risk I/O Security Data Science
Data Science ATL Meetup - Risk I/O Security Data ScienceData Science ATL Meetup - Risk I/O Security Data Science
Data Science ATL Meetup - Risk I/O Security Data Science
 
Fix What Matters: BSidesDetroit 2014
Fix What Matters: BSidesDetroit 2014Fix What Matters: BSidesDetroit 2014
Fix What Matters: BSidesDetroit 2014
 
Risk IO Webisode 1: The Breach Landscape
Risk IO Webisode 1: The Breach LandscapeRisk IO Webisode 1: The Breach Landscape
Risk IO Webisode 1: The Breach Landscape
 
A Heartbleed By Any Other Name - Data Driven Vulnerability Management
A Heartbleed By Any Other Name - Data Driven Vulnerability ManagementA Heartbleed By Any Other Name - Data Driven Vulnerability Management
A Heartbleed By Any Other Name - Data Driven Vulnerability Management
 
Measure What You FIx: Asset Risk Management Done Right
Measure What You FIx: Asset Risk Management Done RightMeasure What You FIx: Asset Risk Management Done Right
Measure What You FIx: Asset Risk Management Done Right
 
BsidesSF 2014 Fix What Matters
BsidesSF 2014 Fix What MattersBsidesSF 2014 Fix What Matters
BsidesSF 2014 Fix What Matters
 
Fix What Matters: A Data Driven Approach to Vulnerability Management
Fix What Matters: A Data Driven Approach to Vulnerability ManagementFix What Matters: A Data Driven Approach to Vulnerability Management
Fix What Matters: A Data Driven Approach to Vulnerability Management
 

Recently uploaded

Cree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBdCree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBd
creerey
 
Memorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.pptMemorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.ppt
seri bangash
 

Recently uploaded (20)

Cracking the Change Management Code Main New.pptx
Cracking the Change Management Code Main New.pptxCracking the Change Management Code Main New.pptx
Cracking the Change Management Code Main New.pptx
 
Improving profitability for small business
Improving profitability for small businessImproving profitability for small business
Improving profitability for small business
 
Transforming Max Life Insurance with PMaps Job-Fit Assessments- Case Study
Transforming Max Life Insurance with PMaps Job-Fit Assessments- Case StudyTransforming Max Life Insurance with PMaps Job-Fit Assessments- Case Study
Transforming Max Life Insurance with PMaps Job-Fit Assessments- Case Study
 
Did Paul Haggis Ever Win an Oscar for Best Filmmaker
Did Paul Haggis Ever Win an Oscar for Best FilmmakerDid Paul Haggis Ever Win an Oscar for Best Filmmaker
Did Paul Haggis Ever Win an Oscar for Best Filmmaker
 
LinkedIn Masterclass Techweek 2024 v4.1.pptx
LinkedIn Masterclass Techweek 2024 v4.1.pptxLinkedIn Masterclass Techweek 2024 v4.1.pptx
LinkedIn Masterclass Techweek 2024 v4.1.pptx
 
Event Report - IBM Think 2024 - It is all about AI and hybrid
Event Report - IBM Think 2024 - It is all about AI and hybridEvent Report - IBM Think 2024 - It is all about AI and hybrid
Event Report - IBM Think 2024 - It is all about AI and hybrid
 
Hyundai capital 2024 1quarter Earnings release
Hyundai capital 2024 1quarter Earnings releaseHyundai capital 2024 1quarter Earnings release
Hyundai capital 2024 1quarter Earnings release
 
Luxury Artificial Plants Dubai | Plants in KSA, UAE | Shajara
Luxury Artificial Plants Dubai | Plants in KSA, UAE | ShajaraLuxury Artificial Plants Dubai | Plants in KSA, UAE | Shajara
Luxury Artificial Plants Dubai | Plants in KSA, UAE | Shajara
 
RMD24 | Debunking the non-endemic revenue myth Marvin Vacquier Droop | First ...
RMD24 | Debunking the non-endemic revenue myth Marvin Vacquier Droop | First ...RMD24 | Debunking the non-endemic revenue myth Marvin Vacquier Droop | First ...
RMD24 | Debunking the non-endemic revenue myth Marvin Vacquier Droop | First ...
 
State of D2C in India: A Logistics Update
State of D2C in India: A Logistics UpdateState of D2C in India: A Logistics Update
State of D2C in India: A Logistics Update
 
Falcon Invoice Discounting Setup for Small Businesses
Falcon Invoice Discounting Setup for Small BusinessesFalcon Invoice Discounting Setup for Small Businesses
Falcon Invoice Discounting Setup for Small Businesses
 
falcon-invoice-discounting-a-premier-platform-for-investors-in-india
falcon-invoice-discounting-a-premier-platform-for-investors-in-indiafalcon-invoice-discounting-a-premier-platform-for-investors-in-india
falcon-invoice-discounting-a-premier-platform-for-investors-in-india
 
USA classified ads posting – best classified sites in usa.pdf
USA classified ads posting – best classified sites in usa.pdfUSA classified ads posting – best classified sites in usa.pdf
USA classified ads posting – best classified sites in usa.pdf
 
Understanding UAE Labour Law: Key Points for Employers and Employees
Understanding UAE Labour Law: Key Points for Employers and EmployeesUnderstanding UAE Labour Law: Key Points for Employers and Employees
Understanding UAE Labour Law: Key Points for Employers and Employees
 
Special Purpose Vehicle (Purpose, Formation & examples)
Special Purpose Vehicle (Purpose, Formation & examples)Special Purpose Vehicle (Purpose, Formation & examples)
Special Purpose Vehicle (Purpose, Formation & examples)
 
Cree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBdCree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBd
 
Using Generative AI for Content Marketing
Using Generative AI for Content MarketingUsing Generative AI for Content Marketing
Using Generative AI for Content Marketing
 
Memorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.pptMemorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.ppt
 
IPTV Subscription UK: Your Guide to Choosing the Best Service
IPTV Subscription UK: Your Guide to Choosing the Best ServiceIPTV Subscription UK: Your Guide to Choosing the Best Service
IPTV Subscription UK: Your Guide to Choosing the Best Service
 
Equinox Gold Corporate Deck May 24th 2024
Equinox Gold Corporate Deck May 24th 2024Equinox Gold Corporate Deck May 24th 2024
Equinox Gold Corporate Deck May 24th 2024
 

Less is More: Behind the Data at Risk I/O

  • 1. Michael Roytman Data Scientist, Risk I/O My name is michael roytman and I’m the data scientist at risk io. I came to risk io just when we were starting to build out our predictive analytics functionality, and I have been wrangling that process like a wild animal ever since.  ! When I started, we didn’t have hadoop clusters, we didn’t have hive or spark, didn’t have custom machines running in our own data center, and didn’t even think about machine learning. I’m proud to say that my data science operation has kept everything that way… except we recently got rid of mongoDB. ! Such wow, so few science you might say. You’d be wrong. In fact, very intentionally limiting our complexity, both statistical and technological, while generating actionable insights and new near-real-time data products is our biggest win to date. 
  • 2. Less Is More: ! Behind The Data at Risk I/O I don’t like dealing in platitudes. Taken on face, this is a meaningless statement, and even in something as simple as security, nothing is quite so black and white. ! today I want you to consider the situations and impacts of resisting the current trends “more data, more data science, more hadoop”. Yes, it’s 2014, yes, we need new methods for But, generating value from our data. But that doesn’t mean we need to buy tulips to do it.
  • 3. Less Less Less Less Tools Data Scientists Data Model Complexity More Impact There are four contentions to my rant. First, there is a propensity to implement complex and costly to maintain tools without the explicit need for them. I am versed in hadoop lore, but there’s only a slight difference between foreseeing needs and overspending. Second, data is everywhere and so are data scientists. It’s surprising and wonderful how much data driven work can be done within an existing environment once the directive is given. Third, collecting all the things is great, but data comes at a cost - the cost of storage, the cost of cleaning, the cost of complexity. It is important to know which questions you’re answering before you begin. And lastly, all of these “constraints” are actually blessings. Much like our beloved twitter, limiting the scope and the tools to 140 characters makes for a much more precise and useful product. !
  • 4. Say “Big Data” One More Time As resistant as people are to change, I hear about a lot of organizations jumping the gun on technological or organizational change. ! I want to explain an ongoing trend I like to call “knee-jerk hadoop”. Tons of organizations think they’ll get a fast datastore or a mature analytics practice just because they hired 5 people to run a hadoop cluster. ! A horn and a horse does not a unicorn make. Real efficiency comes from understanding run-time complexity of your backend and making it as efficient as possible - which takes time and specific knowledge of the system.
  • 5. Cautionary Tales A great example of this is from Value America of dot com bubble fame. There’s a similar Groupon tale, but that one’s still ongoing. It was one of the first just in time models, connecting customers directly to manufacturers, much like dell. It was backed by microsoft and fedex founders. At the peak of their success, they were hiring 100 people a month for over a year. When others caught on to the model and hard times hit, they fired 300 people and continued to fire 100 a month - and the morale and communication impacts from the layoffs were the cause of bankruptcy a year later. This is the cost of a bad forecast that changes organizational and technical structure.
  • 6. “It don’t matta if you win by an inch or a mile winning’s winning.” -Vin Diesel, The Fast and the Furious, 2001 Winner, Best Movie, MTV Movie Awards At Risk I/O we launched our first predictive models back in march of 2013 - they were crude predictions of priority vulnerabilities. They weren’t even written by a proper developer. They ran on a ruby backend, they took half a day to compute, even longer to index. They pulled from mongo, used that to look up aggregations in mysql, did calculations in ruby, and pushed back up to mongo. It was horrible, but it worked. One year later, the same method runs every half an hour, indexing takes minutes. ! We’ve expanded the scope of the model, and it includes 4 times as many inputs. Our code is smarter, our algorithms ignore duplication at every step, we’ve scrapped mongo, we’ve modified the query DSL for ruby to fit our needs. We could have easily said “this is slow. ruby sucks. bring me hadoop with a side of python please”. ! Here’s what we gained by NOT doing that: 1. We’ve saved on infrastructure costs. 2. We took the time to understand exactly what the algorithms are doing, where the deltas are, and what kind of behavior we can expect moving forward. 3. We didn’t spend time and energy hiring, and then tasking our engineers with knowledge transfer. 4. We gained a clean an easy to expand analytics module which requires no specialized skills to work with. If my plane crashes tomorrow into the side of Ed Bellis’s house, a 20 year old without prior CS knowledge could scale the algorithms. ! And most importantly, we can still make that move when we need to. By now, we have spark and julia. New tools have evolved that might serve our purpose better and put us ahead of the status quo.
  • 7. Everyone is a Data Scientist Don’t Save For Tomorrow What You Can Do In Excel Today. But also don’t use excel. I come from an academic background, and I am well versed in R and recently wizard, a tool that I encourage everyone to look into. However, data science work is largely domain knowledge, and I have the least domain knowledge of anyone. If I ask Andrea, our marketing manager the right set of questions, she can do 80% of the work in google analytics or kiss metrics. Even Ed Bellis knows how to write a SQL query. ! ! I am the only data scientist at Risk I/O, but our data science operation is closer to 3 people. 100% my time, 1/4th CEO, CTO, and marketing time, another 10-20% of a security architect and a developer. All this to say, it’s important to look around before expanding data science, and to recognize the importance of specialized domain knowledge.
  • 8. Take Only What You Need Not all data is good data. Not all good data is useful data. The new york times and business week view of data science is that you collect a slew of data, unleash a hipster from brooklyn on it, and voila, insight! ! I disagree. It is much leaner (in the deming sense of cutting out useless movements) to first ask the right questions, then collect the right data, and then generate the right answer. ! Here’s how that works at risk i/o. We attempt to solve the contextual problem of which vulnerabilities put an enterprise most at risk. We have access through partner channels and public data to every kind of security data under the sun - yet, when deciding what data parternships to pursue or which data to use, we have a very strict set of criteria that filters out the noise for us BEFORE we get into the hard work. ! Here’s an example of just ONE data source integration I did at the end of last year: [168x167 system of equations per CVE live stream row echelon reduction]. ! Making quality decisions before you start the process is fundamental in quality control methods, pioneered by Taguchi Toyota. The same applied to data cleaning. So we only take active attacks, active breaches, or data that we can turn into the two. We don’t care about ip reputation data, malware analysis, data that overlaps with a public source. That’s because we can’t afford to row reduce your live stream only to find out it’s useless.
  • 9. Transparency, et al. There are huge wins from model simplicity too. A. Fix What Matters story - refer to www.risk.io/data-driven-security Wins: 1. Transparency 2. Ease of implementation 3. Feedback loop on tools, data.
  • 10. Probability A Vuln Having Property X Has Observed Breaches RANDOM VULN CVSS 10 CVSS 9 CVSS 8 CVSS 6 CVSS 7 CVSS 5 CVSS 4 Has Patch 0.000 0.010 0.020 0.030 0.040
  • 11. Probability A Vuln Having Property X Has Observed Breaches Random Vuln CVSS 10 Exploit DB Metasploit MSP+EDB 0.0 0.1 0.2 0.2 0.3
  • 12. Know What You’re After We have recently been working on a model for risk assessment, the technical documentation for which some of you have seen and which we’ll be releasing shortly. This is a lot more involved than finding one risk factor on a vulnerability - but we’ve structured the effort in a similar matter. We collected a subset of data we knew would be relevant ahead of time. We used the tools at our disposal and the expertise at our disposal to explore the data. Most of my work in creating the model was done on paper, messing around with algebraic equations until they were simple enough where they could be understood easily without losing the value.