1. Automate to enhance human capabilities, not replace them, drawing inspiration from Iron Man's suit rather than Ultron.
2. Make it easy for users to explore patterns in data to understand security issues at both the local and systemic level.
3. Prioritize an open approach that encourages collaboration and community improvement over proprietary solutions.
3. Please don’t worry about trying to absorb everything on the slides.
Instead, view this talk as field notes for the deck, which you can refer back to later.
3
11. - What is the system?
- What is its purpose?
- Where is it hosted?
- Who is the business owner?
- Who is the technical / support owner?
- Who are the admins?
- Who owns the unnamed and named accounts?
- What are the access and activity logs available?
- What controls cover the system?
- Is there an on-call support team?
- Who is the vendor point of contact?
- What are their SLAs for incident response?
- What data is in the system?
- What regulatory reporting requirements do we need to be aware of?
- What is the business impact if the system needs taking offline?
- What evidence do we need to capture?
- What is the threat model for how a threat actor could have got to the system?
- What is the threat model for where a threat actor can go next?
As you press the button to call up the elevator, your brain starts to cycle through all
the things your team will need to find out.
But it’s cold comfort that you’ve dealt with incidents like this before.
11
12. Because the first thing you did when you started at Digicorp was ask everyone about
the firm’s most critical applications.
And this is the first time you’ve heard anyone talk about ‘Sunways’.
Past experience is telling you, you’re going to need to pull a picture together with
only a few pieces to go on.
12
13. You know where the rest of them are likely to be though.
In all their fragmented, partial, and inaccurate glory.
13
14. And so it begins.
The blur of phone calls, emails, and messages.
14
24. Due to these control gaps and failures,
this threat actor
exploited these vulnerabilities,
to gain access to this IT system,
(which generates £xx amount of revenue
and represents a vital trade secret),
compromising its confidentiality, integrity and availability,
which led the business to experience these impacts,
realising these risks,
leading to a total loss amount of $n.
And as you think about the story your presentation told (in that hastily put together
slide deck), part of you is just glad that the last few days are over.
24
25. Relief that you managed to join the dots.
But another part of you is frustrated.
25
26. You know the mental model the team’s just built up will fade from corporate
memory.
Slower for some people, sooner for others.
26
27. Until eventually it returns to the fragmented state you found it in 8 days ago.
27
28. @CxOSidekick
It’s a privilege to be back at the Ground Truth track.
A huge thank you to Gabe (@gdbassett) and Urban (@UrbanSec) for curating this
amazing space at BSides, where we can share ideas at the intersection of data science
and security.
28
29. Building an enterprise security
knowledge graph to fuel better
decisions, faster
In the next 45 minutes, we’re going to look at how knowledge graphs can help
security teams address the problems we’ve just touched on in our make-believe
incident scenario.
And also how we can flip the script on ‘thinking in lists’, to reap the rewards of
thinking - and operating - in graphs.
29
30. Hypothesis
This talk is the product of 9 months work in a ‘live’ operational environment, testing a
hypothesis, which runs as follows.
30
31. To solve the problems we face, we need to be able to join
all the component parts that relate to security,
across business and technical dimensions,
in a scalable knowledge graph,
so it’s easy to capture, link, visualise, contextualise, share,
interrogate and update information in seconds,
for Executive, Management and Operations stakeholders.
[Read slide]
31
32. To solve the problems we face, we need to be able to join
all the component parts that relate to security,
across business and technical dimensions,
in a scalable knowledge graph,
so it’s easy to capture, link, visualise, contextualise, share,
interrogate and update information in seconds,
for Exec, Management and Operations stakeholders.
At its core, this hypothesis focuses on a ‘user need’, which extends far beyond
Incident Response and Security Operations.
32
35. “I process billions of dollars of transactions per week.
Each quarter my control functions in fraud, compliance
and security bring me a report about what went wrong
and what needs to improve. With instant payments,
PSD2 [revised payments service directive] and all the
fintech we’re adopting, I need to run my business
today, using today’s information.”
- CIO, Global Financial
It affects colleagues in many other areas as well.
As per this quote from a CIO:
Each quarter my control functions bring me a report about what went wrong,
but I need to run my business today, using today’s information.
35
37. “Organisations function by making decisions, about Why, What and
How, so it’s startling how bad most organisations are at it, and how
easily organisations that get good at decision making find it to
outpace and outmanoeuvre their competition.
Data driven decisions need the right data (scope), correct data
(accuracy), appropriate processing and presentation, and a proper
insertion point into the decision making process.”
Chris Swan, CTO
[http://blog.thestateofme.com/2019/07/05/making-better-decisions
But even if we solve that problem, we’ve only won half the battle.
Because as per this fantastic blog by Chris Swan, once we’ve mined valuable insights,
they then need ‘a proper insertion point into the decision making process’.
[http://blog.thestateofme.com/2019/07/05/making-better-decisions/]
37
39. Identifying the right lens for presenting it across the various different levels of the
business.
39
40. The challenge is not ‘explaining technical
things to people who are non-technical’.
It’s communicating technical data in a way
that’s relevant to the context, concerns,
priorities and accountability of an audience,
whose primary focus is on business targets,
money and time.
Adjusting that lens to provide the right amount of zoom based on the audience's
context, concerns, priorities and accountability.
40
41. And then of course finding time for people to consume it.
41
42. This is a problem
Action
Trust
Communication
Insight
Context
Analysis
Transformation
Platform
Pipeline
API
Data
Sensor
This is not a simple problem.
It’s also not a problem that’s unique to the domain of cybersecurity ...
43. Human Interaction,
Data Science and
Data Engineering
This is a problem
… as data engineers and data scientists know only too well, regardless of the
industry they work in.
44. So in this talk, we’re not just going to look at building knowledge graphs.
44
45. We’re also going to look at how we can create and deliver ‘context-relevant’ and
‘stakeholder appropriate’ interactions with the information they link together - and
which needs updating continuously.
45
46. Even a stopped clock keeps
the right time twice a day
Is our approach valid?
Is our implementation practical?
Can the concepts we’re working with transfer to other organizations?
46
47. Please share your opinions and questions with us during and after the talk on Twitter.
48. @DinisCruz
@im_geeg
@CxOSidekick
For all things philosophical and technical, you can ‘@’ Dinis, whose vision started us
down this road.
He also wrote most of the code we’re open sourcing today.
GG has been operationalizing a lot of what we’ll look at and is the best person to go
to for a programmer’s and detection engineer’s perspective on day-to-day usage of
the stack.
And for questions about knowledge graph ontology and user needs, feel free to point
them at me.
48
49. Side Note
A note of reflection before a probably ill-advised live demo attempt.
49
50. http://iang.org/papers/market_for_silver_bullets.html
Years ago, the paper “A market for Silver Bullets” was published.
It described a dynamic in which neither buyers nor sellers in cybersecurity had the
information they needed to know what effective solutions looked like for their
problems.
I’d argue this still largely holds true today.
50
51. “Best practices look at what everyone else is doing, crunch numbers-and
come up with what everyone else is doing. Using the same method, one would
conclude that best practices for nutrition mandates a diet high in fat, cholesterol
and sugar, with the average male being 35 pounds overweight.”Ben Rothke.
I have shown that any deviation from best practices is costly, including towards a
presumed direction of greater security. Only the most profitable of security
measures will produce enough benefit to overcome the cost of breaching the
equilibrium. As the cost of breaching the equilibrium is proportional to the number
of community members, the larger the community, the greater the opportunities for
security are foregone, and the more the vulnerability.
Herding is a Nash equilibrium as well as being rational behaviour; if one player
chooses a new silver bullet, other players do not have a better strategy than sticking
to the set of best practices, and even the player that changes is strictly worse off as
they invite extraordinary costs in the event of a breach. This approach lowers the
more significant extraordinary costs and accepts direct costs as unavoidable and to
be absorbed.
http://iang.org/papers/market_for_silver_bullets.html
If recognising that no one has ‘right answers’ is one step towards surviving in this
industry … the other point the paper asks us to acknowledge is that:
“Any deviation from best practices will incur costs where individual members
go it alone.”
51
52. Our team are big believers in the value of open source and creative commons.
The continued perpetration of bad API output, 2-dimensional dashboards, endless
.xml joins ...
52
53. ...and mirror mazes of macros and pivot tables makes it clear we need collaborate if
we’re to breach this current equilibrium and move from lists to graphs.
53
54. Side Effects
Yes, as with anything that requires process change, the shift to thinking and operating
in a hyperlinked way does have side effects in cost, time and effort.
54
55. The 5 stages of graphs
WTF are you talking about?
Ok, I’ll admit, this is intriguing.
I am DONE with this.
Hm, maybe if I just...
There is no graph.
It can also be a remarkably frustrating journey to navigate.
55
56. And no, not everything you see here will be immediately transferable (or applicable
at all!) to your business.
56
57. The goal today is not to suggest ‘this is how things should be done’, just to share one
possible path.
So please treat this talk like a meal :)
Eat what you like. Leave what you don’t. Let us know what dishes confuse you. And
what you’d like to see added to the buffet.
57
59. … as we cast our minds back to the imaginary incident we walked through earlier.
59
60. The early blur of phone calls, emails, and messages ...
60
61. … the problem of keeping everyone on the same page ...
61
62. … and the dots we had to find, then join up.
62
63. Here’s a different version of how our story could have unfolded.
Let’s imagine that when we joined Digicorp, they’d been building a security
knowledge graph for about 9 months, mostly using readily available data sets.
For example: HR data, application user lists, alerts from endpoint technology and a
few cloud systems ... and good old manual data entry.
63
64. See demo video here:
https://youtu.be/LjCtbpXQA9U?t=4696
64
66. Before we go behind the scenes to look at the technology stack that supports a team
in operating this way ...
66
67. … let’s run through the main design principles that informed our choices about what
we built, and how we built it.
68. Automate like Iron Man not Ultron
We believe code should help us move faster in partnership.
That is to say, the primary function of the tech stack is not to automate away pain
points.
We want it to be like an Ironman suit, which any member of the team can use to
manage new problems quickly and efficiently as they arise.
Then, once knowledge and analytics have been baked into a process that is stable
and re-usable, we can add in automation.
68
69. The resulting system felt a lot like Iron Man's suit: One person
could do the work of many, and we could do our jobs better thanks
to the fact that we had an assistant taking care of the busy
work. Learning did not stop because it was a collaborative effort.
The automation took care of the boring stuff and the late-night
work, and we could focus on the creative work of optimizing
and enhancing the system for our customers.
https://queue.acm.org/detail.cfm?id=2841313
This approach draws inspiration from an article titled ‘Automation should be like
Ironman, not Ultron’.
The goal is to enable us to focus on the creative work of continuously optimising and
enhancing the data system, which allows us to move faster and do more than we
otherwise would be able to.
[https://queue.acm.org/detail.cfm?id=2841313]
69
70. Explore patterns
Second, we want to make it easy for anyone who joins our team (and eventually the
wider business) to find and understand patterns in data, both at local and systemic
level.
Rather than providing set ways of looking at something, we want people to be able to
benefit from our knowledge base and way of thinking, and evolve it with theirs.
70
71. 1. What was the service that was exposed?
2. Does it contain sensitive or regulated data, and if so, what?
3. How long were the creds exposed?
4. Who had access to the JIRA ticket?
5. Who placed the creds in the JIRA ticket?
6. How does that team overall manage credentials?
7. What immediate follow up can minimise exposure?
8. Which stakeholders need to be informed of the exposure?
9. Who will rotate the credentials?
10. Is there any potential business impact from rotating the creds (e.g. to scripts or automation)?
11. Is two factor authentication used by the account?
12. Is two factor authentication available on the service?
13. How many people share this credential?
14. Is the username and password likely to have been available to people who left the company?
15. Are there any access logs available for the service?
16. Are other credentials shared on this service?
17. When was the last access list review performed to see what users were valid in the system?
18. Are all owners of service and admin accounts or generic accounts known and documented?
19. Who is the system's technical owner?
20. Who is the system's management owner?
21. Are the same credentials used for any other system?
22. When was the password last changed (if that info is available) so that we can be confident leavers do not have this info?
23. Will creds continue to be shared and if so, what is the justification?
24. Is the shared credential for an admin account?
25. How old is the oldest non-valid user in the system (i.e. when was their leaving date)?
To do that, we want to be able to take a list of questions like this - which exist in
someone’s head - and may have taken months or years to build and refine in terms of
content, the order they are asked, and so on.
71
72. Mission Diagnose Diagnose Diagnose
Solve Solve
Mop Up & Close
Ingredients
Recipes
Micro-service
runbooks
Account
Take-Over
Stale
Accounts
JML
Visibility
Gap
Account
Security
Leaver
Account
Removal
Access /
Account
Correlation
2FA
Monitoring
JML
Assurance
No Threat
Model
Coverage
And build them into
- lists of ingredients
- examples of recipes, and eventually
- a library of ‘microservice’ runbooks, which can be taken and joined up in
different patterns to suit the scenario at hand.
This slide provides an abstract visualisation of this process.
From the top, to deliver a specific ‘Mission’ (e.g. an Incident Response scenario like
the one we looked at earlier), you will need different data points as Ingredients
across its life-cycle.
These Ingredients will be combined in different Recipes to uncover knowledge,
answer questions and complete tasks as you move from left to right.
And in future, the Recipes may be re-usable as run-books, in other situations where a
similar pattern of problem needs solving.
72
73. Design for data curve balls
Finally, there is a heavy focus in the data system we’ve build on the ETL phase of
‘Transformation’.
73
74. Insert witty quote here about how much
time Data Scientists spend complaining
about cleaning data.
Curveballs are the rule, not the exception when consuming and correlating data sets.
Messiness is a feature not a bug.
Especially, it seems, when you need to consume and correlate data at short notice.
So we designed heavily for that.
74
75. The data system
So with that in mind, let’s take a tour through ‘the data system’.
75
78. Doesn’t cost us a load to set up
Built on commodity SaaS components
Easy to customise
Extensible via APIs
Cheap to run (‘zero idle cost’)
Won’t incur large TCO debt for future teams
More on that in a moment.
But when we began building this, we had a shoestring budget.
We needed a system that we could choose to scale, as we wanted to scale it.
78
79. Sure, it may look weird first glance.
But sometimes you have to sail with the ship you have, rather than the one you want.
And if I’m honest, of all the data systems I’ve seen, built and worked with to try and
solve analytics-related problems in security, this is by far the most elegant.
79
80. So, a few definitions in terms of how we think about the data system components.
80
81. Our graph data store and ontology management system
This is where we
1. Create and update nodes (aka Issue Types)
2. Link them via edges (aka Issue Links)
3. Track node lifecycle phases (aka Workflows)
When we say ‘JIRA’, that’s shorthand for our graph data store and ontology
management system.
83. Cheap index for querying
Where we store JIRA data to make it easy to index and query via
Slack / Jupyter notebooks.
(ELK is an open source tech stack for storing, indexing, querying
and visualizing data, which is made up of 3 open source tools:
Elasticsearch, Logstash, and Kibana.)
Elk is our friendly neighbourhood index, where we store JIRA data so it’s easy to
search and visualize via Slack.
84. Ticket
Ticket Fields
Histogram of
created
tickets
List of
available
fields
It’s also a good place for us to analyse and visualise trends relating to nodes and
edges, albeit more in terms of how people are using the data system, than analytics
in operational scenarios.
84
85. Easy command line interface, and message bus
1. A ‘non technical’ command line tool, which can be used create,
search for, visualize, share and update information in JIRA
2. An existing part of a company’s communications fabric, that
enables security to create, tune and automate feedback loops with
anyone who is part of the Slack workspace (colleagues, vendors)
Slack is both a command line tool, for us, as well as the communications fabric that
our company runs on.
This lets us automate all kinds of feedback loops via a medium that all our colleagues
are already familiar with, and are engaged with a huge amount of the time.
89. Programming interface for ingredients lists and recipes
The programming interface we use to explore, transform and
manipulate data, so that ‘ingredients’ books’ and ‘recipe books’ can
be created for various scenarios that teams face repeatedly, which
involve capture, linking, visualisation, interrogation and updating of
data.
Finally, Jupyter.
This is our more advanced interface for creating and working with both ingredients
books and recipe books.
90. Notebook title
Data fields for
filtering
3D results!
‘Ingredients’
code
Here’s an example of an ingredients book, designed to enable easy exploration of
relationships between all our Asset, Vulnerability and Risk data.
91. Tech stack layers
and their function
Non-technical
user
More technical
user
Management Reporting
(consume insight)
GSuite slides /
sheets
Jupyter
User Interface
(create, vizualize, share, update)
Slack
GS Bot
(API broker and orchestration layer)
Lambdas
Database
(store / link / index data)
JIRA ELK
Platform
(compute)
AWS
Sensors
(create data)
Manually entered
qualitative data
Machine
generated alerts
I think about these various components as providing a choice of interfaces, either for
users like me, who are non-technical, or those like my colleague George, who are
highly technical.
91
92. Tech stack layers
and their function
Non-technical
user
More technical
user
Management Reporting
(consume insight)
GSuite slides /
sheets
Jupyter
User Interface
(create, vizualize, share, update)
Slack
GS Bot
(API broker and orchestration layer)
Lambdas
Database
(store / link / index data)
JIRA ELK
Platform
(compute)
AWS
Sensors
(create data)
Manually entered
qualitative data
Machine
generated alerts
Then we have GS Bot in the middle, acting as the API broker to makes all these
various interactions possible.
92
93. Say ‘use case’ again. I dare you.
That’s the tech.
But what about the problem set this solves for people?
93
94. Mode
Trigger
Crisis Ad hoc Periodic Cyclical
Add
Explore
Interrogate
Mode Add Gather more knowledge
Explore See where the data takes us
Interrogate Ask specific questions of knowledge, or lack of
Trigger Crisis Unforeseen circumstances, urgent
Ad hoc Random circumstances, some urgency
Periodic Expected & recurring over short-cycle timeframes
Cyclical Expected & recurring over long-cycle time
Here is a frame of reference for thinking about the ‘modes’ people are in and the
‘triggers’ they have when they need to interact with data or information.
(A big ‘thanks’ to Russ Thomas - https://twitter.com/MrMeritology - for sharing the
triggers part of this with me years ago!)
94
95. AWS
GSuite slides, sheets
Jupyter
Elk
JIRA
Slack
JIRA user
Automated machine
generated alerts
User Context
Add
Add / Explore /
Interrogate
Add / Explore
Explore /
Interrogate
We can map our modes to different parts of the data system, where they fit best, and
consider what interface is best under what triggering condition.
95
96. AWS
GSuite slides, sheets
Jupyter
Elk
JIRA
Slack
JIRA user
Automated machine
generated alerts
Create ticket
in Jira
Commands
into Slack to
create and
update tickets
Data from ELK
can be queried
by Slack
Inputs
API into JIRA Data from a .CSV is
transformed in
Jupyter
Data is backed up
and indexed in ELK
Bulk create / update
from Jupyter
This helps us consider routes for inputs into our knowledge graph.
96
97. AWS
GSuite slides, sheets
Jupyter
Elk
JIRA
Slack
JIRA user
Automated machine
generated alerts
API from
JIRA
Alerts into Slack
channel for relevant
stakeholders
Slide decks
auto-generated to
a time sequence
Slide decks then
sent to relevant
stakeholders
Tickets updated in
JIRA based on alert
interaction
Outputs
As well as ‘outputs’ that support feedback loops in their various different forms.
97
100. AWS
GSuite slides, sheets
Jupyter
Elk
JIRA
Slack
JIRA user
Machine generated
alerts
1. Tickets created
from Slack in
Incident Channel
2. Tickets synced to JIRA
3. JIRA data synced to ELK
4. ELK queried from Slack
… we can now do things like this.
When we were ‘graphing manually on the fly’ via Slack (trigger = crisis; mode =
interrogate), this is what was happening in the background.
100
101. AWS
GSuite slides, sheets
Jupyter
Elk
JIRA
Slack
JIRA user
Machine generated
alerts
1. Get .CSVs that
contain data we
want to input
into knowledge
graph
2. Data transformed
to become ‘graph
ready’ in a Jupyter
‘recipe book’
3. Data synced into
JIRA from Jupyter
4. Data is indexed
in ELK
5. Graph can be
explored in Slack
For ‘batch graphing’, where we’re importing .csv data in bulk (mode = add; trigger =
periodic / cyclical activity), the process looks more like this.
(Apologies that this reads right to left).
101
102. Here’s an example of a Jupyter recipe book, with an imaginary data set showing how
we can parse messy data into a graph ontology.
The nice thing is once you’ve worked with a few data sets from your applications, this
is a process that goes quickly from ‘manual’ to ‘mostly automated’, and you can run it
on batch when you want.
102
103. Business context
This representation of our current ontology (which we’ll cover in mode detail in the
next section of the talk), reflects the kind of information you can get if you pull a user
list from your HR system, and 2 applications.
(The greyed out boxes represent information which will need to come from
somewhere else, as those data sets don’t contain it).
103
104. Tech ecosystem context
Even with a few datasets like this, you can immediately build context between
business and technology dimensions.
104
105. Decision making context
And it’s a short jump from here, to using the parsing process to identify inaccurate,
incomplete and incongruous data.
For example, who owns that active generic account in that SaaS system, which has a
.com email domain which isn’t yours, and seems to have last logged in 3 years ago?
Why is there disagreement between System X and your HR database, about whether
this employee still works for you?
Etc., etc.
105
106. Security context
These sound like facts we might want to capture, and present to colleagues or
management for a decision - in light of the vulnerability this creates.
106
107. Reporting line up to CEO for
engineering team, by role
Once you have this data in a graph, here’s an example of a ‘recipe book’ you can use
to connect roles to technical assets.
Here we’ve mapped the role reporting line for a team up to the CEO, so we can
better understand the stakeholder landscape.
107
108. Asset and
what it does
Roles and reporting
lines
People who
own the roles
And here’s another example with a slightly different narrative.
108
109. AWS
GSuite slides, sheets
Jupyter
Elk
JIRA
Slack
JIRA user
Machine generated
alerts
1. Alert fires in a
detection tool
7. Detection tool
dashboard is
updated via API
2. API sends alert
and formats it as
ticket
3. Data sync’d
to ELK
6. User action in
Slack updates
JIRA ticket 5. Lambda triggered
to request user
feedback
4. Python in a notebook,
or lambda recognises
alert condition
Moving on from data input to mostly automated feedback loops, here’s a workflow
that combines automation, with ‘user interaction in the loop’, to reduce security alert
triage and investigation overhead.
Aka: Ryan Huber’s ‘distributed security alerting’, with a twist of JIRA.
109
110. See demo video here:
https://www.dropbox.com/s/povg5kaa72kv2v1
/BSIDES_DEMO_Dist_Alerting.mov?dl=0
Here is video of the user experience.
110
111. What this enables - in the long run - is micro-population analysis.
111
112. Context for what
population does
IT System that represents a
business asset
Risks due to facts and
vulns affecting IT System
Mitigating detections
enabling risk acceptance
Micro population of
users
Don’t worry about the details on these slides.
The main takeaway is that we’re connecting data through a graph for a specific set of
users, so we can use data to better understand their reality.
Here’s the scenario:
- We have a shared email account, which multiple users need access to, to
perform a particular business function
- This creates vulnerability, and to mitigate that, there are a set of detections
that need to be put in place
112
113. Not just detections
vs an account
Without distributed alerting, and our graph, our security team would usually just see
detections against the account.
Not that helpful, as the account is generic, and has multiple people using it.
113
114. A window into the
pattern of life for a
single user
With distributed alerting, we can build up patterns of life for individuals in those
groups, by having them acknowledge ownership of an alert, via Slack.
This is because when they click on the interactive button in Slack, that
acknowledgement is associated with the identifiers we have through that system
(e.g. email address).
We can then extrapolate that out into groups of employees, and compare this across
our entire organisation.
114
115. The aim here is not - repeat not - to be a 1984-esque security team.
Rather, we want to use data to gain better understanding of a business process, so
that if we need to add or evolve controls, we give ourselves every opportunity not
minimise friction we introduce (or avoid it all together).
115
116. AWS
GSuite slides, sheets
Jupyter
JIRA
Slack
JIRA user
Machine generated
alerts
1. Person manually
creates data point
2. Msg is sent to
relevant triage
stakeholders
3. Triage stakeholder
does quick Slack search
4. Data point is
viewed in Jupyter
for full context5. Data is linked to
other relevant nodes
and a Risk memo is
written up
7. Automated slide deck is
sent to risk owners in Slack
Elk 6. Data is visualised
and reviewed
This last workflow is ‘in progress’ for us at the moment.
In essence:
- Someone reports a vulnerability to the security team by manually entering
data in JIRA;
- The Risk team are alerted by a message in Slack;
- The data is evaluated in context of the asset concerned, triaged and linked, in
Slack;
- The risk team then evaluate how this changes exposure to impact for the
stakeholders who own the asset; and
- If there is a significant change, stakeholders are sent an updated risk memo
based on changes to their risk landscape, with a request to accept the risk by
clicking on an interactive button in Slack - or to request a meeting to discuss
the risk.
116
117. Use decision tracking as a real
world proxy for the abstract
concept of risk appetite
The goal here is to use the feedback loops from these risk memos to start to gather
patterns of decisions.
Then we can analyse those decisions as a proxy for risk appetite at different levels
and business units.
117
119. https://github.com/owasp-sbot
We’ve created a fake ‘mini-company’s’ worth of data, complete with people, roles,
applications, devices, alerts, incident data and so on.
So there’s plenty to play around with.
The integrations with Slack and JIRA aren’t out of the box yet, but watch this space.
119
120. We’re now going to switch gears from technology, to ontology.
Ontology
121. An ontology is a set of concepts and categories in a domain, which shows their
properties and the relations between them.
121
122. The one we’ve arrived at in our knowledge graph has evolved a lot over time.
This section of the talk gives an overview of that process - and shares some of the
learnings along the way.
122
124. This is a flow diagram of our Incident Response process about 8 months ago.
124
125. The highlighted areas show the different types of JIRA tickets we would raise across
this workflow.
125
126. https://pbx-group-security.com/blog/2018/09/15/incident-handling-processes-at-photobox/
While this created an varying amounts of administrative overhead during an incident
(depending on its scale), the detail it enabled us to capture was invaluable.
Both for post incident reviews (to look at what went well and what needed
improving), as well as for capturing knowledge about the business, applications
various teams used, data pipelines, and so on.
We wrote a blog about this, and more details on this are available through the link.
[https://pbx-group-security.com/blog/2018/09/15/incident-handling-processes-at-ph
otobox/]
126
127. Key
= Security Event
= Security Incident
= Investigation Thread
= Incident Task
Early on, we used graph visualisations of incidents to tell us things like
- How many questions we had to ask to complete an investigation thread
- How investigation threads related to each other; and
- Whether we had successfully completed all the incident tasks, or not
And we’d then think about how we could have asked fewer questions to get to
answers that helped us fix things better, faster and cheaper.
127
132. The incident
IR & Threat Mgmt
After realising how expensive it was to refactor node and edge ontologies in JIRA (at
this point we didn’t have Jupyter and were entering data points manually and
individually), we switched to proto-typing in PlantUML.
This made it cheaper to discard mistakes and re-build the graph differently, but the
results weren’t helping simplify our picture of the landscape.
132
133. Almost everything was linking to everything.
And while we developed some key components of the overall ontology during this
phase, (e.g., linking projects and money to the closure of vulnerabilities and
reduction of risk), overall things were getting more confusing.
133
134. “Reality is complicated!” @DinizCruz
When people would say “That looks complicated!”, Dinis would buoy our spirits by
pointing out that the complexity we were creating was a reflection of a complicated
reality.
134
135. But that didn’t change the fact our system of nodes and edges was increasingly hard
to navigate, even if you were working with it constantly.
We had multiple Projects with multiple Issues Types in JIRA (these are the Issue Types
for just one Project) ...
135
136. … and that was before you got to the problem of deciding what edge to use, to link
what nodes.
136
137. While our graphs lacked nothing in terms of freedom of expression, the consequence
was inconsistency.
This made it hard to navigate the graph and ask it questions, with confidence that
you were seeing ‘all’ the data.
137
138. The result was confusion in our own team, let alone when we tried to use the data
we had to communicate with the rest of the business.
To a large extent, this was because our graph had become removed from operational
reality.
Our nodes and edges reflected concepts we were trying to mould together - which
were abstract to anyone outside our team.
138
139. From: “How could we….?”
To: “We need to….!”
Forcing functions are funny things.
And just as Incident Response had been the trigger for us to work in graphs with
practical and beneficial results at operational level ...
139
140. … budget season helped us make an evolutionary jump in a more strategic direction.
140
141. Over time, I’ve focused less on efforts to understand ‘risk’
and more on mapping ‘the investment decision’.
Because data that tells me where there’s no line item for
security against apps or data sets reflects a risk decision
- conscious or unconscious.
If I can get data that surfaces this, I can take the technical
data I have, and explain the possible consequences of
there being no budget to fix issue X, Y, Z.
- CISO, Investment Mgmt.
One of the many challenges security teams face at budget season is articulating ‘what
won’t be done’, either based on the investment that the business is prepared to
make, or the security team’s ability to operationalise a given budget.
141
142. Form follows function
We began focusing on the function of the data we had in our knowledge graph to
solve this problem ...
142
143. … our need for fact-based narratives ...
… and the common themes of questions that were coming our way ...
143
144. … which required us to put data into business context, without requiring a lots of
translation.
144
145. And so in classic ‘2 choice presentation style’ we stole an idea from a friend at a
management consultancy, who once said:
“There are only 2 presentations you give to management:
- Cloudy day, sunny day (in which things are bad but if they do XYZ things get
better); and
- Sunny day, cloudy day (things seem good, but won’t stay that way)”
145
146. This Vulnerability (e.g. control gap)
which relates to this IT System
means if credible Threat Actors we face
target us using these Techniques
we cannot protect against them
Security is blocked from solving this
due to this Fact
This exposes the business to this Risk
which these Stakeholders are accountable for
To address this Risk
you need to make a Funding Decision
so this Security Function
can develop this Pack of Analytics
which will close this Control Gap
This will require these Resources
And will use this Project Workflow
And we started to developed narratives like this one.
146
147. Need for optimise project to
close detection gaps
The SOC
As we transferred these narratives into our graphs, they got simpler and clearer ...
147
148. This Incident Fact
provides evidence of these Vulnerabilities
which relate to this IT System
which was exploited by Threat Actor
using these Techniques
This realized this Risk
causing these Impacts
affecting these Teams
To address this Risk
this Security Capability
can deliver this Outcome
This requires this Funding
for this Project
which will follow this Project Workflow
using these Resources
at this Cost
If the project is not funded
these Stakeholders
need to accept this Risk
… even when our plot lines got more complicated.
148
149. Control capability gap for risk
acceptance
Technology Oversight Team
Once we were confident the storyline was easy to track ...
149
150. … we cycled back to JIRA and began implementing the ontology we’d trialled in
PlantUML.
150
151. 1. This IT
System
2. Is exposed to
this threat vector
3. Which is covered
by this security
technology
4. Which has these
vulnerabilities
5. Which link to
these risk themes
4. Which are fixed
by this project
3. For which the security
program has are these
identified process gaps
5. Which use
these capabilities
While our nodes and edges often didn’t exactly correspond to a human readable
version of the storylines we were telling in the graphs, that mattered less and less.
151
152. 1. This security
technology
2. Has these
detection models
4. Covering this
IT system
5. Which is managed
by this team
6. Reporting to
this person
3. And these
playbooks
Because the nouns and verbs that we needed to make the graph ‘human readable’
were emerging through the shape of the graph.
152
153. 1. This security
technology
2. Has these vulnerabilities at
the management layer
3. Which are fixed by these
Project key results
4. Which are delivered by
these tasks
And the story lines were working as we presented them to stakeholders.
153
154. The Entropy Crushing Committee
So began the era of the great refactoring ...
… and the informal creation of the ‘Entropy Crushing Committee’, (hi James, if you’re
watching).
We started standardising and formalising our nodes, our edges and the relationships
that could exist between them in the graph.
154
156. The ability to enter artitary data
vs. A rigorous structure
We chose a rigorous graph structure over the ability to enter arbitrary data.
156
157. A logical narrative of nouns (nodes) and verbs (edges)
that make it easy and cheap to ask ‘expensive questions’ across the graph
with human readable, granular, and repeatable outputs
and a clear picture of what possible outputs should (probably) look like.
We focused on creating human readable narrative that had predictable paths and
expected patterns through the graph ...
157
158. Incident generates
questions and facts
Connects to IT
system about which
little is documented
And which has no
threat model
information
… which had the added benefit that it made it easy to see when desired data that
was missing from the graph.
Knowing what you don’t know can be very valuable, e.g. during an incident when you
may need to phone a friend in your team and ask them to do an emergency threat
model.
158
159. Thankfully this choice fitted hand in glove with the way JIRA allows you to organise
data.
159
160. Project A group of related nodes
Issue Type A distinct node type
Workflow Lifecycle phases of a node
Links Edges between nodes
The translation of how JIRA organises information into graph-speak goes roughly as
follows.
160
161. Happily from an administrative perspective, this structure also supports innovation
and experimentation in node and edge relationships, while controlling the impact of
that across the graph.
161
162. ‘Change’ as a feature, not a bug
This is important, because change to the ontology is a feature of knowledge graphs,
not a bug.
162
163. And until it becomes cheap to mass-refactor your knowledge graphs, I would highly
recommend avoiding the pain involved in doing so.
163
164. Example : Incident Response Project
Here are 2 examples, starting with the Incident Response Project, of where we
missed opportunities to limit the blast radius of experiments.
164
165. 1. This system or human
reported event
2. Needs handling as a
Security Incident
3. Causing these
threads of activity
(e.g. prepare,
identify, contain, etc)
4. And these
specific individual
actions / questions
to answer
5. Which
generate this
evidence
Nodes
(Issue Types)
This is a generic version of what an incident graph can look like.
165
171. One of the things I failed to capitalise on early enough was investigating the
metadata people were adding to Issue Types.
171
172. Here’s the metadata captured in our ‘Security Incident’ Issue Type.
Various fields were added over time as it became necessary for us to tag stuff, and
capture details we wanted to be able to either search for, or organise by, across
incident tickets.
172
173. Are there (or could
there be) other Issue
Types, which are also
using (or could use)
these fields … or
variations of them?
This is just a different view of all these fields.
What we should have done earlier was look at these fields and ask the question:
“Are there other Issues Types in other Projects that are duplicating these, or
which could benefit from them?”
173
174. And if so, where does it
make more sense to
create new nodes and
edges, vs using a
metadata field?
Then, we should have thought through the benefits and trade offs of creating new
nodes and edges rather than metadata fields, and asked what the relevant nouns and
verbs needed to be to ensure high utility for different teams.
174
175. Incident
Response
Dimensions
Business Security Technology
Information
we’d want to
capture and link
Business Unit
Team
Partners
IT Assets
IT Systems
Attack Surface
Threat Actor
Playbook
Data Types
Security Controls
Vulnerabilities
Impacts / Costs
Risks
Had we looked at the metadata we were adding across different Projects and Issue
Types, we might have begun to identify the common narratives that different people
were trying to glue together independently.
175
176. Example : Red Team Project
The second example of lessons learned is from our Red Team Project.
176
178. 1. Prove if credible Threat
Actor X can compromise
Business Asset Y using
techniques up to Z level of
sophistication.
2. By simulating
these tasks.
3. Which need
these tools.
4. These technical
exploits, control gaps
and / or control failures
were discovered.
5. The ability to exploit these at Z
level of sophistication without
prevention or detection delivers
this proof point towards the Goal
… and here’s an example of the kind of narrative it supports.
178
179. 1. We want to prove if Business
Asset Y can be compromised from
Attack Surface Z
2. We want these specific
proof points
3. They should be made
up of these tasks
4. And only use
these tools
The ontology didn’t start like that though.
Originally, it’s structure reflected the way we ran early Red Teams.
We’d define a set of proof points we wanted; then we defined the tasks we’d run to
meet them, and the tools that could be used.
It was a lot more proscriptive, but it was a very structured way to gather evidence,
and get the business comfortable with Red Teaming in production on a regular basis.
179
180. 1. We want to prove if
credible Threat Actor X can
compromise Business Asset
Y using techniques up to Z
level of sophistication
2. These tasks
4. Which suggests the
following from an attacker’s
eye-view
5. Specifically
about these
controls
3. Found this vulnerability
Over time, everyone got more comfortable with free-form scope.
We’d set the goal, and the Red Teamers we worked with would think creatively
within our structure.
This led to the introduction of Security Controls into the ontology - so that we could
highlight where a Red Team proof point demonstrated a control failure, or a control
strength.
180
181. 7. Which also
provide coverage of
this IT System
1. We want to prove if
credible Threat Actor X can
compromise Business Asset
Y using techniques up to #Z
level of sophistication
2. These tasks and tools
found this vulnerability
5. This suggests the following
from an attacker’s eye-view
6. About these
controls
3. Which affects this asset
4. Which is part of this IT System
Then once the Blue Team were fully involved in the end-to-end tests and evaluating
findings, the concepts of control coverage across IT systems and IT assets was
introduced.
181
182. At a certain point, it was obvious that ‘Security Controls’ and ‘IT Systems / Assets’
shouldn’t live in the Red Team Project.
182
183. Unfortunately, we’d developed the control ontology in isolation in this project, and
we hadn’t taken the time - as we were doing it - to see how applicable the structure
was to other Projects.
This meant we missed some major opportunities to evolve the control ontology to
make our data richer across all projects - for example in relation to how Regulators
articulated controls compared to Red Team operatives.
When we changed it, we had to do a lot of refactoring.
183
185. ‘Missing’ detail at a lower level of
abstraction is different from a
gap in the model that means
something can’t be represented.
The focus at the moment is leaving the detail behind, (as that more or less exists,
even if it’s in a state of moderate chaos).
Our time is now spent thinking much more about the fundamental building blocks of
the enterprise security ontology - and finding the fewest number of relationships
between them to answer expensive questions.
185
186. These are the key building blocks we’re working with as Projects in JIRA.
186
187. The nodes and edges within them look like this...
187
197. Project 1
Project 2
Project 3
By way of a few guiding principles, here are some things I’ve found helpful to avoid
re-factoring.
First, there can be many ways to describe the relationships between nodes in
different projects, but there should be just a few ways to describe node relationships
within projects.
197
198. For example, this is OK, as the ‘People’ node lives outside the Incident Response
Project.
198
200. Next, be careful of node to edge paths that create unstable narratives.
200
201. Here’s an example.
Let’s say a threat actor in an incident uses a specific vector (e.g. malware), which
exploits a vulnerability to cause an impact.
At this point, it’s clear what happened.
201
202. But as we get more incidents that use this vector, and as we experience more
impacts, it soon becomes impossible to know what incident caused what impact
using this vector.
You really need mutually exclusive relationships between context specific and general
purpose nodes (e.g. an incident task, vs a generic list of business impacts) to build
strong narratives.
202
203. Finally, look for node-to-edge joins that create narratives with the fewest number of
touch points between them.
203
205. 1. ACME Metals
2. Finance
3. Ability to pay employees
5. Joe Bloggs
4. Payroll
6. SVP
The edges in this graph can tell us a lot about ‘Joe Bloggs’.
205
206. Link 1
If he reports an incident, we can create one link to ask questions across this graph
(e.g., about what role he has, the team and function this rolls up into, etc.)
206
207. Link 1 Link 2
If the incident concerns an application - again, one link let’s us ask questions across
the graph of that application without associating the Security Incident with all the
individual components.
207
208. Link 1
Link 3
Link 2 Link 4
This example shows a project of work (Green ontology), which uncovers a
vulnerability in a device, used by Joe and his colleagues, which requires a project of
work to fix.
With just 4 links, there are a lot of narratives you can now navigate here.
208
211. The migration
Threat Model Team
… and here’s the result once a threat model has been run.
Sometimes, things really do just get a bit complicated.
211
213. “Data captures. Information tables. Knowledge graphs.
Understanding maps. Wisdom filters.
And if that's right … if traditionally defenders think in
tables and attackers think in graphs, then the future is
owned by cartographers who can navigate maps, and
refine them by filtering to reach worthy destinations.”
@dantiumpro
I’ve been thinking a lot about this quote recently, and how to put graphs in context.
[https://twitter.com/Dantiumpro]
213
214. Patterns of play
One of the reasons we rely so much on generic best practices in Security is that there
is no widely shared knowledge base that helps us identify what pattern of play is best
for the business we serve (e.g., based on it’s resources, our available funding, the
technology and threat landscape, etc.)
214
215. Despite the allure of the frameworks consultants sell us for ‘what good looks like’,
there is no single repeatable pattern.
215
216. It’s more like playing 50 games of chess, when changes to the pattern on one board
also have a knock on effect across many other boards ...
216
217. … as we desperately try to tailor our strategy and operating model to deliver stage
appropriate results ...
217
218. … and build the boat while we’re rowing it.
218
219. Maps give movement choices based on position
This makes it hard to understand, in a given moment, what the best choice we have
is.
This is because we lack a picture of the landscape.
As every visit to SFO reminds me, a short geographical distance that does not account
for hills may not be the smartest route.
219
220. Simon Wardley has written a lot on maps and patterns of play.
For example, this picture illustrates that when something is in a phase of genesis, the
focus should be agile practices that reduce the cost of change, vs when something is
commoditized, when the focus should be on reducing deviation.
220
221. 2. Collect data
3. Analyze 1. Hypothesize
4. Validate
Hunt
1. Collect data
2. Analyze
3. Validate
4. Escalate
1. Discover
2. Triage 4. Monitor
3. Remediate
1. Prepare
2. Detect
3. Manage
4. Learn
Vuln
SOC
CSIRT
1. Collect data
2. Process 4. Share
3. Use
Intel
3. Triage
Red Team
1. Scope
2. Att&ck
4. Share
Withapologiesto@dextercasey
When we think about the inputs and outputs (i.e. the feedback loops) within and
between security controls (let alone the business), and we consider the analytics
pathways we need to build ...
221
222. SOC
1. Collect data
2. Analyze
3. Validate
4. Escalate
… perhaps we can start combining graphs and maps to understand where we need
to put our focus.
222
223. 1. Collect data
2. Analyze
3. Validate
4. Escalate
SOC
1. Collect data
2. Analyze
3. Validate
4. Escalate
For example, if the internal feedback loop your SOC has looks like this ...
223
224. 1. Collect data
2. Analyze
3. Validate
4. Escalate
3. Triage 1. Scope
2. Att&ck
4. Share
Red Team
SOC
1. Collect data
2. Analyze
3. Validate
4. Escalate
… and your Red Team ...
224
225. 1. Collect data
2. Analyze
3. Validate
4. Escalate
2. Att&ck
3. Triage
1. Scope
4. Share
1. Collect data
2. Analyze
3. Validate
4. Escalate
3. Triage 1. Scope
2. Att&ck
4. Share
Red Team
SOC
… looks like this ...
225
226. 1. Collect data
2. Analyze
3. Validate
4. Escalate
2. Att&ck
3. Triage
1. Scope
4. Share
1. Collect data
2. Analyze
3. Validate
4. Escalate
3. Triage 1. Scope
2. Att&ck
4. Share
Red Team
SOC
… and the data feedback loop between these two controls involves this ...
226
227. 1. Collect data
2. Analyze
3. Validate
4. Escalate
2. Att&ck
3. Triage
1. Scope
4. Share
1. Collect data
2. Analyze
3. Validate
4. Escalate
3. Triage 1. Scope
2. Att&ck
4. Share
Red Team
SOC
… then maybe the smart place to invest is here.
227
228. We have a bunch of ideas on this that we haven’t had time to work on, so if anyone
likes graphs and maps, please get in touch!
229. Quantifying exposure to loss (the FAIR model)
Will a credible
threat actor target
Acme Inc. in the
next <defined time
period>?
If yes, will the threat
actor defeat Acme
Inc’s. controls?
If yes, will a loss
event occur, and if
yes, what is the
forecast amount?
Factors:
- Credible Threat
Actors
- Their motivations
that that would lead
them to target your
Business Assets
- Frequency of contact
with threat actors
across Attack
Surfaces
Factors:
- Threat Actor
sophistication
- The tactics, tools and
processes they have
access to
- Control capabilities
across relevant
Attack Surfaces
- Likely Attack Paths
and weaknesses
across them
Factors:
- Speed to recover
- Speed to detect and
respond
- Loss amount over
time for impact to
system or data
availability,
confidentiality and
integrity
The other thing I’m excited about is building the FAIR model into our graph ontology.
[For more info, see https://www.fairinstitute.org]
230. Types of loss
Loss due to lack of visibility
You did not have the data you needed to make a risk
decision (aka: Knightian uncertainty)
Mis-prioritisation loss
You had the data, but overlooked it’s priority in decision
making
This is especially to help us quantify known and unknowns, through the lens of
Knightian uncertainty vs mis-prioritisation.
231. Protection
Too much Impossible.
Find and reduce
control friction.
Reduce spend. Find
and reduce control
friction.
Just right Impossible. Target
Deliver efficiency
gains to reduce
spend.
Too little
Build aligned
strategy and
efficient operations
engine, raise spend.
Optimise control
design, delivery and
operationalisation.
Reduce spend. Solve
gaps / failures in
strategic and / or
operational process.
Too little Just right Too much
Investment
I hope that’s been helpful.
We face a really tough challenge in this industry: to hit a moving target that is
context-dependent on multiple other factors, where what ‘just right’ looks like can
change very quickly.
231
232. Perhaps some of what we’ve shared can help us all escape a common enemy :)
232