Joining the Pieces: Building a Knowledge Graph to Fuel Better Security Decisions

@CxOSidekick
#OpinionsWereSharedMistakesWereMade
Hi everyone.
A few quick housekeeping notes before we begin.
First, in the spirit of ‘What happens in Vegas stays in Vegas’, this talk reflects my own
views, not those of my employer.
1

Second, we’re going to move fast though lots of material.
2

Please don’t worry about trying to absorb everything on the slides.
Instead, view this talk as field notes for the deck, which you can refer back to later.
3

Just a few more seconds.
And your first week as Head of Incident Response at Digicorp will draw to a
satisfyingly uneventful close.
5

You walk past the last of the meeting rooms.
Along the corridor that takes you to the elevator, down to reception, and out to the
weekend.
6

That’s when you hear it.
The unmistakable ping.
7

The email has the CEO on cc.
Thousands of dollars has been lost.
Something about an application called Sunways.
Code has been deleted.
8

You’re needed immediately on the 120th floor.
10

- What is the system?
- What is its purpose?
- Where is it hosted?
- Who is the business owner?
- Who is the technical / support owner?
- Who are the admins?
- Who owns the unnamed and named accounts?
- What are the access and activity logs available?
- What controls cover the system?
- Is there an on-call support team?
- Who is the vendor point of contact?
- What are their SLAs for incident response?
- What data is in the system?
- What regulatory reporting requirements do we need to be aware of?
- What is the business impact if the system needs taking ofﬂine?
- What evidence do we need to capture?
- What is the threat model for how a threat actor could have got to the system?
- What is the threat model for where a threat actor can go next?
As you press the button to call up the elevator, your brain starts to cycle through all
the things your team will need to find out.
But it’s cold comfort that you’ve dealt with incidents like this before.
11

Because the first thing you did when you started at Digicorp was ask everyone about
the firm’s most critical applications.
And this is the first time you’ve heard anyone talk about ‘Sunways’.
Past experience is telling you, you’re going to need to pull a picture together with
only a few pieces to go on.
12

You know where the rest of them are likely to be though.
In all their fragmented, partial, and inaccurate glory.
13

And so it begins.
The blur of phone calls, emails, and messages.
14

A tragi-comic and constant struggle to keep everyone on the same page of what’s
going on.
18

Punctuated every now and then by a few slices of cold pizza.
19

Eight days later, your team have solved the puzzle.
20

You’ve found all the pieces, and joined them up.
21

The picture about what happened is clear.
22

You’ve just come from briefing the board.
23

Due to these control gaps and failures,
this threat actor
exploited these vulnerabilities,
to gain access to this IT system,
(which generates £xx amount of revenue
and represents a vital trade secret),
compromising its conﬁdentiality, integrity and availability,
which led the business to experience these impacts,
realising these risks,
leading to a total loss amount of $n.
And as you think about the story your presentation told (in that hastily put together
slide deck), part of you is just glad that the last few days are over.
24

Relief that you managed to join the dots.
But another part of you is frustrated.
25

You know the mental model the team’s just built up will fade from corporate
memory.
Slower for some people, sooner for others.
26

Until eventually it returns to the fragmented state you found it in 8 days ago.
27

@CxOSidekick
It’s a privilege to be back at the Ground Truth track.
A huge thank you to Gabe (@gdbassett) and Urban (@UrbanSec) for curating this
amazing space at BSides, where we can share ideas at the intersection of data science
and security.
28

Building an enterprise security
knowledge graph to fuel better
decisions, faster
In the next 45 minutes, we’re going to look at how knowledge graphs can help
security teams address the problems we’ve just touched on in our make-believe
incident scenario.
And also how we can flip the script on ‘thinking in lists’, to reap the rewards of
thinking - and operating - in graphs.
29

Hypothesis
This talk is the product of 9 months work in a ‘live’ operational environment, testing a
hypothesis, which runs as follows.
30

To solve the problems we face, we need to be able to join
all the component parts that relate to security,
across business and technical dimensions,
in a scalable knowledge graph,
so it’s easy to capture, link, visualise, contextualise, share,
interrogate and update information in seconds,
for Executive, Management and Operations stakeholders.
[Read slide]
31

To solve the problems we face, we need to be able to join
all the component parts that relate to security,
across business and technical dimensions,
in a scalable knowledge graph,
so it’s easy to capture, link, visualise, contextualise, share,
interrogate and update information in seconds,
for Exec, Management and Operations stakeholders.
At its core, this hypothesis focuses on a ‘user need’, which extends far beyond
Incident Response and Security Operations.
32

Risk
Threat Modelling
Security Operations
Control Assurance
Program Management
Governance
etc.
Doesn’t just affect every function in a security team.
34

“I process billions of dollars of transactions per week.
Each quarter my control functions in fraud, compliance
and security bring me a report about what went wrong
and what needs to improve. With instant payments,
PSD2 [revised payments service directive] and all the
ﬁntech we’re adopting, I need to run my business
today, using today’s information.”
- CIO, Global Financial
It affects colleagues in many other areas as well.
As per this quote from a CIO:
Each quarter my control functions bring me a report about what went wrong,
but I need to run my business today, using today’s information.
35

Meaningful
Yes
Domain
expert
assessments
???
No GRC reports
Telemetry,
alerts, logs
No Yes
Timely
The data analytics struggle - to produce meaningful, timely insights that can help
understand security status, justify priorities, and track results - isn’t disappearing any
time soon.
36

“Organisations function by making decisions, about Why, What and
How, so it’s startling how bad most organisations are at it, and how
easily organisations that get good at decision making ﬁnd it to
outpace and outmanoeuvre their competition.
Data driven decisions need the right data (scope), correct data
(accuracy), appropriate processing and presentation, and a proper
insertion point into the decision making process.”
Chris Swan, CTO
[http://blog.thestateofme.com/2019/07/05/making-better-decisions
But even if we solve that problem, we’ve only won half the battle.
Because as per this fantastic blog by Chris Swan, once we’ve mined valuable insights,
they then need ‘a proper insertion point into the decision making process’.
[http://blog.thestateofme.com/2019/07/05/making-better-decisions/]
37

That means identifying the right stakeholders who should receive said information.
38

Identifying the right lens for presenting it across the various different levels of the
business.
39

The challenge is not ‘explaining technical
things to people who are non-technical’.
It’s communicating technical data in a way
that’s relevant to the context, concerns,
priorities and accountability of an audience,
whose primary focus is on business targets,
money and time.
Adjusting that lens to provide the right amount of zoom based on the audience's
context, concerns, priorities and accountability.
40

And then of course finding time for people to consume it.
41

This is a problem
Action
Trust
Communication
Insight
Context
Analysis
Transformation
Platform
Pipeline
API
Data
Sensor
This is not a simple problem.
It’s also not a problem that’s unique to the domain of cybersecurity ...

Human Interaction,
Data Science and
Data Engineering
This is a problem
… as data engineers and data scientists know only too well, regardless of the
industry they work in.

So in this talk, we’re not just going to look at building knowledge graphs.
44

We’re also going to look at how we can create and deliver ‘context-relevant’ and
‘stakeholder appropriate’ interactions with the information they link together - and
which needs updating continuously.
45

Even a stopped clock keeps
the right time twice a day
Is our approach valid?
Is our implementation practical?
Can the concepts we’re working with transfer to other organizations?
46

Please share your opinions and questions with us during and after the talk on Twitter.

@DinisCruz
@im_geeg
@CxOSidekick
For all things philosophical and technical, you can ‘@’ Dinis, whose vision started us
down this road.
He also wrote most of the code we’re open sourcing today.
GG has been operationalizing a lot of what we’ll look at and is the best person to go
to for a programmer’s and detection engineer’s perspective on day-to-day usage of
the stack.
And for questions about knowledge graph ontology and user needs, feel free to point
them at me.
48

Side Note
A note of reflection before a probably ill-advised live demo attempt.
49

http://iang.org/papers/market_for_silver_bullets.html
Years ago, the paper “A market for Silver Bullets” was published.
It described a dynamic in which neither buyers nor sellers in cybersecurity had the
information they needed to know what effective solutions looked like for their
problems.
I’d argue this still largely holds true today.
50

“Best practices look at what everyone else is doing, crunch numbers-and
come up with what everyone else is doing. Using the same method, one would
conclude that best practices for nutrition mandates a diet high in fat, cholesterol
and sugar, with the average male being 35 pounds overweight.”Ben Rothke.
I have shown that any deviation from best practices is costly, including towards a
presumed direction of greater security. Only the most profitable of security
measures will produce enough benefit to overcome the cost of breaching the
equilibrium. As the cost of breaching the equilibrium is proportional to the number
of community members, the larger the community, the greater the opportunities for
security are foregone, and the more the vulnerability.
Herding is a Nash equilibrium as well as being rational behaviour; if one player
chooses a new silver bullet, other players do not have a better strategy than sticking
to the set of best practices, and even the player that changes is strictly worse off as
they invite extraordinary costs in the event of a breach. This approach lowers the
more significant extraordinary costs and accepts direct costs as unavoidable and to
be absorbed.
http://iang.org/papers/market_for_silver_bullets.html
If recognising that no one has ‘right answers’ is one step towards surviving in this
industry … the other point the paper asks us to acknowledge is that:
“Any deviation from best practices will incur costs where individual members
go it alone.”
51

Our team are big believers in the value of open source and creative commons.
The continued perpetration of bad API output, 2-dimensional dashboards, endless
.xml joins ...
52

...and mirror mazes of macros and pivot tables makes it clear we need collaborate if
we’re to breach this current equilibrium and move from lists to graphs.
53

Side Effects
Yes, as with anything that requires process change, the shift to thinking and operating
in a hyperlinked way does have side effects in cost, time and effort.
54

The 5 stages of graphs
WTF are you talking about?
Ok, I’ll admit, this is intriguing.
I am DONE with this.
Hm, maybe if I just...
There is no graph.
It can also be a remarkably frustrating journey to navigate.
55

And no, not everything you see here will be immediately transferable (or applicable
at all!) to your business.
56

The goal today is not to suggest ‘this is how things should be done’, just to share one
possible path.
So please treat this talk like a meal :)
Eat what you like. Leave what you don’t. Let us know what dishes confuse you. And
what you’d like to see added to the buffet.
57

Demo
So, may the odds be ever in our favour ...
58

… as we cast our minds back to the imaginary incident we walked through earlier.
59

The early blur of phone calls, emails, and messages ...
60

… the problem of keeping everyone on the same page ...
61

… and the dots we had to find, then join up.
62

Here’s a different version of how our story could have unfolded.
Let’s imagine that when we joined Digicorp, they’d been building a security
knowledge graph for about 9 months, mostly using readily available data sets.
For example: HR data, application user lists, alerts from endpoint technology and a
few cloud systems ... and good old manual data entry.
63

See demo video here:
https://youtu.be/LjCtbpXQA9U?t=4696
64

Tech Stack
I appreciate there were quite a few moving parts there.
65

Before we go behind the scenes to look at the technology stack that supports a team
in operating this way ...
66

… let’s run through the main design principles that informed our choices about what
we built, and how we built it.

Automate like Iron Man not Ultron
We believe code should help us move faster in partnership.
That is to say, the primary function of the tech stack is not to automate away pain
points.
We want it to be like an Ironman suit, which any member of the team can use to
manage new problems quickly and efficiently as they arise.
Then, once knowledge and analytics have been baked into a process that is stable
and re-usable, we can add in automation.
68

The resulting system felt a lot like Iron Man's suit: One person
could do the work of many, and we could do our jobs better thanks
to the fact that we had an assistant taking care of the busy
work. Learning did not stop because it was a collaborative effort.
The automation took care of the boring stuff and the late-night
work, and we could focus on the creative work of optimizing
and enhancing the system for our customers.
https://queue.acm.org/detail.cfm?id=2841313
This approach draws inspiration from an article titled ‘Automation should be like
Ironman, not Ultron’.
The goal is to enable us to focus on the creative work of continuously optimising and
enhancing the data system, which allows us to move faster and do more than we
otherwise would be able to.
[https://queue.acm.org/detail.cfm?id=2841313]
69

Explore patterns
Second, we want to make it easy for anyone who joins our team (and eventually the
wider business) to find and understand patterns in data, both at local and systemic
level.
Rather than providing set ways of looking at something, we want people to be able to
benefit from our knowledge base and way of thinking, and evolve it with theirs.
70

1. What was the service that was exposed?
2. Does it contain sensitive or regulated data, and if so, what?
3. How long were the creds exposed?
4. Who had access to the JIRA ticket?
5. Who placed the creds in the JIRA ticket?
6. How does that team overall manage credentials?
7. What immediate follow up can minimise exposure?
8. Which stakeholders need to be informed of the exposure?
9. Who will rotate the credentials?
10. Is there any potential business impact from rotating the creds (e.g. to scripts or automation)?
11. Is two factor authentication used by the account?
12. Is two factor authentication available on the service?
13. How many people share this credential?
14. Is the username and password likely to have been available to people who left the company?
15. Are there any access logs available for the service?
16. Are other credentials shared on this service?
17. When was the last access list review performed to see what users were valid in the system?
18. Are all owners of service and admin accounts or generic accounts known and documented?
19. Who is the system's technical owner?
20. Who is the system's management owner?
21. Are the same credentials used for any other system?
22. When was the password last changed (if that info is available) so that we can be conﬁdent leavers do not have this info?
23. Will creds continue to be shared and if so, what is the justiﬁcation?
24. Is the shared credential for an admin account?
25. How old is the oldest non-valid user in the system (i.e. when was their leaving date)?
To do that, we want to be able to take a list of questions like this - which exist in
someone’s head - and may have taken months or years to build and refine in terms of
content, the order they are asked, and so on.
71

Mission Diagnose Diagnose Diagnose
Solve Solve
Mop Up & Close
Ingredients
Recipes
Micro-service
runbooks
Account
Take-Over
Stale
Accounts
JML
Visibility
Gap
Account
Security
Leaver
Account
Removal
Access /
Account
Correlation
2FA
Monitoring
JML
Assurance
No Threat
Model
Coverage
And build them into
- lists of ingredients
- examples of recipes, and eventually
- a library of ‘microservice’ runbooks, which can be taken and joined up in
different patterns to suit the scenario at hand.
This slide provides an abstract visualisation of this process.
From the top, to deliver a specific ‘Mission’ (e.g. an Incident Response scenario like
the one we looked at earlier), you will need different data points as Ingredients
across its life-cycle.
These Ingredients will be combined in different Recipes to uncover knowledge,
answer questions and complete tasks as you move from left to right.
And in future, the Recipes may be re-usable as run-books, in other situations where a
similar pattern of problem needs solving.
72

Design for data curve balls
Finally, there is a heavy focus in the data system we’ve build on the ETL phase of
‘Transformation’.
73

Insert witty quote here about how much
time Data Scientists spend complaining
about cleaning data.
Curveballs are the rule, not the exception when consuming and correlating data sets.
Messiness is a feature not a bug.
Especially, it seems, when you need to consume and correlate data at short notice.
So we designed heavily for that.
74

The data system
So with that in mind, let’s take a tour through ‘the data system’.
75

AWS
GSuite slides,
sheets
Jupyter
Elk
JIRA
Slack
JIRA user
Automated
machine
generated alerts
Here are its current components.
The lines between them indicate a route of input or output for data to flow.
76

AWS
GSuite slides,
sheets
Jupyter
Elk
JIRA
Slack
JIRA user
Automated
machine
generated alerts
At this point, you may well be thinking: ‘JIRA?!’
77

Doesn’t cost us a load to set up
Built on commodity SaaS components
Easy to customise
Extensible via APIs
Cheap to run (‘zero idle cost’)
Won’t incur large TCO debt for future teams
More on that in a moment.
But when we began building this, we had a shoestring budget.
We needed a system that we could choose to scale, as we wanted to scale it.
78

Sure, it may look weird first glance.
But sometimes you have to sail with the ship you have, rather than the one you want.
And if I’m honest, of all the data systems I’ve seen, built and worked with to try and
solve analytics-related problems in security, this is by far the most elegant.
79

So, a few definitions in terms of how we think about the data system components.
80

Our graph data store and ontology management system
This is where we
1. Create and update nodes (aka Issue Types)
2. Link them via edges (aka Issue Links)
3. Track node lifecycle phases (aka Workﬂows)
When we say ‘JIRA’, that’s shorthand for our graph data store and ontology
management system.

Unique
identiﬁer
Workﬂows
Link types
Structured
metadata
Unstructured
metadata
Owner
Timestamps
As well as having a lot of highly configurable fields, JIRA also logs every single change
to a ticket, providing a full audit trail of who did what updates to what ticket, and
when.

Cheap index for querying
Where we store JIRA data to make it easy to index and query via
Slack / Jupyter notebooks.
(ELK is an open source tech stack for storing, indexing, querying
and visualizing data, which is made up of 3 open source tools:
Elasticsearch, Logstash, and Kibana.)
Elk is our friendly neighbourhood index, where we store JIRA data so it’s easy to
search and visualize via Slack.

Ticket
Ticket Fields
Histogram of
created
tickets
List of
available
ﬁelds
It’s also a good place for us to analyse and visualise trends relating to nodes and
edges, albeit more in terms of how people are using the data system, than analytics
in operational scenarios.
84

Easy command line interface, and message bus
1. A ‘non technical’ command line tool, which can be used create,
search for, visualize, share and update information in JIRA
2. An existing part of a company’s communications fabric, that
enables security to create, tune and automate feedback loops with
anyone who is part of the Slack workspace (colleagues, vendors)
Slack is both a command line tool, for us, as well as the communications fabric that
our company runs on.
This lets us automate all kinds of feedback loops via a medium that all our colleagues
are already familiar with, and are engaged with a huge amount of the time.

https://slack.engineering/distributed-security-alerting-c89414c992d6
A big shout out here to Ryan Huber, whose blog on distributed alerting informed a lot
of our thinking on how we could use Slack to develop fast feedback loops.
[https://slack.engineering/distributed-security-alerting-c89414c992d6]
86

Viz options
GSuite JIRA
Here’s a few examples of all the command line functionality we get through Slack.

JIRA
Here are the specific JIRA commands.
88

Programming interface for ingredients lists and recipes
The programming interface we use to explore, transform and
manipulate data, so that ‘ingredients’ books’ and ‘recipe books’ can
be created for various scenarios that teams face repeatedly, which
involve capture, linking, visualisation, interrogation and updating of
data.
Finally, Jupyter.
This is our more advanced interface for creating and working with both ingredients
books and recipe books.

Notebook title
Data ﬁelds for
ﬁltering
3D results!
‘Ingredients’
code
Here’s an example of an ingredients book, designed to enable easy exploration of
relationships between all our Asset, Vulnerability and Risk data.

Tech stack layers
and their function
Non-technical
user
More technical
user
Management Reporting
(consume insight)
GSuite slides /
sheets
Jupyter
User Interface
(create, vizualize, share, update)
Slack
GS Bot
(API broker and orchestration layer)
Lambdas
Database
(store / link / index data)
JIRA ELK
Platform
(compute)
AWS
Sensors
(create data)
Manually entered
qualitative data
Machine
generated alerts
I think about these various components as providing a choice of interfaces, either for
users like me, who are non-technical, or those like my colleague George, who are
highly technical.
91

Tech stack layers
and their function
Non-technical
user
More technical
user
Management Reporting
(consume insight)
GSuite slides /
sheets
Jupyter
User Interface
(create, vizualize, share, update)
Slack
GS Bot
(API broker and orchestration layer)
Lambdas
Database
(store / link / index data)
JIRA ELK
Platform
(compute)
AWS
Sensors
(create data)
Manually entered
qualitative data
Machine
generated alerts
Then we have GS Bot in the middle, acting as the API broker to makes all these
various interactions possible.
92

Say ‘use case’ again. I dare you.
That’s the tech.
But what about the problem set this solves for people?
93

Mode
Trigger
Crisis Ad hoc Periodic Cyclical
Add
Explore
Interrogate
Mode Add Gather more knowledge
Explore See where the data takes us
Interrogate Ask speciﬁc questions of knowledge, or lack of
Trigger Crisis Unforeseen circumstances, urgent
Ad hoc Random circumstances, some urgency
Periodic Expected & recurring over short-cycle timeframes
Cyclical Expected & recurring over long-cycle time
Here is a frame of reference for thinking about the ‘modes’ people are in and the
‘triggers’ they have when they need to interact with data or information.
(A big ‘thanks’ to Russ Thomas - https://twitter.com/MrMeritology - for sharing the
triggers part of this with me years ago!)
94

AWS
GSuite slides, sheets
Jupyter
Elk
JIRA
Slack
JIRA user
Automated machine
generated alerts
User Context
Add
Add / Explore /
Interrogate
Add / Explore
Explore /
Interrogate
We can map our modes to different parts of the data system, where they fit best, and
consider what interface is best under what triggering condition.
95

AWS
Jupyter
Elk
JIRA
Slack
JIRA user
Automated machine
generated alerts
Create ticket
in Jira
Commands
into Slack to
create and
update tickets
Data from ELK
can be queried
by Slack
Inputs
API into JIRA Data from a .CSV is
transformed in
Jupyter
Data is backed up
and indexed in ELK
Bulk create / update
from Jupyter
This helps us consider routes for inputs into our knowledge graph.
96

AWS
Jupyter
Elk
JIRA
Slack
JIRA user
Automated machine
generated alerts
API from
JIRA
Alerts into Slack
channel for relevant
stakeholders
Slide decks
auto-generated to
a time sequence
Slide decks then
sent to relevant
stakeholders
Tickets updated in
JIRA based on alert
interaction
Outputs
As well as ‘outputs’ that support feedback loops in their various different forms.
97

With our robot army of lambda functions ...
98

AWS
Jupyter
Elk
JIRA
Slack
JIRA user
Automated machine
generated alerts
… acting as a ‘glue box’ across the data system ...
99

AWS
Jupyter
Elk
JIRA
Slack
JIRA user
Machine generated
alerts
1. Tickets created
from Slack in
Incident Channel
2. Tickets synced to JIRA
3. JIRA data synced to ELK
4. ELK queried from Slack
… we can now do things like this.
When we were ‘graphing manually on the fly’ via Slack (trigger = crisis; mode =
interrogate), this is what was happening in the background.
100

AWS
Jupyter
Elk
JIRA
Slack
JIRA user
Machine generated
alerts
1. Get .CSVs that
contain data we
want to input
into knowledge
graph
2. Data transformed
to become ‘graph
ready’ in a Jupyter
‘recipe book’
3. Data synced into
JIRA from Jupyter
4. Data is indexed
in ELK
5. Graph can be
explored in Slack
For ‘batch graphing’, where we’re importing .csv data in bulk (mode = add; trigger =
periodic / cyclical activity), the process looks more like this.
(Apologies that this reads right to left).
101

Here’s an example of a Jupyter recipe book, with an imaginary data set showing how
we can parse messy data into a graph ontology.
The nice thing is once you’ve worked with a few data sets from your applications, this
is a process that goes quickly from ‘manual’ to ‘mostly automated’, and you can run it
on batch when you want.
102

Business context
This representation of our current ontology (which we’ll cover in mode detail in the
next section of the talk), reflects the kind of information you can get if you pull a user
list from your HR system, and 2 applications.
(The greyed out boxes represent information which will need to come from
somewhere else, as those data sets don’t contain it).
103

Tech ecosystem context
Even with a few datasets like this, you can immediately build context between
business and technology dimensions.
104

Decision making context
And it’s a short jump from here, to using the parsing process to identify inaccurate,
incomplete and incongruous data.
For example, who owns that active generic account in that SaaS system, which has a
.com email domain which isn’t yours, and seems to have last logged in 3 years ago?
Why is there disagreement between System X and your HR database, about whether
this employee still works for you?
Etc., etc.
105

Security context
These sound like facts we might want to capture, and present to colleagues or
management for a decision - in light of the vulnerability this creates.
106

Reporting line up to CEO for
engineering team, by role
Once you have this data in a graph, here’s an example of a ‘recipe book’ you can use
to connect roles to technical assets.
Here we’ve mapped the role reporting line for a team up to the CEO, so we can
better understand the stakeholder landscape.
107

Asset and
what it does
Roles and reporting
lines
People who
own the roles
And here’s another example with a slightly different narrative.
108

AWS
Jupyter
Elk
JIRA
Slack
JIRA user
Machine generated
alerts
1. Alert ﬁres in a
detection tool
7. Detection tool
dashboard is
updated via API
2. API sends alert
and formats it as
ticket
3. Data sync’d
to ELK
6. User action in
Slack updates
JIRA ticket 5. Lambda triggered
to request user
feedback
4. Python in a notebook,
or lambda recognises
alert condition
Moving on from data input to mostly automated feedback loops, here’s a workflow
that combines automation, with ‘user interaction in the loop’, to reduce security alert
triage and investigation overhead.
Aka: Ryan Huber’s ‘distributed security alerting’, with a twist of JIRA.
109

See demo video here:
https://www.dropbox.com/s/povg5kaa72kv2v1
/BSIDES_DEMO_Dist_Alerting.mov?dl=0
Here is video of the user experience.
110

What this enables - in the long run - is micro-population analysis.
111

Context for what
population does
IT System that represents a
business asset
Risks due to facts and
vulns affecting IT System
Mitigating detections
enabling risk acceptance
Micro population of
users
Don’t worry about the details on these slides.
The main takeaway is that we’re connecting data through a graph for a specific set of
users, so we can use data to better understand their reality.
Here’s the scenario:
- We have a shared email account, which multiple users need access to, to
perform a particular business function
- This creates vulnerability, and to mitigate that, there are a set of detections
that need to be put in place
112

Not just detections
vs an account
Without distributed alerting, and our graph, our security team would usually just see
detections against the account.
Not that helpful, as the account is generic, and has multiple people using it.
113

A window into the
pattern of life for a
single user
With distributed alerting, we can build up patterns of life for individuals in those
groups, by having them acknowledge ownership of an alert, via Slack.
This is because when they click on the interactive button in Slack, that
acknowledgement is associated with the identifiers we have through that system
(e.g. email address).
We can then extrapolate that out into groups of employees, and compare this across
our entire organisation.
114

The aim here is not - repeat not - to be a 1984-esque security team.
Rather, we want to use data to gain better understanding of a business process, so
that if we need to add or evolve controls, we give ourselves every opportunity not
minimise friction we introduce (or avoid it all together).
115

AWS
Jupyter
JIRA
Slack
JIRA user
Machine generated
alerts
1. Person manually
creates data point
2. Msg is sent to
relevant triage
stakeholders
3. Triage stakeholder
does quick Slack search
4. Data point is
viewed in Jupyter
for full context5. Data is linked to
other relevant nodes
and a Risk memo is
written up
7. Automated slide deck is
sent to risk owners in Slack
Elk 6. Data is visualised
and reviewed
This last workflow is ‘in progress’ for us at the moment.
In essence:
- Someone reports a vulnerability to the security team by manually entering
data in JIRA;
- The Risk team are alerted by a message in Slack;
- The data is evaluated in context of the asset concerned, triaged and linked, in
Slack;
- The risk team then evaluate how this changes exposure to impact for the
stakeholders who own the asset; and
- If there is a significant change, stakeholders are sent an updated risk memo
based on changes to their risk landscape, with a request to accept the risk by
clicking on an interactive button in Slack - or to request a meeting to discuss
the risk.
116

Use decision tracking as a real
world proxy for the abstract
concept of risk appetite
The goal here is to use the feedback loops from these risk memos to start to gather
patterns of decisions.
Then we can analyse those decisions as a proxy for risk appetite at different levels
and business units.
117

The code for all this is here.
118

https://github.com/owasp-sbot
We’ve created a fake ‘mini-company’s’ worth of data, complete with people, roles,
applications, devices, alerts, incident data and so on.
So there’s plenty to play around with.
The integrations with Slack and JIRA aren’t out of the box yet, but watch this space.
119

We’re now going to switch gears from technology, to ontology.
Ontology

An ontology is a set of concepts and categories in a domain, which shows their
properties and the relations between them.
121

The one we’ve arrived at in our knowledge graph has evolved a lot over time.
This section of the talk gives an overview of that process - and shares some of the
learnings along the way.
122

Let’s start with some early work.
123

This is a flow diagram of our Incident Response process about 8 months ago.
124

The highlighted areas show the different types of JIRA tickets we would raise across
this workflow.
125

https://pbx-group-security.com/blog/2018/09/15/incident-handling-processes-at-photobox/
While this created an varying amounts of administrative overhead during an incident
(depending on its scale), the detail it enabled us to capture was invaluable.
Both for post incident reviews (to look at what went well and what needed
improving), as well as for capturing knowledge about the business, applications
various teams used, data pipelines, and so on.
We wrote a blog about this, and more details on this are available through the link.
[https://pbx-group-security.com/blog/2018/09/15/incident-handling-processes-at-ph
otobox/]
126

Key
= Security Event
= Security Incident
= Investigation Thread
= Incident Task
Early on, we used graph visualisations of incidents to tell us things like
- How many questions we had to ask to complete an investigation thread
- How investigation threads related to each other; and
- Whether we had successfully completed all the incident tasks, or not
And we’d then think about how we could have asked fewer questions to get to
answers that helped us fix things better, faster and cheaper.
127

Business Security
Assets
Functions
Teams
Skills
Processes
Projects
Partners
Governance
Risks
Impacts
Threat Actors
Actor Motivations
Threat Tactics
Attack Surfaces
Controls
Vulnerabilities
Threat Models
Security Events
Security Incidents
Attack Paths
Technology
IT Systems IT Assets
Unfortunately, lots of valuable data was imprisoned in free text ‘description fields’ in
JIRA, which was where we captured the results of ‘Incident Tasks’.
128

This was frustrating because Incident Tasks would pull on lots of threads across the
business.
129

But there was no good way to weave this data into a wider context, to make it more
consumable and get long term value from it.
130

So we began experimenting.
131

The incident
IR & Threat Mgmt
After realising how expensive it was to refactor node and edge ontologies in JIRA (at
this point we didn’t have Jupyter and were entering data points manually and
individually), we switched to proto-typing in PlantUML.
This made it cheaper to discard mistakes and re-build the graph differently, but the
results weren’t helping simplify our picture of the landscape.
132

Almost everything was linking to everything.
And while we developed some key components of the overall ontology during this
phase, (e.g., linking projects and money to the closure of vulnerabilities and
reduction of risk), overall things were getting more confusing.
133

“Reality is complicated!” @DinizCruz
When people would say “That looks complicated!”, Dinis would buoy our spirits by
pointing out that the complexity we were creating was a reflection of a complicated
reality.
134

But that didn’t change the fact our system of nodes and edges was increasingly hard
to navigate, even if you were working with it constantly.
We had multiple Projects with multiple Issues Types in JIRA (these are the Issue Types
for just one Project) ...
135

… and that was before you got to the problem of deciding what edge to use, to link
what nodes.
136

While our graphs lacked nothing in terms of freedom of expression, the consequence
was inconsistency.
This made it hard to navigate the graph and ask it questions, with confidence that
you were seeing ‘all’ the data.
137

The result was confusion in our own team, let alone when we tried to use the data
we had to communicate with the rest of the business.
To a large extent, this was because our graph had become removed from operational
reality.
Our nodes and edges reflected concepts we were trying to mould together - which
were abstract to anyone outside our team.
138

From: “How could we….?”
To: “We need to….!”
Forcing functions are funny things.
And just as Incident Response had been the trigger for us to work in graphs with
practical and beneficial results at operational level ...
139

… budget season helped us make an evolutionary jump in a more strategic direction.
140

Over time, I’ve focused less on efforts to understand ‘risk’
and more on mapping ‘the investment decision’.
Because data that tells me where there’s no line item for
security against apps or data sets reﬂects a risk decision
- conscious or unconscious.
If I can get data that surfaces this, I can take the technical
data I have, and explain the possible consequences of
there being no budget to ﬁx issue X, Y, Z.
- CISO, Investment Mgmt.
One of the many challenges security teams face at budget season is articulating ‘what
won’t be done’, either based on the investment that the business is prepared to
make, or the security team’s ability to operationalise a given budget.
141

Form follows function
We began focusing on the function of the data we had in our knowledge graph to
solve this problem ...
142

… our need for fact-based narratives ...
… and the common themes of questions that were coming our way ...
143

… which required us to put data into business context, without requiring a lots of
translation.
144

And so in classic ‘2 choice presentation style’ we stole an idea from a friend at a
management consultancy, who once said:
“There are only 2 presentations you give to management:
- Cloudy day, sunny day (in which things are bad but if they do XYZ things get
better); and
- Sunny day, cloudy day (things seem good, but won’t stay that way)”
145

This Vulnerability (e.g. control gap)
which relates to this IT System
means if credible Threat Actors we face
target us using these Techniques
we cannot protect against them
Security is blocked from solving this
due to this Fact
This exposes the business to this Risk
which these Stakeholders are accountable for
To address this Risk
you need to make a Funding Decision
so this Security Function
can develop this Pack of Analytics
which will close this Control Gap
This will require these Resources
And will use this Project Workﬂow
And we started to developed narratives like this one.
146

Need for optimise project to
close detection gaps
The SOC
As we transferred these narratives into our graphs, they got simpler and clearer ...
147

This Incident Fact
provides evidence of these Vulnerabilities
which relate to this IT System
which was exploited by Threat Actor
using these Techniques
This realized this Risk
causing these Impacts
affecting these Teams
To address this Risk
this Security Capability
can deliver this Outcome
This requires this Funding
for this Project
which will follow this Project Workﬂow
using these Resources
at this Cost
If the project is not funded
these Stakeholders
need to accept this Risk
… even when our plot lines got more complicated.
148

Control capability gap for risk
acceptance
Technology Oversight Team
Once we were confident the storyline was easy to track ...
149

… we cycled back to JIRA and began implementing the ontology we’d trialled in
PlantUML.
150

1. This IT
System
2. Is exposed to
this threat vector
3. Which is covered
by this security
technology
4. Which has these
vulnerabilities
5. Which link to
these risk themes
4. Which are ﬁxed
by this project
3. For which the security
program has are these
identiﬁed process gaps
5. Which use
these capabilities
While our nodes and edges often didn’t exactly correspond to a human readable
version of the storylines we were telling in the graphs, that mattered less and less.
151

1. This security
technology
2. Has these
detection models
4. Covering this
IT system
5. Which is managed
by this team
6. Reporting to
this person
3. And these
playbooks
Because the nouns and verbs that we needed to make the graph ‘human readable’
were emerging through the shape of the graph.
152

1. This security
technology
2. Has these vulnerabilities at
the management layer
3. Which are ﬁxed by these
Project key results
4. Which are delivered by
these tasks
And the story lines were working as we presented them to stakeholders.
153

The Entropy Crushing Committee
So began the era of the great refactoring ...
… and the informal creation of the ‘Entropy Crushing Committee’, (hi James, if you’re
watching).
We started standardising and formalising our nodes, our edges and the relationships
that could exist between them in the graph.
154

… which represented a trade off of sorts.
155

The ability to enter artitary data
vs. A rigorous structure
We chose a rigorous graph structure over the ability to enter arbitrary data.
156

A logical narrative of nouns (nodes) and verbs (edges)
that make it easy and cheap to ask ‘expensive questions’ across the graph
with human readable, granular, and repeatable outputs
and a clear picture of what possible outputs should (probably) look like.
We focused on creating human readable narrative that had predictable paths and
expected patterns through the graph ...
157

Incident generates
questions and facts
Connects to IT
system about which
little is documented
And which has no
threat model
information
… which had the added benefit that it made it easy to see when desired data that
was missing from the graph.
Knowing what you don’t know can be very valuable, e.g. during an incident when you
may need to phone a friend in your team and ask them to do an emergency threat
model.
158

Thankfully this choice fitted hand in glove with the way JIRA allows you to organise
data.
159

Project A group of related nodes
Issue Type A distinct node type
Workﬂow Lifecycle phases of a node
Links Edges between nodes
The translation of how JIRA organises information into graph-speak goes roughly as
follows.
160

Happily from an administrative perspective, this structure also supports innovation
and experimentation in node and edge relationships, while controlling the impact of
that across the graph.
161

‘Change’ as a feature, not a bug
This is important, because change to the ontology is a feature of knowledge graphs,
not a bug.
162

And until it becomes cheap to mass-refactor your knowledge graphs, I would highly
recommend avoiding the pain involved in doing so.
163

Example : Incident Response Project
Here are 2 examples, starting with the Incident Response Project, of where we
missed opportunities to limit the blast radius of experiments.
164

1. This system or human
reported event
2. Needs handling as a
Security Incident
3. Causing these
threads of activity
(e.g. prepare,
identify, contain, etc)
4. And these
speciﬁc individual
actions / questions
to answer
5. Which
generate this
evidence
Nodes
(Issue Types)
This is a generic version of what an incident graph can look like.
165

Edges
(Issue Links)
Here’s an example of our ‘edge narrative’ from a few months ago.
166

Workﬂow : Security Event
And here - just for reference - are our workflows for each Issue Type (node).
167

Workﬂow : Security Incident
168

One of the things I failed to capitalise on early enough was investigating the
metadata people were adding to Issue Types.
171

Here’s the metadata captured in our ‘Security Incident’ Issue Type.
Various fields were added over time as it became necessary for us to tag stuff, and
capture details we wanted to be able to either search for, or organise by, across
incident tickets.
172

Are there (or could
there be) other Issue
Types, which are also
using (or could use)
these ﬁelds … or
variations of them?
This is just a different view of all these fields.
What we should have done earlier was look at these fields and ask the question:
“Are there other Issues Types in other Projects that are duplicating these, or
which could benefit from them?”
173

And if so, where does it
make more sense to
create new nodes and
edges, vs using a
metadata ﬁeld?
Then, we should have thought through the benefits and trade offs of creating new
nodes and edges rather than metadata fields, and asked what the relevant nouns and
verbs needed to be to ensure high utility for different teams.
174

Incident
Response
Dimensions
Business Security Technology
Information
we’d want to
capture and link
Business Unit
Team
Partners
IT Assets
IT Systems
Attack Surface
Threat Actor
Playbook
Data Types
Security Controls
Vulnerabilities
Impacts / Costs
Risks
Had we looked at the metadata we were adding across different Projects and Issue
Types, we might have begun to identify the common narratives that different people
were trying to glue together independently.
175

Example : Red Team Project
The second example of lessons learned is from our Red Team Project.
176

This is the Project ontology ...
177

1. Prove if credible Threat
Actor X can compromise
Business Asset Y using
techniques up to Z level of
sophistication.
2. By simulating
these tasks.
3. Which need
these tools.
4. These technical
exploits, control gaps
and / or control failures
were discovered.
5. The ability to exploit these at Z
level of sophistication without
prevention or detection delivers
this proof point towards the Goal
… and here’s an example of the kind of narrative it supports.
178

1. We want to prove if Business
Asset Y can be compromised from
Attack Surface Z
2. We want these speciﬁc
proof points
3. They should be made
up of these tasks
4. And only use
these tools
The ontology didn’t start like that though.
Originally, it’s structure reflected the way we ran early Red Teams.
We’d define a set of proof points we wanted; then we defined the tasks we’d run to
meet them, and the tools that could be used.
It was a lot more proscriptive, but it was a very structured way to gather evidence,
and get the business comfortable with Red Teaming in production on a regular basis.
179

1. We want to prove if
credible Threat Actor X can
compromise Business Asset
Y using techniques up to Z
level of sophistication
2. These tasks
4. Which suggests the
following from an attacker’s
eye-view
5. Speciﬁcally
about these
controls
3. Found this vulnerability
Over time, everyone got more comfortable with free-form scope.
We’d set the goal, and the Red Teamers we worked with would think creatively
within our structure.
This led to the introduction of Security Controls into the ontology - so that we could
highlight where a Red Team proof point demonstrated a control failure, or a control
strength.
180

7. Which also
provide coverage of
this IT System
1. We want to prove if
credible Threat Actor X can
compromise Business Asset
Y using techniques up to #Z
level of sophistication
2. These tasks and tools
found this vulnerability
5. This suggests the following
from an attacker’s eye-view
6. About these
controls
3. Which affects this asset
4. Which is part of this IT System
Then once the Blue Team were fully involved in the end-to-end tests and evaluating
findings, the concepts of control coverage across IT systems and IT assets was
introduced.
181

At a certain point, it was obvious that ‘Security Controls’ and ‘IT Systems / Assets’
shouldn’t live in the Red Team Project.
182

Unfortunately, we’d developed the control ontology in isolation in this project, and
we hadn’t taken the time - as we were doing it - to see how applicable the structure
was to other Projects.
This meant we missed some major opportunities to evolve the control ontology to
make our data richer across all projects - for example in relation to how Regulators
articulated controls compared to Red Team operatives.
When we changed it, we had to do a lot of refactoring.
183

So, let’s zoom out, and share where we are today.
184

‘Missing’ detail at a lower level of
abstraction is different from a
gap in the model that means
something can’t be represented.
The focus at the moment is leaving the detail behind, (as that more or less exists,
even if it’s in a state of moderate chaos).
Our time is now spent thinking much more about the fundamental building blocks of
the enterprise security ontology - and finding the fewest number of relationships
between them to answer expensive questions.
185

These are the key building blocks we’re working with as Projects in JIRA.
186

The nodes and edges within them look like this...
187

The PlantUML code for these is all available on GitHub so you can play around with
putting them together.
195

https://github.com/owasp-sbot/GraphSV-demo-data
Here’s the link.
196

Project 1
Project 2
Project 3
By way of a few guiding principles, here are some things I’ve found helpful to avoid
re-factoring.
First, there can be many ways to describe the relationships between nodes in
different projects, but there should be just a few ways to describe node relationships
within projects.
197

For example, this is OK, as the ‘People’ node lives outside the Incident Response
Project.
198

However, this would not be OK.
199

Next, be careful of node to edge paths that create unstable narratives.
200

Here’s an example.
Let’s say a threat actor in an incident uses a specific vector (e.g. malware), which
exploits a vulnerability to cause an impact.
At this point, it’s clear what happened.
201

But as we get more incidents that use this vector, and as we experience more
impacts, it soon becomes impossible to know what incident caused what impact
using this vector.
You really need mutually exclusive relationships between context specific and general
purpose nodes (e.g. an incident task, vs a generic list of business impacts) to build
strong narratives.
202

Finally, look for node-to-edge joins that create narratives with the fewest number of
touch points between them.
203

Here’s an example.
Let’s take the Business Structure ontology.
204

1. ACME Metals
2. Finance
3. Ability to pay employees
5. Joe Bloggs
4. Payroll
6. SVP
The edges in this graph can tell us a lot about ‘Joe Bloggs’.
205

Link 1
If he reports an incident, we can create one link to ask questions across this graph
(e.g., about what role he has, the team and function this rolls up into, etc.)
206

Link 1 Link 2
If the incident concerns an application - again, one link let’s us ask questions across
the graph of that application without associating the Security Incident with all the
individual components.
207

Link 1
Link 3
Link 2 Link 4
This example shows a project of work (Green ontology), which uncovers a
vulnerability in a device, used by Joe and his colleagues, which requires a project of
work to fix.
With just 4 links, there are a lot of narratives you can now navigate here.
208

That said, sometimes security just makes a mess out of everything.
209

The migration
Project Management Team
Here’s a project migration graph before security gets involved ...
210

The migration
Threat Model Team
… and here’s the result once a threat model has been run.
Sometimes, things really do just get a bit complicated.
211

In closing, what next?
Best laid plans

“Data captures. Information tables. Knowledge graphs.
Understanding maps. Wisdom filters.
And if that's right … if traditionally defenders think in
tables and attackers think in graphs, then the future is
owned by cartographers who can navigate maps, and
refine them by filtering to reach worthy destinations.”
@dantiumpro
I’ve been thinking a lot about this quote recently, and how to put graphs in context.
[https://twitter.com/Dantiumpro]
213

Patterns of play
One of the reasons we rely so much on generic best practices in Security is that there
is no widely shared knowledge base that helps us identify what pattern of play is best
for the business we serve (e.g., based on it’s resources, our available funding, the
technology and threat landscape, etc.)
214

Despite the allure of the frameworks consultants sell us for ‘what good looks like’,
there is no single repeatable pattern.
215

It’s more like playing 50 games of chess, when changes to the pattern on one board
also have a knock on effect across many other boards ...
216

… as we desperately try to tailor our strategy and operating model to deliver stage
appropriate results ...
217

… and build the boat while we’re rowing it.
218

Maps give movement choices based on position
This makes it hard to understand, in a given moment, what the best choice we have
is.
This is because we lack a picture of the landscape.
As every visit to SFO reminds me, a short geographical distance that does not account
for hills may not be the smartest route.
219

Simon Wardley has written a lot on maps and patterns of play.
For example, this picture illustrates that when something is in a phase of genesis, the
focus should be agile practices that reduce the cost of change, vs when something is
commoditized, when the focus should be on reducing deviation.
220

2. Collect data
3. Analyze 1. Hypothesize
4. Validate
Hunt
1. Collect data
2. Analyze
3. Validate
4. Escalate
1. Discover
2. Triage 4. Monitor
3. Remediate
1. Prepare
2. Detect
3. Manage
4. Learn
Vuln
SOC
CSIRT
1. Collect data
2. Process 4. Share
3. Use
Intel
3. Triage
Red Team
1. Scope
2. Att&ck
4. Share
Withapologiesto@dextercasey
When we think about the inputs and outputs (i.e. the feedback loops) within and
between security controls (let alone the business), and we consider the analytics
pathways we need to build ...
221

SOC
1. Collect data
2. Analyze
3. Validate
4. Escalate
… perhaps we can start combining graphs and maps to understand where we need
to put our focus.
222

1. Collect data
2. Analyze
3. Validate
4. Escalate
SOC
1. Collect data
2. Analyze
3. Validate
4. Escalate
For example, if the internal feedback loop your SOC has looks like this ...
223

1. Collect data
2. Analyze
3. Validate
4. Escalate
3. Triage 1. Scope
2. Att&ck
4. Share
Red Team
SOC
1. Collect data
2. Analyze
3. Validate
4. Escalate
… and your Red Team ...
224

1. Collect data
2. Analyze
3. Validate
4. Escalate
2. Att&ck
3. Triage
1. Scope
4. Share
1. Collect data
2. Analyze
3. Validate
4. Escalate
3. Triage 1. Scope
2. Att&ck
4. Share
Red Team
SOC
… looks like this ...
225

1. Collect data
2. Analyze
3. Validate
4. Escalate
2. Att&ck
3. Triage
1. Scope
4. Share
1. Collect data
2. Analyze
3. Validate
4. Escalate
3. Triage 1. Scope
2. Att&ck
4. Share
Red Team
SOC
… and the data feedback loop between these two controls involves this ...
226

1. Collect data
2. Analyze
3. Validate
4. Escalate
2. Att&ck
3. Triage
1. Scope
4. Share
1. Collect data
2. Analyze
3. Validate
4. Escalate
3. Triage 1. Scope
2. Att&ck
4. Share
Red Team
SOC
… then maybe the smart place to invest is here.
227

We have a bunch of ideas on this that we haven’t had time to work on, so if anyone
likes graphs and maps, please get in touch!

Quantifying exposure to loss (the FAIR model)
Will a credible
threat actor target
Acme Inc. in the
next <deﬁned time
period>?
If yes, will the threat
actor defeat Acme
Inc’s. controls?
If yes, will a loss
event occur, and if
yes, what is the
forecast amount?
Factors:
- Credible Threat
Actors
- Their motivations
that that would lead
them to target your
Business Assets
- Frequency of contact
with threat actors
across Attack
Surfaces
Factors:
- Threat Actor
sophistication
- The tactics, tools and
processes they have
access to
- Control capabilities
across relevant
Attack Surfaces
- Likely Attack Paths
and weaknesses
across them
Factors:
- Speed to recover
- Speed to detect and
respond
- Loss amount over
time for impact to
system or data
availability,
conﬁdentiality and
integrity
The other thing I’m excited about is building the FAIR model into our graph ontology.
[For more info, see https://www.fairinstitute.org]

Types of loss
Loss due to lack of visibility
You did not have the data you needed to make a risk
decision (aka: Knightian uncertainty)
Mis-prioritisation loss
You had the data, but overlooked it’s priority in decision
making
This is especially to help us quantify known and unknowns, through the lens of
Knightian uncertainty vs mis-prioritisation.

Protection
Too much Impossible.
Find and reduce
control friction.
Reduce spend. Find
and reduce control
friction.
Just right Impossible. Target
Deliver efﬁciency
gains to reduce
spend.
Too little
Build aligned
strategy and
efﬁcient operations
engine, raise spend.
Optimise control
design, delivery and
operationalisation.
Reduce spend. Solve
gaps / failures in
strategic and / or
operational process.
Too little Just right Too much
Investment
I hope that’s been helpful.
We face a really tough challenge in this industry: to hit a moving target that is
context-dependent on multiple other factors, where what ‘just right’ looks like can
change very quickly.
231

Perhaps some of what we’ve shared can help us all escape a common enemy :)
232

Joining the Pieces: Building a Knowledge Graph to Fuel Better Security Decisions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Joining the Pieces: Building a Knowledge Graph to Fuel Better Security Decisions

Similar to Joining the Pieces: Building a Knowledge Graph to Fuel Better Security Decisions (20)

Recently uploaded

Recently uploaded (20)

Joining the Pieces: Building a Knowledge Graph to Fuel Better Security Decisions