Unlike business intelligence (BI), threat intelligence (TI) processing need more focus on agility, flexibility, scalability, and sharability. Especially when you have 1 petabyte of data, you never want your investment do and only can do only one thing.
In this talk, we will share the real case on building the generic entity database that supports real-time intelligence query for any kind of data producers and consumers. And how to construct machine learning pipeline to adapt the ever-changing reality on top of the entity database.
13. Requirements for Fact table
• A flexible data schema design on fact table
– To store metadata for various type of en??es
– To store values for various type of en??es
• An efficient write for acributes for a specific en?ty to
Peta-bytes of data volume
• User can also specify a ?mestamp for acributes (versioning)
• An efficient read for en??es by given keys from Peta-
bytes of data volume
– Other criteria including ?mestamp, acribute name, acribute
value
• Reliability, Availability, opera?ons, etc
– AWS managed service is becer
– AWS is taking care of these –ili?es for us
19. An efficient write for acributes for a specific
en?ty to Peta-bytes of data volume
// InputAttributes API
POST/metadatastore/v1/entities/aaa…/attributes/
src_1
{
"entity_type": "file",
"attributes": [
{
"name": "original_filename",
"value": "wd.sys“ # current timestamp
},
{
"name": "download_from",
"value": "https://www.microsoft.com/zh-tw/",
"timestamp": 1470280448 # set 10-digits
}
]
}
p1
p2
pn
●●●
1. Calculate hash based on entity
ID (partition key)
2. Split items based on
timestamp (sort key)
3. Insert item#1
4. Insert item#2
5. Items in one partition are
sorted by timestamp (sort key)
1. 2.
3.
4.
InputAttributes API: POST /metadatastore/v1/
entities/abc…/attributes/src_1
// …
Fact table 5.
21. An efficient read for en??es by given
keys from Peta-bytes of data volume
// GetEntityList API
GET /metadatastore/v1/entities
{
"entities": [
{
"key": “aaa...",
"type": "file"
},
{
"key": “abc…",
"type": "file"
}
],
"attributes": [
{
"source": “src_1",
"name": "original_filename"
},
{
"src": “src_1",
"name": "download_from"
},
]
}
// pseudo code for DynamoDB query
resultSet = //…
For entityKey in entityKeys
SELECT original_filename, download_from
FROM fact_table
WHERE BY KEYS
# find items in range
partitionKey = entityKey
AND sortKey <= timestamp
FILTER
# filter by source
original_filename.src = “src_1”
AND download_from.src = “src_1”
// …
rs_tmp = // get from DynamoDB SDK
for r_tmp in rs_tmp
for f in needField
if r_tmp.exists(f) then
resultSet.add(r_tmp[f])
needFiled.rm(f)
fi
done
done
Done
Return resultSet
pn
●
●
●
Items
Fact table
1. Transform API to DynamoDB
queries
2. Get items by partitionKey and
sortKey
3. Project attributes from items
4. Filter items by source
5. Arrange DynamoDB results to API
results
6. Return API results
1.
2.
4.
5.
6.
3.
22. An efficient read for en??es by given
keys from Peta-bytes of data volume
// GetEntityList API
GET /metadatastore/v1/entities
{
"entities": [
{
"key": “aaa...",
"type": "file"
},
{
"key": “abc…",
"type": "file"
}
],
"attributes": [
{
"source": “src_1",
"name": "original_filename"
},
{
"src": “src_1",
"name": "download_from"
},
]
}
// pseudo code for DynamoDB query
resultSet = //…
For entityKey in entityKeys
SELECT original_filename, download_from
FROM fact_table
WHERE BY KEYS
# find items in range
partitionKey = entityKey
AND sortKey <= timestamp
FILTER
# filter by source
original_filename.src = “src_1”
AND download_from.src = “src_1”
// …
rs_tmp = // get from DynamoDB SDK
for r_tmp in rs_tmp
for f in needField
if r_tmp.exists(f) then
resultSet.add(r_tmp[f])
needFiled.rm(f)
fi
done
done
Done
Return resultSet
pn
●
●
●
Items
Fact table
1. Transform API to DynamoDB
queries
2. Get items by partitionKey and
sortKey
3. Project attributes from items
4. Filter items by source
5. Arrange DynamoDB results to API
results
6. Return API results
1.
2.
4.
5.
6.
3.
26. Propaga?on Output to Dimension table
e.g. Elas?csearch
• For input into ES
– ES document (JSON)
– For searching en??es
By extra coordinates
• Primary column
– To iden?fy same document in ES for updates
– _id in Elas?csearch
26
{
"_id":
"b8e76297d0bfbf889c38cff3e80c1e14de9f7a18",
"rescan_decision": "Friday Intel...",
"dump_mip2scan": "NEW VERISIGN DDOS ...",
"file_census_external": 91232,
"received_timestamp": "2016-09-23T16:19:35"
}
output:
engine: ...
columns:
- name: _id
rule: .key
primary: yes
entity_type: file
examples: ...
summary: File SHA1
Means this column
is primary column
31. Requirements for Dimensional table
• Users can choose various technologies
– To fulfill their different needs
– Currently only support AWS Elas?csearch
• Why search service comes first ?
– A search service can help users to find the interested
en??es with various coordinates not only en?ty keys
• Reliability, Availability, opera?ons, etc
– Efficient reads from Tera-bytes of data size
– AWS Elas?csearch is a managed service
– Its support is becer than AWS Cloudsearch (Apache Solr)
– Kibana built-in
38. Reimport
• Blue/green strategy to alter dimension schema
– Works on all scenarios
– Other lightweight and more restric?ve alterna?ves
• such as Elas?csearch re-index api
• Two-phase reimport process
– Phase 1: reimport via Map/Reduce
– Phase 2: reimport via 2nd propagator
• New components to enable re-import
– Logstash: keep DynamoDB Stream Journal Log
– Map/Reduce: run propaga?on logic in batch
38
39. Overview for re-import
39
AWS Elasticsearch
Metadata Store
Admin
AWS EMR
Logstash
1 .Data keeps to
input
2 .Streams also
connect to Logstash
3 .Dump journal logs
to S3
4 .Start re-import
5 .Launch AWS
EMR for re-import
MR jobs
6 .Run MR jobs
Input: journal logs
Output: ES docs
42. 42
• En?ty Model Storage
Design on DynamoDB
• Op?mized for fast read
by keys with versioning
• Configurable propaga?on
via Jq as ETL Language
• Scalable implementa?on
with Kinesis Client Library
• Elas?csearch as pilot
dimension table engine
• Support Alter Table for
development needs
• Common serving layer
architectural pacern
• Fact à dimension tables
by auto propaga?on
Metadata Store