[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligence Services and Machine Learning

Metadata Store: Generalized
Entity Database for Intelligence
Services and Machine Learning
Trend Micro / SPN
@ DataCon.tw 2018

SPN and the Speakers
•  Trend Micro Smart Protec?on Network (SPN) Team
–  Nearly 10 years PROD experience on Hadoop
–  Big-data, Cloud (AWS), and Big-data + Cloud
2
Jeﬀ Hung

@github
Sco, Miao

@github

Serving Layer
•  Big-data paradigm shiQ to AWS for 5 years
The Journey to the Cloud
DataLake
Data
Ingestion
EMR, Athena,
Glue, Batch
Batch
Processing
S3, Lambda,
Step Functions
Ad-hoc
Processing
SQS, SNS,
Kinesis
Streaming
Processing
Data Access
API
Server
Serving Layer as-a Service
= Metadata Store

Principles of Serving Layer Design
•  Storage Requirements:
–  Query to get result instantly
–  Query to get result at once
•  Define Schema by Needs
–  Row = all info to show on page
–  Op?mize by serving criteria
4
Serving Layer Data Access
View 1
View 2
Storage 1
Storage 2
Challenge: Fulfill different performance and schema
requirements for integrated services

Data Model in API World
•  RESTful API Design 123
1.  Use Nouns in URI
2.  Use HTTP Verbs
3.  Return Object
4.  Advanced Query
5
Attribute
« Name »
Value
Entity
« Type »
Key
1 *
An Object
Query by specific attribute?
Query by range? Regex?

Common Architectural Pacern
6
Elastic
Search
Aurora
HDFS
HBase
DynamoDB
Cloud
Search
SolrCloud
MySQL
Dimension
Tables
Application 1
Application 3
Application 4
Application 2
Fact Tables

Star Schema
•  A schema design widely used in data warehousing
7
TransacEonal data – measurements
or metrics for a speciﬁc event
DescripEve a,ributes – characteris?cs
to describe and select the fact data

vs.
8
Single Key Query ç
Fast & Instant Retrieval ç
Generic but Fixed ç
Peta-byte Scale ç
Single Source of Truth ç
è Allow Complex Query
è  Range Query may be Slower
è Speciﬁc and Op?mized
è  Limited to hundreds of GBs
è  Mul?ple Diﬀerent Indexes
Dimension Table Fact Table Propaga?on
Connect the two storage paradigms by auto propagaEon
to achieve Serving Layer as-a Service

Dimension Table
Engine:
MySQL (RDS)
Dimension Table
Engine:
Elasticsearch
Dimension Table
Engine:
CSV on S3
High Level Architecture
9
Propagator
Dynamo DB Streams
Random Writes
Bulk Writes
(Eventually Consistent)
Dimension Table
Schema
Propagation Rule
Configurations
Fact Tables
(Dynamo DB)

Metadata Store Applica?ons
•  Single Source of Truth
–  Centrally stores all threat intelligence
–  Ensure retrieving latest en?ty acributes
•  Feature Store and Extractor
–  Flexible feature selec?on and prepara?on
–  Feature extrac?on by auto propaga?on
•  Fast Service ConstrucEon
–  Conﬁgurable serving layer storage
–  Standardized data access API
10
Service
ML
Data

#TrendInsight
Internals design brieﬁng
FACT TABLE DESIGN
PROPAGATOR DESIGN
DIMENSION TABLE DESIGN
RE-IMPORT DESIGN
11

#TrendInsight
FACT TABLE DESIGN (DYNAMODB + DYNAMODB STREAMS)
12

Requirements for Fact table
•  A flexible data schema design on fact table
–  To store metadata for various type of en??es
–  To store values for various type of en??es
•  An efficient write for acributes for a specific en?ty to
Peta-bytes of data volume
•  User can also specify a ?mestamp for acributes (versioning)
•  An efficient read for en??es by given keys from Peta-
bytes of data volume
–  Other criteria including ?mestamp, acribute name, acribute
value
•  Reliability, Availability, opera?ons, etc
–  AWS managed service is becer
–  AWS is taking care of these –ili?es for us

Fact table
http://www.slideshare.net/AmazonWebServices/bdt310-big-data-architectural-patterns-and-best-practices-on-aws

DynamoDB stream for propagations
RDS triggers for propagations

A world’s shortest intro. for DynamoDB
•  DynamoDB is built to support workloads of any
scale with predictable, low-latency response
?mes.
•  Yet another distributed noSQL database
–  Key value and document store
–  Is AP in CAP theorem, such as Cassandra (HBase is CP)
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithTables.html
http://wikibon.org/wiki/v/21_NoSQL_Innovators_to_Look_for_in_2020
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions

A ﬂexible data schema design on fact table
•  To store metadata for various type of en??es
•  To store values for various type of en??es
Entity
« File »
aaa…
Attribute
« Filename »
Src_1
2018/9/12
Value
Attribute
« Detection »
Src_1
2018/9/12
Value
Attribute
« FirstSeen »
Src_2
2018/8/17
Value
Metadata of an entity: File
Data of an entity: File
DynamoDB
Fact table
Entity
« Url »
key
Entity
« Url »
key
Entity
« Url »
key
Entity
« Url »
key
Entity
« Url »
abc…

A ﬂexible data schema design on fact
table
Key (partition
key)
Create_ts (sort
key)
Filename
Detection
FirstSeen
Others…
aaa…:file
2018/9/12
(10-digits
timestmap)
{“src”: “src_1”,
“value”: “…”}
{“src”: “src_1”,
“value”: “…”}
aaa…:file
2018/8/17
(10-digits
timestmap)
{“src”: “src_2”,
“value”: “…”}
abc…:url
…
…
efg…:email
…
…
Fact Table Content

An eﬃcient write for acributes for a speciﬁc
en?ty to Peta-bytes of data volume
// InputAttributes API
POST/metadatastore/v1/entities/aaa…/attributes/
src_1
{
"entity_type": "file",
"attributes": [
{
"name": "original_filename",
"value": "wd.sys“ # current timestamp
},
{
"name": "download_from",
"value": "https://www.microsoft.com/zh-tw/",
"timestamp": 1470280448 # set 10-digits
}
]
}
p1
p2
pn
●●●

1. Calculate hash based on entity
ID (partition key)
2. Split items based on
timestamp (sort key)
3. Insert item#1
4. Insert item#2
5. Items in one partition are
sorted by timestamp (sort key)
1. 2.
3.
4.
InputAttributes API: POST /metadatastore/v1/
entities/abc…/attributes/src_1
// …
Fact table 5.

DynamoDB Streams
•  Captures data modiﬁca?on events in DynamoDB tables
•  In near real ?me (in secs), and in the order that the events
occurred
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-aws-integrations.html#es-aws-integrations-dynamodb-es

An eﬃcient read for en??es by given
keys from Peta-bytes of data volume
// GetEntityList API
GET /metadatastore/v1/entities
{
"entities": [
{
"key": “aaa...",
"type": "file"
},
{
"key": “abc…",
"type": "file"
}
],
"attributes": [
{
"source": “src_1",
"name": "original_filename"
},
{
"src": “src_1",
"name": "download_from"
},
]
}
// pseudo code for DynamoDB query
resultSet = //…
For entityKey in entityKeys
SELECT original_filename, download_from
FROM fact_table
WHERE BY KEYS
# find items in range
partitionKey = entityKey
AND sortKey <= timestamp
FILTER
# filter by source
original_filename.src = “src_1”
AND download_from.src = “src_1”
// …
rs_tmp = // get from DynamoDB SDK
for r_tmp in rs_tmp
for f in needField
if r_tmp.exists(f) then
resultSet.add(r_tmp[f])
needFiled.rm(f)
fi
done
done
Done
Return resultSet
pn
●
●
●
Items
Fact table
1. Transform API to DynamoDB
queries
2. Get items by partitionKey and
sortKey
3. Project attributes from items
4. Filter items by source
5. Arrange DynamoDB results to API
results
6. Return API results
1.
2.
4.
5.
6.
3.

#TrendInsight
PROPAGATORS DESIGN
(DYNAMODB STREAMS + AWS KINESIS CLIENT LIBRARY)
23

Write en??es coordinates to Dimension
table
24
A dimension table
1 .Writes
Type: file
Key: abc…
Attr.: detection
Value:
“Ransomware”
…

2 .DynamoDB
streams to
Propagators
3 .Transform and writes to
a dimension table
Metadata Store
user
Propagators

Propaga?on Input from DynamoDB
streams
•  Top-level elements: origin/change/edited, key/type
•  2nd-level elements: those as listed in inputs sec?on
25
Replace these
attributes by
those in "change”
section.
✓
✗

Propaga?on Output to Dimension table
e.g. Elas?csearch
•  For input into ES
–  ES document (JSON)
–  For searching en??es
By extra coordinates
•  Primary column
–  To iden?fy same document in ES for updates
–  _id in Elas?csearch
26
{
"_id":
"b8e76297d0bfbf889c38cff3e80c1e14de9f7a18",
"rescan_decision": "Friday Intel...",
"dump_mip2scan": "NEW VERISIGN DDOS ...",
"file_census_external": 91232,
"received_timestamp": "2016-09-23T16:19:35"
}
output:
engine: ...
columns:
- name: _id
rule: .key
primary: yes
entity_type: file
examples: ...
summary: File SHA1
Means this column
is primary column

jq – like sed for JSON
•  Lightweight/ﬂexible command-line JSON processor
jq '.foo'
Input {"foo": 42, "bar": "less interesting data"}
Output 42
jq '{user, title: .titles[]}'
Input {"user":"stedolan","titles":["JQ Primer", "More JQ"]}
Output {"user":"stedolan", "title": "JQ Primer"}
{"user":"stedolan", "title": "More JQ"}
jq 'map_values(.+1)'
Input {"a": 1, "b": 2, "c": 3}
Output {"a": 2, "b": 3, "c": 4}
$ cat input.json | jq '.foo'
42

Use jq in Metadata Store
•  Metadata Store needs a small language – jq
–  One or few lines which are easy to conﬁgure
–  Like how regular expressions are used everywhere
•  As propaga?on rules:
–  Extract value: from complex acribute input
–  Transform data: shape it to ﬁt dimension table
•  As input validators:
–  Invalid if transformed to (empty, null, or false)
–  Loose asser?ons instead of strict schema check
28

Internals among
DynamoDB -> DynamoDB streams -> Propagators
29
p1
p2
p3
DynamoDB
s1
s2
s3
DynamoDB
Streams
Worker_1
App: file
t_1 for s1
t_2 for s3
ec2_1
Worker_2
App: url
t_1 for s2
t_2 for s3
Worker_3
App: file
t_1 for s2
ec2_2
Worker_4
App: url
t_1 for s1
Dimension
table
AWS
Kinesis
AWS Kinesis
Client Library
configs
S3

#TrendInsight
DIMENSION TABLE DESIGN (AWS ELASTICSEARCH)
30

Requirements for Dimensional table
•  Users can choose various technologies
–  To fulfill their different needs
–  Currently only support AWS Elas?csearch
•  Why search service comes first ?
–  A search service can help users to find the interested
en??es with various coordinates not only en?ty keys
•  Reliability, Availability, opera?ons, etc
–  Efficient reads from Tera-bytes of data size
–  AWS Elas?csearch is a managed service
–  Its support is becer than AWS Cloudsearch (Apache Solr)
–  Kibana built-in

Dimension table
http://www.slideshare.net/AmazonWebServices/bdt310-big-data-architectural-patterns-and-best-practices-on-aws

•  Elas?csearch is a search engine based on Lucene. It
provides a distributed, mulEtenant-capable full-
text search engine with an HTTP web interface and
schema-free JSON documents. Elas?csearch is
developed in Java and is released as open source
under the terms of the Apache License.
33 https://en.wikipedia.org/wiki/Elasticsearch

Write en??es coordinates to ES
34
AWS
Elasticsearch
1 .Writes
Type: file
Key: abc…
Attr.: detection
Value:
“Ransomware”
…

2 .DynamoDB
streams to
Propagators
3 .Transform to ES
docs and upsert to ES
Doc id: abc…
{
detection:
“Ransomware”
…
}
Metadata Store
user

Read en??es based on coordinates from
ES
35
AWS
Elasticsearch
1 .Query file
entities with
detection
“Ransomware”

Metadata Store
2 .Return key
“abc…” for file
entity

3 .Get file entity by
key “abc…”

4 .Return file
entity with key
“abc…”

user

Study metrics on Kibana in visualized
manner
36
AWS
Elasticsearch
Metadata Store
What kind of
sourcing for file ?
How many of them ?
etc…
user

#TrendInsight
REIMPORT (JOURNAL LOGS + AWS EMR + PROPAGATORS)
37

Reimport
•  Blue/green strategy to alter dimension schema
–  Works on all scenarios
–  Other lightweight and more restric?ve alterna?ves
•  such as Elas?csearch re-index api
•  Two-phase reimport process
–  Phase 1: reimport via Map/Reduce
–  Phase 2: reimport via 2nd propagator
•  New components to enable re-import
–  Logstash: keep DynamoDB Stream Journal Log
–  Map/Reduce: run propaga?on logic in batch
38

Overview for re-import
39
AWS Elasticsearch
Metadata Store
Admin
AWS EMR
Logstash
1 .Data keeps to
input

2 .Streams also
connect to Logstash
3 .Dump journal logs
to S3
4 .Start re-import
5 .Launch AWS
EMR for re-import
MR jobs

6 .Run MR jobs
Input: journal logs
Output: ES docs

Reimport Process
40
D.Table
(Elasticsearch)
Start
Re-import
Journal Log
(S3)
DynamoDB
Stream
Fact Table
(DynamoDB)
Input
Attributes
Map/Reduce
LogStash
Propagator
Propagator’
Propagator’’
t

42
•  En?ty Model Storage
Design on DynamoDB
•  Op?mized for fast read
by keys with versioning
•  Conﬁgurable propaga?on
via Jq as ETL Language
•  Scalable implementa?on
with Kinesis Client Library
•  Elas?csearch as pilot
dimension table engine
•  Support Alter Table for
development needs
•  Common serving layer
architectural pacern
•  Fact à dimension tables
by auto propaga?on
Metadata Store

We are hiring!!
•  Data Engineer
•  Data Scien?st
•  Cloud Engineer
We have...
•  Lots of real world data to process
•  Lots of real world ML challenges
hcp://j.mp/trendmicro-tw-jobs
43

[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligence Services and Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligence Services and Machine Learning

Similar to [DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligence Services and Machine Learning (20)

Recently uploaded

Recently uploaded (20)

[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligence Services and Machine Learning