Attack on graph

Who am I

•
•
•
•

RD, SPN, Trend Micro
3 years for Hadoop eco system
Expertise in HDFS/MR/HBase
@takeshi.miao

Product 1 Product 2

Product 3

…

IP, domain, URL, filename, process, file hash,
Virus detection, registry key, etc.

Sandbox
APT KB

Threat
Connect

Virus
DB

TE
Family
Writeup

File
Detecti
on

Threat
Web

Web
Reputa
tion

Process and
correlates different
data sources

Most relevant threat
report with actionable
intelligence
on a single portal

The problems
• Store large size of Graph data
• Access large size of Graph data
• Process large size of Graph data

Property Graph Model (1/3)

https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model

Property Graph Model (2/3)
• A property graph has these elements
– a set of vertices
•
•
•
•

each vertex has a unique identifier.
each vertex has a set of outgoing edges.
each vertex has a set of incoming edges.
each vertex has a collection of properties defined by a map from
key to value.

– a set of edges
•
•
•
•

each edge has a unique identifier.
each edge has an outgoing tail vertex.
each edge has an incoming head vertex.
each edge has a label that denotes the type of relationship
between its two vertices.
• each edge has a collection of properties defined by a map from
key to value.

The domain model for
Property Graph Model

The relational model for
Property Graph Model

Massive
scalable ?

Active
community ?

Analyzable ?

The winner is…
• We use HBase as a Graph Storage
– Google BigTable and PageRank
– HBaseCon2012

Yeah
We are NO. 1 !!

Use HBase to store Graph data (1/3)
• Schema design
– Table: vertex
‘<vertex-id>@<entity-type>’, ‘property:<property-key>@<property-value-type>’,
<property-value>

– Table: edge
‘<vertex1-row-key>--><label>--><vertex2-row-key>’,
‘property:<property-key>@<property-value-type>’, <property-value>

• Sample
– Table: vertex
‘myapps-ups.com@domain’, ‘property:ip@String’, ‘…’
‘myapps-ups.com@domain’, ‘property:asn@String’, ‘…’
…
‘http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:path@String’, ‘…’
‘http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:parameter@String’, ‘…’

– Table: edge
‘myapps-ups.com@domain-->host-->http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’,
‘property:property1’, ‘…’
‘myapps-ups.com@domain-->host-->http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’,
‘property:property2’, ‘…’

• Tables
– create 'test.vertex', {NAME => 'property',
BLOOMFILTER => 'ROW', COMPRESSION => ‘lzo',
TTL => '7776000'}
– create 'test.edge', {NAME => 'property',
BLOOMFILTER => 'ROW', COMPRESSION => ‘lzo',
TTL => '7776000'}

It’s not me,
actually…

ACCESS

3. Process Data
2. Get Data
HBase

Clients

Algorithms

1. Put data

Data Sources

Put Data
• HBase schema design is simple and humanreadable
• They are easy to write your own dumping tool
as you need
– MR/Pig/Completebulkload
– Can write cron-job to clean up the broken-edge
data
– TTL can also help to retire old data

• We already have a lot practices for this task

Get Data (1/2)
• A Graph API
• A better semantics for manipulating Graph
data
– As a wrapper for HBase Client API
– Rather than use HBase Client API directly

• Simple to Use
Vertex vertex = this.graph.getVertex("40012");
Vertex subVertex = null;
Iterable<Edge> edges =
vertex.getEdges(Direction.OUT, "knows", "foo", "bar");
for(Edge edge : edges) {
subVertex = edge.getVertex(Direction.OUT);
...
}

Get Data (2/2)
• We implement blueprints API
– It provides interfaces as spec. for users to impl.
– Currently basic query methods are implemented
– We can get benefits from it
• Other libraries support if we can impl. more degrees of
blueprints API
– http://www.tinkerpop.com/
– RESTful server, graph algorithmn, dataflow, etc

• Thanks for human-readable HBase schema
design and random accessible in natural
– Write your own MR
– Write your own Pig/UDFs

• Ex. The pagerank
– http://zh.wikipedia.org/wiki/Pagerank

HGraph
• A project is open and put on github
– https://github.com/takeshimiao/HGraph

• A partial impl. released from our internal pilot
project
– Follow HBase schema design
– Read data via Blueprints API
– Process data with pagerank

• Download or ‘git clone’ it
– Use ‘mvn clean package’
– Run on unix-like OS
• Use window may encounter some errors

There is another project

http://thinkaurelius.github.
io/faunus/

http://thinkaurelius.github.io/titan/

YARN
• It seems bring Hadoop to a de-facto big data
platform
– Loose bound the MR framework and
accommodate others

• There are bunch of data processing migrated
with it

http://hortonworks.com/hadoop/yarn/

SQL-on-Hadoop
• Impala V.S. Hive (Stinger and Tez)
– Impala seems more mature than Hive
Hive built on top of a batch processing framework (even MRv2), but Impala goes
itself own way !!
Todd Lipcon
Committer/PMC member on Apache Thrift, HBаse, and Hаdoop projects

• YARN !!
– Hive stinger and Tez are based on YARN (HDP2)
– Impala also has plan to migrated to YARN (CDH5)
– Even HBase !! (HOYA)

HBase is a popular noSQL
• As I saw in Europe/CA/China, I can say HBase
is most popular noSQL solution if you already
adopted Hadoop
• Other noSQLs will not help you out of OPS
paintpoints
• So the best way is to pick your right tool and
play it well

http://www.slideshare.net/Hadoop_Summit/what-is-the-point-of-hadoop?from_search=1 #p34

Attack on graph

More Related Content

What's hot

Similar to Attack on graph

More from Scott Miao

Recently uploaded

Attack on graph