Scott Miao
2013/12/14
Who am I

•
•
•
•

RD, SPN, Trend Micro
3 years for Hadoop eco system
Expertise in HDFS/MR/HBase
@takeshi.miao
THREATCONNECT
Product 1 Product 2

Product 3

…

IP, domain, URL, filename, process, file hash,
Virus detection, registry key, etc.

San...
A GRAPH
The problems
• Store large size of Graph data
• Access large size of Graph data
• Process large size of Graph data
大
數
據
STORE
Property Graph Model (1/3)

https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model
Property Graph Model (2/3)
• A property graph has these elements
– a set of vertices
•
•
•
•

each vertex has a unique ide...
Property Graph Model (3/3)
The domain model for
Property Graph Model
The relational model for
Property Graph Model
Massive
scalable ?

Active
community ?

Analyzable ?
The winner is…
• We use HBase as a Graph Storage
– Google BigTable and PageRank
– HBaseCon2012

Yeah
We are NO. 1 !!
Use HBase to store Graph data (1/3)
• Schema design
– Table: vertex
‘<vertex-id>@<entity-type>’, ‘property:<property-key>@...
Use HBase to store Graph data (2/3)
• Sample
– Table: vertex
‘myapps-ups.com@domain’, ‘property:ip@String’, ‘…’
‘myapps-up...
Use HBase to store Graph data (3/3)
• Tables
– create 'test.vertex', {NAME => 'property',
BLOOMFILTER => 'ROW', COMPRESSIO...
It’s not me,
actually…

ACCESS
3. Process Data
2. Get Data
HBase

Clients

Algorithms

1. Put data

Data Sources
Put Data
• HBase schema design is simple and humanreadable
• They are easy to write your own dumping tool
as you need
– MR...
Get Data (1/2)
• A Graph API
• A better semantics for manipulating Graph
data
– As a wrapper for HBase Client API
– Rather...
Get Data (2/2)
• We implement blueprints API
– It provides interfaces as spec. for users to impl.
– Currently basic query ...
PROCESS
• Thanks for human-readable HBase schema
design and random accessible in natural
– Write your own MR
– Write your own Pig/...
HGraph
• A project is open and put on github
– https://github.com/takeshimiao/HGraph

• A partial impl. released from our ...
There is another project

http://thinkaurelius.github.
io/faunus/

http://thinkaurelius.github.io/titan/
OBSERVATIONS
YARN
• It seems bring Hadoop to a de-facto big data
platform
– Loose bound the MR framework and
accommodate others

• Ther...
http://hortonworks.com/hadoop/yarn/
SQL-on-Hadoop
• Impala V.S. Hive (Stinger and Tez)
– Impala seems more mature than Hive
Hive built on top of a batch proce...
HBase is a popular noSQL
• As I saw in Europe/CA/China, I can say HBase
is most popular noSQL solution if you already
adop...
http://www.slideshare.net/Hadoop_Summit/what-is-the-point-of-hadoop?from_search=1 #p34
Attack on graph
Attack on graph
Attack on graph
Attack on graph
Attack on graph
Attack on graph
Attack on graph
Upcoming SlideShare
Loading in...5
×

Attack on graph

533

Published on

This sharing is talking about how Trend micro SPN using HBase to solve Graph model problem. And use pageRank to process our graph data to do predictive things. Then we also put the partial impl. of our Graph solution named HGraph on github for everyone interesting about this topic.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
533
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
19
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Attack on graph

  1. 1. Scott Miao 2013/12/14
  2. 2. Who am I • • • • RD, SPN, Trend Micro 3 years for Hadoop eco system Expertise in HDFS/MR/HBase @takeshi.miao
  3. 3. THREATCONNECT
  4. 4. Product 1 Product 2 Product 3 … IP, domain, URL, filename, process, file hash, Virus detection, registry key, etc. Sandbox APT KB Threat Connect Virus DB TE Family Writeup File Detecti on Threat Web Web Reputa tion Process and correlates different data sources Most relevant threat report with actionable intelligence on a single portal
  5. 5. A GRAPH
  6. 6. The problems • Store large size of Graph data • Access large size of Graph data • Process large size of Graph data
  7. 7. 大 數 據
  8. 8. STORE
  9. 9. Property Graph Model (1/3) https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model
  10. 10. Property Graph Model (2/3) • A property graph has these elements – a set of vertices • • • • each vertex has a unique identifier. each vertex has a set of outgoing edges. each vertex has a set of incoming edges. each vertex has a collection of properties defined by a map from key to value. – a set of edges • • • • each edge has a unique identifier. each edge has an outgoing tail vertex. each edge has an incoming head vertex. each edge has a label that denotes the type of relationship between its two vertices. • each edge has a collection of properties defined by a map from key to value.
  11. 11. Property Graph Model (3/3)
  12. 12. The domain model for Property Graph Model
  13. 13. The relational model for Property Graph Model
  14. 14. Massive scalable ? Active community ? Analyzable ?
  15. 15. The winner is… • We use HBase as a Graph Storage – Google BigTable and PageRank – HBaseCon2012 Yeah We are NO. 1 !!
  16. 16. Use HBase to store Graph data (1/3) • Schema design – Table: vertex ‘<vertex-id>@<entity-type>’, ‘property:<property-key>@<property-value-type>’, <property-value> – Table: edge ‘<vertex1-row-key>--><label>--><vertex2-row-key>’, ‘property:<property-key>@<property-value-type>’, <property-value>
  17. 17. Use HBase to store Graph data (2/3) • Sample – Table: vertex ‘myapps-ups.com@domain’, ‘property:ip@String’, ‘…’ ‘myapps-ups.com@domain’, ‘property:asn@String’, ‘…’ … ‘http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:path@String’, ‘…’ ‘http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:parameter@String’, ‘…’ – Table: edge ‘myapps-ups.com@domain-->host-->http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:property1’, ‘…’ ‘myapps-ups.com@domain-->host-->http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:property2’, ‘…’
  18. 18. Use HBase to store Graph data (3/3) • Tables – create 'test.vertex', {NAME => 'property', BLOOMFILTER => 'ROW', COMPRESSION => ‘lzo', TTL => '7776000'} – create 'test.edge', {NAME => 'property', BLOOMFILTER => 'ROW', COMPRESSION => ‘lzo', TTL => '7776000'}
  19. 19. It’s not me, actually… ACCESS
  20. 20. 3. Process Data 2. Get Data HBase Clients Algorithms 1. Put data Data Sources
  21. 21. Put Data • HBase schema design is simple and humanreadable • They are easy to write your own dumping tool as you need – MR/Pig/Completebulkload – Can write cron-job to clean up the broken-edge data – TTL can also help to retire old data • We already have a lot practices for this task
  22. 22. Get Data (1/2) • A Graph API • A better semantics for manipulating Graph data – As a wrapper for HBase Client API – Rather than use HBase Client API directly • Simple to Use Vertex vertex = this.graph.getVertex("40012"); Vertex subVertex = null; Iterable<Edge> edges = vertex.getEdges(Direction.OUT, "knows", "foo", "bar"); for(Edge edge : edges) { subVertex = edge.getVertex(Direction.OUT); ... }
  23. 23. Get Data (2/2) • We implement blueprints API – It provides interfaces as spec. for users to impl. – Currently basic query methods are implemented – We can get benefits from it • Other libraries support if we can impl. more degrees of blueprints API – http://www.tinkerpop.com/ – RESTful server, graph algorithmn, dataflow, etc
  24. 24. PROCESS
  25. 25. • Thanks for human-readable HBase schema design and random accessible in natural – Write your own MR – Write your own Pig/UDFs • Ex. The pagerank – http://zh.wikipedia.org/wiki/Pagerank
  26. 26. HGraph • A project is open and put on github – https://github.com/takeshimiao/HGraph • A partial impl. released from our internal pilot project – Follow HBase schema design – Read data via Blueprints API – Process data with pagerank • Download or ‘git clone’ it – Use ‘mvn clean package’ – Run on unix-like OS • Use window may encounter some errors
  27. 27. There is another project http://thinkaurelius.github. io/faunus/ http://thinkaurelius.github.io/titan/
  28. 28. OBSERVATIONS
  29. 29. YARN • It seems bring Hadoop to a de-facto big data platform – Loose bound the MR framework and accommodate others • There are bunch of data processing migrated with it
  30. 30. http://hortonworks.com/hadoop/yarn/
  31. 31. SQL-on-Hadoop • Impala V.S. Hive (Stinger and Tez) – Impala seems more mature than Hive Hive built on top of a batch processing framework (even MRv2), but Impala goes itself own way !! Todd Lipcon Committer/PMC member on Apache Thrift, HBаse, and Hаdoop projects • YARN !! – Hive stinger and Tez are based on YARN (HDP2) – Impala also has plan to migrated to YARN (CDH5) – Even HBase !! (HOYA)
  32. 32. HBase is a popular noSQL • As I saw in Europe/CA/China, I can say HBase is most popular noSQL solution if you already adopted Hadoop • Other noSQLs will not help you out of OPS paintpoints • So the best way is to pick your right tool and play it well
  33. 33. http://www.slideshare.net/Hadoop_Summit/what-is-the-point-of-hadoop?from_search=1 #p34
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×