A Graph Service for Global Web Entities Traversal and Reputation Evaluation Based on HBase

HBaseCon
A Graph Service for Global Web
Entities Traversal and Reputation
Evaluation Based on HBase
Chris Huang, Scott Miao
2014/5/5
Who are we
• Scott Miao
• Developer, SPN, Trend Micro
• Worked on hadoop ecosystem since 2011
• Expertise in HDFS/MR/HBase
• Contributor for HBase/HDFS
• @takeshi.miao
• Chris Huang
• RD Manager, SPN, Trend Micro
• Hadoop Architect
• Worked on hadoop ecosystem since 2009
• Contributor for Bigtop
• @chenhsiu48
Our blog ‘Dumbo in TW’: http://dumbointaiwan.blogspot.tw/
Challenges We Faced
New Unique Malware Discovered
http://www.av-test.org/en/statistics/malware/
Social Engineering vs. Cyber Attacks
Trend Micro Defense Strategy
Layer of Protection
Exposure
Infection
Dynamic
Web, Emails
File Signatures
File Behaviors
See The Threat Entity
Connectivity
Connectivity From Different Data Sources
Threat
Connect
Sand-
box
File
Detecti
on
Threat
Web
Web
Reputa
tionFamily
Write-
up
TE
Virus
DB
APT KB
ThreatWeb: Threat Entities as a Graph
Threat Entities Relation Graph
F
D
I
DD
D
I
I
D
DF
F
D
I
I
I
E
E
E
E
E
E
I
I
ID
D
D
D
F
D
D F
I
I
F
D
E
File
IP
Domain
Email
Most Entity Reputations are Unknown
F
D
I
DD
D
I
I
D
DF
F
D
I
I
I
E
E
E
E
E
E
I
I
ID
D
D
D
F
D
D F
I
I
F
D
E
File
IP
Domain
Email
Security Solution Dilemma – Long Tail
Prevalence
Entities
Known good/bad
Traditional heuristic detection
Big Datacan help!
How can we detect the rest
effectively?
Inspired by PageRank
• Too many un-visited pages!
• Users browse pages through
links
• Let users’ clicks (BIG DATA) tell
us the rankings of those un-
visited pages!
Revised PageRank Algorithm
• Too many un-rated threat
entities!
• Malware activities interact with
threat entitles
• Let malware’s behaviors (BIG
DATA) tell us the reputations
of those un-rated threat entities!
F
D
I
DD
D
I
I
D
DF
F
D
II
I
E
E
E
E E
E
I
I
ID
D
D
D
F
D
D F
I
The Graph Problem
The Problems
• Store large size of Graph data
• Access large size of Graph data
• Process large size of Graph data
Data volume
• Dump ~450MB (150 bytes * 3,000,000 records) data
into Graph per day
– Extract from 3GB of data
• Keep it for 3 month
– ~450MB * 90 = ~40,500MB = ~39GB
– With Snappy compression
– ~20 - 22GB
• Dataset
– ~40,000,000 vertices and ~100,000,000 edges
• Data query volume about hundreds of thousands per
day
BIG
GRAPH
!!
http://news.gamme.com.tw/544510
Store
Property Graph Model
Name: Jack
Sex: Male
Marriage: true
Name: Merry
Sex: Female
Marriage: true
marries
Name: Vivian
Sex: Female
Marriage: false
Name: John
Sex: Male
Marriage: true
Date: 2010/5/5
Name: Emily
Sex: Female
Marriage: true
From a soap opera…
https://github.com/tinkerpop/blueprints/wiki/Property-Graph-
Model
• We use HBase as a Graph Storage
– Google BigTable and PageRank
– HBaseCon2012
• Storing and manipulating graphs in HBase
The winner is…
Massive
scalable ?
Active
community ?
Analyzable ?
Use HBase to store Graph data (1/3)
• Tables
– create 'vertex', {NAME => 'property',
BLOOMFILTER => 'ROW', COMPRESSION
=> ‘SNAPPY', TTL => '7776000'}
– create 'edge', {NAME => 'property',
BLOOMFILTER => 'ROW', COMPRESSION
=> ‘SNAPPY', TTL => '7776000'}
Use HBase to store Graph data (2/3)
• Schema design
– Table: vertex
– Table: edge
‘<vertex-id>||<entity-type>’, ‘property:<property-key>@<property-value-type>’,
<property-value>
‘<vertex1-row-key>--><label>--><vertex2-row-key>’,
‘property:<property-key>@<property-value-type>’, <property-value>
Use HBase to store Graph data (3/3)
• Sample
– Table: vertex
– Table: edge
‘myapps-ups.com||domain’, ‘property:ip@String’, ‘…’
‘myapps-ups.com||domain’, ‘property:asn@String’, ‘…’
…
‘track.muapps-ups.com/InvoiceA1423AC.JPG.exe||url’, ‘property:path@String’, ‘…’
‘track.muapps-ups.com/InvoiceA1423AC.JPG.exe||url’, ‘property:parameter@String’, ‘…’
‘myapps-ups.com||domain-->host-->track.muapps-ups.com/InvoiceA1423AC.JPG.exe||url’,
‘property:property1’, ‘…’
‘myapps-ups.com||domain-->host-->track.muapps-ups.com/InvoiceA1423AC.JPG.exe||url’,
‘property:property2’, ‘…’
Keep your rowkey length short
• With long rowkey length
– It does not impact your query performance
– But it does impact your algorithm MR
• OutOfMemoryException
• Use something like HASH function to keep
your rowkey length short
– Use the hash value as rowkey
– Put the original value into a property
Overall Architecture
Src
table
snapshot
Clone
table
Clone
table
Graph
table
snapshot
Clone
table
Clone
table
HBase
Source
C
Source
B
source
A
HDFS
1. Collect data
2. reprocess
& dump data
3. Get Data
Client
4. Process Data
Algorithms
Intermediate data
On HDFS
Src
table
snapsho
t
Clone
table
Clone
table
Graph
table
snapsho
t
Clone
table
Clone
table
HBase
Source
C
Source
B
source
A
HDFS
reprocess &
dump data
Client
Algorithms
Intermediate data
On HDFS
Preprocess and Dump Data
• HBase schema design is simple and human-
readable
• It is easy to write your dumping tool if needed
– MR/Pig/Completebulkload
– Can write cron-job to clean up the broken-edge
data
– TTL can also help to retire old data
• We already have a lot practices for these
tasks
Access
Src
table
snapsho
t
Clone
table
Clone
table
Graph
table
snapsho
t
Clone
table
Clone
table
HBase
Source
C
Source
B
source
A
HDFS
Get Data
Client
Algorithms
Intermediate data
On HDFS
Get Data (1/2)
• A Graph API
• A better semantic for manipulating Graph data
– As a wrapper for HBase Client API
– Rather than use HBase Client API directly
• A malware exploring sample
Vertex vertex = this.graph.getVertex(“malware");
Vertex subVertex = null;
Iterable<Edge> edges =
vertex.getEdges(Direction.OUT, “connect", “infect", “trigger");
for(Edge edge : edges) {
subVertex = edge.getVertex(Direction.OUT);
...
}
Get Data (2/2)
• We implement blueprints API
– It provides interfaces as spec. for users to impl.
• 824 stars, 173 forks on github
– We can get more benefits from it
• plug-and-play different Blueprints-enabled graph
backends
– Traversal language, RESTful server, dataflow, etc
– http://www.tinkerpop.com/
– Currently basic query methods are implemented
Clients
• Real time Client
– Client systems
• they need associated Graph data for a specific entity via RESTful API
– Usually retrieve two levels of graph data
– Quick responsiveness supported by HBase
• With rowkey random access and appropriate schema design
• HTable.get(),Scan.setStartRow(), Scan.setStopRow()
• Batch client
– Threat experts
– Pick one entity and how many levels interested in, generate a graph file
format used by tools
• To visualize and navigate what whether users interested in
• Graph Exploring Tools
– Threat experts
– Find out sub-graphs by given criteria
• E.g. How many levels or associated vertices
Malware Exploring Performance (1/3)
• one request
– Use Malware exploring sample again
– 1 vertex with 2 levels associated instances (2 ~ 9 vertices)
• Dataset
– 42,133,610 vertices and 108,355,774 edges
• Total requests
– 31,764 requests * 100 clients = 3,176,400
Vertex vertex = this.graph.getVertex(“malware");
Vertex subVertex = null;
Iterable<Edge> edges =
vertex.getEdges(Direction.OUT, “connect", “infect", “trigger");
for(Edge edge : edges) {
subVertex = edge.getVertex(Direction.OUT);
...
}
Malware Exploring Performance (2/3)
Malware Exploring Performance (3/3)
• Some statistics
– Mean: 51.61 ms
– Standard Deviation: 653.57 ms
– Empirical rule: 68%, 95%, 99.7%
• 99.7% of requests below 2.1 seconds
• But response time variances still happen
– Use Cache layer between client and HBase
– Warm-up after new data come in
Process
Src
table
snapshot
Clone
table
Clone
table
Graph
table
snapshot
Clone
table
Clone
table
HBase
Source
C
Source
B
source
A
HDFS
Client
Process Data
Algorithms
Intermediate data
On HDFS
• Human-readable HBase schema design
– Write your own MR
– Write your own Pig/UDFs
• So we can write the algorithms to further
process our graph data
– To predict unknown reputation by known threats
– E.g. a revised PageRank algorithm
Data process flow
Src
table
snapshot
Clone
table Data on HDFS
Algorithms
(MR, Pig UDF)
Clone
table
Processed completed
Graph
table
snapshot
Clone
table
Clone
table
Intermediate data on HDFS
HBase
1. Dump daily
data
2. Take
snapshot
3. Clone
snapshot
4. Process data
iteratively
(takes hours)
4.1 generate
Intermediate
data
5. Process
complete 6. Dump
processed data
with timerange
A customized TableInputFormat (1/2)
• One Mapper for one region by default
– Each Mapper process too much data
• OutOfMemoryException
• Too long to process
– Use small split region size ?
• Will overload your HBase cluster !!
• Before: about ~40 Mappers
• After: about ~500 Mappers
A customized TableInputFormat (2/2)
Clone
Graph
Table
MR - Pick
Candidates
Combination
Candidates
list file
<encodedRegionName>t<startKey>t<endKey>
…
d3d1749f3486e850b263c7ecb2424dd3tstartKey_1tendKey_1
d3d1749f3486e850b263c7ecb2424dd3tstartKey_2tendKey_2
d3d1749f3486e850b263c7ecb2424dd3tstartKey_3tendKey_3
Cd91c08d656a19bdb180e0b7f8896575tstartKey_4tendKey_4
Cd91c08d656a19bdb180e0b7f8896575tstartKey_5tendKey_5
…
CustTableInp
utFormat
MR - algorithm
2. Scan table3. Output candidates
4. Load candidates
5. Run Algo. with
more Mappers
1. Run MR
HGraph
• A project is open and put on github
– https://github.com/trendmicro/HGraph
• A partial impl. released from our internal project
– Follow HBase schema design
– Read data via Blueprints API
– Process data with our pagerank default impl.
• Download or ‘git clone’ it
– Use ‘mvn clean package’
– Run on unix-like OS
• Use windows may encounter some errors
PageRank
Result
Experiment Result
• Testing Dataset
– 42,133,610 vertices and 108,355,774 edges
– 1 vertex usually associates 2 ~ 9 vertices
– 4.13% of the vertices are known bad
– 0.09% of the vertices are known good
– The rests are unknown
• Result
– Runs 34hrs for running 23 iterations.
– 1,291 unknown vertices are ranked out
– Top 200 has 99% accuracy (explain later)
Suspicious DGA Discovered
• 3nkp***cq-----x.esf.sinkdns.org
– 196 domains from Domain Generated Algorithms
https://www.virustotal.com/en/url/871004bd9a0fe27e61b0519ceb8457528ea00da0e7ffdc44d98e759ab3e3caa1/analysis/
Untested But Highly Malware Related IP
• 67.*.*.132
– Categorized as “Computers / Internet”, not tested
https://www.virustotal.com/en/ip-address/67.*.*.132/information/
Prevalence
Entities
Knowngood/bad
Traditionalheuristicdetection
IntelligencefromBigDataAnalyt
(90%)
Security in Old Days
Cannot Protect What
You Cannot See
Next Generation Security
Unleash the Power of Data
Discover What We Don’t Know
Q&A
Backups
Property Graph Model (2/2)
Property Graph Model Definition
• A property graph has these elements
– a set of vertices
• each vertex has a unique identifier.
• each vertex has a set of outgoing edges.
• each vertex has a set of incoming edges.
• each vertex has a collection of properties defined by a map from key to value.
– a set of edges
• each edge has a unique identifier.
• each edge has an outgoing tail vertex.
• each edge has an incoming head vertex.
• each edge has a label that denotes the type of relationship between its two vertices.
• each edge has a collection of properties defined by a map from key to value.
About regions
• Keep reasonable amount of regions for each
regionserver
• Notice your splitted regions from one table
– Dump data daily, cause regions splitting
– Make sure your regions scattered evenly on each
regionserver
<hbase.regionserver.global.memstore.upperLimit> / <hbase.hregion.memstore.flush.size> =
<active-regions-per-rs>
e.g. (10G * 0.4) / 128MB = 32 active regions HBase Sizing Notes by Lars George
1 of 54

Recommended

Data Evolution in HBase by
Data Evolution in HBaseData Evolution in HBase
Data Evolution in HBaseHBaseCon
5K views38 slides
A Survey of HBase Application Archetypes by
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesHBaseCon
20K views60 slides
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures by
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index StructuresHBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index StructuresCloudera, Inc.
4.1K views16 slides
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon by
HBaseCon 2013: Deal Personalization Engine with HBase @ GrouponHBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ GrouponCloudera, Inc.
6K views17 slides
HBaseCon 2013: ETL for Apache HBase by
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseCloudera, Inc.
6.9K views25 slides
Apache HBase Application Archetypes by
Apache HBase Application ArchetypesApache HBase Application Archetypes
Apache HBase Application ArchetypesCloudera, Inc.
5.2K views62 slides

More Related Content

What's hot

HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S... by
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
4.7K views11 slides
Apache HBase™ by
Apache HBase™Apache HBase™
Apache HBase™Prashant Gupta
3.8K views58 slides
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ... by
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...Cloudera, Inc.
3.6K views13 slides
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase by
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
8.8K views49 slides
Introduction To HBase by
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
87.9K views18 slides
Hbase status quo apache-con europe - nov 2012 by
Hbase status quo   apache-con europe - nov 2012Hbase status quo   apache-con europe - nov 2012
Hbase status quo apache-con europe - nov 2012Chris Huang
657 views40 slides

What's hot(20)

HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S... by Cloudera, Inc.
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
Cloudera, Inc.4.7K views
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ... by Cloudera, Inc.
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
Cloudera, Inc.3.6K views
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase by HBaseCon
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon8.8K views
Introduction To HBase by Anil Gupta
Introduction To HBaseIntroduction To HBase
Introduction To HBase
Anil Gupta87.9K views
Hbase status quo apache-con europe - nov 2012 by Chris Huang
Hbase status quo   apache-con europe - nov 2012Hbase status quo   apache-con europe - nov 2012
Hbase status quo apache-con europe - nov 2012
Chris Huang657 views
Apache HBase - Introduction & Use Cases by Data Con LA
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
Data Con LA10.1K views
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre... by Data Con LA
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Data Con LA593 views
Hoodie - DataEngConf 2017 by Vinoth Chandar
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar1.2K views
Introduction To Hadoop Ecosystem by InSemble
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
InSemble2.8K views
Big Data Fundamentals in the Emerging New Data World by Jongwook Woo
Big Data Fundamentals in the Emerging New Data WorldBig Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data World
Jongwook Woo3.3K views
HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at Explorys by Cloudera, Inc.
HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at ExplorysHBaseCon 2012 | Real-Time and Batch HBase for Healthcare at Explorys
HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at Explorys
Cloudera, Inc.3.7K views
Cisco connect toronto 2015 big data sean mc keown by Cisco Canada
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada2K views
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB by HBaseCon
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon5.6K views
Hadoop User Group - Status Apache Drill by MapR Technologies
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
MapR Technologies1.6K views
Architecting Applications with Hadoop by markgrover
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover765 views
Getting Started with Hadoop by Cloudera, Inc.
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.3.5K views
Hadoop World 2011: Advanced HBase Schema Design by Cloudera, Inc.
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
Cloudera, Inc.17.9K views

Viewers also liked

HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving by
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon
2.6K views54 slides
Real-time HBase: Lessons from the Cloud by
Real-time HBase: Lessons from the CloudReal-time HBase: Lessons from the Cloud
Real-time HBase: Lessons from the CloudHBaseCon
4.5K views35 slides
HBaseCon 2015: HBase Operations in a Flurry by
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon
4.1K views22 slides
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase by
HBaseCon 2015: Blackbird Collections - In-situ  Stream Processing in HBaseHBaseCon 2015: Blackbird Collections - In-situ  Stream Processing in HBase
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBaseHBaseCon
3.2K views30 slides
Rolling Out Apache HBase for Mobile Offerings at Visa by
Rolling Out Apache HBase for Mobile Offerings at Visa Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa HBaseCon
2.6K views39 slides
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace by
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon
4.5K views26 slides

Viewers also liked(20)

HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving by HBaseCon
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon2.6K views
Real-time HBase: Lessons from the Cloud by HBaseCon
Real-time HBase: Lessons from the CloudReal-time HBase: Lessons from the Cloud
Real-time HBase: Lessons from the Cloud
HBaseCon4.5K views
HBaseCon 2015: HBase Operations in a Flurry by HBaseCon
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon4.1K views
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase by HBaseCon
HBaseCon 2015: Blackbird Collections - In-situ  Stream Processing in HBaseHBaseCon 2015: Blackbird Collections - In-situ  Stream Processing in HBase
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
HBaseCon3.2K views
Rolling Out Apache HBase for Mobile Offerings at Visa by HBaseCon
Rolling Out Apache HBase for Mobile Offerings at Visa Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa
HBaseCon2.6K views
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace by HBaseCon
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon4.5K views
Update on OpenTSDB and AsyncHBase by HBaseCon
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
HBaseCon2.6K views
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems by Cloudera, Inc.
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
Cloudera, Inc.6.1K views
HBase Data Modeling and Access Patterns with Kite SDK by HBaseCon
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
HBaseCon4.7K views
Digital Library Collection Management using HBase by HBaseCon
Digital Library Collection Management using HBaseDigital Library Collection Management using HBase
Digital Library Collection Management using HBase
HBaseCon3.1K views
HBase at Bloomberg: High Availability Needs for the Financial Industry by HBaseCon
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBaseCon6.7K views
Content Identification using HBase by HBaseCon
Content Identification using HBaseContent Identification using HBase
Content Identification using HBase
HBaseCon3.8K views
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS by HBaseCon
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon4K views
Apache HBase in the Enterprise Data Hub at Cerner by HBaseCon
Apache HBase in the Enterprise Data Hub at CernerApache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at Cerner
HBaseCon2.1K views
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster by Cloudera, Inc.
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.7.5K views
Apache HBase Improvements and Practices at Xiaomi by HBaseCon
Apache HBase Improvements and Practices at XiaomiApache HBase Improvements and Practices at Xiaomi
Apache HBase Improvements and Practices at Xiaomi
HBaseCon4.8K views
HBaseCon 2015: HBase @ CyberAgent by HBaseCon
HBaseCon 2015: HBase @ CyberAgentHBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgent
HBaseCon5.5K views
HBaseCon 2013: Full-Text Indexing for Apache HBase by Cloudera, Inc.
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
Cloudera, Inc.7.3K views
HBaseCon 2015: Analyzing HBase Data with Apache Hive by HBaseCon
HBaseCon 2015: Analyzing HBase Data with Apache  HiveHBaseCon 2015: Analyzing HBase Data with Apache  Hive
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon7.9K views
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural... by Cloudera, Inc.
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
Cloudera, Inc.8.8K views

Similar to A Graph Service for Global Web Entities Traversal and Reputation Evaluation Based on HBase

Attack on graph by
Attack on graphAttack on graph
Attack on graphScott Miao
1.1K views40 slides
מיכאל by
מיכאלמיכאל
מיכאלsqlserver.co.il
1K views18 slides
Modern Big Data Analytics Tools: An Overview by
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
8.9K views67 slides
Big Data Processing by
Big Data ProcessingBig Data Processing
Big Data ProcessingMichael Ming Lei
185 views74 slides
Data Infrastructure for a World of Music by
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
1.5K views37 slides
Sept 17 2013 - THUG - HBase a Technical Introduction by
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
8.1K views60 slides

Similar to A Graph Service for Global Web Entities Traversal and Reputation Evaluation Based on HBase(20)

Attack on graph by Scott Miao
Attack on graphAttack on graph
Attack on graph
Scott Miao1.1K views
Modern Big Data Analytics Tools: An Overview by Great Wide Open
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Great Wide Open8.9K views
Data Infrastructure for a World of Music by Lars Albertsson
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
Lars Albertsson1.5K views
Sept 17 2013 - THUG - HBase a Technical Introduction by Adam Muise
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
Adam Muise8.1K views
Michael stack -the state of apache h base by hdhappy001
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
hdhappy001965 views
HBase and Hadoop at Urban Airship by dave_revell
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
dave_revell2.1K views
Big Data and NoSQL for Database and BI Pros by Andrew Brust
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Andrew Brust3.3K views
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory... by Cloudera, Inc.
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
Cloudera, Inc.4.1K views
Hypercubes In Hbase by George Ang
Hypercubes In HbaseHypercubes In Hbase
Hypercubes In Hbase
George Ang1.2K views
Hive @ Hadoop day seattle_2010 by nzhang
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang3K views
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010 by Jonathan Seidman
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Jonathan Seidman10.6K views
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013) by VMware Tanzu
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu4.9K views
Big Data in the Microsoft Platform by Jesus Rodriguez
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez5.2K views
Prabhakar_Hadoop_2 years Experience by PRABHAKAR T
Prabhakar_Hadoop_2 years ExperiencePrabhakar_Hadoop_2 years Experience
Prabhakar_Hadoop_2 years Experience
PRABHAKAR T119 views
Prabhakar_Hadoop_2 years Experience by PRABHAKAR T
Prabhakar_Hadoop_2 years ExperiencePrabhakar_Hadoop_2 years Experience
Prabhakar_Hadoop_2 years Experience
PRABHAKAR T63 views
Analyzing Large-Scale User Data with Hadoop and HBase by WibiData
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBase
WibiData 683 views

More from HBaseCon

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes by
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon
3.9K views36 slides
hbaseconasia2017: HBase on Beam by
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on BeamHBaseCon
1.3K views26 slides
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei by
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at HuaweiHBaseCon
1.4K views21 slides
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest by
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in PinterestHBaseCon
936 views42 slides
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程 by
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程HBaseCon
1.1K views21 slides
hbaseconasia2017: Apache HBase at Netease by
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at NeteaseHBaseCon
754 views27 slides

More from HBaseCon(20)

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes by HBaseCon
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
HBaseCon3.9K views
hbaseconasia2017: HBase on Beam by HBaseCon
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beam
HBaseCon1.3K views
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei by HBaseCon
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
HBaseCon1.4K views
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest by HBaseCon
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon936 views
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程 by HBaseCon
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
HBaseCon1.1K views
hbaseconasia2017: Apache HBase at Netease by HBaseCon
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Netease
HBaseCon754 views
hbaseconasia2017: HBase在Hulu的使用和实践 by HBaseCon
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
HBaseCon878 views
hbaseconasia2017: 基于HBase的企业级大数据平台 by HBaseCon
hbaseconasia2017: 基于HBase的企业级大数据平台hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台
HBaseCon701 views
hbaseconasia2017: HBase at JD.com by HBaseCon
hbaseconasia2017: HBase at JD.comhbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.com
HBaseCon828 views
hbaseconasia2017: Large scale data near-line loading method and architecture by HBaseCon
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
HBaseCon598 views
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei by HBaseCon
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huaweihbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
HBaseCon683 views
hbaseconasia2017: HBase Practice At XiaoMi by HBaseCon
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMi
HBaseCon1.8K views
hbaseconasia2017: hbase-2.0.0 by HBaseCon
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
HBaseCon1.8K views
HBaseCon2017 Democratizing HBase by HBaseCon
HBaseCon2017 Democratizing HBaseHBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBase
HBaseCon897 views
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest by HBaseCon
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon646 views
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase by HBaseCon
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon608 views
HBaseCon2017 Transactions in HBase by HBaseCon
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBase
HBaseCon1.8K views
HBaseCon2017 Highly-Available HBase by HBaseCon
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBase
HBaseCon1.1K views
HBaseCon2017 Apache HBase at Didi by HBaseCon
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at Didi
HBaseCon996 views
HBaseCon2017 gohbase: Pure Go HBase Client by HBaseCon
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon1.7K views

Recently uploaded

20231129 - Platform @ localhost 2023 - Application-driven infrastructure with... by
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...sparkfabrik
8 views46 slides
MS PowerPoint.pptx by
MS PowerPoint.pptxMS PowerPoint.pptx
MS PowerPoint.pptxLitty Sylus
7 views14 slides
Generic or specific? Making sensible software design decisions by
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsBert Jan Schrijver
6 views60 slides
Page Object Model by
Page Object ModelPage Object Model
Page Object Modelartembondar5
6 views5 slides
ShortStory_qlora.pptx by
ShortStory_qlora.pptxShortStory_qlora.pptx
ShortStory_qlora.pptxpranathikrishna22
5 views10 slides
nintendo_64.pptx by
nintendo_64.pptxnintendo_64.pptx
nintendo_64.pptxpaiga02016
5 views7 slides

Recently uploaded(20)

20231129 - Platform @ localhost 2023 - Application-driven infrastructure with... by sparkfabrik
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
sparkfabrik8 views
Generic or specific? Making sensible software design decisions by Bert Jan Schrijver
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
predicting-m3-devopsconMunich-2023-v2.pptx by Tier1 app
predicting-m3-devopsconMunich-2023-v2.pptxpredicting-m3-devopsconMunich-2023-v2.pptx
predicting-m3-devopsconMunich-2023-v2.pptx
Tier1 app9 views
AI and Ml presentation .pptx by FayazAli87
AI and Ml presentation .pptxAI and Ml presentation .pptx
AI and Ml presentation .pptx
FayazAli8713 views
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium... by Lisi Hocke
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Lisi Hocke35 views
FIMA 2023 Neo4j & FS - Entity Resolution.pptx by Neo4j
FIMA 2023 Neo4j & FS - Entity Resolution.pptxFIMA 2023 Neo4j & FS - Entity Resolution.pptx
FIMA 2023 Neo4j & FS - Entity Resolution.pptx
Neo4j17 views
Airline Booking Software by SharmiMehta
Airline Booking SoftwareAirline Booking Software
Airline Booking Software
SharmiMehta7 views
Top-5-production-devconMunich-2023.pptx by Tier1 app
Top-5-production-devconMunich-2023.pptxTop-5-production-devconMunich-2023.pptx
Top-5-production-devconMunich-2023.pptx
Tier1 app8 views
Introduction to Git Source Control by John Valentino
Introduction to Git Source ControlIntroduction to Git Source Control
Introduction to Git Source Control
John Valentino6 views
JioEngage_Presentation.pptx by admin125455
JioEngage_Presentation.pptxJioEngage_Presentation.pptx
JioEngage_Presentation.pptx
admin1254556 views
Navigating container technology for enhanced security by Niklas Saari by Metosin Oy
Navigating container technology for enhanced security by Niklas SaariNavigating container technology for enhanced security by Niklas Saari
Navigating container technology for enhanced security by Niklas Saari
Metosin Oy14 views
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action by Márton Kodok
Gen Apps on Google Cloud PaLM2 and Codey APIs in ActionGen Apps on Google Cloud PaLM2 and Codey APIs in Action
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action
Márton Kodok15 views
Understanding HTML terminology by artembondar5
Understanding HTML terminologyUnderstanding HTML terminology
Understanding HTML terminology
artembondar56 views

A Graph Service for Global Web Entities Traversal and Reputation Evaluation Based on HBase

  • 1. A Graph Service for Global Web Entities Traversal and Reputation Evaluation Based on HBase Chris Huang, Scott Miao 2014/5/5
  • 2. Who are we • Scott Miao • Developer, SPN, Trend Micro • Worked on hadoop ecosystem since 2011 • Expertise in HDFS/MR/HBase • Contributor for HBase/HDFS • @takeshi.miao • Chris Huang • RD Manager, SPN, Trend Micro • Hadoop Architect • Worked on hadoop ecosystem since 2009 • Contributor for Bigtop • @chenhsiu48 Our blog ‘Dumbo in TW’: http://dumbointaiwan.blogspot.tw/
  • 4. New Unique Malware Discovered http://www.av-test.org/en/statistics/malware/
  • 5. Social Engineering vs. Cyber Attacks
  • 7. Layer of Protection Exposure Infection Dynamic Web, Emails File Signatures File Behaviors
  • 8. See The Threat Entity Connectivity
  • 9. Connectivity From Different Data Sources Threat Connect Sand- box File Detecti on Threat Web Web Reputa tionFamily Write- up TE Virus DB APT KB
  • 11. Threat Entities Relation Graph F D I DD D I I D DF F D I I I E E E E E E I I ID D D D F D D F I I F D E File IP Domain Email
  • 12. Most Entity Reputations are Unknown F D I DD D I I D DF F D I I I E E E E E E I I ID D D D F D D F I I F D E File IP Domain Email
  • 13. Security Solution Dilemma – Long Tail Prevalence Entities Known good/bad Traditional heuristic detection Big Datacan help! How can we detect the rest effectively?
  • 14. Inspired by PageRank • Too many un-visited pages! • Users browse pages through links • Let users’ clicks (BIG DATA) tell us the rankings of those un- visited pages!
  • 15. Revised PageRank Algorithm • Too many un-rated threat entities! • Malware activities interact with threat entitles • Let malware’s behaviors (BIG DATA) tell us the reputations of those un-rated threat entities! F D I DD D I I D DF F D II I E E E E E E I I ID D D D F D D F I
  • 17. The Problems • Store large size of Graph data • Access large size of Graph data • Process large size of Graph data
  • 18. Data volume • Dump ~450MB (150 bytes * 3,000,000 records) data into Graph per day – Extract from 3GB of data • Keep it for 3 month – ~450MB * 90 = ~40,500MB = ~39GB – With Snappy compression – ~20 - 22GB • Dataset – ~40,000,000 vertices and ~100,000,000 edges • Data query volume about hundreds of thousands per day
  • 20. Store
  • 21. Property Graph Model Name: Jack Sex: Male Marriage: true Name: Merry Sex: Female Marriage: true marries Name: Vivian Sex: Female Marriage: false Name: John Sex: Male Marriage: true Date: 2010/5/5 Name: Emily Sex: Female Marriage: true From a soap opera… https://github.com/tinkerpop/blueprints/wiki/Property-Graph- Model
  • 22. • We use HBase as a Graph Storage – Google BigTable and PageRank – HBaseCon2012 • Storing and manipulating graphs in HBase The winner is… Massive scalable ? Active community ? Analyzable ?
  • 23. Use HBase to store Graph data (1/3) • Tables – create 'vertex', {NAME => 'property', BLOOMFILTER => 'ROW', COMPRESSION => ‘SNAPPY', TTL => '7776000'} – create 'edge', {NAME => 'property', BLOOMFILTER => 'ROW', COMPRESSION => ‘SNAPPY', TTL => '7776000'}
  • 24. Use HBase to store Graph data (2/3) • Schema design – Table: vertex – Table: edge ‘<vertex-id>||<entity-type>’, ‘property:<property-key>@<property-value-type>’, <property-value> ‘<vertex1-row-key>--><label>--><vertex2-row-key>’, ‘property:<property-key>@<property-value-type>’, <property-value>
  • 25. Use HBase to store Graph data (3/3) • Sample – Table: vertex – Table: edge ‘myapps-ups.com||domain’, ‘property:ip@String’, ‘…’ ‘myapps-ups.com||domain’, ‘property:asn@String’, ‘…’ … ‘track.muapps-ups.com/InvoiceA1423AC.JPG.exe||url’, ‘property:path@String’, ‘…’ ‘track.muapps-ups.com/InvoiceA1423AC.JPG.exe||url’, ‘property:parameter@String’, ‘…’ ‘myapps-ups.com||domain-->host-->track.muapps-ups.com/InvoiceA1423AC.JPG.exe||url’, ‘property:property1’, ‘…’ ‘myapps-ups.com||domain-->host-->track.muapps-ups.com/InvoiceA1423AC.JPG.exe||url’, ‘property:property2’, ‘…’
  • 26. Keep your rowkey length short • With long rowkey length – It does not impact your query performance – But it does impact your algorithm MR • OutOfMemoryException • Use something like HASH function to keep your rowkey length short – Use the hash value as rowkey – Put the original value into a property
  • 27. Overall Architecture Src table snapshot Clone table Clone table Graph table snapshot Clone table Clone table HBase Source C Source B source A HDFS 1. Collect data 2. reprocess & dump data 3. Get Data Client 4. Process Data Algorithms Intermediate data On HDFS
  • 29. Preprocess and Dump Data • HBase schema design is simple and human- readable • It is easy to write your dumping tool if needed – MR/Pig/Completebulkload – Can write cron-job to clean up the broken-edge data – TTL can also help to retire old data • We already have a lot practices for these tasks
  • 32. Get Data (1/2) • A Graph API • A better semantic for manipulating Graph data – As a wrapper for HBase Client API – Rather than use HBase Client API directly • A malware exploring sample Vertex vertex = this.graph.getVertex(“malware"); Vertex subVertex = null; Iterable<Edge> edges = vertex.getEdges(Direction.OUT, “connect", “infect", “trigger"); for(Edge edge : edges) { subVertex = edge.getVertex(Direction.OUT); ... }
  • 33. Get Data (2/2) • We implement blueprints API – It provides interfaces as spec. for users to impl. • 824 stars, 173 forks on github – We can get more benefits from it • plug-and-play different Blueprints-enabled graph backends – Traversal language, RESTful server, dataflow, etc – http://www.tinkerpop.com/ – Currently basic query methods are implemented
  • 34. Clients • Real time Client – Client systems • they need associated Graph data for a specific entity via RESTful API – Usually retrieve two levels of graph data – Quick responsiveness supported by HBase • With rowkey random access and appropriate schema design • HTable.get(),Scan.setStartRow(), Scan.setStopRow() • Batch client – Threat experts – Pick one entity and how many levels interested in, generate a graph file format used by tools • To visualize and navigate what whether users interested in • Graph Exploring Tools – Threat experts – Find out sub-graphs by given criteria • E.g. How many levels or associated vertices
  • 35. Malware Exploring Performance (1/3) • one request – Use Malware exploring sample again – 1 vertex with 2 levels associated instances (2 ~ 9 vertices) • Dataset – 42,133,610 vertices and 108,355,774 edges • Total requests – 31,764 requests * 100 clients = 3,176,400 Vertex vertex = this.graph.getVertex(“malware"); Vertex subVertex = null; Iterable<Edge> edges = vertex.getEdges(Direction.OUT, “connect", “infect", “trigger"); for(Edge edge : edges) { subVertex = edge.getVertex(Direction.OUT); ... }
  • 37. Malware Exploring Performance (3/3) • Some statistics – Mean: 51.61 ms – Standard Deviation: 653.57 ms – Empirical rule: 68%, 95%, 99.7% • 99.7% of requests below 2.1 seconds • But response time variances still happen – Use Cache layer between client and HBase – Warm-up after new data come in
  • 40. • Human-readable HBase schema design – Write your own MR – Write your own Pig/UDFs • So we can write the algorithms to further process our graph data – To predict unknown reputation by known threats – E.g. a revised PageRank algorithm
  • 41. Data process flow Src table snapshot Clone table Data on HDFS Algorithms (MR, Pig UDF) Clone table Processed completed Graph table snapshot Clone table Clone table Intermediate data on HDFS HBase 1. Dump daily data 2. Take snapshot 3. Clone snapshot 4. Process data iteratively (takes hours) 4.1 generate Intermediate data 5. Process complete 6. Dump processed data with timerange
  • 42. A customized TableInputFormat (1/2) • One Mapper for one region by default – Each Mapper process too much data • OutOfMemoryException • Too long to process – Use small split region size ? • Will overload your HBase cluster !! • Before: about ~40 Mappers • After: about ~500 Mappers
  • 43. A customized TableInputFormat (2/2) Clone Graph Table MR - Pick Candidates Combination Candidates list file <encodedRegionName>t<startKey>t<endKey> … d3d1749f3486e850b263c7ecb2424dd3tstartKey_1tendKey_1 d3d1749f3486e850b263c7ecb2424dd3tstartKey_2tendKey_2 d3d1749f3486e850b263c7ecb2424dd3tstartKey_3tendKey_3 Cd91c08d656a19bdb180e0b7f8896575tstartKey_4tendKey_4 Cd91c08d656a19bdb180e0b7f8896575tstartKey_5tendKey_5 … CustTableInp utFormat MR - algorithm 2. Scan table3. Output candidates 4. Load candidates 5. Run Algo. with more Mappers 1. Run MR
  • 44. HGraph • A project is open and put on github – https://github.com/trendmicro/HGraph • A partial impl. released from our internal project – Follow HBase schema design – Read data via Blueprints API – Process data with our pagerank default impl. • Download or ‘git clone’ it – Use ‘mvn clean package’ – Run on unix-like OS • Use windows may encounter some errors
  • 46. Experiment Result • Testing Dataset – 42,133,610 vertices and 108,355,774 edges – 1 vertex usually associates 2 ~ 9 vertices – 4.13% of the vertices are known bad – 0.09% of the vertices are known good – The rests are unknown • Result – Runs 34hrs for running 23 iterations. – 1,291 unknown vertices are ranked out – Top 200 has 99% accuracy (explain later)
  • 47. Suspicious DGA Discovered • 3nkp***cq-----x.esf.sinkdns.org – 196 domains from Domain Generated Algorithms https://www.virustotal.com/en/url/871004bd9a0fe27e61b0519ceb8457528ea00da0e7ffdc44d98e759ab3e3caa1/analysis/
  • 48. Untested But Highly Malware Related IP • 67.*.*.132 – Categorized as “Computers / Internet”, not tested https://www.virustotal.com/en/ip-address/67.*.*.132/information/
  • 49. Prevalence Entities Knowngood/bad Traditionalheuristicdetection IntelligencefromBigDataAnalyt (90%) Security in Old Days Cannot Protect What You Cannot See Next Generation Security Unleash the Power of Data Discover What We Don’t Know
  • 50. Q&A
  • 53. Property Graph Model Definition • A property graph has these elements – a set of vertices • each vertex has a unique identifier. • each vertex has a set of outgoing edges. • each vertex has a set of incoming edges. • each vertex has a collection of properties defined by a map from key to value. – a set of edges • each edge has a unique identifier. • each edge has an outgoing tail vertex. • each edge has an incoming head vertex. • each edge has a label that denotes the type of relationship between its two vertices. • each edge has a collection of properties defined by a map from key to value.
  • 54. About regions • Keep reasonable amount of regions for each regionserver • Notice your splitted regions from one table – Dump data daily, cause regions splitting – Make sure your regions scattered evenly on each regionserver <hbase.regionserver.global.memstore.upperLimit> / <hbase.hregion.memstore.flush.size> = <active-regions-per-rs> e.g. (10G * 0.4) / 128MB = 32 active regions HBase Sizing Notes by Lars George