[263] s2graph large-scale-graph-database-with-hbase-2

S2Graph :
A large-scale graph database
with Hbase
daumkakao

2
Reference
1. HBase Conference 2015
1.http://www.slideshare.net/HBaseCon/use-cases-session-5
2.https://vimeo.com/128203919
2. Deview 2015
3. Apache Con BigData Europe
1.http://sched.co/3ztM

3
Our Social Graph
Message
Write
length :
Read
Coupon
price :
Present
price : 3
affinityaffinity:
affinity
affinity
affinity
affinity
affinity
affinity
affinity
Friend
Group
size : 6
Emoticon
Eat
rating :
View
count :
Play
level: 6
Style
share : 3
Advertise
Search
keyword
:
Listen
count :
Like
count : 7
Comment
affinity

4
Our Social Graph
Message
length : 9
Write
length : 3
affinity 6affinity: 9
affinity 3
affinity 3
affinity 4
affinity 1
affinity 2
affinity 2
affinity 9
Friend
Play
level: 6
Style
share : 3
Advertise
ctr : 0.32
Search
keyword
: “HBase"
Listen
count : 6
C
l
affinity 3
Message ID :
201
Ad ID : 603
Music ID
Item ID : 13
Post ID : 97
Game ID : 1984

5
Technical Challenges
1. Large social graph constantly changing
a. Scale
more than,
social network: 10 billion edges, 200 million vertices, 50 million update on existing edges.
user activities: over 1 billion new edges per day

6
Technical Challenges (cont)
2. Low latency for breadth first search traversal on connected data.
a. performance requirement
peak graph-traversing query per second: 20000
response time: 100ms

7
3. Realtime update capabilities for viral effects
Person A
Post
Fast Person B
Comment
Person C
Sharing
Person D
Mention
Fast Fast

8
4. Support for Dynamic Ranking logic
a. Push strategy: Hard to change data ranking logic dynamically.
b. Pull strategy: Enables user to try out various data ranking logics.

9
Before
Each app server should know each DB’s sharding logic.
Highly inter-connected architecture
Friend relationship SNS feeds Blog user activities Messaging
Messaging
App
SNS
App
Blog
App

10
After
SNS
App
Blog
App
Messaging
App
S2Graph DB
stateless app servers

12
What is S2Graph?
Storage-as-a-Service + Graph API = Realtime Breadth First Search

13
Example: Messanger Data Model
Participates
Chat Room
Message 1
Message 1
Message 1
Contains
Recent messages in my chat rooms.
SELECT a.* FROM user_chat_rooms a, chat_room_messages b WHERE a.user_id = 1 AND a.chat_room_id =
b.chat_room_id WHERE b.created_at >= yesterday

14
Example: Messanger Data Model
Participates
Chat Room
Message 1
Message 1
Message 1
Contains
Recent messages in my chat rooms.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],
"steps": [
[{"label": "user_chat_rooms", "direction": "out", "limit": 100}], // step
[{"label": "chat_room_messages", "direction": "out", "limit": 10, “where”: “created_at >= yesterday”}]
]
}
'

15
Example: News Feed (cont)
Friends
Post1
Post 2
Post 3
create/like/share posts
Posts that my friends interacted.
SELECT a.*, b.* FROM friends a, user_posts b WHERE a.user_id = b.user_id WHERE b.updated_at >= yesterday and
b.action_type in (‘create’, ‘like’, ‘share’)

16
Example: News Feed (cont)
Friends
Post1
Post 2
Post 3
create/like/share posts
Posts that my friends interacted.
{
"steps": [
[{"label": "friends", "direction": "out", "limit": 100}], // step
[{"label": “user_posts", "direction": "out", "limit": 10, “where”: “created_at >= yesterday”}]
]
}
'

17
Example: Recommendation(User-based CF) (cont)
Similar Users
Product 1
Product2
Product 3
user-product interaction
(click/buy/like/share)
Products that similar user interact recently.
SELECT a.* , b.* FROM similar_users a, user_products b WHERE a.sim_user_id = b.user_id AND b.updated_at >=
yesterday
Batch

18
Example: Recommendation(User-based CF) (cont)
Products that similar user interact recently.
{
“filterOut”: {“srcVertices”: [{“serviceName”: “s2graph”, “columnName”: “user_id”, “id”: 1}],
“steps”: [[{“label”: “user_products_interact”}]]
},
"steps": [
[{"label": “similar_users", "direction": "out", "limit": 100, “where”: “similarity > 0.2”}], // step
[{"label": “user_products_interact”, "direction": "out", "limit": 10, “where”: “created_at >= yesterday and price >= 1000”}]
]
}
Similar Users
Product 1
Product2
Product 3
Batch

19
Example: Recommendation(Item-based CF) (cont)
Similar Products
Product 1
Product2
Product 3
Product 1
Product 1
Product 1
Products that are similar to what I have interested.
SELECT a.* , b.* FROM similar_ a, user_products b WHERE a.sim_user_id = b.user_id AND b.updated_at >= yesterday
Batch

20
Example: Recommendation(Item-based CF) (cont)
Products that are similar to what I have interested.
{
"steps": [
[{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}],
[{"label": “similar_products”, "direction": "out", "limit": 10, “where”: “similarity > 0.2”}]
]
}
'
Similar Products
Product 1
Product2
Product 3
Product 1
Product 1
Product 1
Batch

21
Example: Recommendation(Content + Most popular) (cont)
TopK(k=1) product per timeUnit(day)
Product1
Product2
Product 3
Daily top product per categories in products that I liked.
SELECT c.*
FROM user_products a, product_categories b, category_daily_top_products c
WHERE a.user_id = 1 and a.product_id = b.product_id and b.category_id = c.category_id and c.time between (yesterday,
today)
Category1
Category2
Product10
Product20
Product20
Today
Product10 Yesterday
Today
Yesterday

22
Example: Recommendation(Content + Most popular) (cont)
Daily top product per categories in products that I liked.
{
"steps": [
[{“label”: “product_cates”, “direction”: “out”, “limit”: 3}],
[{"label": “category_products_topK”, "direction": "out", "limit": 10]
]
}
TopK(k=1) product per timeUnit(day)
Product1
Product2
Product 3
Category1
Category2
Product10
Product20
Product20
Today
Product10 Yesterday
Today
Yesterday

23
Example: Recommendation(Spreading Activation) (cont)
Product 1
Product2
Product 3
Products that is interacted by users who interacted on products that I interact
SELECT b.product_id, count(*)
FROM user_products a, user_products b
WHERE a.user_id = 1
AND a.product_id = b.product_id
GROUP BY b.product_id

24
Example: Recommendation(Spreading Activation) (cont)
Product 1
Product2
Product 3
Products that is interacted by users who interacted on products that I interact
{
"steps": [
[{"label": “user_products_interact", "direction": "in", "limit": 10, “where”: “created_at >= today”}],
[{"label": “user_products_interact", "direction": "out", "limit": 10, “where”: “created_at >= 1 hour ago”}],
]
}
'

25
Realization
1. These examples resemble graphs.
2. Object isVertex, Relationship is Edge.
3. Necessary APIs: breadth first search on large scale graph.

26
S2Graph API: Vertex
Vertex:
1. insert, delete, getVertex
2. vertex id: what user
provided(string/int/long)
ID 1231-123
Prop1 Val1
Prop2 Val2
… …

27
S2Graph API: Edge
Edges:
1. Insert, delete, update, getEdge(like
CRUD in RDBMS)
2. Edge reference: (from, to, label,
direction)
3. Multiple props on edge.
4. Every edges are ordered (details
follow).
Edge Reference 1,101,”friend”,”out”
Prop1 Val1
Prop2 Val2
… …

28
S2Graph API: Query
Query: getEdges, countEdges, removeEdges
Class Query {
// Define breadth first search
List[VertexId] startVertices;
List[Step] steps;
}
Class Step {
// Define one breadth
List[QueryParam] queryParams;
}
Class QueryParam {
// Define each edges to traverse for current
breadth
String label;
String direction;
Map options;
}
QueryParam
Step1 Step2
Query

29
S2Graph API: indices
Degree Q1 Q2 Q3
1-friend-
out-PK
3 c-103 b-102 a-101
1
101
102
103
Name: a
Name: b
Name: c
Ordered(DESC)
Indices:
1. addIndex, createIndex
2. Automatically keep edges ordered for
multiple indices.
3. Support int/long/float/string data
types.
class Index {
// define how to order edges.
String indexName;
List[Prop] indexProps;
}

30
What is S2Graph
Not support global computation(not like Apache Giraph, graphX).
Not support graph algorithm like page rank, shortest path.
Storage-as-a-Service + Graph API = Realtime Breadth First Search
S2Graph is Not

31
Why S2Graph: Push vs Pull. Feeds with Push
1. Only timestamp can be used as scoring
2. Hard to change scoring function dynamically
Post
Like
Write(Fanout)
Friends Feed Queue
Feed Queue
Feed Queue
Write # of friends
Read O(1) for friends
Storage AVG(# of friends) * total user activity
Query O(1)

32
1.Different weights to different action types: Like = 0.8, Click = 0.1…
2.Client can change scoring dynamically.
PostLike
Friends
Why S2Graph: Push vs Pull. Feeds with Pull
Write O(1)
Read None
Storage total user activity
Query O(1) for friends + O(# of friends)

33
Pull >> push only if
1. fast response time: 10 ~ 100ms
2. throughput: 10K ~ 20K QPS
S2Graph provide linear scalability on
1. number of machine.
2. bfs search space(how many edges that single query will traverse).
more detail on benchmark section later.
Why S2Graph: S2Graph Supports Pull + Push

34
Why S2Graph: Simplify Data Flow
S2Graph
Write API + Query DSL
WAL log
OpenSourced
User/Item
Similarity
Apache Spark
(Batch Computing Layer)
TopK Counter Others
S2Graph
Bulk Loader
will be open sourced soon

35
Why S2Graph: Built in A/B test
1. Register Query Template: Each Query template have impressionId.
2. Insert Click/Impression event into S2Graph as Edge insert.

36
Why S2Graph: Just Insert Edge
S2Graph
1. user activity history.
2. friends feed.
3. user-item based collaborative filtering.
4. topK ranking(most popular, segmented most popular).
and many many more.
just think your service as graph model.

38
Detail: previous talk on HBaseCon 2015
1.https://vimeo.com/128203919
2.http://www.slideshare.net/HBaseCon/use-cases-session-5

40
HBase Table Configuration
1. setDurability(Durability.ASYNC_WAL)
2. setCompressionType(Compression.Algorithm.LZ4)
3. setBloomFilterType(BloomType.Row)
4. setDataBlockEncoding(DataBlockEncoding.FAST_DIFF)
5. setBlockSize(32768)
6. setBlockCacheEnabled(true)
7. pre-split by (Intger.MaxValue / regionCount). regionCount = 120 when create table(on 20 region server).

41
HBase Cluster Configuration
• each machine: 8core, 32G memory, SSD
• hfile.block.cache.size: 0.6
• hbase.hregion.memstore.flush.size: 128MB
• otherwise use default value from CDH 5.3.1
• s2graph rest server: 4core, 16G memory

42
Performance
1. Total # of Edges: 100,000,000,000(100,000,000 row x 1000 column)
2. Test environment
a. Zookeeper server: 3
b. HBase Masterserver: 2
c. HBase Regionserver: 20
d. App server: 4 core, 16GB Ram
e. Write traffic: 5K / second

43
- Benchmark Query : src.out(“friend”).limit(100).out(“friend”).limit(10)
- Total concurrency: 20 * # of app server
Performance
2. Linear scalability
Latency
0
50
100
150
200
QPS
0
1,000
2,000
3,000
4,000
# of app server
1 2 4 8
QPS(Query Per Second) Latency(ms)
46454543
3,491
1,763
885
464
43 45 45 46
# of app server
1 2 3 4 5 6 7 8
50010001500200025003000
QPS

Performance
3. Varying width of traverse (tested with a single server)
Latency
0
87.5
175
262.5
350
QPS
0
500
1,000
1,500
2,000
Limit on ﬁrst step
20 40 80 200 400 800
QPS Latency(ms)
327
164
84
351911 61122237
570
1,023
1,821
11 19 35
84
164
327
- Benchmark Query : src.out(“friend”).limit(x).out(“friend”).limit(10)
- Total concurrency = 20 * 1(# of app server)

45
- All query touch 1000 edges.
- each step` limit is on x axis.
- Can expect performance with given query`s search space.
Performance
4. Different query path(different I/O pattern)
Latency
0
37.5
75
112.5
150
QPS
0
80
160
240
320
400
limits on path
10 -> 100 100 -> 10 10 -> 10 -> 10 2 -> 5 -> 10 -> 10 2 -> 5 -> 2 -> 5 -> 10
QPS Latency(ms)
323436
2314
307.5292.1274.4
435.3695
14 23
36 34 32

46
Performance
5. Write throughput per operation on single app server
Insert operation
Latency
0
1.25
2.5
3.75
5
Request per second
8000 16000 800000

47
Performance
6. write throughput per operation on single app server
Update(increment/update/delete) operation
Latency
0
2
4
6
8
Request per second
2000 4000 6000

48
Stats
1. HBase cluster per IDC (2 IDC)
- 3 Zookeeper Server
- 2 HBase Master
- (20 + 40) HBase Slave
2. App server per IDC
- 10 server for write-only
- 30 server for query only
3. Real traffic
- read: 10K ~ 20K request per second
- now mostly 2 step queries with limit 100 on first step.
- write: over 5k ~ 10k request per second

51
Now Available As an Open Source
- https://github.com/daumkakao/s2graph
- Finding contributors and mentors
Contact
- Doyoung Yoon : shom83@gmail.com

[263] s2graph large-scale-graph-database-with-hbase-2

More Related Content

What's hot

Viewers also liked

Similar to [263] s2graph large-scale-graph-database-with-hbase-2

More from NAVER D2

Recently uploaded

[263] s2graph large-scale-graph-database-with-hbase-2