GDM 2011 Talk

  • 2,971 views
Uploaded on

GDM 2011 keynote slides

GDM 2011 keynote slides

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,971
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
26
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Horton: online query execution on large distributed graphs
    Sameh Elnikety
    Microsoft Research
  • 2. People
    Team
    Interns
    Mohamed Sarwat (UMN), Jiaqing Du (EPFL)
  • 3. Large Graphs Are Important
    Large graphs appear in many application domains
    Examples: social networks, knowledge representation, …
  • 4. Example : Social Network
    Scale
    LinkedIn
    70 million users
    Facebook
    500 million users
    65 Billion photos
    Queries
    Find Alice’s friends
    Find Alice’s photos with friends
    Rich graph
    Types, attributes
  • 5. Objective: Management & Querying
    Manage and query large graphs online
    Today
    Expensive hardware (limited)
    Cluster of machines (ad-hoc)
    Tomorrow
    Growth area
    Analogy: provide system support
    Analogy: DBMS manages application data
  • 6. Key Design Decisions
    Interactive queries
    Graph stored in random access memory
    Graph too large for one machine
    Distributed problem
    Partitioning: slice graph into pieces
    Replication: for fault tolerance
    Long time to build graph
    Support updates
  • 7. Three Research Problems
    How to partition the graph
    Static and dynamic techniques
    E.g., Evolving set partitioning
    How to query the graph
    Data model & query language
    Execution engine
    Query optimizations
    How to update the graph
    Multi-version concurrency control
    E.g., Generalized Snapshot Isolation
  • 8. Concurrency Control
    • Generalized Snapshot Isolation – GSI
    • 9. Replicas agree
    On which update transactions commit
    On their commit order
    • Example (bank database)
    T1: set balance = $1000
    T2: set balance = $2000
    Replica1: see T1 then T2  balance = $2000
    Replica2: see T2 then T1  balance = $1000
  • 10. Replication
    Single
    Graph Server
    Replica 1
    Load
    Balancer
    Replica 2
    Replica 3
  • 11. Read Tx
    Replica 1
    Load
    Balancer
    Replica 2
    T
    Replica 3
  • 12. Update Tx
    Replica 1
    Load
    Balancer
    Replica 2
    T
    ws
    ws
    Replica 3
  • 13. Replication Middleware
    Tx A
    Replica 1
    Tx A
    A  B  C
    durability
    OFF
    A  B  C
    Replication MW
    (global ordering)
    Tx B
    Replica 2
    Tx B
    durability
    A  B  C
    A  B  C
    A  B  C
    durability
    OFF
    A  B  C
    Replica 3
    Tx C
    Tx C
    A  B  C
    A  B  C
    durability
    OFF
  • 14. Partitioning – Key Ideas
    Objective
    Maximize temporal and spatial locality
    Static
    Linear time
    Evolving-set partitioning
    Streaming algorithm
    During loading
    Dynamic
    Incremental
    Affinity propagation
  • 15. Three Research Problems
    How to partition the graph
    Static and dynamic techniques
    E.g., Evolving set partitioning
    How to query the graph
    Data model & query language
    Execution engine
    Query optimizations
    How to update the graph
    Multi-version concurrency control
    E.g., Generalized Snapshot Isolation
  • 16. Data Model
    Node
    ID, type, attributes
    Edge
    Connects two nodes
    Direction, type, attributes
    App
    Horton
  • 17. Query Language
    Trade-off
    Expressiveness vs. efficiency
    Regular language reachability
    Query is a regular expression
    Sequence of node and edge predicates
    Example
    Alice’s photos
    Photo, tags, Alice
    Node: type=photo , edge: type=tags , node: type=person, name=Alice
    Result:matching paths
  • 18. Query Language - Operators
    Projection
    Alice’s photos
    Photo, tags, Alice
    Node: type=photo , edge: type=tags , node: type=person, name=Alice
    SELECT photoFROM photo, tags, Alice
    OR
    (Photo | video), tags, Alice
    Kleenestar
    Alice org chart
    Alice, (manages, person)*
  • 19. Example App: CodeBook - Graph
  • 20. Example App: CodeBook - Queries
    Person, FileOwner>, TFSFile,FileOwner<, Person
    Person, DiscussionOwner>, Discussion, DiscussionOwner<, Person
    Person, WorkItemOwner>, TFSWorkItem, WorkItemOwner< ,Person
    Person, Manages<, Person, Manages>, Person
    Person, WorkItemOwner>, TFSWorkItem, Mentions>, TFSFile, Mentions>, TFSWorkItem, WorkItemOwner<, Person
    Person, WorkItemOwner>, TFSWorkItem, Mentions>, TFSFile, FileOwner<, Person
    Person, FileOwner>, TFSFile, Mentions>, TFSWorkItem, Mentions>, TFSFile,   FileOwner<, Person
  • 21. Query Language - Future Extensions
    From: path
    Sequence of predicates
    To: sub-graph
    Graph pattern
    Sub-graph isomorphism
  • 22. Talk Outline
    • Graph query language
    • 23. Graph query execution engine
    • 24. Graph query optimizer
    • 25. Initial experimental results
    • 26. Demo
  • Execution Engine
    Build a FSM from input query
    Optimize FSM
    Executes FSM using distributed BFS
  • 27. Centralized Query Execution
    Photo
    Tags
    Alice
    S2
    S0
    S1
    S3
    Alice, Tags, Photo
    Breadth First Search
    Answer Paths:
    Alice, Tags, Photo1Alice, Tags, Photo8
  • 28. Distributed Execution Engine
  • 29. Distributed Query Execution
    Alice, Tags, Photo, Tags, Hillary
    Partition 1
    Partition 2
  • 30. Distributed Query Execution
    Alice, Tags, Photo, Tags, Hillary
    FSM
    Partition 1
    Partition 2
    S0
    Partition 1
    Step 1
    Alice
    Alice
    S1
    Tags
    S2
    Step 2
    Photo1
    Photo8
    Photo
    S3
    Tags
    S4
    Step 3
    Hillary
    Hillary
    Partition 2
    S5
  • 31. Graph Query Optimization
  • 32. Graph Query Optimization
    Input
    Graph statistics + query plan
    Output
    Optimized query plan (execution sequence)
    Action
    Enumerate execution plans
    Evaluate their costs using graph statistics
    Find the plan with minimum cost
  • 33. Graph Query Optimization Techniques
    Predicate ordering
    Derived edges
    Fragment reuse
  • 34. Predicate Ordering
    Find Mike’s photo that is also tagged by at least one of his friends
    Mike, Tagged, Photo, Tagged, Person, FriendOf, Mike
    Execute left to right
    Execute right to left
    Different predicate orders can result in different execution costs.
  • 35. Predicate Ordering
    Find Mike’s photo that is also tagged by at least one of his friends
    Mike, Tagged, Photo, Tagged, Person, FriendOf, Mike
    Execute left to right
  • 36. Predicate Ordering
    Find Mike’s photo that is also tagged by at least one of his friends
    Mike, Tagged, Photo, Tagged, Person, FriendOf, Mike
    Execute left to right
  • 37. Predicate Ordering
    Find Mike’s photo that is also tagged by at least one of his friends
    Mike, Tagged, Photo, Tagged, Person, FriendOf, Mike
    Execute left to right
  • 38. Predicate Ordering
    Find Mike’s photo that is also tagged by at least one of his friends
    Mike, Tagged, Photo, Tagged, Person, FriendOf, Mike
    Execute left to right
    Total cost = 14
  • 39. Predicate Ordering
    Find Mike’s photo that is also tagged by at least one of his friends
    Mike, Tagged, Photo, Tagged, Person, FriendOf, Mike
    Execute left to right
    Execute right to left
    Different predicate orders can result in different execution costs.
    Total cost = 7
    Total cost = 14
  • 40. How to Decide Predicate Ordering?
    • Enumerate execution sequences of predicates
    • 41. Estimate their costs using graph statistics
    • 42. Find the sequence with minimum cost
  • Cost Estimation using Graph Statistics
    Graph Statistics
    Left to right
    Mike, Tagged, Photo, Tagged, Person, FriendOf, Mike
    EstimatedCost = ???
  • 43. Cost Estimation using Graph Statistics
    Graph Statistics
    Left to right
    Mike, Tagged, Photo, Tagged, Person, FriendOf, Mike
    EstimatedCost = 1 [find Mike]
  • 44. Cost Estimation using Graph Statistics
    Graph Statistics
    Left to right
    Mike, Tagged, Photo, Tagged, Person, FriendOf, Mike
    EstimatedCost = 1 [find Mike]
    + (1* 2.2) [find Mike, Tagged, Photo]
  • 45. Cost Estimation using Graph Statistics
    Graph Statistics
    Left to right
    Mike, Tagged, Photo, Tagged, Person, FriendOf, Mike
    EstimatedCost = 1 [find Mike]
    + (1* 2.2) [find Mike, Tagged, Photo]
    + (2.2 * 1.6) [find Mike, Tagged, Photo, Tagged, Person]
    + (2.2 * 1.6 * 1.2) [find Mike, Tagged, Photo, Tagged, Person, FriendOf, Mike]
    = 11
  • 46. Plan Enumeration
    Find Mike’s photo that is also tagged by at least one of his friends
    Plan1
    Mike, Tagged, Photo, Tagged, Person, FriendOf, Mike
    Plan2
    Mike, FriendOf, Person, Tagged, Photo, Tagged, Mike
    Plan3
    (Mike, FriendOf, Person) ⋈ (Person, Tagged, Photo, Tagged, Mike)
    Plan4
    (Mike, FriendOf, Person, Tagged, Photo) ⋈ (Photo, Tagged, Mike)
    .
    .
    .
    .
    .
  • 47. Derived Edges
    Friends-photo-tag
    Mike, Tagged, Photo, Tagged, Person, FriendOf, Mike
    Cost = 7
    Mike, Tagged, Photo, Friends-photo-tag, Mike
    Cost = 5
  • 48. Query Fragment Reuse
    Query1
    Mike, Tagged, Photo
    Query2
    Person, FriendOf, Mike
    Query3
    Mike, Tagged, Photo, Tagged, Person, FriendOf, Mike
  • 49. Query Fragment Reuse
    Query1
    Mike, Tagged, Photo
    Query2
    Person, FriendOf, Mike
    Query3
    Mike, Tagged, Photo, Tagged, Person, FriendOf, Mike
    Execute Query3 based on cached results of Query1 and Query2:
    [Query1], Tagged, Person, FriendOf, Mike
    [Query2], Tagged, Photo, Tagged, Mike
    [Query1]⋈[Photo, Tagged, Person] ⋈[Query2]
  • 50. Summary of Query Optimization
    Predicate ordering
    Derived edges
    Reuse of query fragments
  • 51. Experimental Evaluation
    Graph
    Codebook application
    1 Million nodes
    10 million edges
    Machines
    7 machines
    Intel Core 2 Duo 2.26 GHz , 4 GB ram
    Graph partitioned on 6 machines
  • 52. Initial Results
  • 53. Summary
    • Graph query language
    Regular language expression
    • Graph query execution engine
    Distributed breadth first-search
    • Graph query optimizer
    Predicate ordering, derived edges, query fragment reuse
    • Implementation
    Initial experimental results
    Demo
  • 54. Questions?
    .