Gremlins for Amundsen
08/13/2020
p3
Content
p6
p13
p16
Gremlin Introduction
Amundsen Gremlin Overview
Lessons Learned
Upstream Plan
2
Gremlin Introduction
Gremlin Introduction
Image
● Graph traversal language
● Apache Tinkerpop
● Vertexes and Edges
● Widely supported
4
● Sample queries
g.V().hasLabel('airport').has('code','DFW')
g.V().has(Table, ‘key’, table_uri)
.outE().inV().hasLabel(Column).as_('column')
● Curious to know more? See
Practical Gremlin
Gremlin Introduction
● Cypher equivalent
MATCH (Airport {code: DFW})
MATCH (Table {key: $table_uri})
-[:COLUMN]->(column:Column)
5
Amundsen Gremlin
Overview
Amundsen Gremlin Overview
Image
● Why build this?
○ Hosted graph
○ Online backups
○ Proxy is platform-agnostic*
7
● Amundsen
8
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
AWS Neptune
Elastic
Search
Metadata Service Search Service
Frontend Service
Metadata Sources
Gremlin shared code
Amundsen Gremlin Overview
● Gremlin shared code
9
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
AWS Neptune
Elastic
Search
Metadata Service Search Service
Frontend Service
Metadata Sources
Gremlin shared code
Amundsen Gremlin Overview
● Metadata service
○ Gremlin proxy
10
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
AWS Neptune
Elastic
Search
Metadata Service Search Service
Frontend Service
Metadata Sources
Gremlin shared code
Amundsen Gremlin Overview
11
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
AWS Neptune
Elastic
Search
Metadata Service Search Service
Frontend Service
Metadata Sources
Gremlin shared code
● Abstract proxy tests
○ Construct one case, test against
every* proxy
def test_rt_table(self) -> None:
expected = Fixtures.next_table()
self.get_proxy().put_table(table=expected)
actual: Table = self.get_proxy().get_table(table_uri=expected.key)
self.assertEqual(expected, actual)
Amundsen Gremlin Overview
● Databuilder
12
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
AWS Neptune
Elastic
Search
Metadata Service Search Service
Frontend Service
Metadata Sources
Gremlin shared code
Amundsen Gremlin Overview
Lessons Learned
Lessons Learned
Image
● Failed experiments
○ Transactional gremlin for writes:
■ V only once - prefer V(id)
● g.V(id1).as_('one').V(id2).addE(label).from_('one')
■ Smaller traversals are better
■ Minimize coalesce() in write
14
Lessons Learned
Image
● Failed experiments (cont)
○ SessionedClient
○ AWS Lambda write from Kinesis
15
Upstream Plan
Upstream Plan
TODAY
Internal refactoring
Consolidation of gremlin code into new shared
amundsen-gremlin repository. Databuilder and
metadata service will utilize the shared code.
Approx. August 17
Stabilization
Improve stability/performance of existing gremlin
code
Approx. August 7
Ship to amundsen
Clean up square-specific bits of amundsen-gremlin,
publish. Publish proxy and proxy tests utilizing
amundsen-gremlin
Approx. August 21
17
Thank you
Kudos to the rest of the Privacy Engineering team
at Square who worked on this - Dan Simms, Alyssa
Ransbury, Sarah Harvey, and Kat Hawthorne
Questions?
See also: amundsen-io issue 526

Amundsen gremlin proxy design

  • 1.
  • 2.
    p3 Content p6 p13 p16 Gremlin Introduction Amundsen GremlinOverview Lessons Learned Upstream Plan 2
  • 3.
  • 4.
    Gremlin Introduction Image ● Graphtraversal language ● Apache Tinkerpop ● Vertexes and Edges ● Widely supported 4
  • 5.
    ● Sample queries g.V().hasLabel('airport').has('code','DFW') g.V().has(Table,‘key’, table_uri) .outE().inV().hasLabel(Column).as_('column') ● Curious to know more? See Practical Gremlin Gremlin Introduction ● Cypher equivalent MATCH (Airport {code: DFW}) MATCH (Table {key: $table_uri}) -[:COLUMN]->(column:Column) 5
  • 6.
  • 7.
    Amundsen Gremlin Overview Image ●Why build this? ○ Hosted graph ○ Online backups ○ Proxy is platform-agnostic* 7
  • 8.
    ● Amundsen 8 Postgres HiveRedshift ... Presto Github Source File Databuilder Crawler AWS Neptune Elastic Search Metadata Service Search Service Frontend Service Metadata Sources Gremlin shared code Amundsen Gremlin Overview
  • 9.
    ● Gremlin sharedcode 9 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler AWS Neptune Elastic Search Metadata Service Search Service Frontend Service Metadata Sources Gremlin shared code Amundsen Gremlin Overview
  • 10.
    ● Metadata service ○Gremlin proxy 10 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler AWS Neptune Elastic Search Metadata Service Search Service Frontend Service Metadata Sources Gremlin shared code Amundsen Gremlin Overview
  • 11.
    11 Postgres Hive Redshift... Presto Github Source File Databuilder Crawler AWS Neptune Elastic Search Metadata Service Search Service Frontend Service Metadata Sources Gremlin shared code ● Abstract proxy tests ○ Construct one case, test against every* proxy def test_rt_table(self) -> None: expected = Fixtures.next_table() self.get_proxy().put_table(table=expected) actual: Table = self.get_proxy().get_table(table_uri=expected.key) self.assertEqual(expected, actual) Amundsen Gremlin Overview
  • 12.
    ● Databuilder 12 Postgres HiveRedshift ... Presto Github Source File Databuilder Crawler AWS Neptune Elastic Search Metadata Service Search Service Frontend Service Metadata Sources Gremlin shared code Amundsen Gremlin Overview
  • 13.
  • 14.
    Lessons Learned Image ● Failedexperiments ○ Transactional gremlin for writes: ■ V only once - prefer V(id) ● g.V(id1).as_('one').V(id2).addE(label).from_('one') ■ Smaller traversals are better ■ Minimize coalesce() in write 14
  • 15.
    Lessons Learned Image ● Failedexperiments (cont) ○ SessionedClient ○ AWS Lambda write from Kinesis 15
  • 16.
  • 17.
    Upstream Plan TODAY Internal refactoring Consolidationof gremlin code into new shared amundsen-gremlin repository. Databuilder and metadata service will utilize the shared code. Approx. August 17 Stabilization Improve stability/performance of existing gremlin code Approx. August 7 Ship to amundsen Clean up square-specific bits of amundsen-gremlin, publish. Publish proxy and proxy tests utilizing amundsen-gremlin Approx. August 21 17
  • 18.
    Thank you Kudos tothe rest of the Privacy Engineering team at Square who worked on this - Dan Simms, Alyssa Ransbury, Sarah Harvey, and Kat Hawthorne
  • 19.