Scaling Tribal Knowledge
CHRIS WILLIAMS / JOHN BODLEY / FEB 8, 2017 / BIG DATA APPLICATION MEETUP
The problem
tribal knowledge |ˈtrībəl ˈnäləj |
noun
Tribal knowledge is any unwritten information that is not commonly
known by others within a company
As Airbnb grows so do the challenges around the volume,
complexity, and obscurity of data
In a large and complex organization, with a sea of data
resources, users struggle to find the right data
Data is often siloed and lacks context
I’m a recovering Data Scientist who wants to democratize
data; automate common workflows, surface relevant
information, and provide context
Tables in our Hive data warehouse
100k
Data resources
Beyond the data warehouse
> 6,000
Superset charts and
dashboards
Data resources
Beyond the data warehouse
> 6,000
Superset charts and
dashboards
> 5,000
Experiments and
metrics
Data resources
Beyond the data warehouse
> 6,000
Superset charts and
dashboards
> 5,000
Experiments and
metrics
> 4,000
Tableau dashboards
and workbooks
Data resources
Beyond the data warehouse
> 6,000
Superset charts and
dashboards
> 5,000
Experiments and
metrics
> 4,000
Tableau dashboards
and workbooks
> 1,000
Knowledge posts
Data resources
Beyond the data warehouse
With many more data sources
and data types to love
With many more data sources
and data types to love
and most importantly…
> 3,000 Airbnb employees
Portland
San Francisco
Los Angeles
Toronto
New York
Miami
Sao Paulo
Dublin
London
Paris
Barcelona
Berlin
Milan
Copenhagen
New Delhi
Seoul
Beijing
Tokyo
Sydney
Singapore
Washington, DC
> 20
Offices around the world
The mandate
To democratize data and empower Airbnb employees to be data-
informed by aiding with data exploration, discovery, and trust
The concept
Search…
It should be fairly evident what we feed into the search indices
But are we missing something?
The relevancy of relationships
Nodes and relationships have equal standing
created consumedSpoke 3
The graph
created
consumed
associated
associated
consumed
consum
ed
created
consum
ed
The graph
created
consumed
associated
associated
consumed
consum
ed
created
consum
ed
The graph
created
associated
associated
consumed
consum
ed
created
consum
ed
consumed
The graph
consumed
associated
associated
consumed
consum
ed
consum
ed
created
created
The graph
created
consumed
associated
associated
consumed
created
consum
ed
consum
ed
The graph
created
consumed
associated
associated consum
ed
created
consum
ed
consumed
The graph
created
consumed
consumed
consum
ed
created
consum
ed
associated
associated
The construction
Databases
5
APIs
3
Airflow DAG
1
Databases
5
APIs
3
Airflow DAG
1
We leverage all these data resources to build a graph comprising of
nodes and relationships
The Airflow DAG is run everyday and the output is stored in Hive
We gather over 10,000 thumbnails from the Tableau API,
Knowledge Repo database, and Superset screenshots
The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
Why we choose Neo4j for our database
The main reasons
Logical
Given our data is
represented as a graph
it is logical to use a
graph database to
store the data
Why we choose Neo4j for our database
The main reasons
Logical
Given our data is
represented as a graph
it is logical to use a
graph database to
store the data
Nimble
Performance wins
when dealing with
connected data versus
relational databases
Why we choose Neo4j for our database
The main reasons
Logical
Given our data is
represented as a graph
it is logical to use a
graph database to
store the data
Nimble
Performance wins
when dealing with
connected data versus
relational databases
Popular
It is the world’s leading
graph database and
the community edition
is free
Why we choose Neo4j for our database
The main reasons
Logical
Given our data is
represented as a graph
it is logical to use a
graph database to
store the data
Nimble
Performance wins
when dealing with
connected data versus
relational databases
Popular
It is the world’s leading
graph database and
the community edition
is free
Integrative
It integrates well with
Python and
Elasticsearch
Why we choose Neo4j for our database
The main reasons
The Neo4j and Elasticsearch symbiotic relationship
Courtesy of two GraphAware plugins
The Neo4j and Elasticsearch symbiotic relationship
Courtesy of two GraphAware plugins
Neo4j plugin
Provides bi-directional integration which transparently and asynchronously replicate data from
Neo4j to Elasticsearch
The Neo4j and Elasticsearch symbiotic relationship
Courtesy of two GraphAware plugins
Neo4j plugin
Provides bi-directional integration which transparently and asynchronously replicate data from
Neo4j to Elasticsearch
Elasticsearch plugin
Enables Elasticsearch to consult with the Neo4j database during a search query to enrich the
search rankings by leveraging the graph topology
The schema
createds t
(s:Entity)-[r:CREATED]->(t:Entity)
:Entity
:Org
:Group :User
:Superset
:Slice:Dashboard
Node label hierarchy
:Hive
:Schema :Table
(:Entity:Org:User {id: ‘jane_doe’})
(:Entity:Hive:Table {id: ‘core_data.dim_users’})
(:Entity:Superset:Dashboard {id: 123})
Efficient data retrieval and uniqueness
Restrictions and workarounds with the Neo4j schema
Efficient data retrieval and uniqueness
Restrictions and workarounds with the Neo4j schema
Indexes
Neo4j provides indexes for efficient data retrieval similar to a RDMS, however they are only
defined for a single label
Efficient data retrieval and uniqueness
Restrictions and workarounds with the Neo4j schema
Indexes
Neo4j provides indexes for efficient data retrieval similar to a RDMS, however they are only
defined for a single label
Uniqueness Constraints
Ensures that properties are unique for all nodes for a specific single label
Efficient data retrieval and uniqueness
Restrictions and workarounds with the Neo4j schema
Indexes
Neo4j provides indexes for efficient data retrieval similar to a RDMS, however they are only
defined for a single label
Uniqueness Constraints
Ensures that properties are unique for all nodes for a specific single label
GraphAware UUID plugin
Transparently assigns a globally unique UUID property to newly created elements which
cannot be changed or deleted
(:Entity {uuid: ‘<UUID>’})
The web app
The web app
Designing the user experience and interface of 

a data tool should not be an afterthought
Designing the user experience and interface of 

a data tool should not be an afterthought
Technical data power
user; the epitome of a
tribal knowledge
holder
Daphne Data
User personas
Less data literate;
needs to keep tabs on
her team’s resources
Manager Mel
New employee or 

new team; has no idea
what’s going on
Nathan New
Designing for data exploration, discovery, and trust
Company dataSearch
Resource details

&meta-data
User data Group data
Company dataSearch User data Group data
Resource details

&meta-data
Company dataSearch User data Group data
Resource details

&meta-data
Search
Resource details 

&meta-data
Company dataUser data Group data
Search
Resource details 

&meta-data
Company dataUser data Group data
Google-esque search filters
Search
Resource details 

&meta-data
Company dataUser data Group data
Google-esque search filters
Resource details & meta-data
Search
Resource details 

&meta-data
Company dataUser data Group data
Google-esque search filters
Resource details & meta-data
Context, context, & context
Search
Resource details 

&meta-data
Company dataUser data Group data
Search
Resource details 

&meta-data
Company dataUser data Group data
Description, external link, social
Search
Resource details 

&meta-data
Company dataUser data Group data
Meta-data & consumption
Description, external link, social
Search
Resource details 

&meta-data
Company dataUser data Group data
Surface relationships,
everything’s a link to promote
exploration
Meta-data & consumption
Description, external link, social
Column details & value distributions
Table lineage
Enrich meta-data on the fly
Search
Resource details 

&meta-data
Company dataUser data Group data
Column details & value distributions
Table lineage
Enrich meta-data on the fly
Search
Resource details 

&meta-data
Company dataUser data Group data
Search
Resource details 

&meta-data
Company dataUser data Group data
Search
Resource details 

&meta-data
Company dataUser data Group data
Search
Resource details 

&meta-data
Company dataUser data Group data
Search
Resource details 

&meta-data
Company dataUser data Group data
User details & 

meta-data
Search
Resource details 

&meta-data
Company dataUser data Group data
User details & 

meta-data
What they make, 

what they consume
Search
Resource details 

&meta-data
Company dataUser data Group data
Former employees also 

hold tribal knowledge
Search
Resource details 

&meta-data
Company dataUser data Group data
Search
Resource details 

&meta-data
Company dataUser data Group data
Group overview
Search
Resource details 

&meta-data
Company dataUser data Group data
Group overview
Search
Resource details 

&meta-data
Company dataUser data Group data
Pinterest-like curation
Group overview
Search
Resource details 

&meta-data
Company dataUser data Group data
Basic organization functionality
Pinterest-like curation
Search
Resource details 

&meta-data
Company dataUser data Group data
Curated + Popular content
Search
Resource details 

&meta-data
Company dataUser data Group data
Curated + Popular content
Thumbnails for maximum context
Search
Resource details 

&meta-data
Company dataUser data Group data
Pinning flow from resource page
Edit mode / draggable grid
Search
Resource details 

&meta-data
Company dataUser data Group data
Pinning flow from resource page
Edit mode / draggable grid
???? ??
Employees can feel disconnected
from Company-level metrics
Search
Resource details 

&meta-data
Company dataUser data Group data
The technology stack
Application +
dependencies
DOM Testing
eslint
enzyme
mocha
chai
Application
state
Styling
khan/aphrodite
The challenges
The challenges
The challenges
Complex
dependencies
An umbrella data tool is
vulnerable to changes
in upstream resource
dependencies
The challenges
Complex
dependencies
An umbrella data tool is
vulnerable to changes
in upstream resource
dependencies
Data-dense design
Balancing simplicity and
functionality is hard;
most internal design
resources are not made
for data-rich apps
The challenges
Complex
dependencies
An umbrella data tool is
vulnerable to changes
in upstream resource
dependencies
Data-dense design
Balancing simplicity and
functionality is hard;
most internal design
resources are not made
for data-rich apps
Graph merging
Non-trivial Git-like
merging of (daily or real-
time) graph updates
The challenges
Complex
dependencies
An umbrella data tool is
vulnerable to changes
in upstream resource
dependencies
Data-dense design
Balancing simplicity and
functionality is hard;
most internal design
resources are not made
for data-rich apps
Graph flickering
Transient relationships
should not create
“flickering” artifacts
Graph merging
Non-trivial Git-like
merging of (daily or real-
time) graph updates
The future
The future
The future
New resource types
A/B tests, logging
schemas, SQL queries,
etc.
The future
New resource types
A/B tests, logging
schemas, SQL queries,
etc.
Certified content
Use certification to build
trust and enable users to
filter through a sea of
stale content
The future
New resource types
A/B tests, logging
schemas, SQL queries,
etc.
Certified content
Use certification to build
trust and enable users to
filter through a sea of
stale content
Alerts&
recommendations
Move from active
exploration to deliver
relevant updates and
content suggestions
The future
New resource types
A/B tests, logging
schemas, SQL queries,
etc.
Certified content
Use certification to build
trust and enable users to
filter through a sea of
stale content
Game-ification
Provide content
producers with a sense
of value
Alerts&
recommendations
Move from active
exploration to deliver
relevant updates and
content suggestions
The team
The Dataportal team
Analytics&Experimentation Products
John Bodley
Software Engineer
Eli Brumbaugh
Experience Designer
Jeff Feng
Product Manager
Michelle Thomas
Software Engineer
Chris Williams
Data Visualization
The Dataportal team
Analytics&Experimentation Products
John Bodley
Software Engineer
Eli Brumbaugh
Experience Designer
Jeff Feng
Product Manager
Michelle Thomas
Software Engineer
Chris Williams
Data Visualization
Thank you
john.bodley@airbnb.com
chris.williams@airbnb.com

2017-01-08-scaling tribalknowledge