2017-01-08-scaling tribalknowledge

Scaling Tribal Knowledge
CHRIS WILLIAMS / JOHN BODLEY / FEB 8, 2017 / BIG DATA APPLICATION MEETUP

tribal knowledge |ˈtrībəl ˈnäləj |
noun
Tribal knowledge is any unwritten information that is not commonly
known by others within a company

As Airbnb grows so do the challenges around the volume,
complexity, and obscurity of data

In a large and complex organization, with a sea of data
resources, users struggle to find the right data

Data is often siloed and lacks context

I’m a recovering Data Scientist who wants to democratize
data; automate common workflows, surface relevant
information, and provide context

Tables in our Hive data warehouse
100k

Data resources
Beyond the data warehouse

> 6,000
Superset charts and
dashboards
Data resources

> 6,000
Superset charts and
dashboards
> 5,000
Experiments and
metrics
Data resources

> 6,000
Superset charts and
dashboards
> 5,000
Experiments and
metrics
> 4,000
Tableau dashboards
and workbooks
Data resources

> 6,000
Superset charts and
dashboards
> 5,000
Experiments and
metrics
> 4,000
Tableau dashboards
and workbooks
> 1,000
Knowledge posts
Data resources

With many more data sources
and data types to love

Portland
San Francisco
Los Angeles
Toronto
New York
Miami
Sao Paulo
Dublin
London
Paris
Barcelona
Berlin
Milan
Copenhagen
New Delhi
Seoul
Beijing
Tokyo
Sydney
Singapore
Washington, DC
> 20
Offices around the world

To democratize data and empower Airbnb employees to be data-
informed by aiding with data exploration, discovery, and trust

It should be fairly evident what we feed into the search indices

The relevancy of relationships
Nodes and relationships have equal standing
created consumedSpoke 3

The graph
created
consumed
associated
associated
consumed
consum
ed
created
consum
ed

The graph
created
associated
associated
consumed
consum
ed
created
consum
ed
consumed

The graph
consumed
associated
associated
consumed
consum
ed
consum
ed
created
created

The graph
created
consumed
associated
associated
consumed
created
consum
ed
consum
ed

The graph
created
consumed
associated
associated consum
ed
created
consum
ed
consumed

The graph
created
consumed
consumed
consum
ed
created
consum
ed
associated
associated

Databases
5
APIs
3
Airflow DAG
1

Databases
5
APIs
3
Airflow DAG
1
We leverage all these data resources to build a graph comprising of
nodes and relationships
The Airflow DAG is run everyday and the output is stored in Hive

We gather over 10,000 thumbnails from the Tableau API,
Knowledge Repo database, and Superset screenshots

The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse

Why we choose Neo4j for our database
The main reasons

Logical
Given our data is
represented as a graph
it is logical to use a
graph database to
store the data
The main reasons

Logical
Given our data is
graph database to
store the data
Nimble
Performance wins
when dealing with
connected data versus
relational databases
The main reasons

Logical
Given our data is
graph database to
store the data
Nimble
Performance wins
when dealing with
Popular
It is the world’s leading
graph database and
the community edition
is free
The main reasons

Logical
Given our data is
graph database to
store the data
Nimble
Performance wins
when dealing with
Popular
It is the world’s leading
graph database and
the community edition
is free
Integrative
It integrates well with
Python and
Elasticsearch
The main reasons

The Neo4j and Elasticsearch symbiotic relationship
Courtesy of two GraphAware plugins

Neo4j plugin
Provides bi-directional integration which transparently and asynchronously replicate data from
Neo4j to Elasticsearch

Neo4j plugin
Provides bi-directional integration which transparently and asynchronously replicate data from
Neo4j to Elasticsearch
Elasticsearch plugin
Enables Elasticsearch to consult with the Neo4j database during a search query to enrich the
search rankings by leveraging the graph topology

(s:Entity)-[r:CREATED]->(t:Entity)

:Entity
:Org
:Group :User
:Superset
:Slice:Dashboard
Node label hierarchy
:Hive
:Schema :Table

(:Entity:Org:User {id: ‘jane_doe’})

(:Entity:Hive:Table {id: ‘core_data.dim_users’})

(:Entity:Superset:Dashboard {id: 123})

Efficient data retrieval and uniqueness
Restrictions and workarounds with the Neo4j schema

Indexes
Neo4j provides indexes for efficient data retrieval similar to a RDMS, however they are only
defined for a single label

Indexes
Uniqueness Constraints
Ensures that properties are unique for all nodes for a specific single label

Indexes
Uniqueness Constraints
Ensures that properties are unique for all nodes for a specific single label
GraphAware UUID plugin
Transparently assigns a globally unique UUID property to newly created elements which
cannot be changed or deleted

(:Entity {uuid: ‘<UUID>’})

Designing the user experience and interface of  
a data tool should not be an afterthought

Technical data power
user; the epitome of a
tribal knowledge
holder
Daphne Data
User personas
Less data literate;
needs to keep tabs on
her team’s resources
Manager Mel
New employee or  
new team; has no idea
what’s going on
Nathan New

Designing for data exploration, discovery, and trust
Company dataSearch
Resource details 
&meta-data
User data Group data

Company dataSearch User data Group data
Resource details 
&meta-data

Search
Resource details  
&meta-data
Company dataUser data Group data

Search
&meta-data
Google-esque search filters

Search
&meta-data
Resource details & meta-data

Search
&meta-data
Resource details & meta-data
Context, context, & context

Search
&meta-data
Description, external link, social

Search
&meta-data
Meta-data & consumption

Search
&meta-data
Surface relationships,
everything’s a link to promote
exploration
Meta-data & consumption

Column details & value distributions
Table lineage
Enrich meta-data on the fly
Search
&meta-data

User details &  
meta-data
Search
&meta-data

User details &  
meta-data
What they make,  
what they consume
Search
&meta-data

Former employees also  
hold tribal knowledge
Search
&meta-data

Group overview
Search
&meta-data

Group overview
Search
&meta-data
Pinterest-like curation

Group overview
Search
&meta-data
Basic organization functionality
Pinterest-like curation

Search
&meta-data
Curated + Popular content

Search
&meta-data
Curated + Popular content
Thumbnails for maximum context

Search
&meta-data
Pinning flow from resource page
Edit mode / draggable grid

???? ??
Employees can feel disconnected
from Company-level metrics
Search
&meta-data

The technology stack
Application +
dependencies
DOM Testing
eslint
enzyme
mocha
chai
Application
state
Styling
khan/aphrodite

The challenges
Complex
dependencies
An umbrella data tool is
vulnerable to changes
in upstream resource
dependencies

The challenges
Complex
dependencies
dependencies
Data-dense design
Balancing simplicity and
functionality is hard;
most internal design
resources are not made
for data-rich apps

The challenges
Complex
dependencies
dependencies
Data-dense design
for data-rich apps
Graph merging
Non-trivial Git-like
merging of (daily or real-
time) graph updates

The challenges
Complex
dependencies
dependencies
Data-dense design
for data-rich apps
Graph flickering
Transient relationships
should not create
“flickering” artifacts
Graph merging
Non-trivial Git-like
merging of (daily or real-
time) graph updates

The future
New resource types
A/B tests, logging
schemas, SQL queries,
etc.

The future
New resource types
A/B tests, logging
etc.
Certified content
Use certification to build
trust and enable users to
filter through a sea of
stale content

The future
New resource types
A/B tests, logging
etc.
Certified content
stale content
Alerts&
recommendations
Move from active
exploration to deliver
relevant updates and
content suggestions

The future
New resource types
A/B tests, logging
etc.
Certified content
stale content
Game-ification
Provide content
producers with a sense
of value
Alerts&
recommendations
Move from active
exploration to deliver
relevant updates and
content suggestions

The Dataportal team
Analytics&Experimentation Products
John Bodley
Software Engineer
Eli Brumbaugh
Experience Designer
Jeff Feng
Product Manager
Michelle Thomas
Software Engineer
Chris Williams
Data Visualization

Thank you
john.bodley@airbnb.com
chris.williams@airbnb.com

2017-01-08-scaling tribalknowledge

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2017-01-08-scaling tribalknowledge

Similar to 2017-01-08-scaling tribalknowledge (20)

Recently uploaded

Recently uploaded (20)

2017-01-08-scaling tribalknowledge