Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2017-01-08-scaling tribalknowledge

5,832 views

Published on

The Airbnb Dataportal is an internal data resource search engine.

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

2017-01-08-scaling tribalknowledge

  1. 1. Scaling Tribal Knowledge CHRIS WILLIAMS / JOHN BODLEY / FEB 8, 2017 / BIG DATA APPLICATION MEETUP
  2. 2. The problem
  3. 3. tribal knowledge |ˈtrībəl ˈnäləj | noun Tribal knowledge is any unwritten information that is not commonly known by others within a company
  4. 4. As Airbnb grows so do the challenges around the volume, complexity, and obscurity of data
  5. 5. In a large and complex organization, with a sea of data resources, users struggle to find the right data
  6. 6. Data is often siloed and lacks context
  7. 7. I’m a recovering Data Scientist who wants to democratize data; automate common workflows, surface relevant information, and provide context
  8. 8. Tables in our Hive data warehouse 100k
  9. 9. Data resources Beyond the data warehouse
  10. 10. > 6,000 Superset charts and dashboards Data resources Beyond the data warehouse
  11. 11. > 6,000 Superset charts and dashboards > 5,000 Experiments and metrics Data resources Beyond the data warehouse
  12. 12. > 6,000 Superset charts and dashboards > 5,000 Experiments and metrics > 4,000 Tableau dashboards and workbooks Data resources Beyond the data warehouse
  13. 13. > 6,000 Superset charts and dashboards > 5,000 Experiments and metrics > 4,000 Tableau dashboards and workbooks > 1,000 Knowledge posts Data resources Beyond the data warehouse
  14. 14. With many more data sources and data types to love
  15. 15. With many more data sources and data types to love
  16. 16. and most importantly…
  17. 17. > 3,000 Airbnb employees
  18. 18. Portland San Francisco Los Angeles Toronto New York Miami Sao Paulo Dublin London Paris Barcelona Berlin Milan Copenhagen New Delhi Seoul Beijing Tokyo Sydney Singapore Washington, DC > 20 Offices around the world
  19. 19. The mandate
  20. 20. To democratize data and empower Airbnb employees to be data- informed by aiding with data exploration, discovery, and trust
  21. 21. The concept
  22. 22. Search…
  23. 23. It should be fairly evident what we feed into the search indices
  24. 24. But are we missing something?
  25. 25. The relevancy of relationships Nodes and relationships have equal standing created consumedSpoke 3
  26. 26. The graph created consumed associated associated consumed consum ed created consum ed
  27. 27. The graph created consumed associated associated consumed consum ed created consum ed
  28. 28. The graph created associated associated consumed consum ed created consum ed consumed
  29. 29. The graph consumed associated associated consumed consum ed consum ed created created
  30. 30. The graph created consumed associated associated consumed created consum ed consum ed
  31. 31. The graph created consumed associated associated consum ed created consum ed consumed
  32. 32. The graph created consumed consumed consum ed created consum ed associated associated
  33. 33. The construction
  34. 34. Databases 5 APIs 3 Airflow DAG 1
  35. 35. Databases 5 APIs 3 Airflow DAG 1 We leverage all these data resources to build a graph comprising of nodes and relationships The Airflow DAG is run everyday and the output is stored in Hive
  36. 36. We gather over 10,000 thumbnails from the Tableau API, Knowledge Repo database, and Superset screenshots
  37. 37. The winding data path Airflow Data transfer Python Graph datastore py2neo Python Neo4j driver Neo4j Graph database GraphAware Neo4j/Elasticsearch plugin Elasticsearch Search engine Flask Python web framework Hive Data warehouse
  38. 38. The winding data path Airflow Data transfer Python Graph datastore py2neo Python Neo4j driver Neo4j Graph database GraphAware Neo4j/Elasticsearch plugin Elasticsearch Search engine Flask Python web framework Hive Data warehouse
  39. 39. The winding data path Airflow Data transfer Python Graph datastore py2neo Python Neo4j driver Neo4j Graph database GraphAware Neo4j/Elasticsearch plugin Elasticsearch Search engine Flask Python web framework Hive Data warehouse
  40. 40. The winding data path Airflow Data transfer Python Graph datastore py2neo Python Neo4j driver Neo4j Graph database GraphAware Neo4j/Elasticsearch plugin Elasticsearch Search engine Flask Python web framework Hive Data warehouse
  41. 41. The winding data path Airflow Data transfer Python Graph datastore py2neo Python Neo4j driver Neo4j Graph database GraphAware Neo4j/Elasticsearch plugin Elasticsearch Search engine Flask Python web framework Hive Data warehouse
  42. 42. The winding data path Airflow Data transfer Python Graph datastore py2neo Python Neo4j driver Neo4j Graph database GraphAware Neo4j/Elasticsearch plugin Elasticsearch Search engine Flask Python web framework Hive Data warehouse
  43. 43. The winding data path Airflow Data transfer Python Graph datastore py2neo Python Neo4j driver Neo4j Graph database GraphAware Neo4j/Elasticsearch plugin Elasticsearch Search engine Flask Python web framework Hive Data warehouse
  44. 44. The winding data path Airflow Data transfer Python Graph datastore py2neo Python Neo4j driver Neo4j Graph database GraphAware Neo4j/Elasticsearch plugin Elasticsearch Search engine Flask Python web framework Hive Data warehouse
  45. 45. The winding data path Airflow Data transfer Python Graph datastore py2neo Python Neo4j driver Neo4j Graph database GraphAware Neo4j/Elasticsearch plugin Elasticsearch Search engine Flask Python web framework Hive Data warehouse
  46. 46. The winding data path Airflow Data transfer Python Graph datastore py2neo Python Neo4j driver Neo4j Graph database GraphAware Neo4j/Elasticsearch plugin Elasticsearch Search engine Flask Python web framework Hive Data warehouse
  47. 47. The winding data path Airflow Data transfer Python Graph datastore py2neo Python Neo4j driver Neo4j Graph database GraphAware Neo4j/Elasticsearch plugin Elasticsearch Search engine Flask Python web framework Hive Data warehouse
  48. 48. The winding data path Airflow Data transfer Python Graph datastore py2neo Python Neo4j driver Neo4j Graph database GraphAware Neo4j/Elasticsearch plugin Elasticsearch Search engine Flask Python web framework Hive Data warehouse
  49. 49. The winding data path Airflow Data transfer Python Graph datastore py2neo Python Neo4j driver Neo4j Graph database GraphAware Neo4j/Elasticsearch plugin Elasticsearch Search engine Flask Python web framework Hive Data warehouse
  50. 50. The winding data path Airflow Data transfer Python Graph datastore py2neo Python Neo4j driver Neo4j Graph database GraphAware Neo4j/Elasticsearch plugin Elasticsearch Search engine Flask Python web framework Hive Data warehouse
  51. 51. The winding data path Airflow Data transfer Python Graph datastore py2neo Python Neo4j driver Neo4j Graph database GraphAware Neo4j/Elasticsearch plugin Elasticsearch Search engine Flask Python web framework Hive Data warehouse
  52. 52. Why we choose Neo4j for our database The main reasons
  53. 53. Logical Given our data is represented as a graph it is logical to use a graph database to store the data Why we choose Neo4j for our database The main reasons
  54. 54. Logical Given our data is represented as a graph it is logical to use a graph database to store the data Nimble Performance wins when dealing with connected data versus relational databases Why we choose Neo4j for our database The main reasons
  55. 55. Logical Given our data is represented as a graph it is logical to use a graph database to store the data Nimble Performance wins when dealing with connected data versus relational databases Popular It is the world’s leading graph database and the community edition is free Why we choose Neo4j for our database The main reasons
  56. 56. Logical Given our data is represented as a graph it is logical to use a graph database to store the data Nimble Performance wins when dealing with connected data versus relational databases Popular It is the world’s leading graph database and the community edition is free Integrative It integrates well with Python and Elasticsearch Why we choose Neo4j for our database The main reasons
  57. 57. The Neo4j and Elasticsearch symbiotic relationship Courtesy of two GraphAware plugins
  58. 58. The Neo4j and Elasticsearch symbiotic relationship Courtesy of two GraphAware plugins Neo4j plugin Provides bi-directional integration which transparently and asynchronously replicate data from Neo4j to Elasticsearch
  59. 59. The Neo4j and Elasticsearch symbiotic relationship Courtesy of two GraphAware plugins Neo4j plugin Provides bi-directional integration which transparently and asynchronously replicate data from Neo4j to Elasticsearch Elasticsearch plugin Enables Elasticsearch to consult with the Neo4j database during a search query to enrich the search rankings by leveraging the graph topology
  60. 60. The schema
  61. 61. createds t
  62. 62. (s:Entity)-[r:CREATED]->(t:Entity)
  63. 63. :Entity :Org :Group :User :Superset :Slice:Dashboard Node label hierarchy :Hive :Schema :Table
  64. 64. (:Entity:Org:User {id: ‘jane_doe’})
  65. 65. (:Entity:Hive:Table {id: ‘core_data.dim_users’})
  66. 66. (:Entity:Superset:Dashboard {id: 123})
  67. 67. Efficient data retrieval and uniqueness Restrictions and workarounds with the Neo4j schema
  68. 68. Efficient data retrieval and uniqueness Restrictions and workarounds with the Neo4j schema Indexes Neo4j provides indexes for efficient data retrieval similar to a RDMS, however they are only defined for a single label
  69. 69. Efficient data retrieval and uniqueness Restrictions and workarounds with the Neo4j schema Indexes Neo4j provides indexes for efficient data retrieval similar to a RDMS, however they are only defined for a single label Uniqueness Constraints Ensures that properties are unique for all nodes for a specific single label
  70. 70. Efficient data retrieval and uniqueness Restrictions and workarounds with the Neo4j schema Indexes Neo4j provides indexes for efficient data retrieval similar to a RDMS, however they are only defined for a single label Uniqueness Constraints Ensures that properties are unique for all nodes for a specific single label GraphAware UUID plugin Transparently assigns a globally unique UUID property to newly created elements which cannot be changed or deleted
  71. 71. (:Entity {uuid: ‘<UUID>’})
  72. 72. The web app
  73. 73. The web app
  74. 74. Designing the user experience and interface of 
 a data tool should not be an afterthought
  75. 75. Designing the user experience and interface of 
 a data tool should not be an afterthought
  76. 76. Technical data power user; the epitome of a tribal knowledge holder Daphne Data User personas Less data literate; needs to keep tabs on her team’s resources Manager Mel New employee or 
 new team; has no idea what’s going on Nathan New
  77. 77. Designing for data exploration, discovery, and trust Company dataSearch Resource details
 &meta-data User data Group data
  78. 78. Company dataSearch User data Group data Resource details
 &meta-data
  79. 79. Company dataSearch User data Group data Resource details
 &meta-data
  80. 80. Search Resource details 
 &meta-data Company dataUser data Group data
  81. 81. Search Resource details 
 &meta-data Company dataUser data Group data Google-esque search filters
  82. 82. Search Resource details 
 &meta-data Company dataUser data Group data Google-esque search filters Resource details & meta-data
  83. 83. Search Resource details 
 &meta-data Company dataUser data Group data Google-esque search filters Resource details & meta-data Context, context, & context
  84. 84. Search Resource details 
 &meta-data Company dataUser data Group data
  85. 85. Search Resource details 
 &meta-data Company dataUser data Group data Description, external link, social
  86. 86. Search Resource details 
 &meta-data Company dataUser data Group data Meta-data & consumption Description, external link, social
  87. 87. Search Resource details 
 &meta-data Company dataUser data Group data Surface relationships, everything’s a link to promote exploration Meta-data & consumption Description, external link, social
  88. 88. Column details & value distributions Table lineage Enrich meta-data on the fly Search Resource details 
 &meta-data Company dataUser data Group data
  89. 89. Column details & value distributions Table lineage Enrich meta-data on the fly Search Resource details 
 &meta-data Company dataUser data Group data
  90. 90. Search Resource details 
 &meta-data Company dataUser data Group data
  91. 91. Search Resource details 
 &meta-data Company dataUser data Group data
  92. 92. Search Resource details 
 &meta-data Company dataUser data Group data
  93. 93. Search Resource details 
 &meta-data Company dataUser data Group data
  94. 94. User details & 
 meta-data Search Resource details 
 &meta-data Company dataUser data Group data
  95. 95. User details & 
 meta-data What they make, 
 what they consume Search Resource details 
 &meta-data Company dataUser data Group data
  96. 96. Former employees also 
 hold tribal knowledge Search Resource details 
 &meta-data Company dataUser data Group data
  97. 97. Search Resource details 
 &meta-data Company dataUser data Group data
  98. 98. Group overview Search Resource details 
 &meta-data Company dataUser data Group data
  99. 99. Group overview Search Resource details 
 &meta-data Company dataUser data Group data Pinterest-like curation
  100. 100. Group overview Search Resource details 
 &meta-data Company dataUser data Group data Basic organization functionality Pinterest-like curation
  101. 101. Search Resource details 
 &meta-data Company dataUser data Group data Curated + Popular content
  102. 102. Search Resource details 
 &meta-data Company dataUser data Group data Curated + Popular content Thumbnails for maximum context
  103. 103. Search Resource details 
 &meta-data Company dataUser data Group data Pinning flow from resource page Edit mode / draggable grid
  104. 104. Search Resource details 
 &meta-data Company dataUser data Group data Pinning flow from resource page Edit mode / draggable grid
  105. 105. ???? ?? Employees can feel disconnected from Company-level metrics Search Resource details 
 &meta-data Company dataUser data Group data
  106. 106. The technology stack Application + dependencies DOM Testing eslint enzyme mocha chai Application state Styling khan/aphrodite
  107. 107. The challenges
  108. 108. The challenges
  109. 109. The challenges Complex dependencies An umbrella data tool is vulnerable to changes in upstream resource dependencies
  110. 110. The challenges Complex dependencies An umbrella data tool is vulnerable to changes in upstream resource dependencies Data-dense design Balancing simplicity and functionality is hard; most internal design resources are not made for data-rich apps
  111. 111. The challenges Complex dependencies An umbrella data tool is vulnerable to changes in upstream resource dependencies Data-dense design Balancing simplicity and functionality is hard; most internal design resources are not made for data-rich apps Graph merging Non-trivial Git-like merging of (daily or real- time) graph updates
  112. 112. The challenges Complex dependencies An umbrella data tool is vulnerable to changes in upstream resource dependencies Data-dense design Balancing simplicity and functionality is hard; most internal design resources are not made for data-rich apps Graph flickering Transient relationships should not create “flickering” artifacts Graph merging Non-trivial Git-like merging of (daily or real- time) graph updates
  113. 113. The future
  114. 114. The future
  115. 115. The future New resource types A/B tests, logging schemas, SQL queries, etc.
  116. 116. The future New resource types A/B tests, logging schemas, SQL queries, etc. Certified content Use certification to build trust and enable users to filter through a sea of stale content
  117. 117. The future New resource types A/B tests, logging schemas, SQL queries, etc. Certified content Use certification to build trust and enable users to filter through a sea of stale content Alerts& recommendations Move from active exploration to deliver relevant updates and content suggestions
  118. 118. The future New resource types A/B tests, logging schemas, SQL queries, etc. Certified content Use certification to build trust and enable users to filter through a sea of stale content Game-ification Provide content producers with a sense of value Alerts& recommendations Move from active exploration to deliver relevant updates and content suggestions
  119. 119. The team
  120. 120. The Dataportal team Analytics&Experimentation Products John Bodley Software Engineer Eli Brumbaugh Experience Designer Jeff Feng Product Manager Michelle Thomas Software Engineer Chris Williams Data Visualization
  121. 121. The Dataportal team Analytics&Experimentation Products John Bodley Software Engineer Eli Brumbaugh Experience Designer Jeff Feng Product Manager Michelle Thomas Software Engineer Chris Williams Data Visualization
  122. 122. Thank you john.bodley@airbnb.com chris.williams@airbnb.com

×