The document discusses scaling tribal knowledge at Airbnb by building a graph database of the company's data resources. It describes collecting metadata on over 6,000 charts, dashboards, experiments and other data assets from various systems into a Neo4j graph database using Airflow. The graph is indexed in Elasticsearch for fast search. This allows employees to explore relationships between data and find relevant resources.
Relevant Search Leveraging Knowledge Graphs with Neo4jGraphAware
Neo4j as a viable tool in a relevant search ecosystem demonstrating that it offers not only a suitable model for representing several complex data, like text, user models, business goal, and context information but also providing efficient ways for navigating this data in real time. Moreover at an early stage in the "search improvement process" Neo4j can help relevance engineers to identify salient features describing the content, the user or the search query, later will be helpful to find a way to instruct the search engine about those features through extraction and enrichment.
Moreover, the talk demonstrates how the graph model can provide the right support for all the components of the relevant search and concludes with the presentation of a complete end-to-end infrastructure for providing relevant search in a real use case. It will show how it is integrated with other tools like Elasticsearch, Apache Kafka, Stanford NLP, OpenNLP, Apache Spark.
Machine Learning Powered by Graphs - Alessandro NegroGraphAware
Graph-based machine learning is becoming a very important trend in Artificial Intelligence, transcending a lot of other techniques. The world's largest companies are promoting this trend. For instance Google Expander's platform combines semi-supervised machine learning with large-scale graph-based learning by building a multi-graph representation of the data with nodes corresponding to objects or concepts and edges connecting concepts that share similarities.
Using graphs as basic representation of data for machine learning purposes has several advantages: (i) the data is already modelled for further analysis, explicitly representing connections and relationships between things and concepts; (ii) graphs can easily combine multiple sources into a single graph representation and learn over them, creating Knowledge Graphs; (iii) a lot of machine learning algorithms exploit graphs for improving computation performances and results quality.
The presentation shows the advantages above presenting also some applications like recommendation engine and natural language processing that use machine learning over a graph. Concrete scenarios, models and end-to-end infrastructure will be discussed.
Graph-Powered Machine Learning - Meetup Paris - March 5, 2018
Graph -based machine learning is becoming an important trend in artificial intelligence, transcending a lot of other techniques. Using graphs as a basic representation of data for multiple purposes:
- the data is already modeled for further analysis
- graphs can easily combine multiple sources into a single graph representation and learn over them, creating Knowledge Graphs;
- improving computation performances and quality. The talk will present these advantages and present applications in the context of recommendation engines and natural language processing.
Speaker: Dr. Vlasta Kus (@VlastaKus) is a Data Scientist at GraphAware, specializing in graph-based Natural Language Processing and related topics, including deep learning techniques. He speaks English, Czech and some French and currently lives in Prague.
How Boston Scientific Improves Manufacturing Quality Using Graph AnalyticsGraphAware
Tracking end of line manufacturing issues to their source can be a daunting task. Boston Scientific, in partnership with GraphAware, has used the Neo4j platform to build a manufacturing quality tool that offers dramatic improvements to the time, quality, and quantity of investigations. In this talk we will review a manufacturing value stream in a graph and discuss the analysis methods available, which result in striking increases in business efficiencies, for this unique application. We will also present how the system was implemented within the existing data architecture and then scaled from a laptop investigational tool to an enterprise-grade solution with Neo4j Server.
*Talk at GraphConnect NYC 2018*
The Business Case for Semantic Web Ontology & Knowledge GraphCambridge Semantics
In this webinar Mark Wallace, Ontologist & Developer, Semantic Arts, and Thomas Cook, Director of Sales AnzoGraph DB, Cambridge Semantics, explore the benefits of building a Semantic Knowledge Graph with RDF*, wrapping up with an airline data demo that illustrates the value of schema, inference and reasoning in it.
How Graph Databases efficiently store, manage and query connected data at s...jexp
Graph Databases try to make it easy for developers to leverage huge amounts of connected information for everything from routing to recommendations. Doing that poses a number of challenges on the implementation side. In this talk we want to look at the different storage, query and consistency approaches that are used behind the scenes. We’ll check out current and future solutions used in Neo4j and other graph databases for addressing global consistency, query and storage optimization, indexing and more and see which papers and research database developers take inspirations from.
Relevant Search Leveraging Knowledge Graphs with Neo4jGraphAware
Neo4j as a viable tool in a relevant search ecosystem demonstrating that it offers not only a suitable model for representing several complex data, like text, user models, business goal, and context information but also providing efficient ways for navigating this data in real time. Moreover at an early stage in the "search improvement process" Neo4j can help relevance engineers to identify salient features describing the content, the user or the search query, later will be helpful to find a way to instruct the search engine about those features through extraction and enrichment.
Moreover, the talk demonstrates how the graph model can provide the right support for all the components of the relevant search and concludes with the presentation of a complete end-to-end infrastructure for providing relevant search in a real use case. It will show how it is integrated with other tools like Elasticsearch, Apache Kafka, Stanford NLP, OpenNLP, Apache Spark.
Machine Learning Powered by Graphs - Alessandro NegroGraphAware
Graph-based machine learning is becoming a very important trend in Artificial Intelligence, transcending a lot of other techniques. The world's largest companies are promoting this trend. For instance Google Expander's platform combines semi-supervised machine learning with large-scale graph-based learning by building a multi-graph representation of the data with nodes corresponding to objects or concepts and edges connecting concepts that share similarities.
Using graphs as basic representation of data for machine learning purposes has several advantages: (i) the data is already modelled for further analysis, explicitly representing connections and relationships between things and concepts; (ii) graphs can easily combine multiple sources into a single graph representation and learn over them, creating Knowledge Graphs; (iii) a lot of machine learning algorithms exploit graphs for improving computation performances and results quality.
The presentation shows the advantages above presenting also some applications like recommendation engine and natural language processing that use machine learning over a graph. Concrete scenarios, models and end-to-end infrastructure will be discussed.
Graph-Powered Machine Learning - Meetup Paris - March 5, 2018
Graph -based machine learning is becoming an important trend in artificial intelligence, transcending a lot of other techniques. Using graphs as a basic representation of data for multiple purposes:
- the data is already modeled for further analysis
- graphs can easily combine multiple sources into a single graph representation and learn over them, creating Knowledge Graphs;
- improving computation performances and quality. The talk will present these advantages and present applications in the context of recommendation engines and natural language processing.
Speaker: Dr. Vlasta Kus (@VlastaKus) is a Data Scientist at GraphAware, specializing in graph-based Natural Language Processing and related topics, including deep learning techniques. He speaks English, Czech and some French and currently lives in Prague.
How Boston Scientific Improves Manufacturing Quality Using Graph AnalyticsGraphAware
Tracking end of line manufacturing issues to their source can be a daunting task. Boston Scientific, in partnership with GraphAware, has used the Neo4j platform to build a manufacturing quality tool that offers dramatic improvements to the time, quality, and quantity of investigations. In this talk we will review a manufacturing value stream in a graph and discuss the analysis methods available, which result in striking increases in business efficiencies, for this unique application. We will also present how the system was implemented within the existing data architecture and then scaled from a laptop investigational tool to an enterprise-grade solution with Neo4j Server.
*Talk at GraphConnect NYC 2018*
The Business Case for Semantic Web Ontology & Knowledge GraphCambridge Semantics
In this webinar Mark Wallace, Ontologist & Developer, Semantic Arts, and Thomas Cook, Director of Sales AnzoGraph DB, Cambridge Semantics, explore the benefits of building a Semantic Knowledge Graph with RDF*, wrapping up with an airline data demo that illustrates the value of schema, inference and reasoning in it.
How Graph Databases efficiently store, manage and query connected data at s...jexp
Graph Databases try to make it easy for developers to leverage huge amounts of connected information for everything from routing to recommendations. Doing that poses a number of challenges on the implementation side. In this talk we want to look at the different storage, query and consistency approaches that are used behind the scenes. We’ll check out current and future solutions used in Neo4j and other graph databases for addressing global consistency, query and storage optimization, indexing and more and see which papers and research database developers take inspirations from.
GraphDB Cloud: Enterprise Ready RDF Database on DemandOntotext
GraphDB Cloud is an enterprise grade RDF graph database providing high-performance querying over large volumes of RDF data. On this webinar, Ontotext demonstrates how to instantly create and deploy a fully managed Graph Database, then import & query data with the (OpenRDF) GraphDB Workbench, and finally explore and visualize data with the build in visualization tools.
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j
Neo4j-Databridge is a fully-featured ETL tool specifically built for Neo4j, and designed for usability, expressive power and high performance. It has been created to help solve the most common problems faced by large enterprises when importing data into Neo4j - data locality, multiple data sources and formats, performance when loading very large data sets, bespoke data conversions, inclusion of non-tabular data, filtering, merging and de-duplication...
In this webinar, we’ll take a quick tour of the main features of Neo4j-Databridge and understand how it can to help to solve these problems and facilitate importing your data easily and quickly into Neo4j.
Not only, is our data is getting not just more complex but also more connected. In order not to lose sight of the web of information, but to use it as a source of new insights and opportunities, technologies such as graph databases can help.
For both analytical and transactional use cases, they allow efficient storage, retrieval, and processing of networked data without loss of detail. In this talk, we want to get to know existing tools and techniques for graph data processing.
Should a Graph Database Be in Your Next Data Warehouse Stack?Cambridge Semantics
In this webinar, AnzoGraph’s graph database guru Barry Zane (former co-founder of Netezza) and data governance author Steve Sarsfield talk about how graph databases fit into the data warehouse modernization trend. They also explore how certain workloads can be better served with an analytical graph database and how today’s technology stacks offer new paradigms for deployment like the cloud, containers and graph analytics.
When it comes to dealing with large, complex, and disparate data sets, traditional database technologies are unable to keep pace with the rich analytics necessary to power today’s data-driven applications. Graph analytics databases are becoming the underlying infrastructure for AI and machine learning. These databases allow users to ask complex questions across complex data, which is not always practical or even possible at scale using other approaches. They also enable faster insights against massive data sets when combined with pattern recognition, statistical analysis, and AI/ machine learning. And in the case of standards-based graph databases, they connect with popular visualization tools like Graphileon, allowing users to easily explore their data stores and quickly build compelling graph-based applications.
Relational databases were conceived to digitize paper forms and automate well-structured business processes, and still have their uses. But, oftentimes with RDBMS, performance degrades with the increasing number and levels of data relationships and data size.
A graph database like Neo4j naturally stores, manages, analyzes, and uses data within the context of connections meaning Neo4j provides faster query performance and vastly improved flexibility in handling complex hierarchies than SQL.
This webinar explains why companies are shifting away from RDBMS towards graphs to unlock the business value in their data relationships.
Graphs and Machine Learning have long been a focus for Franz Inc. and currently we are collaborating with a number of companies to deliver the ability to understand possible future events based on a company's internal as well as externally available data. By combining machine learning, semantic technologies, big data, graph databases and dynamic visualizations we will discuss the core components of a Cognitive Computing platform.
We discuss example Cognitive Computing platforms from Ecommerce, fraud detection and healthcare that combine structured/unstructured data, knowledge, linked open data, predictive analytics, and machine learning to enhance corporate decision making.
What you need to know to start an AI company?Mo Patel
An overview of why AI and Deep Learning are hot now? Overview f Machine Intelligence startups. What are the key ingredients for AI startup? How can AI startups compete with big tech companies and areas to focus on for differentiation?
This webinar focuses on the particular use case of graph databases in Network & IT-Management. This webinar is designed for people who work with Network Management at telecom companies or professionals within industries that handle and rely on complex networks.
We’ll start with an overview of Neo4j and Graph-thinking within Networks, explaining how Neworks are naturally modelled as graphs. We’ll explain how graph databases vastly help mitigate some of the major challenges the Network and Security Managers face on daily basis — including intrusions and other cyber crimes, performance optimization, outage simulations, fraud prevention and more.
While the Rio 2016 Olympics are winding down and the final medals are being handed out, we thought we would share a bit of work that was done recently by Rik Van Bruggen to explore a really interesting dataset in Neo4j.
Based on an original public dataset by the UK newspaper The Guardian, Rik completed the medallist dataset to contain over 30,000 Olympians between 1896 and 2012. He created a graph model, loaded the data, and wrote a bunch of example queries that yielded some very interesting results. Join us for this 30 minute webinar where we’ll take you through this great Olympian graph and take the data for a spin yourself afterwards.
The trajectory schema.org has taken, starting with a history that is less a retrospective than a narrative. I'll follow this narrative to the fortunately-timed emergence of JSON-LD, providing as it does a flexible, standards-based serialization of the vocabulary.
This, I'll explain, helped fuel the popularity of schema.org, which in turn has caused a demand for more schemas, growing the vocabulary and its capabilities. I'll make the case that schema.org has started to resemble exactly what everyone involved in the initiative declared it shouldn't be: an ontology of everything.
Whether or not that be the case, I'll say, the utility of having a relatively simple, well thought-out, well-understood and very broad vocabulary available has made schema.org (along with JSON-LD) a go-to tool for linked data modelers.
Finally, and with a look at the many ways Google, in particular, has made use of schema.org, I'll explore to what extent its utility extends past being a convenient starting for point for back-of-the napkin knowledge graph development, or whether it's making a significant contribution to realizing the promise of a web of data.
GraphDB Cloud: Enterprise Ready RDF Database on DemandOntotext
GraphDB Cloud is an enterprise grade RDF graph database providing high-performance querying over large volumes of RDF data. On this webinar, Ontotext demonstrates how to instantly create and deploy a fully managed Graph Database, then import & query data with the (OpenRDF) GraphDB Workbench, and finally explore and visualize data with the build in visualization tools.
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j
Neo4j-Databridge is a fully-featured ETL tool specifically built for Neo4j, and designed for usability, expressive power and high performance. It has been created to help solve the most common problems faced by large enterprises when importing data into Neo4j - data locality, multiple data sources and formats, performance when loading very large data sets, bespoke data conversions, inclusion of non-tabular data, filtering, merging and de-duplication...
In this webinar, we’ll take a quick tour of the main features of Neo4j-Databridge and understand how it can to help to solve these problems and facilitate importing your data easily and quickly into Neo4j.
Not only, is our data is getting not just more complex but also more connected. In order not to lose sight of the web of information, but to use it as a source of new insights and opportunities, technologies such as graph databases can help.
For both analytical and transactional use cases, they allow efficient storage, retrieval, and processing of networked data without loss of detail. In this talk, we want to get to know existing tools and techniques for graph data processing.
Should a Graph Database Be in Your Next Data Warehouse Stack?Cambridge Semantics
In this webinar, AnzoGraph’s graph database guru Barry Zane (former co-founder of Netezza) and data governance author Steve Sarsfield talk about how graph databases fit into the data warehouse modernization trend. They also explore how certain workloads can be better served with an analytical graph database and how today’s technology stacks offer new paradigms for deployment like the cloud, containers and graph analytics.
When it comes to dealing with large, complex, and disparate data sets, traditional database technologies are unable to keep pace with the rich analytics necessary to power today’s data-driven applications. Graph analytics databases are becoming the underlying infrastructure for AI and machine learning. These databases allow users to ask complex questions across complex data, which is not always practical or even possible at scale using other approaches. They also enable faster insights against massive data sets when combined with pattern recognition, statistical analysis, and AI/ machine learning. And in the case of standards-based graph databases, they connect with popular visualization tools like Graphileon, allowing users to easily explore their data stores and quickly build compelling graph-based applications.
Relational databases were conceived to digitize paper forms and automate well-structured business processes, and still have their uses. But, oftentimes with RDBMS, performance degrades with the increasing number and levels of data relationships and data size.
A graph database like Neo4j naturally stores, manages, analyzes, and uses data within the context of connections meaning Neo4j provides faster query performance and vastly improved flexibility in handling complex hierarchies than SQL.
This webinar explains why companies are shifting away from RDBMS towards graphs to unlock the business value in their data relationships.
Graphs and Machine Learning have long been a focus for Franz Inc. and currently we are collaborating with a number of companies to deliver the ability to understand possible future events based on a company's internal as well as externally available data. By combining machine learning, semantic technologies, big data, graph databases and dynamic visualizations we will discuss the core components of a Cognitive Computing platform.
We discuss example Cognitive Computing platforms from Ecommerce, fraud detection and healthcare that combine structured/unstructured data, knowledge, linked open data, predictive analytics, and machine learning to enhance corporate decision making.
What you need to know to start an AI company?Mo Patel
An overview of why AI and Deep Learning are hot now? Overview f Machine Intelligence startups. What are the key ingredients for AI startup? How can AI startups compete with big tech companies and areas to focus on for differentiation?
This webinar focuses on the particular use case of graph databases in Network & IT-Management. This webinar is designed for people who work with Network Management at telecom companies or professionals within industries that handle and rely on complex networks.
We’ll start with an overview of Neo4j and Graph-thinking within Networks, explaining how Neworks are naturally modelled as graphs. We’ll explain how graph databases vastly help mitigate some of the major challenges the Network and Security Managers face on daily basis — including intrusions and other cyber crimes, performance optimization, outage simulations, fraud prevention and more.
While the Rio 2016 Olympics are winding down and the final medals are being handed out, we thought we would share a bit of work that was done recently by Rik Van Bruggen to explore a really interesting dataset in Neo4j.
Based on an original public dataset by the UK newspaper The Guardian, Rik completed the medallist dataset to contain over 30,000 Olympians between 1896 and 2012. He created a graph model, loaded the data, and wrote a bunch of example queries that yielded some very interesting results. Join us for this 30 minute webinar where we’ll take you through this great Olympian graph and take the data for a spin yourself afterwards.
The trajectory schema.org has taken, starting with a history that is less a retrospective than a narrative. I'll follow this narrative to the fortunately-timed emergence of JSON-LD, providing as it does a flexible, standards-based serialization of the vocabulary.
This, I'll explain, helped fuel the popularity of schema.org, which in turn has caused a demand for more schemas, growing the vocabulary and its capabilities. I'll make the case that schema.org has started to resemble exactly what everyone involved in the initiative declared it shouldn't be: an ontology of everything.
Whether or not that be the case, I'll say, the utility of having a relatively simple, well thought-out, well-understood and very broad vocabulary available has made schema.org (along with JSON-LD) a go-to tool for linked data modelers.
Finally, and with a look at the many ways Google, in particular, has made use of schema.org, I'll explore to what extent its utility extends past being a convenient starting for point for back-of-the napkin knowledge graph development, or whether it's making a significant contribution to realizing the promise of a web of data.
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Impetus Technologies
Traditional databases and batch ETL operations have not been able to serve the growing data volumes and the need for fast and continuous data processing.
How can modern enterprises provide their business users real-time access to the most up-to-date and complete data?
In our upcoming webinar, our experts will talk about how real-time CDC improves data availability and fast data processing through incremental updates in the big data lake, without modifying or slowing down source systems. Join this session to learn:
What is CDC and how it impacts business
The various methods for CDC in the enterprise data warehouse
The key factors to consider while building a next-gen CDC architecture:
Batch vs. real-time approaches
Moving from just capturing and storing, to capturing enriching, transforming, and storing
Avoiding stopgap silos to state-through processing
Implementation of CDC through a live demo and use-case
You can view the webinar here - https://www.streamanalytix.com/webinar/planning-your-next-gen-change-data-capture-cdc-architecture-in-2019/
For more information visit - https://www.streamanalytix.com
The Data World Distilled
Understanding how the data world works in the Big Data era
I created this slide deck as a learning tool for new employees, I figured I would post it in case it can help others understand the data space.
This slide deck covers:
- Big Data
- Data Warehouses
- ETL/Data Integration
- Business Intelligence and Analytics
- Data Quality
- Data Testing
- Data Governance
It provides a brief description along with key vendors in the space.
Data Discovery at Databricks with AmundsenDatabricks
Databricks used to use a static manually maintained wiki page for internal data exploration. We will discuss how we leverage Amundsen, an open source data discovery tool from Linux Foundation AI & Data, to improve productivity with trust by surfacing the most relevant dataset and SQL analytics dashboard with its important information programmatically at Databricks internally.
We will also talk about how we integrate Amundsen with Databricks world class infrastructure to surface metadata including:
Surface the most popular tables used within Databricks
Support fuzzy search and facet search for dataset- Surface rich metadata on datasets:
Lineage information (downstream table, upstream table, downstream jobs, downstream users)
Dataset owner
Dataset frequent users
Delta extend metadata (e.g change history)
ETL job that generates the dataset
Column stats on numeric type columns
Dashboards that use the given dataset
Use Databricks data tab to show the sample data
Surface metadata on dashboards including: create time, last update time, tables used, etc
Last but not least, we will discuss how we incorporate internal user feedback and provide the same discovery productivity improvements for Databricks customers in the future.
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
Challenges in Data Analytics:
Different application scenarios need different storage solutions: HBASE is ideal for point query scenarios but unsuitable for multi-dimensional queries. MPP is suitable for data warehouse scenarios but engine and data are coupled together which hampers scalability. OLAP stores used in BI applications perform best for Aggregate queries but full scan queries perform at a sub-optimal performance. Moreover, they are not suitable for real-time analysis. These distinct systems lead to low resource sharing and need different pipelines for data and application management.
Tapping into Scientific Data with Hadoop and FlinkMichael Häusler
At ResearchGate, we constantly analyze scientific data to connect the world of science and make research open to all. It can be tricky to set up a process to continuously deliver improved versions of algorithms that tap into more than 100 million publications and corresponding bibliographic metadata. In this talk, we illustrate some (big) data engineering challenges of running data pipelines and incorporating results into the live databases that power our user-facing features every day. We show how Apache Flink helps us to improve performance, robustness, ease of maintenance - and most importantly - have more fun while building big data pipelines.
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
Are You Ready? Stepping Up To The Big Data Challenge In 2016 - Learn why Testing is pivotal to the success of your Big Data Strategy.
According to a new report by analyst firm IDG, 70% of enterprises have either deployed or are planning to deploy big data projects and programs this year due to the increase in the amount of data they need to manage.
The growing variety of new data sources is pushing organizations to look for streamlined ways to manage complexities and get the most out of their data-related investments. The companies that do this correctly are realizing the power of big data for business expansion and growth.
Learn why testing your enterprise's data is pivotal for success with big data and Hadoop. Learn how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data - all with one data testing tool.
Visualize some of Austin's open source data using Elasticsearch with Kibana. ObjectRocket's Steve Croce presented this talk on 10/13/17 at the DBaaS event in Austin, TX.
In this webinar we discuss the primary use cases for Graph Databases and explore the properties of Neo4j that make those use cases possible.
We cover the high-level steps of modeling, importing, and querying your data using Cypher and give an overview of the transition from RDBMS to Graph.
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...Yann Pauly
Drive the business with your KPIs. That is what we aimed to do at OVH. As a 18 years old company and quite big cloud provider, we encountered several issues during this long journey to setup change data capture and data driven culture.
Getting data from thousands of tables into one place, keep it all up to date was not possible without a strong streaming engine like Apache Flink.
We will present you our current production pipeline with its pros and cons. From the data collection made directly with binary logs of the databases, to continuous writing into Apache Hive in a Kerberized cloud-based Apache Hadoop cluster. We will describe how we handle schema transcription, events lifecycle, stream partitioning, sort of the events with the use of watermarks and windows aggregation - all of this in a transaction way until the data availability on user side.
Finally we will introduce our production infrastructure based on cloud only, its operation and monitoring.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
12. > 6,000
Superset charts and
dashboards
> 5,000
Experiments and
metrics
Data resources
Beyond the data warehouse
13. > 6,000
Superset charts and
dashboards
> 5,000
Experiments and
metrics
> 4,000
Tableau dashboards
and workbooks
Data resources
Beyond the data warehouse
14. > 6,000
Superset charts and
dashboards
> 5,000
Experiments and
metrics
> 4,000
Tableau dashboards
and workbooks
> 1,000
Knowledge posts
Data resources
Beyond the data warehouse
20. Portland
San Francisco
Los Angeles
Toronto
New York
Miami
Sao Paulo
Dublin
London
Paris
Barcelona
Berlin
Milan
Copenhagen
New Delhi
Seoul
Beijing
Tokyo
Sydney
Singapore
Washington, DC
> 20
Offices around the world
41. Databases
5
APIs
3
Airflow DAG
1
We leverage all these data resources to build a graph comprising of
nodes and relationships
The Airflow DAG is run everyday and the output is stored in Hive
42.
43. We gather over 10,000 thumbnails from the Tableau API,
Knowledge Repo database, and Superset screenshots
44. The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
45. The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
46. The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
47. The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
48. The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
49. The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
50. The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
51. The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
52. The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
53. The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
54. The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
55. The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
56. The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
57. The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
58. The winding data path
Airflow
Data transfer
Python
Graph datastore
py2neo
Python Neo4j
driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
59. Why we choose Neo4j for our database
The main reasons
60. Logical
Given our data is
represented as a graph
it is logical to use a
graph database to
store the data
Why we choose Neo4j for our database
The main reasons
61. Logical
Given our data is
represented as a graph
it is logical to use a
graph database to
store the data
Nimble
Performance wins
when dealing with
connected data versus
relational databases
Why we choose Neo4j for our database
The main reasons
62. Logical
Given our data is
represented as a graph
it is logical to use a
graph database to
store the data
Nimble
Performance wins
when dealing with
connected data versus
relational databases
Popular
It is the world’s leading
graph database and
the community edition
is free
Why we choose Neo4j for our database
The main reasons
63. Logical
Given our data is
represented as a graph
it is logical to use a
graph database to
store the data
Nimble
Performance wins
when dealing with
connected data versus
relational databases
Popular
It is the world’s leading
graph database and
the community edition
is free
Integrative
It integrates well with
Python and
Elasticsearch
Why we choose Neo4j for our database
The main reasons
64. The Neo4j and Elasticsearch symbiotic relationship
Courtesy of two GraphAware plugins
65. The Neo4j and Elasticsearch symbiotic relationship
Courtesy of two GraphAware plugins
Neo4j plugin
Provides bi-directional integration which transparently and asynchronously replicate data from
Neo4j to Elasticsearch
66. The Neo4j and Elasticsearch symbiotic relationship
Courtesy of two GraphAware plugins
Neo4j plugin
Provides bi-directional integration which transparently and asynchronously replicate data from
Neo4j to Elasticsearch
Elasticsearch plugin
Enables Elasticsearch to consult with the Neo4j database during a search query to enrich the
search rankings by leveraging the graph topology
78. Efficient data retrieval and uniqueness
Restrictions and workarounds with the Neo4j schema
Indexes
Neo4j provides indexes for efficient data retrieval similar to a RDMS, however they are only
defined for a single label
79. Efficient data retrieval and uniqueness
Restrictions and workarounds with the Neo4j schema
Indexes
Neo4j provides indexes for efficient data retrieval similar to a RDMS, however they are only
defined for a single label
Uniqueness Constraints
Ensures that properties are unique for all nodes for a specific single label
80. Efficient data retrieval and uniqueness
Restrictions and workarounds with the Neo4j schema
Indexes
Neo4j provides indexes for efficient data retrieval similar to a RDMS, however they are only
defined for a single label
Uniqueness Constraints
Ensures that properties are unique for all nodes for a specific single label
GraphAware UUID plugin
Transparently assigns a globally unique UUID property to newly created elements which
cannot be changed or deleted
84. Designing the user experience and interface of
a data tool should not be an afterthought
85. Designing the user experience and interface of
a data tool should not be an afterthought
86. Technical data power
user; the epitome of a
tribal knowledge
holder
Daphne Data
User personas
Less data literate;
needs to keep tabs on
her team’s resources
Manager Mel
New employee or
new team; has no idea
what’s going on
Nathan New
87. Designing for data exploration, discovery, and trust
Company dataSearch
Resource details
&meta-data
User data Group data
97. Search
Resource details
&meta-data
Company dataUser data Group data
Surface relationships,
everything’s a link to promote
exploration
Meta-data & consumption
Description, external link, social
98. Column details & value distributions
Table lineage
Enrich meta-data on the fly
Search
Resource details
&meta-data
Company dataUser data Group data
99. Column details & value distributions
Table lineage
Enrich meta-data on the fly
Search
Resource details
&meta-data
Company dataUser data Group data
120. The challenges
Complex
dependencies
An umbrella data tool is
vulnerable to changes
in upstream resource
dependencies
Data-dense design
Balancing simplicity and
functionality is hard;
most internal design
resources are not made
for data-rich apps
121. The challenges
Complex
dependencies
An umbrella data tool is
vulnerable to changes
in upstream resource
dependencies
Data-dense design
Balancing simplicity and
functionality is hard;
most internal design
resources are not made
for data-rich apps
Graph merging
Non-trivial Git-like
merging of (daily or real-
time) graph updates
122. The challenges
Complex
dependencies
An umbrella data tool is
vulnerable to changes
in upstream resource
dependencies
Data-dense design
Balancing simplicity and
functionality is hard;
most internal design
resources are not made
for data-rich apps
Graph flickering
Transient relationships
should not create
“flickering” artifacts
Graph merging
Non-trivial Git-like
merging of (daily or real-
time) graph updates
126. The future
New resource types
A/B tests, logging
schemas, SQL queries,
etc.
Certified content
Use certification to build
trust and enable users to
filter through a sea of
stale content
127. The future
New resource types
A/B tests, logging
schemas, SQL queries,
etc.
Certified content
Use certification to build
trust and enable users to
filter through a sea of
stale content
Alerts&
recommendations
Move from active
exploration to deliver
relevant updates and
content suggestions
128. The future
New resource types
A/B tests, logging
schemas, SQL queries,
etc.
Certified content
Use certification to build
trust and enable users to
filter through a sea of
stale content
Game-ification
Provide content
producers with a sense
of value
Alerts&
recommendations
Move from active
exploration to deliver
relevant updates and
content suggestions
130. The Dataportal team
Analytics&Experimentation Products
John Bodley
Software Engineer
Eli Brumbaugh
Experience Designer
Jeff Feng
Product Manager
Michelle Thomas
Software Engineer
Chris Williams
Data Visualization
131. The Dataportal team
Analytics&Experimentation Products
John Bodley
Software Engineer
Eli Brumbaugh
Experience Designer
Jeff Feng
Product Manager
Michelle Thomas
Software Engineer
Chris Williams
Data Visualization