This document provides an overview of MetaQL, which allows composing queries across NoSQL, SQL, SPARQL, and Spark databases using a domain model. Key points include:
- MetaQL uses a domain model to define concepts and compose typed queries in code that can execute across different databases.
- This separates concerns and improves developer efficiency over managing schemas and databases separately.
- Examples demonstrate MetaQL queries in graph, path, select, and aggregation formats across SQL, NoSQL, and RDF implementations.
Video: https://www.youtube.com/watch?v=Rt2oHibJT4k
Technologies such as Hadoop have addressed the "Volume" problem of Big Data, and technologies such as Spark have recently addressed the "Velocity" problem – but the "Variety" problem is largely unaddressed – there is a lot of manual "data wrangling" to mange data models.
These manual processes do not scale well. Not only is the variety of data increasing, also the rate of change in the data definitions is increasing. We can’t keep up. NoSQL data repositories can handle storage, but we need effective models of the data to fully utilize it.
This talk will present tools and a methodology to manage Big Data Models in a rapidly changing world. This talk covers:
Creating Semantic Metadata Models of Big Data Resources
Graphical UI Tools for Big Data Models
Tools to synchronize Big Data Models and Application Code
Using NoSQL Databases, such as Amazon DynamoDB, with Big Data Models
Using Big Data Models with Hadoop, Storm, Spark, Giraph, and Inference
Using Big Data Models with Machine Learning to generate Predictive Models
Developer Collaborative/Coordination processes using Big Data Models and Git
Managing change – Big Data Models with rapidly changing Data Resources
Optimizing the Data Supply Chain for Data ScienceVital.AI
As we move from the Data Warehouse to the Data Supply Chain, we open our perspective to include the full life cycle of data, from raw material to data product.
To produce data products with the most value, in an efficient and cost effective manner, quality control processes must be put into place at each link in the chain, driven by the requirements of data scientists. With such quality control processes in place, the burden of data scientists to cleanse data – typically 80% of the data scientists’ efforts – can be greatly reduced.
Data Models – including schema, metadata, rules, and provenance – play a crucial role in ensuring an effective Data Supply Chain.
Each Data Supply Chain link must be defined with firm boundaries with clear lines of team responsibility – with Data Models providing the natural borders.
In this talk we will discuss the processes that must be put into place at each link in the Data Supply Chain including perspectives on:
* The definition of Data Supply Chain vs. Data Warehouse
* Tools to create, manage, utilize, and share Data Models
* Tracking Data Provenance
* ETL processes, driven by Data Models
* Collaborative processes across Data Science teams
* Visualization of Data and Data Flow across the Data Supply Chain
* Apache Hadoop and Apache Spark as enabling technologies
* Data Science
* Cross-Organizational Collaboration
* Security
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Paris Sud University
Today, we are experiencing an unprecedented production of resources, published as Linked Open Data (LOD, for short). This is leading to the creation of knowledge graphs (KGs) containing billions of RDF (Resource Description Framework) triples, such as DBpedia, YAGO and Wikidata on the academic side, and the Google Knowledge Graph or Microsoft’s Satori graph on the commercial side. These KGs contain millions of entities (such as people, proteins, or books), and millions of facts about them. This knowledge is typically expressed in RDF (Resource Description Framework), i.e., as triples of the form ⟨Macron, presidentOf, France⟩. Some KGs provide an ontology expressed in OWL2 (Web Ontology Language), which describes the vocabulary (the classes and properties) for the RDF facts. However, to exploit and take benefits from the richness of this available data and knowledge, several problems have to be faced, namely, data linking, data fusion and knowledge discovery, when data is of big volume, heterogeneous and evolving. In this tutorial we will first give an overview of exiting data linking and key discovery approaches. Then, we will discuss the problem of identity crisis caused by the misuse of owl:sameAs predicate and give some possible solutions. We will finish by highlighting some current challenges in this research area.
Semantics for Big Data Integration and AnalysisCraig Knoblock
Much of the focus on big data has been on the problem of processing very large sources. There is an equally hard problem of how to normalize, integrate, and transform the data from many sources into the format required to run large-scale anal- ysis and visualization tools. We have previously developed an approach to semi-automatically mapping diverse sources into a shared domain ontology so that they can be quickly com- bined. In this paper we describe our approach to building and executing integration and restructuring plans to support analysis and visualization tools on very large and diverse datasets.
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
Data science requires so many skills, people and time before the results can be accessed. Moreover, these results cannot be static anymore. And finally, the Big Data comes to the plate and the whole tool chain needs to change.
In this talk Data Fellas introduces Shar3, a tool kit aiming to bridged the gaps to build a interactive distributed data processing pipeline, or loop!
Then the talk covers genomics nowadays problems including data types, processing, discovery by introducing the GA4GH initiative and its implementation using Shar3.
The Apache Solr Semantic Knowledge GraphTrey Grainger
What if instead of a query returning documents, you could alternatively return other keywords most related to the query: i.e. given a search for "data science", return me back results like "machine learning", "predictive modeling", "artificial neural networks", etc.? Solr’s Semantic Knowledge Graph does just that. It leverages the inverted index to automatically model the significance of relationships between every term in the inverted index (even across multiple fields) allowing real-time traversal and ranking of any relationship within your documents. Use cases for the Semantic Knowledge Graph include disambiguation of multiple meanings of terms (does "driver" mean truck driver, printer driver, a type of golf club, etc.), searching on vectors of related keywords to form a conceptual search (versus just a text match), powering recommendation algorithms, ranking lists of keywords based upon conceptual cohesion to reduce noise, summarizing documents by extracting their most significant terms, and numerous other applications involving anomaly detection, significance/relationship discovery, and semantic search. In this talk, we'll do a deep dive into the internals of how the Semantic Knowledge Graph works and will walk you through how to get up and running with an example dataset to explore the meaningful relationships hidden within your data.
Video: https://www.youtube.com/watch?v=Rt2oHibJT4k
Technologies such as Hadoop have addressed the "Volume" problem of Big Data, and technologies such as Spark have recently addressed the "Velocity" problem – but the "Variety" problem is largely unaddressed – there is a lot of manual "data wrangling" to mange data models.
These manual processes do not scale well. Not only is the variety of data increasing, also the rate of change in the data definitions is increasing. We can’t keep up. NoSQL data repositories can handle storage, but we need effective models of the data to fully utilize it.
This talk will present tools and a methodology to manage Big Data Models in a rapidly changing world. This talk covers:
Creating Semantic Metadata Models of Big Data Resources
Graphical UI Tools for Big Data Models
Tools to synchronize Big Data Models and Application Code
Using NoSQL Databases, such as Amazon DynamoDB, with Big Data Models
Using Big Data Models with Hadoop, Storm, Spark, Giraph, and Inference
Using Big Data Models with Machine Learning to generate Predictive Models
Developer Collaborative/Coordination processes using Big Data Models and Git
Managing change – Big Data Models with rapidly changing Data Resources
Optimizing the Data Supply Chain for Data ScienceVital.AI
As we move from the Data Warehouse to the Data Supply Chain, we open our perspective to include the full life cycle of data, from raw material to data product.
To produce data products with the most value, in an efficient and cost effective manner, quality control processes must be put into place at each link in the chain, driven by the requirements of data scientists. With such quality control processes in place, the burden of data scientists to cleanse data – typically 80% of the data scientists’ efforts – can be greatly reduced.
Data Models – including schema, metadata, rules, and provenance – play a crucial role in ensuring an effective Data Supply Chain.
Each Data Supply Chain link must be defined with firm boundaries with clear lines of team responsibility – with Data Models providing the natural borders.
In this talk we will discuss the processes that must be put into place at each link in the Data Supply Chain including perspectives on:
* The definition of Data Supply Chain vs. Data Warehouse
* Tools to create, manage, utilize, and share Data Models
* Tracking Data Provenance
* ETL processes, driven by Data Models
* Collaborative processes across Data Science teams
* Visualization of Data and Data Flow across the Data Supply Chain
* Apache Hadoop and Apache Spark as enabling technologies
* Data Science
* Cross-Organizational Collaboration
* Security
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Paris Sud University
Today, we are experiencing an unprecedented production of resources, published as Linked Open Data (LOD, for short). This is leading to the creation of knowledge graphs (KGs) containing billions of RDF (Resource Description Framework) triples, such as DBpedia, YAGO and Wikidata on the academic side, and the Google Knowledge Graph or Microsoft’s Satori graph on the commercial side. These KGs contain millions of entities (such as people, proteins, or books), and millions of facts about them. This knowledge is typically expressed in RDF (Resource Description Framework), i.e., as triples of the form ⟨Macron, presidentOf, France⟩. Some KGs provide an ontology expressed in OWL2 (Web Ontology Language), which describes the vocabulary (the classes and properties) for the RDF facts. However, to exploit and take benefits from the richness of this available data and knowledge, several problems have to be faced, namely, data linking, data fusion and knowledge discovery, when data is of big volume, heterogeneous and evolving. In this tutorial we will first give an overview of exiting data linking and key discovery approaches. Then, we will discuss the problem of identity crisis caused by the misuse of owl:sameAs predicate and give some possible solutions. We will finish by highlighting some current challenges in this research area.
Semantics for Big Data Integration and AnalysisCraig Knoblock
Much of the focus on big data has been on the problem of processing very large sources. There is an equally hard problem of how to normalize, integrate, and transform the data from many sources into the format required to run large-scale anal- ysis and visualization tools. We have previously developed an approach to semi-automatically mapping diverse sources into a shared domain ontology so that they can be quickly com- bined. In this paper we describe our approach to building and executing integration and restructuring plans to support analysis and visualization tools on very large and diverse datasets.
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
Data science requires so many skills, people and time before the results can be accessed. Moreover, these results cannot be static anymore. And finally, the Big Data comes to the plate and the whole tool chain needs to change.
In this talk Data Fellas introduces Shar3, a tool kit aiming to bridged the gaps to build a interactive distributed data processing pipeline, or loop!
Then the talk covers genomics nowadays problems including data types, processing, discovery by introducing the GA4GH initiative and its implementation using Shar3.
The Apache Solr Semantic Knowledge GraphTrey Grainger
What if instead of a query returning documents, you could alternatively return other keywords most related to the query: i.e. given a search for "data science", return me back results like "machine learning", "predictive modeling", "artificial neural networks", etc.? Solr’s Semantic Knowledge Graph does just that. It leverages the inverted index to automatically model the significance of relationships between every term in the inverted index (even across multiple fields) allowing real-time traversal and ranking of any relationship within your documents. Use cases for the Semantic Knowledge Graph include disambiguation of multiple meanings of terms (does "driver" mean truck driver, printer driver, a type of golf club, etc.), searching on vectors of related keywords to form a conceptual search (versus just a text match), powering recommendation algorithms, ranking lists of keywords based upon conceptual cohesion to reduce noise, summarizing documents by extracting their most significant terms, and numerous other applications involving anomaly detection, significance/relationship discovery, and semantic search. In this talk, we'll do a deep dive into the internals of how the Semantic Knowledge Graph works and will walk you through how to get up and running with an example dataset to explore the meaningful relationships hidden within your data.
How Graph Databases efficiently store, manage and query connected data at s...jexp
Graph Databases try to make it easy for developers to leverage huge amounts of connected information for everything from routing to recommendations. Doing that poses a number of challenges on the implementation side. In this talk we want to look at the different storage, query and consistency approaches that are used behind the scenes. We’ll check out current and future solutions used in Neo4j and other graph databases for addressing global consistency, query and storage optimization, indexing and more and see which papers and research database developers take inspirations from.
Geophy CTO Sander Mulders presented their Metadata platform at our March meetup at Skillsmatters' CodeNode. The talk was about how Geophy use Linked Data approaches to accelerate & improve the accuracy of real estate requirements such as valuations.
Sander talked about the thousands of data sources used, how they use RDF for data integration, how to construct features and metadata driven services using components such as Apache Kafka and Stardog.
Family tree of data – provenance and neo4jM. David Allen
Discusses data provenance and how it can be implemented in neo4j, as well as many lessons learned about the relative strengths and weaknesses of relational and graph databases.
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup
Basic knowledge of R/python and general ML concepts
Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop
Level: 200
Time: 2 Hours
Agenda:
- Introduction to ML, H2O and Sparkling Water
- Refresher of data manipulation in R & Python
- Supervised learning
---- Understanding liner regression model with an example
---- Understanding binomial classification with an example
---- Understanding multinomial classification with an example
- Unsupervised learning
---- Understanding k-means clustering with an example
- Using machine learning models in production
- Sparkling Water Introduction & Demo
An Introduction to Graph: Database, Analytics, and Cloud ServicesJean Ihm
Graph analysis employs powerful algorithms to explore and discover relationships in social network, IoT, big data, and complex transaction data. Learn how graph technologies are used in applications such as fraud detection for banking, customer 360, public safety, and manufacturing. This session will provide an overview and demos of graph technologies for Oracle Cloud Services, Oracle Database, NoSQL, Spark and Hadoop, including PGX analytics and PGQL property graph query language.
Presented at Analytics and Data Summit, March 20, 2018
Presentation of the Semantic Knowledge Graph research paper at the 2016 IEEE 3rd International Conference on Data Science and Advanced Analytics (Montreal, Canada - October 18th, 2016)
Abstract—This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain.
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Andy Petrella
Distributed Data Science…
* A genomics use case
* Spark Notebook
* Interactive Distributed Data Science
Distributed Data Science… Pipeline
* Pipeline: productizing Data Science
* Demo of Distributed Pipeline (ADAM, Akka, Cassandra, Parquet, Spark)
* Why Micro Services?
* Painful points:
* Data science is Discontiguous
* Context Lost in Translation
* Solution: Data Fellas’ Agile Data Science Toolkit
Large Scale Graph Analytics with RDF and LPG Parallel ProcessingCambridge Semantics
Analytics that traverse large portions of large graphs have been problematic for both RDF and LPG graph engines. In this webinar Barry Zane, former co-founder of Netezza, Paraccel and SPARQL City and current VP of Engineering at Cambridge Semantics, discusses the native parallel-computing approach taken in AnzoGraph to yield interactive, scalable performance for RDF and LPG graphs.
Speaker: Philippe Mizrahi - Associate Product Manager - Lyft
Abstract: Philippe Mizrahi works on Lyft’s data discovery and metadata engine, Amundsen. With the help of a Neo4j graph database, Amundsen has improved Lyft’s data discovery by reducing time to discover data by 10x.
During this session, Philippe will dive deep into Amundsen’s use cases, impact, and architecture, which effectively combines a comprehensive knowledge graph based upon Neo4j, centralized metadata and other search ranking optimizations to discover data quickly.
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Stefan Urbanek
This keynote looks at some very common forces and threats that are causing common suffering in a data warehouse. Shows examples why the concepts are still relevant despite having all high-end technology. Provides suggestions for starting with architecture and metadata.
Leveraging mesos as the ultimate distributed data science platformAndy Petrella
Keynote at the first @MesosCon #Europe on what was Data Science, what are the new challenge and needs and how we target them in Data Fellas with the Spark Notebook and Shar3
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
With so much talk of how Big Data is revolutionizing the world and how a data lake with Hadoop and/or Spark will solve all your data problems, it is hard to tell what is hype, reality, or somewhere in-between.
In working with dozens of enterprises in varying stages of their enterprise data management (EDM) strategy, MongoDB enterprise architect, Matt Kalan, sees the same challenges and misunderstandings arise again and again.
In this session, he will explain common challenges in data management, what capabilities are necessary, and what the future state of architecture looks like. MongoDB is uniquely capable of filling common gaps in the data lake strategy.
This session also includes a live Q&A portion during which you are encouraged to ask questions of our team.
PoolParty Semantic Suite is Semantic Web Company’s platform for enterprise information integration based on Linked Data principles. PoolParty consists of several components that process and manage RDF based data sets. These components have consistency requirements towards the data they work on.
Also, users have requirements towards the quality of the data they manage. We want to express constraints for both in a standard way throughout PoolParty components. SKOS-based PoolParty Thesaurus project data requires both consistency and quality.
Introduction to Designing and Building Big Data ApplicationsCloudera, Inc.
Learn what the course covers, from capturing data to building a search interface; the spectrum of processing engines, Apache projects, and ecosystem tools available for converged analytics; who is best suited to attend the course and what prior knowledge you should have; and the benefits of building applications with an enterprise data hub.
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
This is a slide deck from QuerySurge's Big Data Testing webinar.
Learn why Testing is pivotal to the success of your Big Data Strategy .
Learn more at www.querysurge.com
The growing variety of new data sources is pushing organizations to look for streamlined ways to manage complexities and get the most out of their data-related investments. The companies that do this correctly are realizing the power of big data for business expansion and growth.
Learn why testing your enterprise's data is pivotal for success with big data, Hadoop and NoSQL. Learn how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.
This information is geared towards:
- Big Data & Data Warehouse Architects,
- ETL Developers
- ETL Testers, Big Data Testers
- Data Analysts
- Operations teams
- Business Intelligence (BI) Architects
- Data Management Officers & Directors
You will learn how to:
- Improve your Data Quality
- Accelerate your data testing cycles
- Reduce your costs & risks
- Provide a huge ROI (as high as 1,300%)
How Graph Databases efficiently store, manage and query connected data at s...jexp
Graph Databases try to make it easy for developers to leverage huge amounts of connected information for everything from routing to recommendations. Doing that poses a number of challenges on the implementation side. In this talk we want to look at the different storage, query and consistency approaches that are used behind the scenes. We’ll check out current and future solutions used in Neo4j and other graph databases for addressing global consistency, query and storage optimization, indexing and more and see which papers and research database developers take inspirations from.
Geophy CTO Sander Mulders presented their Metadata platform at our March meetup at Skillsmatters' CodeNode. The talk was about how Geophy use Linked Data approaches to accelerate & improve the accuracy of real estate requirements such as valuations.
Sander talked about the thousands of data sources used, how they use RDF for data integration, how to construct features and metadata driven services using components such as Apache Kafka and Stardog.
Family tree of data – provenance and neo4jM. David Allen
Discusses data provenance and how it can be implemented in neo4j, as well as many lessons learned about the relative strengths and weaknesses of relational and graph databases.
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup
Basic knowledge of R/python and general ML concepts
Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop
Level: 200
Time: 2 Hours
Agenda:
- Introduction to ML, H2O and Sparkling Water
- Refresher of data manipulation in R & Python
- Supervised learning
---- Understanding liner regression model with an example
---- Understanding binomial classification with an example
---- Understanding multinomial classification with an example
- Unsupervised learning
---- Understanding k-means clustering with an example
- Using machine learning models in production
- Sparkling Water Introduction & Demo
An Introduction to Graph: Database, Analytics, and Cloud ServicesJean Ihm
Graph analysis employs powerful algorithms to explore and discover relationships in social network, IoT, big data, and complex transaction data. Learn how graph technologies are used in applications such as fraud detection for banking, customer 360, public safety, and manufacturing. This session will provide an overview and demos of graph technologies for Oracle Cloud Services, Oracle Database, NoSQL, Spark and Hadoop, including PGX analytics and PGQL property graph query language.
Presented at Analytics and Data Summit, March 20, 2018
Presentation of the Semantic Knowledge Graph research paper at the 2016 IEEE 3rd International Conference on Data Science and Advanced Analytics (Montreal, Canada - October 18th, 2016)
Abstract—This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain.
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Andy Petrella
Distributed Data Science…
* A genomics use case
* Spark Notebook
* Interactive Distributed Data Science
Distributed Data Science… Pipeline
* Pipeline: productizing Data Science
* Demo of Distributed Pipeline (ADAM, Akka, Cassandra, Parquet, Spark)
* Why Micro Services?
* Painful points:
* Data science is Discontiguous
* Context Lost in Translation
* Solution: Data Fellas’ Agile Data Science Toolkit
Large Scale Graph Analytics with RDF and LPG Parallel ProcessingCambridge Semantics
Analytics that traverse large portions of large graphs have been problematic for both RDF and LPG graph engines. In this webinar Barry Zane, former co-founder of Netezza, Paraccel and SPARQL City and current VP of Engineering at Cambridge Semantics, discusses the native parallel-computing approach taken in AnzoGraph to yield interactive, scalable performance for RDF and LPG graphs.
Speaker: Philippe Mizrahi - Associate Product Manager - Lyft
Abstract: Philippe Mizrahi works on Lyft’s data discovery and metadata engine, Amundsen. With the help of a Neo4j graph database, Amundsen has improved Lyft’s data discovery by reducing time to discover data by 10x.
During this session, Philippe will dive deep into Amundsen’s use cases, impact, and architecture, which effectively combines a comprehensive knowledge graph based upon Neo4j, centralized metadata and other search ranking optimizations to discover data quickly.
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Stefan Urbanek
This keynote looks at some very common forces and threats that are causing common suffering in a data warehouse. Shows examples why the concepts are still relevant despite having all high-end technology. Provides suggestions for starting with architecture and metadata.
Leveraging mesos as the ultimate distributed data science platformAndy Petrella
Keynote at the first @MesosCon #Europe on what was Data Science, what are the new challenge and needs and how we target them in Data Fellas with the Spark Notebook and Shar3
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
With so much talk of how Big Data is revolutionizing the world and how a data lake with Hadoop and/or Spark will solve all your data problems, it is hard to tell what is hype, reality, or somewhere in-between.
In working with dozens of enterprises in varying stages of their enterprise data management (EDM) strategy, MongoDB enterprise architect, Matt Kalan, sees the same challenges and misunderstandings arise again and again.
In this session, he will explain common challenges in data management, what capabilities are necessary, and what the future state of architecture looks like. MongoDB is uniquely capable of filling common gaps in the data lake strategy.
This session also includes a live Q&A portion during which you are encouraged to ask questions of our team.
PoolParty Semantic Suite is Semantic Web Company’s platform for enterprise information integration based on Linked Data principles. PoolParty consists of several components that process and manage RDF based data sets. These components have consistency requirements towards the data they work on.
Also, users have requirements towards the quality of the data they manage. We want to express constraints for both in a standard way throughout PoolParty components. SKOS-based PoolParty Thesaurus project data requires both consistency and quality.
Introduction to Designing and Building Big Data ApplicationsCloudera, Inc.
Learn what the course covers, from capturing data to building a search interface; the spectrum of processing engines, Apache projects, and ecosystem tools available for converged analytics; who is best suited to attend the course and what prior knowledge you should have; and the benefits of building applications with an enterprise data hub.
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
This is a slide deck from QuerySurge's Big Data Testing webinar.
Learn why Testing is pivotal to the success of your Big Data Strategy .
Learn more at www.querysurge.com
The growing variety of new data sources is pushing organizations to look for streamlined ways to manage complexities and get the most out of their data-related investments. The companies that do this correctly are realizing the power of big data for business expansion and growth.
Learn why testing your enterprise's data is pivotal for success with big data, Hadoop and NoSQL. Learn how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.
This information is geared towards:
- Big Data & Data Warehouse Architects,
- ETL Developers
- ETL Testers, Big Data Testers
- Data Analysts
- Operations teams
- Business Intelligence (BI) Architects
- Data Management Officers & Directors
You will learn how to:
- Improve your Data Quality
- Accelerate your data testing cycles
- Reduce your costs & risks
- Provide a huge ROI (as high as 1,300%)
New to MongoDB? We'll provide an overview of installation, high availability through replication, scale out through sharding, and options for monitoring and backup. No prior knowledge of MongoDB is assumed. This session will jumpstart your knowledge of MongoDB operations, providing you with context for the rest of the day's content.
Building and deploying LLM applications with Apache AirflowKaxil Naik
Behind the growing interest in Generate AI and LLM-based enterprise applications lies an expanded set of requirements for data integrations and ML orchestration. Enterprises want to use proprietary data to power LLM-based applications that create new business value, but they face challenges in moving beyond experimentation. The pipelines that power these models need to run reliably at scale, bringing together data from many sources and reacting continuously to changing conditions.
This talk focuses on the design patterns for using Apache Airflow to support LLM applications created using private enterprise data. We’ll go through a real-world example of what this looks like, as well as a proposal to improve Airflow and to add additional Airflow Providers to make it easier to interact with LLMs such as the ones from OpenAI (such as GPT4) and the ones on HuggingFace, while working with both structured and unstructured data.
In short, this shows how these Airflow patterns enable reliable, traceable, and scalable LLM applications within the enterprise.
https://airflowsummit.org/sessions/2023/keynote-llm/
Graph Databases in the Microsoft EcosystemMarco Parenzan
With SQL Server and Cosmos Db we now have graph databases broadly available, after being studied for decades in Db theory, or being a niche approach in Open Source with Neo4J. And then there are services like Microsoft Graph and Azure Digital Twins that give us vertical implementations of graph. So let's make a walkaround of graphs in the MIcrosoft ecosystem.
”Oslo” is the codename for Microsoft’s forthcoming modeling platform. Modeling is used across a wide range of domains and allows more people to participate in application design and allows developers to write applications at a much higher level of abstraction
MongoDB Evenings Toronto - Monolithic to Microservices with MongoDBMongoDB
Monolithic to Microservices with MongoDB: Building Highly Available Services
Shawn McCarthy, Senior Solutions Architect, MongoDB
MongoDB Evenings Toronto
Infusion Offices
September 27, 2016
Tutorial Workgroup - Model versioning and collaborationPascalDesmarets1
Hackolade Studio has native integration with Git repositories to provide state-of-the-art collaboration, versioning, branching, conflict resolution, peer review workflows, change tracking and traceability. Mostly, it allows to co-locate data models and schemas with application code, and further integrate with DevOps CI/CD pipelines as part of our vision for Metadata-as-Code.
Co-located application code and data models provide the single source-of-truth for business and technical stakeholders.
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingPaco Nathan
Presentation to the Boulder/Denver BigData meetup 2013-09-25 http://www.meetup.com/Boulder-Denver-Big-Data/events/131047972/
Overview of Enterprise Data Workflows with Cascading; code samples in Cascading, Cascalog, Scalding; Lingual and Pattern Examples; An Evolution of Cluster Computing based on Apache Mesos, with use cases
Big Data Adavnced Analytics on Microsoft AzureMark Tabladillo
This presentation provides a survey of the advanced analytics strengths of Microsoft Azure from an enterprise perspective (with these organizations being the bulk of big data users) based on the Team Data Science Process. The talk also covers the range of analytics and advanced analytics solutions available for developers using data science and artificial intelligence from Microsoft Azure.
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
Sig Narvaez, Executive Solution Architect at MongoDB
MongoDB is now a Developer Data Platform. Come learn what�s new in the 6.0 release and Atlas following all the recent announcements made at MongoDB World 2022. Topics will include
- Atlas Search which combines 3 systems into one (database, search engine, and sync mechanisms) letting you focus on your product's differentiation.
- Atlas Data Federation to seamlessly query, transform, and aggregate data from one or more MongoDB Atlas databases, Atlas Data Lake and AWS S3 buckets
- Queryable Encryption lets you run expressive queries on fully randomized encrypted data to meet the most stringent security requirements
- Relational Migrator which analyzes your existing relational schemas and helps you design a new MongoDB schema.
- And more!
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
4. MetaQL
Leverage Domain Model (Schema)
Compose Queries in Code: Typed
Execute Queries on Databases,
Interchangeably
Minimize TCO:
Separation of Concerns
Developer Efficiency
Query Framework
Executable JVM Code! (Groovy Closure)
5. MetaQL Origin
Across many data-driven application
implementations, a desire for:
Reusable Processes, Tools:
Stop re-inventing the wheel.
Tools to manage “schema” across an
application & organization.
Tools to combine Semantic Web,
NOSQL, and Hadoop/Spark.
Team Collaboration:
Human Labor is usually limiting factor.
12. Internet of Things:
Batch and
Stream
Processing
Amazon Echo
Amazon Echo Service
haley-app webservice
Vert.X
Vital Prime
Database
DataScript
Hadoop - HDFS
Apache Spark
Streaming, MLLIB, NLP, GraphX
Aspen Datawarehouse
Analytics Layer
Serving Layer
Haley Device
Raspberry Pi
Voice to Text API
Cognitive Application
NLP and Inference to
process User request.
Query Knowledge in DB
Streaming Prediction Models:
“Should I really have
more Coffee?”
External APIs…
28. volume, velocity, variety
polyglot persistance = multiple database technologies
…but we also have very many data models.
many databases, many data models, changing rapidly.
too many moving parts for a developer to reasonably manage!
need fewer APIs to learn!
29. what happens when changes occur?
Task
Infrastructure
DevOps
Data Scientists
Business +
Domain Experts
Developers
Roles
30. what changes?
Data Model Changes
New Data Sources
Infrastructure Change
Switch Databases
New Prediction Models / Features
New Service APIs…
Many Interdependencies…
Example: Change in the taxonomy of a categorization
service breaks all the logic tied to the old categories.
31. total cost of ownership
How much code changes when we modify our data
model to include new sources?
How to minimize by decoupling dependencies?
When we switch database technologies?
32. Domain Model as “Contract”
Infrastructure
DevOps
Data Scientists
Business +
Domain Experts
Developers
Domain
Model
Everyone to agree (or at least be aware) of
the definition of Domain Concepts.
Ue semantics to map “views”.
34. Infrastructure / DevOps
Database Types:
• Key/Value
• Document
• RDF Graph
• NOSQL
• Relational
• Timeseries
ACID vs. BASE
Optimizing Query Generation
Tuning Secondary Indices
Update MetaQL DSL for new DB features
CAP Theorem
36. Domain Model Implementation
Combine:
SQL-style Schema
with
Hadoop Data Serialization Schema
(Avro, Thrift, Protocol Buffers, Kyro, Parquet)
add
Semantics: the “Meaning” of objects
Not a table “person”, but define the concept of
Person to be used throughout an application.
The implementation decides how to store “Person”
data in it’s database.
37. Domain Model Implementation
Domain Model definition resolves:
RDF vs Property Graph model
Object Relational Impedance Mismatch
Use OWL to capture Domain Model:
SubClasses
SubProperties
Multiple Inheritance
Marginal technology performance gains are hugely outweighed
by Human productively gains, and wider choice of tools.
Compromise
across modeling
paradigms .
38. Domain Model Implementation
Example: Healthcare Application:
URI<Person123> IS_A:
• Patient
• BillableAccount
• InsuredEntity
Same URI across three domain concepts:
Diagnostics Records, Billing System, Insurance System.
Implementation Note:
We generate code for the JVM using “traits” as a way to implement
multiple inheritance (Groovy, Scala, Java8).
The trait is used as a semantic marker to link to the Domain Model.
39. Domain Model - Core Classes
Node NodeEdge
HyperNodeHyperEdge
Properties:
• URI
• Primary Type
• Types
Edges/HyperEdges:
• Source URI
• Destination URI
Edges:
• Peer
• Taxonomy
Class Instances
contain Properties.
41. VitalSigns: Domain Model Dev Kit
$ vitalsigns generate -o ./domain-ontology/enron-dataset-1.0.0.owl
$ ls domain-groovy-jar
enron-dataset-groovy-1.0.0.jar
$ ls domain-json-schema
enron-dataset-1.0.0.js
OWL can be compiled into JVM code
statically (create an artifact for maven), or
done dynamically at runtime.
43. Development with the Domain Model
VitalSigns vs = VitalSigns.get()
Musician john = new Musician().generateURI(“john")
john.name = "John Lennon"
john.birthday = "October 9, 1940"^xsd.xdatetime("MMMM d, yyyy”)
MusicGroup thebeatles = new MusicGroup().generateURI("thebeatles")
thebeatles.name = "The Beatles"
// try to assign the wrong property, throws an exception
try { thebeatles.birthday = "January 1, 1970"^xsd.xdatetime("MMMM d, yyyy”)
} catch(Exception ex) { println ex } // no such property exception
vs.addToCache( thebeatles.addEdge_hasMember(john) )
// use cache to resolve queries
thebeatles.getMembers().each{ println it.name }
// use database to resolve queries
thebeatles.getMembers(ServiceWide).each{ println it.name }
Implicit MetaQL Queries
44. VitalService API
• Open/Close Endpoint
• Create/Remove Segment
• Create/Read/Update/Delete Object
• Queries (MetaQL as input closure)
• Service Operations (MetaQL as input closure)
• callFunction (DataScript)
• init Transaction/Commit/Rollback
A “Segment” is a Database (container of objects)
45. MetaQL
VitalSigns: Domain Model Manager
• MetaQL DSL
• Prediction Model DSL
• Pipeline Transformation DSL (ETL)
(in development)
A tricky bit is find the best way to express
the DSL within the allowed grammar of the
host language (Groovy).
It’s an ongoing effort.
50. GRAPH query (2)
GRAPH {
value segments: [VitalSegment.withId('wordnet')]
value inlineObjects: true
ARC {
node_bind { "node1" }
node_constraint { SynsetNode.expandSubclasses(true) }
node_constraint { SynsetNode.props().name.contains_i("happy") }
ARC {
edge_bind { "edge" }
node_bind { "node2" }
}
}
}
Code iterating over Results can use bind names to
reference objects in each solution: node1, edge, node2.
<—- inline objects
51. PATH query
def forward = true
def reverse = false
PATH {
value segments: segments
value maxdepth: 5
value rootURIs: [URIProperty.withString(inputURI)]
if( forward ) {
ARC {
value direction: 'forward'
// accept any edge: edge_constraint { }
// accept any node: node_constraint { }
}
}
if( reverse ) {
ARC {
value direction: 'reverse'
// accept any edge: edge_constraint { }
// accept any node: node_constraint { }
}
}
}
52. AGGREGATION query
SUM Product.props().cost
AVERAGE Person.props().birthday
COUNT_DISTINCT Document.props().active
FIRST { DISTINCT Document.props().title, expandProperty : false, order: Order.ASC }
Part of a SELECT query
60. Spark-SQL / Dataframe
URI P V
Segment RDD Property RDD
K V
Experimenting with: new Dataframe Optimizer: Catalyst, new Dataframe
DSL for query generation, and using GraphX for isolated Graph Query cases
Generate “Bad” queries, with optimizer fixing them and Spark
partitioning RDDs, as long as Spark is aware of Schema.
65. implementation
DSL Documentation to be posted:
http://www.metaql.org/
VitalSigns, VitalService, MetaQL
https://dashboard.vital.ai/
Vital AI github: https://github.com/vital-ai/
Sample Code
Spark Code: Aspen, Aspen-Datawarehouse
Documentation Coming!
66. closing thoughts
Separation of Concerns yields
the Agility needed to keep up
with rapidly evolving Data.
“Domain Model as Contract” provides a framework for
consistent interpretation of Data across an application.
MetaQL provides a framework for the consistent
access and query of Data across an application.
Context: Data-Driven Application / Cognitive Applications:
67. Thank You!
Marc C. Hadfield, Founder
Vital AI
http://vital.ai
marc@vital.ai
917.463.4776
68. Pipeline DSL (ETL)
PIPELINE { // Workflow
PIPE { // a Workflow Component with dependencies
TRANSFORM { // Joins across Datasets
IF (RULE { } ) // Boolean, Query, Construct, …
THEN { RULE { } }
ELSE { RULE { } }
}
PIPE { … } // dependent PIPE
} // Output Dataset
PIPE { …
}
}
Influenced by Spark Pipeline and
Google Dataflow Pipeline
70. Multiple Endpoints
def service1 = VitalService.getService(profile:”kv-users”)
def service2 = VitalService.getService(profile:”posts-db”)
def service3 = VitalService.getService(profile:”friendgraph-db”)
// given user URI:user123@email.org
// get user object from service1
// find friends of user in friendgraph via service3
// find posts of friends in posts-db
// update service1 with cache of user-to-friends-postings
// send postings of friends to user in UI