Slides of the AIMS (http://aims.fao.org/) webinar of 21 September 2017 of Martin Kaltenböck and Timea Turdean (Semantic Web Company) about: Text Mining in PoolParty Semantic Suite (https://www.poolparty.biz)
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
This document discusses Open Lineage and the Marquez project for collecting metadata and data lineage information from data pipelines. It describes how Open Lineage defines a standard model and protocol for instrumentation to collect metadata on jobs, datasets, and runs in a consistent way. This metadata can then provide context on the data source, schema, owners, usage, and changes. The document outlines how Marquez implements the Open Lineage standard by defining entities, relationships, and facets to store this metadata and enable use cases like data governance, discovery, and debugging. It also positions Marquez as a centralized but modular framework to integrate various data platforms and extensions like Datakin's lineage analysis tools.
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services
AWS hosts a variety of public data sets that anyone can access for free. Previously, large data sets such as satellite imagery or genomic data have required hours or days to locate, download, customize, and analyze. When data is made publicly available on AWS, anyone can analyze any volume of data without downloading or storing it themselves. In this session, the AWS Open Data Team shares tips and tricks, patterns and anti-patterns, and tools to help you effectively stage your data for analysis in the cloud.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
The document provides instructions for setting up the environment and coding tutorial for the BOSS'21 Copenhagen tutorial on Apache Calcite.
It includes the following steps:
1. Clone the GitHub repository containing sample code and dependencies.
2. Compile the project.
3. It outlines the draft schedule for the tutorial, which will cover topics like Calcite introduction, demonstration of SQL queries on CSV files, setting up the coding environment, using Lucene for indexing, and coding exercises to build parts of the logical and physical query plans in Calcite.
4. The tutorial will be led by Stamatis Zampetakis from Cloudera and Julian Hyde from Google, who are both committers to
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
This document discusses data product architectures and provides examples of different architectures for data products, including the lambda architecture, analyst architecture, recommender architecture, and partisan discourse architecture. It also discusses common design principles for data product architectures, such as using microservices with stateful backend services and database-backed APIs. Key aspects of data product architectures include handling training data and models, making predictions via APIs, updating models and annotations, and designing flexible systems that can incorporate new models and data.
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
This document discusses Open Lineage and the Marquez project for collecting metadata and data lineage information from data pipelines. It describes how Open Lineage defines a standard model and protocol for instrumentation to collect metadata on jobs, datasets, and runs in a consistent way. This metadata can then provide context on the data source, schema, owners, usage, and changes. The document outlines how Marquez implements the Open Lineage standard by defining entities, relationships, and facets to store this metadata and enable use cases like data governance, discovery, and debugging. It also positions Marquez as a centralized but modular framework to integrate various data platforms and extensions like Datakin's lineage analysis tools.
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services
AWS hosts a variety of public data sets that anyone can access for free. Previously, large data sets such as satellite imagery or genomic data have required hours or days to locate, download, customize, and analyze. When data is made publicly available on AWS, anyone can analyze any volume of data without downloading or storing it themselves. In this session, the AWS Open Data Team shares tips and tricks, patterns and anti-patterns, and tools to help you effectively stage your data for analysis in the cloud.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
The document provides instructions for setting up the environment and coding tutorial for the BOSS'21 Copenhagen tutorial on Apache Calcite.
It includes the following steps:
1. Clone the GitHub repository containing sample code and dependencies.
2. Compile the project.
3. It outlines the draft schedule for the tutorial, which will cover topics like Calcite introduction, demonstration of SQL queries on CSV files, setting up the coding environment, using Lucene for indexing, and coding exercises to build parts of the logical and physical query plans in Calcite.
4. The tutorial will be led by Stamatis Zampetakis from Cloudera and Julian Hyde from Google, who are both committers to
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
This document discusses data product architectures and provides examples of different architectures for data products, including the lambda architecture, analyst architecture, recommender architecture, and partisan discourse architecture. It also discusses common design principles for data product architectures, such as using microservices with stateful backend services and database-backed APIs. Key aspects of data product architectures include handling training data and models, making predictions via APIs, updating models and annotations, and designing flexible systems that can incorporate new models and data.
Achieving Lakehouse Models with Spark 3.0Databricks
It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Accelerate Your ML Pipeline with AutoML and MLflowDatabricks
Building ML models is a time consuming endeavor that requires a thorough understanding of feature engineering, selecting useful features, choosing an appropriate algorithm, and performing hyper-parameter tuning. Extensive experimentation is required to arrive at a robust and performant model. Additionally, keeping track of the models that have been developed and deployed may be complex. Solving these challenges is key for successfully implementing end-to-end ML pipelines at scale.
In this talk, we will present a seamless integration of automated machine learning within a Databricks notebook, thus providing a truly unified analytics lifecycle for data scientists and business users with improved speed and efficiency. Specifically, we will show an app that generates and executes a Databricks notebook to train an ML model with H2O’s Driverless AI automatically. The resulting model will be automatically tracked and managed with MLflow. Furthermore, we will show several deployment options to score new data on a Databricks cluster or with an external REST server, all within the app.
Agile Data Engineering - Intro to Data Vault Modeling (2016)Kent Graziano
The document provides an introduction to Data Vault data modeling and discusses how it enables agile data warehousing. It describes the core structures of a Data Vault model including hubs, links, and satellites. It explains how the Data Vault approach provides benefits such as model agility, productivity, and extensibility. The document also summarizes the key changes in the Data Vault 2.0 methodology.
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.
From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.
As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.
In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.
In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).
Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.
You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.
In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.
He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.
A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.
Data engineers build massive data storage systems and develop architectures like databases and data processing systems. They install continuous pipelines to move data between these large data "pools" and allow data scientists to access relevant data sets. Data engineers require technical skills in databases, SQL, data modeling, ETL, programming languages, data warehousing and newer technologies like NoSQL, Hadoop and machine learning. They are responsible for designing, implementing, testing and maintaining scalable data systems, ensuring business requirements are met, researching new data sources, cleaning and analyzing data, and collaborating with other teams. The role continues to evolve with new database and development technologies.
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks
Upwork has the biggest closed-loop online dataset of jobs and job seekers in labor history (>10M Profiles, >100M Job Posts, Job Proposals and Hiring Decisions, >10B of Messages, Transaction and Feedback Data). Besides sheer quantity, our data is also contextually very rich. We have client and contractor data for the entire job-funnel – from finding jobs to getting the job done.
For various machine learning applications including search and recommendations and labor marketplace optimization (rate, supply and demand), we heavily relied on a Greenplum-based data warehouse solution for data processing and ad-hoc ML pipelines (weka, scikit-learn, R) for offline model development and online model scoring.
In this talk, we present our modernization efforts in moving towards a 1) holistic data processing infrastructure for batch and stream data processing using S3, Kinesis, Spark and Spark Structured Streaming 2) model development using Spark MLlib and other ML libraries for Spark 3) model serving using Databricks Model Scoring, Scoring over Structured Streams and microservices and 3) how we orchestrate and streamline all these processes using Apache Airflow and a CI/CD workflow customized to our Data Science product engineering needs. The focus of this talk is on how we were able to leverage the Databricks service offering to reduce DevOps overhead and costs, complete the entire modernization with moderate efforts and adopt a collaborative notebook-based solution for all our data scientists to develop model, reuse features and share results. We will shared the core lessons learned and pitfalls we encountered during this journey.
Lake Database Database Template Map Data in Azure Synapse AnalyticsErwin de Kreuk
Database templates in Synapse Analytics are blueprints which can be used by organizations to plan, architect and design solutions.
How can we use these Database Templates in a day-to-day business, in order to speed up to automate this process?
Map data tool can help us with that
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
The document discusses the role of data engineers and data pipelines. It begins with an introduction to big data and why data volumes are increasing. It then covers what data engineers do, including building data architectures, working with cloud infrastructure, and programming for data ingestion, transformation, and loading. The document also explains data pipelines, describing extract, transform, load (ETL) processes and batch versus streaming data. It provides an example of Credit OK's data pipeline architecture on Google Cloud Platform that extracts raw data from various sources, cleanses and loads it into BigQuery, then distributes processed data to various applications. It emphasizes the importance of data engineers in processing and managing large, complex data sets.
Graph databases are a type of NoSQL database that is optimized for storing and querying connected data and relationships. A graph database represents data in graphs consisting of nodes and edges, where the nodes represent entities and the edges represent relationships between the entities. Graph databases are well-suited for applications that involve complex relationships and connected data, such as social networks, knowledge graphs, and recommendation systems. They allow for flexible querying of relationships and connections via graph traversal operations.
Summary introduction to data engineeringNovita Sari
Data engineering involves designing, building, and maintaining data warehouses to transform raw data into queryable forms that enable analytics. A core task of data engineers is Extract, Transform, and Load (ETL) processes - extracting data from sources, transforming it through processes like filtering and aggregation, and loading it into destinations. Data engineers help divide systems into transactional (OLTP) and analytical (OLAP) databases, with OLTP providing source data to data warehouses analyzed through OLAP systems. While similar, data engineers focus more on infrastructure and ETL processes, while data scientists focus more on analysis, modeling, and insights.
"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base.
We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."
This document discusses how graphs and Neo4j can be used for various use cases in banking. It provides an agenda for the discussion including introductions to graphs and Neo4j, banking data overviews, and specific use cases like fraud detection, risk analysis, knowledge graphs, and customer 360 views. Examples are given for how graph databases could help with each use case, with a fraud detection demo. Additional potential uses include identity and access management and regulatory compliance.
Volvo Cars - Retrieving Safety Insights using Graphs (GraphSummit Stockholm 2...Neo4j
Volvo Cars has developed a map attributes representation as a graph in Neo4j. By including real time car data, they are able to collect insights to learn on possible accident causes based on road infrastructure.
The document introduces Visual DataVault, a modeling language for visually expressing Data Vault models. It aims to generate DDL from models and support Microsoft Office. The language defines basic entities like hubs, links, satellites and reference tables. It also covers query assistant tables, computed structures, exploration links and business vault tables to enhance the raw data vault. Some remarks note it focuses on logical not physical modeling and more features are planned.
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
Over the past eight or nine years, applying DevOps practices to various areas of technology within business has grown in popularity and produced demonstrable results. These principles are particularly fruitful when applied to a data analytics environment. Bob Eilbacher explains how to implement a strong DevOps practice for data analysis, starting with the necessary cultural changes that must be made at the executive level and ending with an overview of potential DevOps toolchains. Bob also outlines why DevOps and disruption management go hand in hand.
Topics include:
- The benefits of a DevOps approach, with an emphasis on improving quality and efficiency of data analytics
- Why the push for a DevOps practice needs to come from the C-suite and how it can be integrated into all levels of business
- An overview of the best tools for developers, data analysts, and everyone in between, based on the business’s existing data ecosystem
- The challenges that come with transforming into an analytics-driven company and how to overcome them
- Practical use cases from Caserta clients
This presentation was originally given by Bob at the 2017 Strata Data Conference in New York City.
Timeseries - data visualization in GrafanaOCoderFest
This document discusses using Grafana to visualize time series data stored in InfluxDB. It begins with an introduction to the speaker and agenda. It then discusses why Grafana is useful for quality assurance, anomaly detection, and monitoring analytics. It provides an overview of the monitoring process involving collecting metrics via StatsD and storing them in InfluxDB. Details are given about InfluxDB's purpose, structure, querying, downsampling and retention policies. Telegraf is described as an agent for collecting and processing metrics to send to InfluxDB. StatsD is explained as a protocol for incrementally reporting counters and gauges. Finally, Grafana's purpose, structure, data sources and dashboard creation are outlined, with examples shown in a demonstration.
Simple Knowledge Organisation System (SKOS) as the core of Enterprise Knowled...Andreas Blumauer
Enterprises use knowledge graphs for a more agile information management. Taxonomies build an essential part of knowledge graphs. When based on Semantic Web standards, parts of graphs can be reused more efficiently. SKOS as a standard for taxonomies plays a crucial role in this information architecture.
The document summarizes plans for the LockSchuppenAg company to renovate a rail roundhouse in Dresden, Germany into a future lab and coworking space called NooPolis. Key aspects include investing 5-7 million euros in the project, creating an open organization with open venture capital and a peer academy. NooPolis will use virtual currency called KayGroschen and include a stock exchange, wiki-based constitution and governance, and serve as a testbed for new technologies and economic models. The goal is to build NooPolis and host a SingularSummitEurope conference in September/October 2009 to discuss issues around the future and technological singularity.
Achieving Lakehouse Models with Spark 3.0Databricks
It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Accelerate Your ML Pipeline with AutoML and MLflowDatabricks
Building ML models is a time consuming endeavor that requires a thorough understanding of feature engineering, selecting useful features, choosing an appropriate algorithm, and performing hyper-parameter tuning. Extensive experimentation is required to arrive at a robust and performant model. Additionally, keeping track of the models that have been developed and deployed may be complex. Solving these challenges is key for successfully implementing end-to-end ML pipelines at scale.
In this talk, we will present a seamless integration of automated machine learning within a Databricks notebook, thus providing a truly unified analytics lifecycle for data scientists and business users with improved speed and efficiency. Specifically, we will show an app that generates and executes a Databricks notebook to train an ML model with H2O’s Driverless AI automatically. The resulting model will be automatically tracked and managed with MLflow. Furthermore, we will show several deployment options to score new data on a Databricks cluster or with an external REST server, all within the app.
Agile Data Engineering - Intro to Data Vault Modeling (2016)Kent Graziano
The document provides an introduction to Data Vault data modeling and discusses how it enables agile data warehousing. It describes the core structures of a Data Vault model including hubs, links, and satellites. It explains how the Data Vault approach provides benefits such as model agility, productivity, and extensibility. The document also summarizes the key changes in the Data Vault 2.0 methodology.
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.
From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.
As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.
In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.
In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).
Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.
You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.
In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.
He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.
A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.
Data engineers build massive data storage systems and develop architectures like databases and data processing systems. They install continuous pipelines to move data between these large data "pools" and allow data scientists to access relevant data sets. Data engineers require technical skills in databases, SQL, data modeling, ETL, programming languages, data warehousing and newer technologies like NoSQL, Hadoop and machine learning. They are responsible for designing, implementing, testing and maintaining scalable data systems, ensuring business requirements are met, researching new data sources, cleaning and analyzing data, and collaborating with other teams. The role continues to evolve with new database and development technologies.
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks
Upwork has the biggest closed-loop online dataset of jobs and job seekers in labor history (>10M Profiles, >100M Job Posts, Job Proposals and Hiring Decisions, >10B of Messages, Transaction and Feedback Data). Besides sheer quantity, our data is also contextually very rich. We have client and contractor data for the entire job-funnel – from finding jobs to getting the job done.
For various machine learning applications including search and recommendations and labor marketplace optimization (rate, supply and demand), we heavily relied on a Greenplum-based data warehouse solution for data processing and ad-hoc ML pipelines (weka, scikit-learn, R) for offline model development and online model scoring.
In this talk, we present our modernization efforts in moving towards a 1) holistic data processing infrastructure for batch and stream data processing using S3, Kinesis, Spark and Spark Structured Streaming 2) model development using Spark MLlib and other ML libraries for Spark 3) model serving using Databricks Model Scoring, Scoring over Structured Streams and microservices and 3) how we orchestrate and streamline all these processes using Apache Airflow and a CI/CD workflow customized to our Data Science product engineering needs. The focus of this talk is on how we were able to leverage the Databricks service offering to reduce DevOps overhead and costs, complete the entire modernization with moderate efforts and adopt a collaborative notebook-based solution for all our data scientists to develop model, reuse features and share results. We will shared the core lessons learned and pitfalls we encountered during this journey.
Lake Database Database Template Map Data in Azure Synapse AnalyticsErwin de Kreuk
Database templates in Synapse Analytics are blueprints which can be used by organizations to plan, architect and design solutions.
How can we use these Database Templates in a day-to-day business, in order to speed up to automate this process?
Map data tool can help us with that
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
The document discusses the role of data engineers and data pipelines. It begins with an introduction to big data and why data volumes are increasing. It then covers what data engineers do, including building data architectures, working with cloud infrastructure, and programming for data ingestion, transformation, and loading. The document also explains data pipelines, describing extract, transform, load (ETL) processes and batch versus streaming data. It provides an example of Credit OK's data pipeline architecture on Google Cloud Platform that extracts raw data from various sources, cleanses and loads it into BigQuery, then distributes processed data to various applications. It emphasizes the importance of data engineers in processing and managing large, complex data sets.
Graph databases are a type of NoSQL database that is optimized for storing and querying connected data and relationships. A graph database represents data in graphs consisting of nodes and edges, where the nodes represent entities and the edges represent relationships between the entities. Graph databases are well-suited for applications that involve complex relationships and connected data, such as social networks, knowledge graphs, and recommendation systems. They allow for flexible querying of relationships and connections via graph traversal operations.
Summary introduction to data engineeringNovita Sari
Data engineering involves designing, building, and maintaining data warehouses to transform raw data into queryable forms that enable analytics. A core task of data engineers is Extract, Transform, and Load (ETL) processes - extracting data from sources, transforming it through processes like filtering and aggregation, and loading it into destinations. Data engineers help divide systems into transactional (OLTP) and analytical (OLAP) databases, with OLTP providing source data to data warehouses analyzed through OLAP systems. While similar, data engineers focus more on infrastructure and ETL processes, while data scientists focus more on analysis, modeling, and insights.
"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base.
We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."
This document discusses how graphs and Neo4j can be used for various use cases in banking. It provides an agenda for the discussion including introductions to graphs and Neo4j, banking data overviews, and specific use cases like fraud detection, risk analysis, knowledge graphs, and customer 360 views. Examples are given for how graph databases could help with each use case, with a fraud detection demo. Additional potential uses include identity and access management and regulatory compliance.
Volvo Cars - Retrieving Safety Insights using Graphs (GraphSummit Stockholm 2...Neo4j
Volvo Cars has developed a map attributes representation as a graph in Neo4j. By including real time car data, they are able to collect insights to learn on possible accident causes based on road infrastructure.
The document introduces Visual DataVault, a modeling language for visually expressing Data Vault models. It aims to generate DDL from models and support Microsoft Office. The language defines basic entities like hubs, links, satellites and reference tables. It also covers query assistant tables, computed structures, exploration links and business vault tables to enhance the raw data vault. Some remarks note it focuses on logical not physical modeling and more features are planned.
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
Over the past eight or nine years, applying DevOps practices to various areas of technology within business has grown in popularity and produced demonstrable results. These principles are particularly fruitful when applied to a data analytics environment. Bob Eilbacher explains how to implement a strong DevOps practice for data analysis, starting with the necessary cultural changes that must be made at the executive level and ending with an overview of potential DevOps toolchains. Bob also outlines why DevOps and disruption management go hand in hand.
Topics include:
- The benefits of a DevOps approach, with an emphasis on improving quality and efficiency of data analytics
- Why the push for a DevOps practice needs to come from the C-suite and how it can be integrated into all levels of business
- An overview of the best tools for developers, data analysts, and everyone in between, based on the business’s existing data ecosystem
- The challenges that come with transforming into an analytics-driven company and how to overcome them
- Practical use cases from Caserta clients
This presentation was originally given by Bob at the 2017 Strata Data Conference in New York City.
Timeseries - data visualization in GrafanaOCoderFest
This document discusses using Grafana to visualize time series data stored in InfluxDB. It begins with an introduction to the speaker and agenda. It then discusses why Grafana is useful for quality assurance, anomaly detection, and monitoring analytics. It provides an overview of the monitoring process involving collecting metrics via StatsD and storing them in InfluxDB. Details are given about InfluxDB's purpose, structure, querying, downsampling and retention policies. Telegraf is described as an agent for collecting and processing metrics to send to InfluxDB. StatsD is explained as a protocol for incrementally reporting counters and gauges. Finally, Grafana's purpose, structure, data sources and dashboard creation are outlined, with examples shown in a demonstration.
Simple Knowledge Organisation System (SKOS) as the core of Enterprise Knowled...Andreas Blumauer
Enterprises use knowledge graphs for a more agile information management. Taxonomies build an essential part of knowledge graphs. When based on Semantic Web standards, parts of graphs can be reused more efficiently. SKOS as a standard for taxonomies plays a crucial role in this information architecture.
The document summarizes plans for the LockSchuppenAg company to renovate a rail roundhouse in Dresden, Germany into a future lab and coworking space called NooPolis. Key aspects include investing 5-7 million euros in the project, creating an open organization with open venture capital and a peer academy. NooPolis will use virtual currency called KayGroschen and include a stock exchange, wiki-based constitution and governance, and serve as a testbed for new technologies and economic models. The goal is to build NooPolis and host a SingularSummitEurope conference in September/October 2009 to discuss issues around the future and technological singularity.
This document summarizes an Open GLAM workshop held at the Creative Commons Global Summit on September 15, 2011. The workshop discussed creating a global network to open up content and data from galleries, libraries, archives, and museums. Key topics included developing principles for open content and data, identifying incentives and barriers, and brainstorming next steps like creating an Open GLAM mailing list and wiki. The overall goal is to enable cultural heritage institutions to share digital public domain works and metadata.
Museums in Flanders are contributing their artwork data to Wikidata to make it available to a broader audience. The data, including persistent identifiers and links to external authorities, was uploaded under a CC0 license. This provides benefits like low costs, increased reach through Wikipedia, and placing the works in a wider context. Museums can now get back an RDF export of the data and see their works integrated in the linked open data cloud. Next steps include adding more detail to artist biographies and correcting any errors or duplicates in the data.
Linked Open Data Publications through Wikidata & Persistent Identification...PACKED vzw
In order for museums to truly reap the benefits of publishing their collections online in a sustainable way, PACKED vzw presents the results of its Linked open data project as a best practice guide for the Flemish heritage sector.
In order for museums to truly reap the benefits of publishing their collections online in a sustainable way, PACKED vzw presents the results of its Linked open data project as a best practice guide for the Flemish heritage sector.
The document discusses the Datahub project, which aims to create a shared datahub architecture for museums in Flanders to store and provide access to their collection data. The goals of the project are to lower barriers for museums to connect their data to modern technologies, make data more flexible and reusable, and improve accessibility of museum data. The project will develop an open source datahub framework and deploy a reference implementation in three Flemish art museums in Phase 1 from 2016-2017. Phase 2 will expand the community and integrate four contemporary art museums. The datahub will store collection metadata and make it available through APIs and other technologies to enable new uses of the data.
Why we share more than ever. The potential of open and reusable collection dataAntje Schmidt
The document discusses the Museum für Kunst und Gewerbe's approach to making its collection data openly available and reusable. Some key points:
- The museum launched a website in 2015 providing open access to around 3,000 digitized objects using open licenses like CC0 to waive all copyright restrictions.
- Making the data openly available online has led to increased usage, with over 20,000 downloads and shares of collection content.
- Opening the collection has shifted the museum's perception of itself from sole owner and gatekeeper of the collection to a facilitator and collaborator, with online audiences now seen as co-authors.
- New applications and reuses of the open collection data have emerged, like
The document discusses making web content machine readable through linked open data and APIs in order to increase discoverability. It provides examples of how metadata from documents and databases can be extracted and linked together in semantic graphs to allow for complex queries across multiple sources. By making content and metadata accessible via APIs, cultural institutions like libraries, archives and museums are able to publish their collections as linked open data and have their resources incorporated and linked to by other semantic web applications and databases. This improves discovery of materials while also providing opportunities for new types of applications to be built by developers using the data.
Online Collections: interpretation vs automation - Part 1Brian Gomez
The document discusses balancing automation and interpretation when presenting online collections. It defines automation as mass publishing of collection data and interpretation as telling artifact stories and contexts. It presents the Virtual Exhibit and PastPerfect-Online tools for these purposes. It also provides ideas for using a combination of tools to tell stories through an Artifact of the Day blog linked from the website and social media to recruit help creating interpretive content.
Maps of The World's Most Important Museum ClustersStipo
Stipo research of the world's most important museum clusters. The maps are in the same scale, in order to enable comparison in size. Also, a list of museum cluster websites and logos.
museumplein, amsterdam, museumsquartier, wien, museum mile, new york, national mall, washington, millennium park, chicago, exhibition road, london, louvre, paris, hermitage, st petersburg, federation square, melbourne, balboa park, san diego, skeppsholmen, stockholm, kunstberg, brussels, varosliget, városliget, budapest, kitanomaru park, tokyo, museumsufer, frankfurt, kunstareal, muenchen, munchen, munic, münchen, museum, museum clusters, museum cluster, cultural clusters, cultural cluster
This document summarizes an interview with Merete Sanderhoff, a project researcher at the Danish National Gallery about openness and sharing of cultural works. The interview discusses how providing open access to works and removing restrictions on use can increase awareness and engagement with cultural works. It notes that this approach allows others to build upon and spread knowledge of the works. While it may mean losing some control and potential revenue, the benefits of a larger audience and community of supporters who help promote the works are seen as outweighing these concerns.
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
@WebSciDL, @phonedude_mln
With:
ODU: Michele C. Weigle, Mohamed Aturban
Los Alamos National Laboratory: Herbert Van de Sompel, Martin Klein
The document discusses the Datahub project, which aims to create an open source framework for cultural institutions to manage and share their collection data. The project involves designing a universal datahub architecture, developing an open source software package, and deploying a reference implementation across seven Flemish museums. The goals are to make data more accessible and reusable, enrich it through links to external authorities, and lower the barriers for museums to connect their data to modern technologies. When complete, the datahub will aggregate collection information from multiple institutions into rich, standardized formats.
The document discusses a digital project at the Frick Art Reference Library to document art collections from the Gilded Age in New York. It received grant funding to digitize 70 catalogs from private art collections between the late 19th and early 20th centuries. The project coordinator discusses selecting materials, coordinating across departments, and creating an online exhibition using Google Open Gallery to showcase themes, subjects, and history from the materials.
This is a very basic workshop to introduce novice users to Omeka with an eye towards providing hands-on experience to decide whether it can serve their own research needs.
The document outlines the Museum of New Media's plans to implement a collection management system to catalog and provide access to their growing net art collection. They will use the open source software xDams and the metadata standard LIDO to catalog over 50 net artworks and related materials. The project will establish numbering schemes, file naming conventions, and internal vocabularies. It details staff roles and a six month timeline to have the collection cataloged in xDams before a planned net art retrospective exhibition.
1. The document discusses Open GLAM, which aims to make digital copies and metadata from galleries, libraries, archives, and museums (GLAM institutions) openly available through Creative Commons licenses.
2. It proposes creating an Open GLAM coalition to encourage GLAM institutions to openly share public domain works and metadata.
3. Short term plans include growing an Open GLAM mailing list and wiki, holding meetings and events, and hiring an Open GLAM evangelist to build relationships between activists and GLAM institutions.
"Long Tail" Web strategy paper for Smithsonian American Art MuseumMichael Edson
This document outlines a web strategy for the Smithsonian American Art Museum. It proposes adopting a "Long Tail" strategy to build audiences by publishing as much online content as possible, making that content highly findable, and involving customers in the process. The strategy emphasizes publishing content in smaller "microcontent" pieces across a wide range of topics to reach niche interests. It also stresses improving findability through design and search optimization. Finally, it recommends learning more about current customers and involving them in contributing content to strengthen relationships and grow audiences.
Similar to Text Mining in PoolParty Semantic Suite (20)
Benefiting from Semantic AI along the data life cycleMartin Kaltenböck
Slides of 1 hour session of Martin Kaltenböck (CFO and Managing Partner of Semantic Web Company / PoolParty Software Ltd) on 19 March 2019 in Boston, US at the Enterprise Data World 2019, with its title: Benefiting from Semantic AI along the data life cycle.
Knowledge Graph Implementation into Drupal Content Management System (CMS) fo...Martin Kaltenböck
Slides of presentation of Martin Kaltenböck (Managing Partner Semantic Web Company, SWC https://www.semantic-web.com) at the Taxonomy Boot Camp London 2017 on 17th of October 2017 with the title: Knowledge Graph Implementation into Drupal Content Management System (CMS) for the UN Climate Technology Centre and Network (CTCN)
The Climate Tagger - a tagging and recommender service for climate informatio...Martin Kaltenböck
The Climate Tagger - a tagging- and recommender service for climate information based on PoolParty Semantic Suite - slides of the talk by Sukaina Bharwani (Stockholm Environment Institute, SEI Oxford) and Martin Kaltenböck (Semantic Web Company, SWC Vienna) at the Taxonomy Boot Camp London 2016 (TBC London) taking place on 19.10.2016
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open DataMartin Kaltenböck
Vortrag im Rahmen des Data Pioneers Workshop am 10.10.2016 am BMVIT zum Thema Open Innovation und Open Data (Open Innovation mittels Open Data) seitens Elmar Kiesling (TU Wien) und Martin Kaltenböck (SWC) für den ODI (Open Data Institute) Node Vienna.
Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...Martin Kaltenböck
The document outlines a workshop on quality assessment and improvements for open data portals, including results from requirements elicitation, proposed data quality metrics, and plans for the ADEQUATE project which aims to improve data quality through assessment, algorithms, linked data principles, and community involvement. The workshop agenda covers user requirements, best practices, and an open discussion on data quality issues.
PoolParty Semantic Suite - LT-Innovate Industry Summit-2016 - BrusselsMartin Kaltenböck
This document provides an overview of Semantic Web Company (SWC) and their PoolParty Semantic Suite product. It discusses SWC's background, customers, and partners. It then describes the key components and functionalities of PoolParty, including maintaining vocabularies, entity extraction, linked data integration, and advanced features like custom ontologies and corpus analysis. The document explains how PoolParty can integrate with databases like MarkLogic and Virtuoso, as well as content management systems like Drupal. Overall, the document aims to introduce SWC and PoolParty and demonstrate how their semantic technologies can provide benefits for tasks like data integration, search, and knowledge management.
Presentation of the Big Data Europe project at the EIP Water Conference 2016 ...Martin Kaltenböck
Presentation of the Big Data Europe project (http://www.big-data-europe.eu) at the EIP Water Conference 2016 in Leeuwarden, The Netherlands. Taking place on 09/02/2016 at the Wetsus Campus in Leeuwarden, the Netherlands in the course of an ICT4Water workshop.
The European Innovation Partnership on Water Online MarketplaceMartin Kaltenböck
Presentation about the 'The European Innovation Partnership (EIP) on Water Online Marketplace (http://www.eip-water.eu)' taking place on 09.02.2016 in the course of the EIP Water annual conference in Leeuwarden, The Nethetlands.
PoolParty Semantic Suite: Management Briefing and Functional Overview Martin Kaltenböck
Slides for the presentation of PoolParty Semantic Suite on 12.11. 2015 at KNVI Congres 2015 in Utrecht, the Netherlands, see: http://congres.knvi.info/ by Martin Kaltenböck in the Big Data & Linked Data Session.
PoolParty Semantic Suite - Solutions for Sustainable DevelopmentMartin Kaltenböck
Presentation of the webinar: PoolParty for Sustainable Development - the Climate Tagger - taking place on 5 November 2015. More information and other presentations to be found here: http://bit.ly/1NpTcGT.
Recording of the webinar: https://www.youtube.com/watch?v=3GxtFfLL1ps.
Climate Technology Transfer supported through Linked Data A Proof of Concept ...Martin Kaltenböck
Presentation: Climate Technology Transfer supported through Linked Data A Proof of Concept for The Climate Technology Centre and Network (CTCN) - by Eelco Kruizinga (DNV GL) and Martin Kaltenböck (SWC) at the Linked Data Netherlands Conference on 29 September 2015 in Hilversum, NL.
Introduction: The Big Data Europe Project at the: CMG-AE Event: Big Data: Strategien, Technologien und Nutzen
19th of May 2015, Expat Center der Wirtschaftsagentur, Vienna, Austria
See: http://www.big-data-europe.eu
Einführung Linked Open Data (LOD) - Introduction to Linked Open Data (LOD)Martin Kaltenböck
Präsentation von Martin Kaltenböck (SWC) bei den Wissenschaftstagen der Akademie der Wissenschaften am 3.12. 2014 zum Thema - Einführung, Basis und Benefits von Linked Open Data (LOD) inkl. Best Practise: Linked Open Data Pilot Österreich (LOD Pilot AT - http://linkeddata.gv.at).
Vortrag zum Semantic Web MeetUp Vienna am 16. Oktober 2014, Top 24 im Arkadenhof des Wiener Rathaus zum Beta Launch des Linked Open Data Piloten Österreich (LOD Pilot AT).
Open Data Portal (ODP) Österreich - Präsentation bei der opendata.ch 2014 in ...Martin Kaltenböck
Folien zum Vortrag von Martin Kaltenböck am 18.9.2014 bei der jährlichen Open Data CH Konferenz in Zürich, Schweiz - zum Thema Open Data Portal (ODP) Österreich (http://www.opendataportal.at) und Linked Open Data (LOD) Pilot Österreich.
Linked Open Data Pilotprojekt Österreich - LOD Pilot ATMartin Kaltenböck
Foliensatz im Rahmen des Open Data Support Trainings für die österr. Verwaltung am 15.9.2014, organisiert vom östterreichischen Bundeskanzleramt. Der LOD Pilot Österreich realisiert eine digitale Datenbasisinfrastruktur als Linked Open Data (vernetzten offenen Daten) für Österreich - auf Basis der offenen Daten von data.gv.at (Nationales Open Data Portal) und open.wien.gv.at (Datenportal der Stadt Wien). Hierbei werden 30-50 Basisdatensätze (Industriesektoren, Wirtschaftszweige oder Gemeindekennziffern etc) als Linked Open Data unter linked.data.gv.at publiziert und zur Wiederverwendung bereitgestellt. Das Projekt wurde von der Internetfoundation (netidee) Österreich finanziell unterstützt.
Easy SPARQLing for the Building Performance ProfessionalMartin Kaltenböck
Slides of Martin Kaltenböcks (SWC) presentation at SEMANTiCS2014 conference in Leipzig on 5th of September 2014 about the 'Tool for Building Energy Performance Scenarios' of GBPN (Global Buildings Performance Network, http://gbpn.org) that provides a prediction tool for buildings performance worldwide by making use of Linked Open Data (LOD).
Slides for the presentation of PoolParty Semantic Suite (http://www.poolparty.biz) at the PiLOD conference (http://www.pilod.nl) in Hilversum, the Nethertlands on 25th of June 2014 by Martin Kaltenböck - as part of the presentation of Linked Open Data at Wolters Kluwer - together with Christian Dirschl of WKD
Presentation by M. Kaltenböck of Semantic Web Company given at the Linked Open Data MeetUp Mannheim on 23 February 2014 on: Semantic Information Management using PoolParty 4 - explaining PoolParty Semantic Suite with features and applications and real world use cases...
Using DBpedia for Thesaurus Management and Linked Open Data IntegrationMartin Kaltenböck
Using DBpedia for Thesaurus Creation and Management as well as Linked Open Data (LOD) Integration with PoolParty Semantic Suite (http://www.poolparty.biz) at Semantic Web Company (SWC, http://www.semantic-web.at).
Enhanced data collection methods can help uncover the true extent of child abuse and neglect. This includes Integrated Data Systems from various sources (e.g., schools, healthcare providers, social services) to identify patterns and potential cases of abuse and neglect.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Did you know that drowning is a leading cause of unintentional death among young children? According to recent data, children aged 1-4 years are at the highest risk. Let's raise awareness and take steps to prevent these tragic incidents. Supervision, barriers around pools, and learning CPR can make a difference. Stay safe this summer!
1. Martin Kaltenböck
CFO, Semantic Web Company
Timea Turdean
Technical Consultant, SWC
POOLPARTY
SEMANTIC
SUITE
AIMS Webinar
21st Sept 2017
1
2. PoolParty
Drupal
Integration
2
Agenda
▸ Introduction Semantic Web Company (SWC)
▸ Introduction PoolParty Semantic Suite
▸ Using PoolParty for Text & Data Mining
▹ Text Mining for continuous knowledge graph modelling
▹ Entity linking and data integration
▹ Classification and semantic annotation / tagging
▸ DEMO(s) of text mining capability of PoolParty
▸ Customer Success Stories
▹ REEEP ClimateTagger
▹ healthdirect Australia
▹ CTCN Semantic Search
▹ EIP Water Matchmaking
▸ Q&A Session
4. INTRODUCING
SEMANTIC
WEB COMPANY
Semantic Web Company (SWC)
▸ Founded in 2004
▸ Based in Vienna
▸ Privately held
▸ 40+ employees, experts in text
mining & linked data
▸ ~15-20% revenue growth / year
▸ 2.5 Mio Euro funding for R&D
▸ SWC named to KMWorld’s 2017
‘100 Companies That Matter in
Knowledge Management’
▸ Organising SEMANTiCS
conference series for 13 years
▸ https://www.semantic-web.com
4
5. INTRODUCING
POOLPARTY
PoolParty Semantic Suite
▸ First release in 2009
▸ Current version 6.0
▸ W3C standards compliant
▸ Over 200 installations
worldwide
▸ 50% of revenue is reinvested
into PoolParty development
PoolParty on-premises or
used as a cloud service
▸ KMWorld listed PoolParty as
Trend-Setting Product 2015,
2016 and 2017
▸ https://www.poolparty.biz/
5
6. SELECTED
CUSTOMER
REFERENCES
AND PARTNERS
SWC head-
quarters
6
Customer References
● Credit Suisse
● Boehringer Ingelheim
● Roche
● adidas
● The Pokémon Company
● Canadian Broadcasting Corporation
● Harvard Business School
● Wolters Kluwer
● Talend
● HealthStream
● TC Media
● Techtarget
● Seek
● Alliander N.V.
● Pearson - Always Learning
● Education Services Australia
● American Physical Society
● Healthdirect Australia
● World Bank Group
● Inter-American Development Bank
● Renewable Energy Partnership
● Wood MacKenzie
● Oxford University Press
● International Atomic Energy Agency
● Norwegian Directorate of Immigration
● Ministry of Finance (AT)
● Council of the E.U.
● Australian National Data Service
Partners
● Accenture
● EPAM Systems
● Enterprise Knowledge
● Mekon Intelligent Content Solutions
● B-S-S Business Software Solutions
● MarkLogic
● Wolters Kluwer
● Digirati
● Quark
US
East
US
West
AUS/
NZL
UK
8. TECHNICAL
CORE
COMPONENTS
8
Bain Capital is a venture capital
company based in Boston, MA.
Since inception it has invested in
hundreds of companies including AMC
Entertainment, Brookstone, and Burger
King. The company was co-founded by
Mitt Romney.
Taxonomy &
Ontology Server
Entity Extractor &
Text Mining
Data Integration &
Data Linking
Unstructured
Data
Semi-
structured
Data
Structured
Data
Unified
Views
PoolParty
GraphSearch
Identify new
candidate concepts
to be included in a
controlled vocabulary
Controlled vocabularies
as a basis for highly
precise entity
extraction
Entity Extractor informs
all incoming data
streams about its
semantics and links them
Schema mapping
based on ontologies
RDF
Graph Database
11. ‘Elevator
Pitch’
▸ Built as a ‘Semantic Middleware’
▸ Outstanding user-friendliness
▸ Fully standards-compliant
▸ Highly precise entity extraction
▸ Comprehensive API
▸ Excellent maintainability of extraction models
▸ Integrated with leading search engines & graph databases
▸ Integrated with leading content management platforms
▸ Product configuration options for growing requirements
▸ Highly expertised partners / service team
11
12. Product
Overview
All products are
available as
cloud services or
for on-premise
installation
> PoolParty
Feature & Price
Matrix
12
PoolParty
Basic
Server
PoolParty
Advanced
Server
PoolParty
Enterprise
Server
PoolParty
Semantic
Integrator
SKOS Taxonomy Management
Multiple Projects
Taxonomy Rest API
Import/Export (incl. Excel)
Rollback and History
Ontologies and Custom Schemes
Quality Management & Reports
Advanced Corpus Management
Vocabulary Mapping, Linked Data Mapping
Linked Data Enrichment, Frontend, and SPARQL endpoint
Entity Extractor Extractor API
Auto Populate project from DBpedia
Export to Remote Repository
Workflow Management
SKOS-XL (optional)
Integration with Graph databases
Integration with Search engines
Data linking & mapping
Data transformation pipelines with UnifiedViews
Graph Search Server
16. Metadata and
semantic data
16
The Peggy Guggenheim Collection
is a modern art museum on the
Grand Canal in the Dorsoduro
sestiere of Venice, Italy. It is one of
the most visited attractions in
Venice. The collection is housed in
the Palazzo Venier dei Leoni, an
18th-century palace, which was
the home of the American heiress
Peggy Guggenheim for three
decades. She began displaying her
private collection of modern
artworks to the public seasonally
in 1951. After her death in 1979, it
passed to the Solomon R.
Guggenheim Foundation, which
eventually opened the collection
year-round.
17. Metadata and
semantic data
17
The Peggy Guggenheim Collection
is a modern art museum on the
Grand Canal in the Dorsoduro
sestiere of Venice, Italy. It is one of
the most visited attractions in
Venice. The collection is housed in
the Palazzo Venier dei Leoni, an
18th-century palace, which was
the home of the American heiress
Peggy Guggenheim for three
decades. She began displaying her
private collection of modern
artworks to the public seasonally
in 1951. After her death in 1979, it
passed to the Solomon R.
Guggenheim Foundation, which
eventually opened the collection
year-round.
Peggy Guggenheim
Peggy Guggenheim
Collection
Venice
Canale
Grande
http://my.com/resource/328832
skos:preLabel
http://my.com/docs/45367
skos:preLabel
http://my.com/docs/52345
skos:preLabel
http://my.com/resource/328832
skos:preLabel
18. Metadata and
semantic data
18
The Peggy Guggenheim Collection
is a modern art museum on the
Grand Canal in the Dorsoduro
sestiere of Venice, Italy. It is one of
the most visited attractions in
Venice. The collection is housed in
the Palazzo Venier dei Leoni, an
18th-century palace, which was
the home of the American heiress
Peggy Guggenheim for three
decades. She began displaying her
private collection of modern
artworks to the public seasonally
in 1951. After her death in 1979, it
passed to the Solomon R.
Guggenheim Foundation, which
eventually opened the collection
year-round.
Peggy Guggenheim
Peggy Guggenheim
Collection
Venice
museum
Canale
Grande
skos:preLabel
http://my.com/docs/45367
skos:preLabel
http://my.com/docs/52345
skos:preLabel
skos:preLabel
http://my.com/resource/62545
skos:preLabel
http://www.mycom.com/
images/90546089
imgae
has ladmark
named after
http://my.com/resource/328832
http://my.com/resource/328832
hosted in
hosted in
has
19. Metadata and
semantic data
19
The Peggy Guggenheim Collection
is a modern art museum on the
Grand Canal in the Dorsoduro
sestiere of Venice, Italy. It is one of
the most visited attractions in
Venice. The collection is housed in
the Palazzo Venier dei Leoni, an
18th-century palace, which was
the home of the American heiress
Peggy Guggenheim for three
decades. She began displaying her
private collection of modern
artworks to the public seasonally
in 1951. After her death in 1979, it
passed to the Solomon R.
Guggenheim Foundation, which
eventually opened the collection
year-round.
Peggy Guggenheim
Collection
dct:title
Mike Miller
Michael Miller
skos:prefLabel
skos:altLabel
dct:creator
http://my.com/docs/328832
http://my.com/people/32schema:Article
rdf:type
http://my.com/img/99.jpg
schema:image
skos:subject
Peggy Guggenheim
Collection Venice
museum
skos:prefLabel
skos:subject
skos:altLabel
skos:broader
skos:prefLabel
schema:image
Canale
Grande
skos:prefLabel
20. Resolving Language Problems
“While most people can deal with
linguistic features as synonyms,
homographs, polyhierarchies,
and even with far more peculiar
characteristics of natural
languages, machines often
struggle with automatic sense-
making because of the lack of a
semantic knowledge model that
can be used programmatically.”
22. PoolParty
Extractor
Uses several components of a knowledge model:
▸ Taxonomies based on the SKOS standard
▸ Ontologies based on RDF Schema or OWL
▸ Word form dictionaries
▸ Blacklists and stop word lists
▸ Disambiguation settings
▸ Domain-specific reference document corpus
▸ Statistical language model
22
23. PoolParty’s
SKOS editor
23
The Audi Q3 is a compact
crossover SUV made by
Audi.
It is based on the PQ35
platform of Volkswagen.
A5 platform
A series
25. ‘Setting the
rules’ for text
mining & entity
extraction via
thesaurus
25
Proper use of an funduscope
requires a bit of practice and
familiarity with the functions of
your device.
Diagnostic Equipment
Ophtalmoscope
28. Corpus
analysis results
in a network of
concepts and
terms
28
I need support to
continuously extend our
taxonomy / controlled
vocabulary!
skos:
Concept
Reference
Corpus
- Websites
- PDF, Word, …
- Abstracts from
DBpedia
- RSS Feeds
skos:
Concept
skos:
Concept
Term 1
Term 3
Term 7
Term 8
Term 6
Term 4
Term 2
Term 5
- Relevant terms and phrases
- Relevancy of concepts
- co-occurence between concepts and terms
- co-occurence between terms and terms
31. PoolParty as a
supervised
learning
system
31
Content Manager
Integrator
Taxonomist/
Ontologist
Thesaurus
Server
Extractor
PowerTagging
uses API
is user of
is user of
is basis of
is basis of
Index
annotates
enriches
Referenc
e Corpus
CMS
extends
is basis of
analyzes
uses API
33. PoolParty
Semantic
Integrator -
at a glance
https://youtu.be/l_LppfS3wxk
33
Deep Data
Analytics
Semantic
Search
Semantic
Integrator
Unstructured
Data
Structured
Data
ETL / Monitoring / Scheduling
39. Use Cases:
Text Mining &
Linked Data
▸ Climate Tagger (PDF)
Streamline and catalogue data and information resources
▸ healthdirect Australia (PDF)
Semantic Search based on the Australian Health Thesaurus
▸ CTCN Semantic Search
Integrating thousands of documents from several sources on climate technology
▸European Innovation Partnership /EIP) on Water
Online Marketplace including semantic Matchmaking
39
40. Place your screenshot here
40
Climate Tagger
Help organizations in the
climate and development
arenas catalogue, categorize,
contextualize, and connect data
and information resources.
Climate Tagger is backed by the
expansive Climate Compatible
Development Thesaurus.
http://www.climatetagger.net
42. Place your screenshot here
42
EIP Water
Matchmaking
Controlled vocabularies enable
accurate matchmaking
between Supply and Demand
for Water Innovation in Europe.
Matchmaking is based upon
the EIP Water Innovation
Thesaurus (GEMET based).
http://www.eip-water.eu
43. Place your screenshot here
43
CTCN Semantic
Search
Help organisations in the climate
technology field to explore and find
relevant content from thousands of
Drupal Nodes and several sources
using PoolParty, PowerTagging and
s0nr webmining
CTCN is backed by the CTCN
Climate Technology Thesaurus.
https://www.ctc-n.org/semantic-search
44. Place your screenshot here
44
healthdirect
Australia
Integrated views and
semantic search over more
than 100 trusted sources.
Harmonization of various
metadata systems through
the use of a central
vocabulary hub:
Australian Health Thesaurus.
http://www.healthdirect.gov.au
45. SUMMARY
WHY
TAXONOMISTS
AND
INFORMATION
ARCHITECTS
LIKE
POOLPARTY
Read more
Different project stakeholders expect specific
qualities from a semantic technology platform:
45
I am a taxonomist. I need a tool that
provides convenient functionalities and
intuitive user interfaces for my daily work.
I am an information architect. Enterprise
metadata management deserves scalable
technologies, which provide semantic services
on top of rich APIs based on standards.
Welcome - 3’ - Martin & Timea
SWC & PP - 10’ max - Martin
Using PP - 10 - Timea
Demos - 12’ - Timea
Customer Stories - 10’ (max) - Martin
TOTAL = 45’ plus Q&A
In the core of each application that builds upon a semantic information architecture, we clearly distinguish between content layer, metadata layer, semantic layer, and the navigation logic on top
Metadata layer
Not actionable
Semantic layer
Adding meaning to the metadata
Strings become ‘things’ and can be linked between them and enriched with more data.
Semantic layer
Adding meaning to the metadata
Strings become ‘things’ and can be linked between them and enriched with more data.
By adding a semantic layer that contains facts one can quickly find the document when searching for museum, even if ‘museum’ is not in the text.
A semantic layer is a network (or graph) of things including its relations and attributes such as its various names. This layer serves like a glue to link all information available for a certain business object (‘thing’ or ‘resource’) scattered across various repositories and data silos in order to create a complete picture of it.
* Disambiguation ‘Jaguar is owned by Tata Motors’, Jaguar = homograph.
* Synonyms = refer to the same things
* A polyhierarchy describes an entity or concept as a child concept of at least two parent concepts.
Taxonomies based on the SKOS standard
‘Setting the rules’ for text mining & entity extraction via thesaurus
Ontologies based on RDF Schema or OWL
Word form dictionaries
Blacklists and stop word lists
Annotated concepts will then be compared with the surroundings of the potentially ambiguous extracted entities in a given text.
Annotated concepts will then be compared with the surroundings of the potentially ambiguous extracted entities in a given text.
Domain-specific reference document corpus
Show dupal.poolparty.biz/PoolParty
Show dupal.poolparty.biz