Due to the decentralised and autonomous architecture of the
Web of Data, data replication and local deployment of SPARQL endpoints
is inevitable. Nowadays, it is common to have multiple copies of
the same dataset accessible by various SPARQL endpoints, thus leading
to the problem of selecting optimal data source for a user query based on
data properties and requirements of the user or the application. Quality
of Service (QoS) parameters can play a pivotal role for the selection of
optimal data sources according to the user's requirements. QoS parameters
have been widely studied in the context of web service selection.
However, to the best of our knowledge, the potential of associating QoS
parameters to SPARQL endpoints for optimal data source selection has
not been investigated.
In this paper, we dene various QoS parameters associated with the
SPARQL endpoints and represent a semantic model for QoS parameters
and their evaluation. We present a monitoring service for the SPARQL
endpoint which automatically evaluates the QoS metrics of any given
SPARQL endpoint. We demonstrate the utility of our monitoring service
by implementing an extension of the SPARQL query language, which
caters for user requirements based on QoS parameters and selects the
optimal data source for a particular user query over federated sources.
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...Databricks
Both Spark and HBase are widely used, but how to use them together with high performance and simplicity is a very hard topic. Spark HBase Connector (SHC) provides feature-rich and efficient access to HBase through Spark SQL. It bridges the gap between the simple HBase key value store and complex relational SQL queries, and enables users to perform complex data analytics on top of HBase using Spark.
SHC implements the standard Spark data source APIs, and leverages the Spark catalyst engine for query optimization. To achieve high performance, SHC constructs the RDD from scratch instead of using the standard HadoopRDD. With the customized RDD, all critical techniques can be applied and fully implemented, such as partition pruning, column pruning, predicate pushdown and data locality. The design makes the maintenance very easy, while achieving a good tradeoff between performance and simplicity. Also, SHC has integrated natively with Phoenix data types. With SHC, Spark can execute batch jobs to read/write data from/into Phoenix tables. Phoenix can also read/write data from/into HBase tables created by SHC. For example, users can run a complex SQL query on top of an HBase table created by Phoenix inside Spark, perform a table join against a DataFrame which reads the data from a Hive table, or integrate with Spark Streaming to implement a more complicated system.
This session will demonstrate how SHC works, how to use SHC in secure/non-secure clusters, how SHC works with multi-HBase clusters and how Spark reads/writes data from/into Phoenix tables with SHC, etc. It will also benefit people who use Spark and other data sources (besides HBase) as it inspires them with ideas of how to support high performance data source access at the Spark DataFrame level.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache Nifi, Apache Kafka, Apache Storm.
GDPR Community Showcase for Apache Ranger and Apache AtlasDataWorks Summit
The communities for Apache Atlas and Apache Ranger, which are foundational components for Security and Governance across the Hadoop stack, have spawned a robust industry ecosystem of tools and platforms. Such industry solutions build upon the extensibility offered via open and robust APIs and integration patterns to provide innovative “better together” capabilities. In this talk, we will showcase how the ecosystem of solutions being built by different vendors provide value-added capabilities to address the key aspects of securing and governing your data lakes based on Apache Ranger and Apache Atlas frameworks. The talk will showcase multiple ecosystem demonstrations that will include how to identify, map, and classify personal data, harvest and maintain metadata, track and map the movement of data through your enterprise, and enforce appropriate controls to monitor access and usage of personal data.
Come hear from community partners:
-Balaji Ganesan from Privacera will showcase how Privacera integrates with and leverages Apache Ranger and Apache Atlas features to help with GDPR compliance
-Greg Goldsmith and Jordan Martz from Attunity will showcase how Attunity’s solutions integrate into Apache Atlas to provide robust chain of custody and classifications required for GDPR
-Somil Kulkarni from IBM will demonstrate how IBM Information Governance Catalog integrates with Apache Atlas to exchange metadata to build a connected solution for GDPR compliance that harnesses both open source community enhancements and IBM’s innovations in governance space.
Speakers
Ali Bajwa, Principal Solutions Engineer, Hortonworks
Srikanth Venkat, Senior Director Product Management, Hortonworks
Migration from Oracle to PostgreSQL: NEED vs REALITYAshnikbiz
Some of the largest organization in the world today are going cost-efficient by innovating their database layer. Migrating workloads from legacy systems to an enterprise open source database technology like Postgres is a preferred choice for many.
Apache Pulsar at Tencent Game: Adoption, Operational Quality Optimization Exp...StreamNative
After nearly 10 years of development of Tencent Game big data, the daily data transmission volume can reach 1.7 trillion. As the key component of the big data platform, the MQ system is critical to provide real-time service operational quality assurance, which requires the support of various applications such as real-time game operational service, real-time index data analysis, and real-time personalized recommendation. With the fast growth of the gaming business and the continuous expansion of data, the challenge of real-time service operational quality assurance is also increasing.
In this presentation, We will introduce the development history of Tencent Game big data technology and our practical experience of operational service quality optimization for Apache Pulsar in Tencent Game real-time service scenarios.
Stream processing has become the defacto standard for building real-time ETL and Stream Analytics applications. We see batch workloads move into Stream processing to act on the data and derive insights faster. With the explosion of data with "Perishable Insights" such IoT and machine-generated data, Stream Processing + Predictive Analytics is driving tremendous business value. This is evidenced by the explosion of Stream Processing frameworks like proven and evolving Apache Storm and newer frameworks such as Apache Flink, Apache Apex, and Spark Streaming.
Today, users have to choose and try to understand the benefits of each of these frameworks and not only that they have to learn the new APIs and also operationalize their applications. To create value faster, we are introducing new open source tool - Streamline. It is a self-service framework that will ease building streaming application and deploy the streaming application across multiple frameworks/engines that users prefer in a snap. It simplifies integration with Machine Learning models for scoring and classification of data for Predictive Analytics. It provides an elegant way to build Analytics dashboards to derive business insights out of the streaming data and to allow the business users to consume it easily.
In this talk, we will outline the fundamentals of real-time stream processing and demonstrate Streamline capabilities to show how it simplifies building real-time streaming analytics applications.
Speaker:
Priyank Shah, Staff Software Engineer, Hortonworks
Apache Hive is an Enterprise Data Warehouse build on top of Hadoop. Hive supports Insert/Update/Delete SQL statements with transactional semantics and read operations that run at Snapshot Isolation. This talk will describe the intended use cases, architecture of the implementation, new features such as SQL Merge statement and recent improvements. The talk will also cover Streaming Ingest API, which allows writing batches of events into a Hive table without using SQL. This API is used by Apache NiFi, Storm and Flume to stream data directly into Hive tables and make it visible to readers in near real time.
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...Databricks
Both Spark and HBase are widely used, but how to use them together with high performance and simplicity is a very hard topic. Spark HBase Connector (SHC) provides feature-rich and efficient access to HBase through Spark SQL. It bridges the gap between the simple HBase key value store and complex relational SQL queries, and enables users to perform complex data analytics on top of HBase using Spark.
SHC implements the standard Spark data source APIs, and leverages the Spark catalyst engine for query optimization. To achieve high performance, SHC constructs the RDD from scratch instead of using the standard HadoopRDD. With the customized RDD, all critical techniques can be applied and fully implemented, such as partition pruning, column pruning, predicate pushdown and data locality. The design makes the maintenance very easy, while achieving a good tradeoff between performance and simplicity. Also, SHC has integrated natively with Phoenix data types. With SHC, Spark can execute batch jobs to read/write data from/into Phoenix tables. Phoenix can also read/write data from/into HBase tables created by SHC. For example, users can run a complex SQL query on top of an HBase table created by Phoenix inside Spark, perform a table join against a DataFrame which reads the data from a Hive table, or integrate with Spark Streaming to implement a more complicated system.
This session will demonstrate how SHC works, how to use SHC in secure/non-secure clusters, how SHC works with multi-HBase clusters and how Spark reads/writes data from/into Phoenix tables with SHC, etc. It will also benefit people who use Spark and other data sources (besides HBase) as it inspires them with ideas of how to support high performance data source access at the Spark DataFrame level.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache Nifi, Apache Kafka, Apache Storm.
GDPR Community Showcase for Apache Ranger and Apache AtlasDataWorks Summit
The communities for Apache Atlas and Apache Ranger, which are foundational components for Security and Governance across the Hadoop stack, have spawned a robust industry ecosystem of tools and platforms. Such industry solutions build upon the extensibility offered via open and robust APIs and integration patterns to provide innovative “better together” capabilities. In this talk, we will showcase how the ecosystem of solutions being built by different vendors provide value-added capabilities to address the key aspects of securing and governing your data lakes based on Apache Ranger and Apache Atlas frameworks. The talk will showcase multiple ecosystem demonstrations that will include how to identify, map, and classify personal data, harvest and maintain metadata, track and map the movement of data through your enterprise, and enforce appropriate controls to monitor access and usage of personal data.
Come hear from community partners:
-Balaji Ganesan from Privacera will showcase how Privacera integrates with and leverages Apache Ranger and Apache Atlas features to help with GDPR compliance
-Greg Goldsmith and Jordan Martz from Attunity will showcase how Attunity’s solutions integrate into Apache Atlas to provide robust chain of custody and classifications required for GDPR
-Somil Kulkarni from IBM will demonstrate how IBM Information Governance Catalog integrates with Apache Atlas to exchange metadata to build a connected solution for GDPR compliance that harnesses both open source community enhancements and IBM’s innovations in governance space.
Speakers
Ali Bajwa, Principal Solutions Engineer, Hortonworks
Srikanth Venkat, Senior Director Product Management, Hortonworks
Migration from Oracle to PostgreSQL: NEED vs REALITYAshnikbiz
Some of the largest organization in the world today are going cost-efficient by innovating their database layer. Migrating workloads from legacy systems to an enterprise open source database technology like Postgres is a preferred choice for many.
Apache Pulsar at Tencent Game: Adoption, Operational Quality Optimization Exp...StreamNative
After nearly 10 years of development of Tencent Game big data, the daily data transmission volume can reach 1.7 trillion. As the key component of the big data platform, the MQ system is critical to provide real-time service operational quality assurance, which requires the support of various applications such as real-time game operational service, real-time index data analysis, and real-time personalized recommendation. With the fast growth of the gaming business and the continuous expansion of data, the challenge of real-time service operational quality assurance is also increasing.
In this presentation, We will introduce the development history of Tencent Game big data technology and our practical experience of operational service quality optimization for Apache Pulsar in Tencent Game real-time service scenarios.
Stream processing has become the defacto standard for building real-time ETL and Stream Analytics applications. We see batch workloads move into Stream processing to act on the data and derive insights faster. With the explosion of data with "Perishable Insights" such IoT and machine-generated data, Stream Processing + Predictive Analytics is driving tremendous business value. This is evidenced by the explosion of Stream Processing frameworks like proven and evolving Apache Storm and newer frameworks such as Apache Flink, Apache Apex, and Spark Streaming.
Today, users have to choose and try to understand the benefits of each of these frameworks and not only that they have to learn the new APIs and also operationalize their applications. To create value faster, we are introducing new open source tool - Streamline. It is a self-service framework that will ease building streaming application and deploy the streaming application across multiple frameworks/engines that users prefer in a snap. It simplifies integration with Machine Learning models for scoring and classification of data for Predictive Analytics. It provides an elegant way to build Analytics dashboards to derive business insights out of the streaming data and to allow the business users to consume it easily.
In this talk, we will outline the fundamentals of real-time stream processing and demonstrate Streamline capabilities to show how it simplifies building real-time streaming analytics applications.
Speaker:
Priyank Shah, Staff Software Engineer, Hortonworks
Apache Hive is an Enterprise Data Warehouse build on top of Hadoop. Hive supports Insert/Update/Delete SQL statements with transactional semantics and read operations that run at Snapshot Isolation. This talk will describe the intended use cases, architecture of the implementation, new features such as SQL Merge statement and recent improvements. The talk will also cover Streaming Ingest API, which allows writing batches of events into a Hive table without using SQL. This API is used by Apache NiFi, Storm and Flume to stream data directly into Hive tables and make it visible to readers in near real time.
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDataWorks Summit
When interacting with analytics dashboards in order to achieve a smooth user experience, two major key requirements are sub-second response time and data freshness. Cluster computing frameworks such as Hadoop or Hive/Hbase work well for storing large volumes of data, although they are not optimized for ingesting streaming data and making it available for queries in realtime. Also, long query latencies make these systems sub-optimal choices for powering interactive dashboards and BI use-cases.
In this talk we will present Druid as a complementary solution to existing hadoop based technologies. Druid is an open-source analytics data store, designed from scratch, for OLAP and business intelligence queries over massive data streams. It provides low latency realtime data ingestion and fast sub-second adhoc flexible data exploration queries.
Many large companies are switching to Druid for analytics, and we will cover how druid is able to handle massive data streams and why it is a good fit for BI use cases.
Agenda -
1) Introduction and Ideal Use cases for Druid
2) Data Architecture
3) Streaming Ingestion with Kafka
4) Demo using Druid, Kafka and Superset.
5) Recent Improvements in Druid moving from lambda architecture to Exactly once Ingestion
6) Future Work
A Journey to Reactive Function ProgrammingAhmed Soliman
A gentle introduction to functional reactive programming highlighting the reactive manifesto and ends with a demo in RxJS https://github.com/AhmedSoliman/rxjs-test-cat-scope
Many enterprise are implementing Hadoop projects to manage and process large datasets. Big question is: how to configure Hadoop clusters to connect to enterprise directory containing 100k+ users and groups for access management. Several large enterprises have complex directory servers for managing users and groups. Many advanced features have been recently added to Hadoop user management in order to support various complex directory server structures.
In this session attendees will learn about: setting up Hadoop node with users from Active Directory for executing Hadoop jobs, setting up authentication for enterprise users, and setting up authorization for users and groups using Apache Ranger. Attendees will also learn about the common challenges faced in the enterprise environments while interacting with Active Directory including filtering out users to be brought into Hadoop from Active Directory, restricting access to a set of users from Active Directory, handling users from nested group structures, etc.
Speakers
Sailaja Polavarapu, staff Software Engineer, Hortonworks
Velmurugan Periasamy, Director - Engineering, Hortonworks
Apache Ambari is an extensible framework that simplifies provisioning, managing and monitoring Hadoop clusters. Apache Ambari was built on a standardized stack-based operations model. Stacks wrap services of all shapes and sizes with a consistent definition and lifecycle-control layer; thereby providing a consistent approach for managing and monitoring the services. This also provided a natural extension point for operators and the community to bring in their own add-on services and “plug-in” the new services into the stack.
However, one of the fundamental limitations of the current Apache Ambari architecture has been that there is a strong one-on-one coupling between entities. For instance, a cluster is tied to a single stack and a Hadoop operator can only deploy services defined in that stack, a cluster can have only a single instance of a service and a host can have only a single instance of a component. Taking into consideration various use case scenarios that cannot be enabled due to these limitations there is a growing need to revamp the Ambari architecture.
In this talk, we propose a revamped Apache Ambari architecture that will open up the floodgates for a wide range of scenarios that wouldn’t have been possible thus far. We will focus the discussion on a new mpack-based operations model that will replace the stack-based operations model. A management package is a self-contained deployment artifact that includes all the details for deploying, managing and upgrading a set of services bundled in the package. A third-party provider can also build their own management package containing their custom services. This eliminates the need to plug-in their services into a stack and also can define their own upgrade story for these custom services. A Hadoop operator will be able to deploy a Hadoop cluster with a mix of services across multiple packages instead of being limited to a single stack. For example, it would be possible to deploy a cluster with HDFS from HDP and NIFI from HDF.
Further, we will also discuss about the architectural changes needed to enable a multi instance architecture in future Ambari releases to support deploying multiple instances of a service in a cluster, deploying multiple instances of a component on a host as well as future proofing the Ambari architecture to leverage some of the advancements happening in the Hadoop community like YARN services (YARN-4692). We will wrap up the conversation with a brief overview of other improvements planned for future releases of Ambari.
Hortonworks technical workshop operations with ambariHortonworks
Ambari continues on its journey of provisioning, monitoring and managing enterprise Hadoop deployments. With 2.0, Apache Ambari brings a host of new capabilities including updated metric collections; Kerberos setup automation and developer views for Big Data developers. In this Hortonworks Technical Workshop session we will provide an in-depth look into Apache Ambari 2.0 and showcase security setup automation using Ambari 2.0. View the recording at https://www.brighttalk.com/webcast/9573/155575. View the github demo work at https://github.com/abajwa-hw/ambari-workshops/blob/master/blueprints-demo-security.md. Recorded May 28, 2015.
Learn how Hortonworks Data Flow (HDF), powered by Apache Nifi, enables organizations to harness IoAT data streams to drive business and operational insights. We will use the session to provide an overview of HDF, including detailed hands-on lab to build HDF pipelines for capture and analysis of streaming data.
Recording and labs available at:
http://hortonworks.com/partners/learn/#hdf
Apache Spark 2.0 set the architectural foundations of structure in Spark, unified high-level APIs, structured streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Apache Spark 2.3 & 2.4 has made similar strides too. In this talk, we want to highlight some of the new features and enhancements, such as:
• Apache Spark and Kubernetes
• Native Vectorized ORC and SQL Cache Readers
• Pandas UDFs for PySpark
• Continuous Stream Processing
• Barrier Execution
• Avro/Image Data Source
• Higher-order Functions
Speaker: Robert Hryniewicz, AI Evangelist, Hortonworks
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
For the first time, Hortonworks Data Platform ships with Apache Storm for processing stream data in Hadoop.
In this presentation, Himanshu Bari, Hortonworks senior product manager, and Taylor Goetz, Hortonworks engineer and committer to Apache Storm, cover Storm and stream processing in HDP 2.1:
+ Key requirements of a streaming solution and common use cases
+ An overview of Apache Storm
+ Q & A
Akka is a runtime framework for building resilient, distributed applications in Java or Scala. In this webinar, Konrad Malawski discusses the roadmap and features of the upcoming Akka 2.4.0 and reveals three upcoming enhancements that enterprises will receive in the latest certified, tested build of Typesafe Reactive Platform.
Akka Split Brain Resolver (SBR)
Akka SBR provides advanced recovery scenarios in Akka Clusters, improving on the safety of Akka’s automatic resolution to avoid cascading partitioning.
Akka Support for Docker and NAT
Run Akka Clusters in Docker containers or NAT with complete hostname and port visibility on Java 6+ and Akka 2.3.11+
Akka Long-Term Support
Receive Akka 2.4 support for Java 6, Java 7, and Scala 2.10
Slides from http://www.meetup.com/Reactive-Systems-Hamburg/events/232887060
Barys and Simon talked about Akka Cluster. Cluster Sharding allows to transparently distribute work in an Akka cluster with automatic balancing, migration of workers and automatic restart in case of errors. Cluster PubSub offers the publish/subscribe pattern. Akka Distributed Data offers eventually consistent data structures across the cluster, that allow for keeping the cluster's state.
They talked about the Akka Modules and explained how they interplay. Finally, they shared what Risk.Ident have learned running a reactive application based on Akka Cluster in production for almost a year.
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDataWorks Summit
When interacting with analytics dashboards in order to achieve a smooth user experience, two major key requirements are sub-second response time and data freshness. Cluster computing frameworks such as Hadoop or Hive/Hbase work well for storing large volumes of data, although they are not optimized for ingesting streaming data and making it available for queries in realtime. Also, long query latencies make these systems sub-optimal choices for powering interactive dashboards and BI use-cases.
In this talk we will present Druid as a complementary solution to existing hadoop based technologies. Druid is an open-source analytics data store, designed from scratch, for OLAP and business intelligence queries over massive data streams. It provides low latency realtime data ingestion and fast sub-second adhoc flexible data exploration queries.
Many large companies are switching to Druid for analytics, and we will cover how druid is able to handle massive data streams and why it is a good fit for BI use cases.
Agenda -
1) Introduction and Ideal Use cases for Druid
2) Data Architecture
3) Streaming Ingestion with Kafka
4) Demo using Druid, Kafka and Superset.
5) Recent Improvements in Druid moving from lambda architecture to Exactly once Ingestion
6) Future Work
A Journey to Reactive Function ProgrammingAhmed Soliman
A gentle introduction to functional reactive programming highlighting the reactive manifesto and ends with a demo in RxJS https://github.com/AhmedSoliman/rxjs-test-cat-scope
Many enterprise are implementing Hadoop projects to manage and process large datasets. Big question is: how to configure Hadoop clusters to connect to enterprise directory containing 100k+ users and groups for access management. Several large enterprises have complex directory servers for managing users and groups. Many advanced features have been recently added to Hadoop user management in order to support various complex directory server structures.
In this session attendees will learn about: setting up Hadoop node with users from Active Directory for executing Hadoop jobs, setting up authentication for enterprise users, and setting up authorization for users and groups using Apache Ranger. Attendees will also learn about the common challenges faced in the enterprise environments while interacting with Active Directory including filtering out users to be brought into Hadoop from Active Directory, restricting access to a set of users from Active Directory, handling users from nested group structures, etc.
Speakers
Sailaja Polavarapu, staff Software Engineer, Hortonworks
Velmurugan Periasamy, Director - Engineering, Hortonworks
Apache Ambari is an extensible framework that simplifies provisioning, managing and monitoring Hadoop clusters. Apache Ambari was built on a standardized stack-based operations model. Stacks wrap services of all shapes and sizes with a consistent definition and lifecycle-control layer; thereby providing a consistent approach for managing and monitoring the services. This also provided a natural extension point for operators and the community to bring in their own add-on services and “plug-in” the new services into the stack.
However, one of the fundamental limitations of the current Apache Ambari architecture has been that there is a strong one-on-one coupling between entities. For instance, a cluster is tied to a single stack and a Hadoop operator can only deploy services defined in that stack, a cluster can have only a single instance of a service and a host can have only a single instance of a component. Taking into consideration various use case scenarios that cannot be enabled due to these limitations there is a growing need to revamp the Ambari architecture.
In this talk, we propose a revamped Apache Ambari architecture that will open up the floodgates for a wide range of scenarios that wouldn’t have been possible thus far. We will focus the discussion on a new mpack-based operations model that will replace the stack-based operations model. A management package is a self-contained deployment artifact that includes all the details for deploying, managing and upgrading a set of services bundled in the package. A third-party provider can also build their own management package containing their custom services. This eliminates the need to plug-in their services into a stack and also can define their own upgrade story for these custom services. A Hadoop operator will be able to deploy a Hadoop cluster with a mix of services across multiple packages instead of being limited to a single stack. For example, it would be possible to deploy a cluster with HDFS from HDP and NIFI from HDF.
Further, we will also discuss about the architectural changes needed to enable a multi instance architecture in future Ambari releases to support deploying multiple instances of a service in a cluster, deploying multiple instances of a component on a host as well as future proofing the Ambari architecture to leverage some of the advancements happening in the Hadoop community like YARN services (YARN-4692). We will wrap up the conversation with a brief overview of other improvements planned for future releases of Ambari.
Hortonworks technical workshop operations with ambariHortonworks
Ambari continues on its journey of provisioning, monitoring and managing enterprise Hadoop deployments. With 2.0, Apache Ambari brings a host of new capabilities including updated metric collections; Kerberos setup automation and developer views for Big Data developers. In this Hortonworks Technical Workshop session we will provide an in-depth look into Apache Ambari 2.0 and showcase security setup automation using Ambari 2.0. View the recording at https://www.brighttalk.com/webcast/9573/155575. View the github demo work at https://github.com/abajwa-hw/ambari-workshops/blob/master/blueprints-demo-security.md. Recorded May 28, 2015.
Learn how Hortonworks Data Flow (HDF), powered by Apache Nifi, enables organizations to harness IoAT data streams to drive business and operational insights. We will use the session to provide an overview of HDF, including detailed hands-on lab to build HDF pipelines for capture and analysis of streaming data.
Recording and labs available at:
http://hortonworks.com/partners/learn/#hdf
Apache Spark 2.0 set the architectural foundations of structure in Spark, unified high-level APIs, structured streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Apache Spark 2.3 & 2.4 has made similar strides too. In this talk, we want to highlight some of the new features and enhancements, such as:
• Apache Spark and Kubernetes
• Native Vectorized ORC and SQL Cache Readers
• Pandas UDFs for PySpark
• Continuous Stream Processing
• Barrier Execution
• Avro/Image Data Source
• Higher-order Functions
Speaker: Robert Hryniewicz, AI Evangelist, Hortonworks
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
For the first time, Hortonworks Data Platform ships with Apache Storm for processing stream data in Hadoop.
In this presentation, Himanshu Bari, Hortonworks senior product manager, and Taylor Goetz, Hortonworks engineer and committer to Apache Storm, cover Storm and stream processing in HDP 2.1:
+ Key requirements of a streaming solution and common use cases
+ An overview of Apache Storm
+ Q & A
Akka is a runtime framework for building resilient, distributed applications in Java or Scala. In this webinar, Konrad Malawski discusses the roadmap and features of the upcoming Akka 2.4.0 and reveals three upcoming enhancements that enterprises will receive in the latest certified, tested build of Typesafe Reactive Platform.
Akka Split Brain Resolver (SBR)
Akka SBR provides advanced recovery scenarios in Akka Clusters, improving on the safety of Akka’s automatic resolution to avoid cascading partitioning.
Akka Support for Docker and NAT
Run Akka Clusters in Docker containers or NAT with complete hostname and port visibility on Java 6+ and Akka 2.3.11+
Akka Long-Term Support
Receive Akka 2.4 support for Java 6, Java 7, and Scala 2.10
Slides from http://www.meetup.com/Reactive-Systems-Hamburg/events/232887060
Barys and Simon talked about Akka Cluster. Cluster Sharding allows to transparently distribute work in an Akka cluster with automatic balancing, migration of workers and automatic restart in case of errors. Cluster PubSub offers the publish/subscribe pattern. Akka Distributed Data offers eventually consistent data structures across the cluster, that allow for keeping the cluster's state.
They talked about the Akka Modules and explained how they interplay. Finally, they shared what Risk.Ident have learned running a reactive application based on Akka Cluster in production for almost a year.
The SWA Country Stories captures best practices from partners around the world. They include their experiences in using the SWA partnership to advance the case of water, sanitation and hygiene in their countries and of implementing the commitments countries made at the SWA High Level Meetings.
The SWA Country Stories captures best practices from partners around the world.
They include their experiences in using the SWA partnership to advance the case of water, sanitation and hygiene in their countries and of implementing the commitments countries made at the SWA High Level Meetings. For more information sanitationandwaterforall.org
Drawing from the FANSA's experience of engaging with SWA, Ramisetty Murali from Fresh Water Action Network South Asia (FANSA) made a presentation on the topic of "Learning and achievements of SWA Global platform and its relevance to achieving Hygiene and Sanitation Development in India".
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference InformationKai Schlegel
Presentation for 5th International Workshop on
Data Engineering meets the Semantic Web (DESWeb)
In conjunction with ICDE 2014, Chicago IL, USA, March 31, 2014 held by Kai Schlegel
A practical introduction to Oracle NoSQL Database - OOW2014Anuj Sahni
Not familiar with Oracle NoSQL Database yet? This great product introduction session discusses the primary functionality included with the product as well as integration with other Oracle products. It includes a live demo that illustrates installation and configuration as well as data modeling and sample NoSQL application development.
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsEnrico Daga
Presented at #SALAD2015
The heterogeneity of methods and technologies to publish open data is still an issue to develop distributed systems on the Web. On the one hand, Web APIs, the most popular approach to offer data services, implement REST principles, which focus on addressing loose coupling and interoperability issues. On the other hand, Linked Data, available through SPARQL endpoints, focus on data integration between distributed data sources. We proposes BASIL, an approach to build Web APIs on top of SPARQL endpoints, in order to benefit of the advantages from both Web APIs and Linked Data approaches. Compared to similar solution, BASIL aims on minimising the learning curve for users to promote its adoption. The main feature of BASIL is a simple API that does not introduce new specifications, formalisms and technologies for users that belong to both Web APIs and Linked Data communities.
The Query Service is the new platform solution for querying a variety of data sources. The goal of Query Service is that administrators can configure a metadata description of the data source that can then be used by end users without detailed knowledge of the underlying data source. This session explains how to configure Query Service data sources and use them with the RESTful API or component collection.
Capital One: Using Cassandra In Building A Reporting PlatformDataStax Academy
As a leader in the financial industry, Capital One applications generate huge amounts of data that require fast and accurate handling, storage and analysis. We are transforming how we report operational data to our internal users so that they can make quick and precise business decisions to serve our customers. As part of this transformation, we are building a new Go-based data processing framework that will enable us to transfer data from multiple data stores (RDBMS, files, etc.) to a single NoSQL database - Cassandra. This new NoSQL store will act as a reporting database that will receive data on a near real-time basis and serve the data through scorecards and reports. We would like to share our experience in defining this fast data platform and the methodologies used to model financial data in Cassandra.
A Practical Guide To End-to-End Tracing In Event Driven ArchitecturesHostedbyConfluent
"Can you determine how a given event came to be? Is it an aggregation, a combination of multiple events with different sources? What are its origins?
As event driven architectures become more sophisticated, with features such as stateful stream processing, data joining, and multi-cluster flows, it becomes harder to trace the path of an event, its origins and touch points. At the same time, it also becomes more important.
Using code examples and usage scenarios we will dive into the tracing capabilities of OpenTelemetry for Kafka clients, including those using the Consumer/Producer and Kafka Streams libraries, as well as the Connect and ksqlDB platforms. This will culminate in an end-to-end tracing pipeline demonstration.
This talk will cover the following topics:
- Distributed tracing concepts, including context propagation and the OpenTelemetry implementation stack
- OpenTelemetry’s Kafka instrumentation, what is supported out of the box, code examples, edge cases, challenges and solutions
- A demonstration of an end-to-end tracing implementation
In this session, you will gain an understanding of the importance of end-to-end traceability, and several tools & examples for improving observability in your own distributed event driven applications."
SAP FIORI COEP Pune - pavan golesar (ppt)Pavan Golesar
Hi,
This material is not for commercial purpose, Disclaimer: Copyright content included.
For learning purpose only.
sapparamount@gmail.com
Pavan Golesar
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...DataStax
Apache Cassandra makes it possible to execute millions of operations per second in scalable fashion. Harnessing the power of C* leaves many developers pondering about the following:
- Is my data model appropriate and not going to end up as wide partition(s) causing heap pressure and other issues?
- How do I tune my connection pool configuration? What are the optimal settings for my environment ?
- What is my C* cluster capacity in terms of number of IOPs for a given 95th and 99th latency?
- How do I perf-test my data access layer?
In this talk, Vinay Chella, Cloud Data Architect @ Netflix, will share open source tools, techniques and platform(NDBench) that Netflix uses to perf-test their C* fleet with simulations millions of operations per second.
About the Speaker
Vinay Chella Cloud Data Architect, NETFLIX Inc
About Vinay Chella, Cloud Data Architect at Netflix having deeper understanding of Cassandra and other RDBMS. As an Engineer and Architect, working extensively on data modeling, performance tuning and guiding best practices of various persistence stores. Helping various teams @ Netflix building next generation data access layers.
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Christian Tzolov
When working with BigData & IoT systems we often feel the need for a Common Query Language. The system specific languages usually require longer adoption time and are harder to integrate within the existing stacks.
To fill this gap some NoSql vendors are building SQL access to their systems. Building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your NoSql system.
We will walk through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.
Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.
Similar to How good is your SPARQL endpoint? A QoS-Aware SPARQL Endpoint Monitoring and Data Source Selection Mechanism for Federated SPARQL Queries (20)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
The affect of service quality and online reviews on customer loyalty in the E...
How good is your SPARQL endpoint? A QoS-Aware SPARQL Endpoint Monitoring and Data Source Selection Mechanism for Federated SPARQL Queries
1. How good is your SPARQL endpoint?
A QoS-aware SPARQL endpoint monitoring and data source
selection mechanism for federated SPARQL queries
Ali Intizar and Alessandra Mileo
2. How good is your SPARQL endpoint?
A QoS-aware SPARQL endpoint monitoring and data source
selection mechanism for federated SPARQL queries
Ali Intizar and Alessandra Mileo
3. How good is your SPARQL endpoint?
A QoS-aware SPARQL endpoint monitoring and data source
selection mechanism for federated SPARQL queries
Ali Intizar and Alessandra Mileo
4. How good is your SPARQL endpoint?
A QoS-aware SPARQL endpoint monitoring and data source
selection mechanism for federated SPARQL queries
Ali Intizar and Alessandra Mileo
5. Linked Open Data and SPARQL Endpoints
• Linked Data
• LOD cloud
2 28/10/2014
6. Linked Open Data and SPARQL Endpoints
• Linked Data
• LOD cloud
2 28/10/2014
7. Linked Open Data and SPARQL Endpoints
• Linked Data
• LOD cloud
• SPARQL Endpoints
2 28/10/2014
8. Linked Open Data and SPARQL Endpoints
• Linked Data
• LOD cloud
• SPARQL Endpoints
• Both pubic and private
2 28/10/2014
9. Linked Open Data and SPARQL Endpoints
• Linked Data
• LOD cloud
• SPARQL Endpoints
• Both pubic and private
• Allow easy access to linked data using SPARQL queries
• Provide a querying interface
2 28/10/2014
10. Linked Open Data and SPARQL Endpoints
• Linked Data
• LOD cloud
• SPARQL Endpoints
• Both pubic and private
• Allow easy access to linked data using SPARQL queries
• Provide a querying interface
• Open Data Management Tools
2 28/10/2014
11. Linked Open Data and SPARQL Endpoints
• Linked Data
• LOD cloud
• SPARQL Endpoints
• Both pubic and private
• Allow easy access to linked data using SPARQL queries
• Provide a querying interface
• Open Data Management Tools
• Datahub
2 28/10/2014
12. Linked Open Data and SPARQL Endpoints
• Linked Data
• LOD cloud
• SPARQL Endpoints
• Both pubic and private
• Allow easy access to linked data using SPARQL queries
• Provide a querying interface
• Open Data Management Tools
• Datahub
• LOD Stats
28/10/2014
2
13. Linked Open Data and SPARQL Endpoints
• Linked Data
• LOD cloud
• SPARQL Endpoints
• Both pubic and private
• Allow easy access to linked data using SPARQL queries
• Provide a querying interface
• Open Data Management Tools
• Datahub
• LOD Stats
• SPARQL Endpoint Description
• Vocabulary for Interlinking Datasets
• Service Description
28/10/2014
2
14. Ranking of the SPARQL Endpoints
• Multiple SPARQL endpoints can
represent the same dataset
3 28/10/2014
15. Ranking of the SPARQL Endpoints
• Multiple SPARQL endpoints can
represent the same dataset
• Which one is the best for me?
3 28/10/2014
16. Ranking of the SPARQL Endpoints
• Multiple SPARQL endpoints can
represent the same dataset
• Ranking of the SPARQL endpoints
3 28/10/2014
17. Ranking of the SPARQL Endpoints
• Multiple SPARQL endpoints can
represent the same dataset
• Ranking of the SPARQL endpoints
• Based on QoI/QoS Parameters
4 28/10/2014
18. Ranking of the SPARQL Endpoints
• Multiple SPARQL endpoints can
represent the same dataset
• Ranking of the SPARQL endpoints
• Based on QoI/QoS Parameters
4 28/10/2014
19. QoS Parameters for SPARQL Endpoints
For QoS based ranking of the SPARQL endpoints
• Identification of the various QoS parameters associated with
the SPARQL endpoints
• Semantic respresentation of the identified QoS parameters
• Extension of the existing SPARQL endpoints description
vocabularies (VoID/SD) to associate QoS parameters
• Evaluation techniques for the QoS metrics
• Continuous monitoring of the SPARQL endpoints to generate
QoS profiles
5 28/10/2014
20. QoS Parameters for SPARQL Endpoints
For QoS based ranking of the SPARQL endpoints
• Identification of the various QoS parameters associated with
the SPARQL endpoints
• Semantic respresentation of the identified QoS parameters
• Extension of the existing SPARQL endpoints description
vocabularies (VoID/SD) to associate QoS parameters
• Evaluation techniques for the QoS metrics
• Continuous monitoring of the SPARQL endpoints to generate
QoS profiles
5 28/10/2014
21. QoS Parameters for SPARQL Endpoints
28/10/2014
• Performance
• Response Time
• Execution Time
• Throughput
• Error Rate
6
22. QoS Parameters for SPARQL Endpoints
• Performance
• Response Time
• Execution Time
• Throughput
• Error Rate
• Data Quality
• Accuracy
• Data Consistency
• Completeness
• Freshness
6 28/10/2014
23. QoS Parameters for SPARQL Endpoints
• Interoperabiilty
• SPARQL Version
• Additional Features
• Restricted Features
7 28/10/2014
24. QoS Parameters for SPARQL Endpoints
• Interoperabiilty
• SPARQL Version
• Additional Features
• Restricted Features
• Availability
• UpTime
• DownTime
• MeanUpTime
• MTTR
7 28/10/2014
28. QoS Parameters for SPARQL Endpoints
For QoS based ranking of the SPARQL endpoints
• Identification of the various QoS parameters associated with
the SPARQL endpoints
• Semantic respresentation of the identified QoS parameters
• Extension of the existing SPARQL endpoints description
vocabularies (VoID/SD) to associate QoS parameters
• Evaluation techniques for the QoS metrics
• Continuous monitoring of the SPARQL endpoints to generate
QoS profiles
9 28/10/2014
29. QoS Parameters for SPARQL Endpoints
• Semantic Description of SPARQL Endpoint (VoID/SD)
• QoS Profile of SPARQL Endpoints
SPARQL
Endpoint
has
QoSProfileEndpoint QoSProfile
QoSProfileDefault
QoSProfileUser
Property SubClass
10 28/10/2014
30. QoS Parameters for SPARQL Endpoints
• Semantic Description of SPARQL Endpoint
• QoS Profile of SPARQL Endpoints
SPARQL
Endpoint
has
QoSProfileEndpoint QoSProfile
QoSProfileDefault
QoSProfileUser
Property SubClass
10 28/10/2014
31. QoS Parameters for SPARQL Endpoints
• Semantic Description of SPARQL Endpoint
• QoS Profile of SPARQL Endpoints
SPARQL
Endpoint
has
QoSProfileEndpoint QoSProfile
QoSProfileDefault
QoSProfileUser
Property SubClass
10 28/10/2014
32. QoS Parameters for SPARQL Endpoints
• Semantic Description of SPARQL Endpoint
• QoS Profile of SPARQL Endpoints
• QoS Profile
1. Endpoint
SPARQL
Endpoint
2. Default
3. User
has
QoSProfileEndpoint QoSProfile
QoSProfileDefault
QoSProfileUser
Property SubClass
10 28/10/2014
33. QoS Parameters for SPARQL Endpoints
hasValue
QoSProfile
QoSParame
ter
Name
QoSWeight
sameAs
11 28/10/2014
QoSCategory
Value QoSMetric
Tendency
NonNumericMe
tric
ExactNumeric NumericMetric
IntervalNumeric
QoSUnit
BooleanMetric
LinguisticMe
tric GradingMetric
No
Low Mid High
Numeric
Value
TextValue
Yes
hasvalue
hasvalue
hasvalue
hasvalue
hasvalue
start end
hasName hasTendency
hasCategory
contains
hasWeight
hasMetric
isMeasuredIn
34. QoS Parameters for SPARQL Endpoints
hasValue
QoSProfile
QoSParame
ter
Name
QoSWeight
sameAs
11 28/10/2014
QoSCategory
Value QoSMetric
Tendency
NonNumericMe
tric
ExactNumeric NumericMetric
IntervalNumeric
QoSUnit
BooleanMetric
LinguisticMe
tric GradingMetric
No
Low Mid High
Numeric
Value
TextValue
Yes
hasvalue
hasvalue
hasvalue
hasvalue
hasvalue
start end
hasName hasTendency
hasCategory
contains
hasWeight
hasMetric
isMeasuredIn
35. QoS Parameters for SPARQL Endpoints
hasValue
QoSProfile
QoSParame
ter
Name
QoSWeight
sameAs
11 28/10/2014
QoSCategory
Value QoSMetric
Tendency
NonNumericMe
tric
ExactNumeric NumericMetric
IntervalNumeric
QoSUnit
BooleanMetric
LinguisticMe
tric GradingMetric
No
Low Mid High
Numeric
Value
TextValue
Yes
hasvalue
hasvalue
hasvalue
hasvalue
hasvalue
start end
hasName hasTendency
hasCategory
contains
hasWeight
hasMetric
isMeasuredIn
36. QoS Parameters for SPARQL Endpoints
hasValue
QoSProfile
QoSParame
ter
Name
QoSWeight
sameAs
11 28/10/2014
QoSCategory
Value QoSMetric
Tendency
NonNumericMe
tric
ExactNumeric NumericMetric
IntervalNumeric
QoSUnit
BooleanMetric
LinguisticMe
tric GradingMetric
No
Low Mid High
Numeric
Value
TextValue
Yes
hasvalue
hasvalue
hasvalue
hasvalue
hasvalue
start end
hasName hasTendency
hasCategory
contains
hasWeight
hasMetric
isMeasuredIn
37. QoS Parameters for SPARQL Endpoints
hasValue
QoSProfile
QoSParame
ter
Name
QoSWeight
sameAs
11 28/10/2014
QoSCategory
Value QoSMetric
Tendency
NonNumericMe
tric
ExactNumeric NumericMetric
IntervalNumeric
QoSUnit
BooleanMetric
LinguisticMe
tric GradingMetric
No
Low Mid High
Numeric
Value
TextValue
Yes
hasvalue
hasvalue
hasvalue
hasvalue
hasvalue
start end
hasName hasTendency
hasCategory
contains
hasWeight
hasMetric
isMeasuredIn
38. QoS Parameters for SPARQL Endpoints
hasValue
QoSProfile
QoSParame
ter
Name
QoSWeight
sameAs
11 28/10/2014
QoSCategory
Value QoSMetric
Tendency
NonNumericMe
tric
ExactNumeric NumericMetric
IntervalNumeric
QoSUnit
BooleanMetric
LinguisticMe
tric GradingMetric
No
Low Mid High
Numeric
Value
TextValue
Yes
hasvalue
hasvalue
hasvalue
hasvalue
hasvalue
start end
hasName hasTendency
hasCategory
contains
hasWeight
hasMetric
isMeasuredIn
39. QoS Parameters for SPARQL Endpoints
hasValue
QoSProfile
QoSParame
ter
Name
QoSWeight
sameAs
11 28/10/2014
QoSCategory
Value QoSMetric
Tendency
NonNumericMe
tric
ExactNumeric NumericMetric
IntervalNumeric
QoSUnit
BooleanMetric
LinguisticMe
tric GradingMetric
No
Low Mid High
Numeric
Value
TextValue
Yes
hasvalue
hasvalue
hasvalue
hasvalue
hasvalue
start end
hasName hasTendency
hasCategory
contains
hasWeight
hasMetric
isMeasuredIn
40. QoS Parameters for SPARQL Endpoints
hasValue
QoSProfile
QoSParame
ter
Name
QoSWeight
sameAs
11 28/10/2014
QoSCategory
Value QoSMetric
Tendency
NonNumericMe
tric
ExactNumeric NumericMetric
IntervalNumeric
QoSUnit
BooleanMetric
LinguisticMe
tric GradingMetric
No
Low Mid High
Numeric
Value
TextValue
Yes
hasvalue
hasvalue
hasvalue
hasvalue
hasvalue
start end
hasName hasTendency
hasCategory
contains
hasWeight
hasMetric
isMeasuredIn
41. QoS Parameters for SPARQL Endpoints
hasValue
QoSProfile
QoSParame
ter
Name
QoSWeight
sameAs
11 28/10/2014
QoSCategory
Value QoSMetric
Tendency
NonNumericMe
tric
ExactNumeric NumericMetric
IntervalNumeric
QoSUnit
BooleanMetric
LinguisticMe
tric GradingMetric
No
Low Mid High
Numeric
Value
TextValue
Yes
hasvalue
hasvalue
hasvalue
hasvalue
hasvalue
start end
hasName hasTendency
hasCategory
contains
hasWeight
hasMetric
isMeasuredIn
42. QoS Parameters for SPARQL Endpoints
hasValue
QoSProfile
QoSParame
ter
Name
QoSWeight
sameAs
11 28/10/2014
QoSCategory
Value QoSMetric
Tendency
NonNumericMe
tric
ExactNumeric NumericMetric
IntervalNumeric
QoSUnit
BooleanMetric
LinguisticMe
tric GradingMetric
No
Low Mid High
Numeric
Value
TextValue
Yes
hasvalue
hasvalue
hasvalue
hasvalue
hasvalue
start end
hasName hasTendency
hasCategory
contains
hasWeight
hasMetric
isMeasuredIn
43. QoS Parameters for SPARQL Endpoints
hasValue
QoSProfile
QoSParame
ter
Name
QoSWeight
sameAs
11 28/10/2014
QoSCategory
Value QoSMetric
Tendency
NonNumericMe
tric
ExactNumeric NumericMetric
IntervalNumeric
QoSUnit
BooleanMetric
LinguisticMe
tric GradingMetric
No
Low Mid High
Numeric
Value
TextValue
Yes
hasvalue
hasvalue
hasvalue
hasvalue
hasvalue
start end
hasName hasTendency
hasCategory
contains
hasWeight
hasMetric
isMeasuredIn
44. QoS Parameters for SPARQL Endpoints
hasValue
QoSProfile
QoSParame
ter
Name
QoSWeight
sameAs
11 28/10/2014
QoSCategory
Value QoSMetric
Tendency
NonNumericMe
tric
ExactNumeric NumericMetric
IntervalNumeric
QoSUnit
BooleanMetric
LinguisticMe
tric GradingMetric
No
Low Mid High
Numeric
Value
TextValue
Yes
hasvalue
hasvalue
hasvalue
hasvalue
hasvalue
start end
hasName hasTendency
hasCategory
contains
hasWeight
hasMetric
isMeasuredIn
45. QoS Parameters for SPARQL Endpoints
For QoS based ranking of the SPARQL endpoints
• Identification of the various QoS parameters associated with
the SPARQL endpoints
• Semantic respresentation of the identified QoS parameters
• Extension of the existing SPARQL endpoints description
vocabularies (VoID/SD) to associate QoS parameters
• Evaluation techniques for the QoS metrics
• Continuous monitoring of the SPARQL endpoints to generate
QoS profiles
12 28/10/2014
46. Evaluation of the QoS Parameters
• Performance
• Response Time
Q1 . SELECT ?p where { <s> ?p <o> }
Q2 . SELECT ?o where { s1 p1 ?o
s2 p2 ?o }
13 28/10/2014
47. Evaluation of the QoS Parameters
• Performance
• Response Time
• Execution Time
Q1 . SELECT ?p where { <s> ?p <o> }
Q2 . SELECT ?o where { s1 p1 ?o
s2 p2 ?o }
Q3 . SELECT * where { ? s ?p ?o } LIMIT 1000
14 28/10/2014
48. Evaluation of the QoS Parameters
• Performance
• Response Time
• Execution Time
• Throughput
Repeated execution of Q1.
15 28/10/2014
49. Evaluation of the QoS Parameters
• Performance
• Response Time
• Execution Time
• Throughput
• Error Rate
By putting the counter on the error returns by the
SPARQL endpoint during the execution of the queries
16 28/10/2014
50. Evaluation of the QoS Parameters
• Interoperabiilty
• SPARQL Version
• Additional Features
• Restricted Features
• SPARQL 1.1 test data set
17 28/10/2014
51. Evaluation of the QoS Parameters
• Availability
• UpTime
• DownTime
• MeanUpTime
• MTTR
• We rely on the service provider for the provision of initial
UpTime.
• Periodic execution of query Q1 to monitor availability
• Started the counter of DownTime whenever Q1 failed
• MeanUpTime calculated as percentage of the time
SPARQL endpoint was available since its initial UpTime.
• Mean Time To Recover (MTTR) is calculated as average
time taken by SPARQL endpoint to recover after failure.
18 28/10/2014
52. Evaluation of the QoS Parameters
• Licensing
• PDDL
• ODC-By
• ODC-ODbL
• CC0 0.1 Universal
Q6 .
PREFIX dcterms : <http://purl.org/dc/terms/>
SELECT ?license WHERE {
?ds a void:Dataset .
?ds dcterms:license ?license .
}
19 28/10/2014
53. Evaluation of the QoS Parameters
• Dataset Description
• Vocabulary for Interlinking Datasets
• Service Description
Q4 .
PREFIX void : <http://rdfs.org/ns/void#>
SELECT ?ds WHERE {
?ds a void:Dataset .
?ds void:SPARQLEndpoint
<SPRQLEnpointURI>
}
20 28/10/2014
54. Evaluation of the QoS Parameters
• ResultSet
• Size Limit
• Result Format
Q5 .
PREFIX sd: <http://www.w3.org/ns/sparql-service-description#>
SELECT ?format WHERE {
?s a sd:service .
?s sd:endpoint <endpointURI> .
?s sd:resultFormat ?format . }
21 28/10/2014
55. Evaluation of the QoS Parameters
• Data Quality
• Accuracy
• Data Consistency
• Completeness
• Freshness
Data quality is an overlap between quality of
information(QoI) and quality of service(QoS)
22 28/10/2014
56. QoS Parameters for SPARQL Endpoints
For QoS based ranking of the SPARQL endpoints
• Identification of the various QoS parameters associated with
the SPARQL endpoints
• Semantic respresentation of the identified QoS parameters
• Extension of the existing SPARQL endpoints description
vocabularies (VoID/SD) to associate QoS parameters
• Evaluation techniques for the QoS metrics
• Continuous monitoring of the SPARQL endpoints to generate
QoS profiles
23 28/10/2014
59. Federated SPARQL Queries
• SPARQL 1.1 extension provides SERVICE keyword
• Allows remotely execution of the SPARQL queries on several
endpoints
Federated SPARQL Query Engine
Source Selection
Indexing/Caching
Optimiser Query Execution
SPARQL
SPARQL
SPARQL
Endpoint Endpoint
Endpoint
…
SPARQL
Endpoint
SPARQL
25 28/10/2014
60. Federated SPARQL Queries
• Problem of data source selection
• Automated discovery and execution of the SPARQL endpoints
for any federated query.
Federated SPARQL Query Engine
Source Selection
Indexing/Caching
Optimiser Query Execution
SPARQL
SPARQL
SPARQL
Endpoint Endpoint
Endpoint
…
SPARQL
Endpoint
SPARQL
26 28/10/2014
61. Federated SPARQL Queries
• Problem of data source selection
• Automated discovery and execution of the SPARQL endpoints
for any federated query.
• Candidate Data Sources:
“Given a user's query Q and set of n data sources
DS = { dsi | i =1..n} ,
we define set of candidate data sources as
DSc = { dscj | j = 1..m }
that can potentially contribute to answer query Q, where
DSc ⊆ DS and 1 ≤ m ≤ n . “
27 28/10/2014
62. Federated SPARQL Queries
• Problem of data source selection
• Automated discovery and execution of the SPARQL endpoints
for any federated query.
• QoS Aware Data Sources:
“Given a set of candidate data sources DSc,
we define set of QoS aware data sources as
DSqos = { dsqosk | k = 1..l }
as set of optimal data sources that can potentially contribute to the answer
of the Query Q and are compliant with the QoS requirements
mentioned in the query, where
DSqos ⊆ DSc and 1 ≤ l ≤ m ≤ n . “
28 28/10/2014
66. SPARQL Extension with QoS
• QoS requirements can be described as part of the SPARQL
query
• We introduce a new QOSREQ keyword in the SPARQL
query language
• QOSREQ operator is applied to the triple pattern or BGP
immediarly proceeding the operator
• Comma separated values of multiple QoS parameters within
QOSREQ operator
• Comparison operators to compare the user defined QoS
requirements with QoS profile of the SPARQL endpoint
30 28/10/2014
67. SPARQL Extension with QoS
• QoS requirements can be described as part of the SPARQL
query
SELECT ?drug ?keggUrl ?chebiImage
WHERE {
?drug rdf:type drugbank:drugs .
QOSREQ[ qs:ResponseTime < 10 , qs:SizeLimit > 10000]
?drug drugbank:keggCompoundId ?keggDrug .
?keggDrug bio2rdf:u r l ?keggUrl .
{
?drug drugbank:genericName ?drugBankName .
?chebiDrug purl:title ?drugBankName .
}
QOSREQ[ qs:DatasetDescription = 'VoID' ,
qs:MeanUpTime > 80 ]
?chebiDrug chebi:image ?chebiImage . }
30 28/10/2014
75. Experimental Evaluation
• FedBench Benchmark
• A benchmark suite for federated SPARQL queries
evaluation
• Provides various data sets from Life Sciences, Linked
Data and Cross Domains
• 25 queries to evaluate the performance
• Testbed
• Datasets are deployed as SPARQL endpoints
• Multiple Copies of the data sets to create higher number of
candidate data sources
• Human intervention to create fluctuation
• Montioring of the SPARQL endpoints for more than 2 months
• QoS Profiles generation and updates in QoS metrics values based
on continuous monitoring
32 28/10/2014
78. Conclusion
• Identification and semantic representation of the QoS
parameters of the SPARQL endpoints
• QoS metrics evaluation mechanism
• A monitoring Service for QoS Evaluation
• SPARQL extension for users QoS requirements within
query language
• QoS-Aware Federated SPARQL query evaluation
35 28/10/2014
79. Future Work
• QoS monitoring over public SPARQL endpoints &
integration with SPARQLES
• Sophisticated mechanisms for Quality of Information
evaluation
• Taking QoS requirements as well preferences into account
(Hard and Soft Constraints)
• QoS aggregated values
35 28/10/2014