This document summarizes research on trade-offs in data integration systems. It discusses three main contributions:
1. A method to estimate response freshness using existing data summaries, which was able to estimate freshness with 6% error.
2. A maintenance process to maximize consistency under latency constraints by querying cached entries and maintaining stale or slowly changing entries. This outperformed baseline policies.
3. An extension of the maintenance policy to consider both latency and space constraints, including cache replacement policies. This outperformed state-of-the-art replacement policies when implemented in CSPARQL.
The document concludes that balancing latency and consistency in data integration is challenging due to their trade-off relationship, and discusses
Performance Testing of Big Data Applications - Impetus WebcastImpetus Technologies
Impetus webcast "Performance Testing of Big Data Applications" available at http://lf1.me/cqb/
This Impetus webcast talks about:
• A solution approach to measure performance and throughput of Big Data applications
• Insights into areas to focus for increasing the effectiveness of Big Data performance testing
• Tools available to address Big Data specific performance related challenges
Big Data Testing Approach - Rohit KharabeROHIT KHARABE
This presentation speaks about -
1) How to perform big data testing
2) Tools that can be used for testing
3) Different validation stages involved
4) Performance testing
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
Geisinger Health System is well known in the healthcare community as a pioneer in data and analytics. We have had an Electronic Health Record (EHR) since 1996, and an Electronic Data Warehouse (EDW) since 2008. Much of daily and weekly operational reporting, as well as an abundance of ad hoc analytics, come from the EDW.
Approximately 18 months ago, the Data Management team implemented Hadoop in the Hortonworks Data Platform (HDP), and successes in implementation and development have proven to the organization that we should abandon the traditional EDW in favor of the Big Data (HDP) platform.
In less than 18 months, we stood up the platform, created a data ingestion pipeline, duplicated all source feeds from the EDW into HDP, and had several analytics developed with HDP and Tableau. Furthermore, we have exploited the new capabilities of the platform, where we use Natural Language Processing (NLP) to interrogate valuable (but previously hidden) clinical notes. The new platform has data that is modeled and governed, setting the stage to push Geisinger Health System from a pioneer to a leader in Big Data and Analytics.
This session will focus on Hortonworks Data Platform, covering data architecture, security, data process flow, and development. It is geared toward Data Architects, Data Scientists, and Operations/I.T. audiences.
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...Data Con LA
OnPrem Solution Partners worked with NBCU to profile in-house data to determine data quality, and recommend process and quality improvements. We present our process for data import, improvements we want to make, and lessons learned regarding various tools used, including MariaDB, ElasticSearch, Cassandra, and others.
This deck cover Microsoft Analytics Platform System (APS) formerly known as Parallel Data Warehouse (PDW). This is based on massively parallel processing technology and can typically reduce your OLAP workloads by 98%.
APS AU3 is a phenomenal technology based on SQL Server 2014 and costs a fraction of a comparable Netezza or Teradata.
Big data security challenges are bit different from traditional client-server applications and are distributed in nature, introducing unique security vulnerabilities. Cloud Security Alliance (CSA) has categorized the different security and privacy challenges into four different aspects of the big data ecosystem. These aspects are infrastructure security, data privacy, data management and, integrity and reactive security. Each of these aspects are further divided into following security challenges:
1. Infrastructure security
a. Secure distributed processing of data
b. Security best practices for non-relational data stores
2. Data privacy
a. Privacy-preserving analytics
b. Cryptographic technologies for big data
c. Granular access control
3. Data management
a. Secure data storage and transaction logs
b. Granular audits
c. Data provenance
4. Integrity and reactive security
a. Endpoint input validation/filtering
b. Real-time security/compliance monitoring
In this talk, we are going to refer above classification and identify existing security controls, best practices, and guidelines. We will also paint a big picture about how collective usage of all discussed security controls (Kerberos, TDE, LDAP, SSO, SSL/TLS, Apache Knox, Apache Ranger, Apache Atlas, Ambari Infra, etc.) can address fundamental security and privacy challenges that encompass the entire Hadoop ecosystem. We will also discuss briefly recent security incidents involving Hadoop systems.
Speakers
Krishna Pandey, Staff Software Engineer, Hortonworks
Kunal Rajguru, Premier Support Engineer, Hortonworks
Performance Testing of Big Data Applications - Impetus WebcastImpetus Technologies
Impetus webcast "Performance Testing of Big Data Applications" available at http://lf1.me/cqb/
This Impetus webcast talks about:
• A solution approach to measure performance and throughput of Big Data applications
• Insights into areas to focus for increasing the effectiveness of Big Data performance testing
• Tools available to address Big Data specific performance related challenges
Big Data Testing Approach - Rohit KharabeROHIT KHARABE
This presentation speaks about -
1) How to perform big data testing
2) Tools that can be used for testing
3) Different validation stages involved
4) Performance testing
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
Geisinger Health System is well known in the healthcare community as a pioneer in data and analytics. We have had an Electronic Health Record (EHR) since 1996, and an Electronic Data Warehouse (EDW) since 2008. Much of daily and weekly operational reporting, as well as an abundance of ad hoc analytics, come from the EDW.
Approximately 18 months ago, the Data Management team implemented Hadoop in the Hortonworks Data Platform (HDP), and successes in implementation and development have proven to the organization that we should abandon the traditional EDW in favor of the Big Data (HDP) platform.
In less than 18 months, we stood up the platform, created a data ingestion pipeline, duplicated all source feeds from the EDW into HDP, and had several analytics developed with HDP and Tableau. Furthermore, we have exploited the new capabilities of the platform, where we use Natural Language Processing (NLP) to interrogate valuable (but previously hidden) clinical notes. The new platform has data that is modeled and governed, setting the stage to push Geisinger Health System from a pioneer to a leader in Big Data and Analytics.
This session will focus on Hortonworks Data Platform, covering data architecture, security, data process flow, and development. It is geared toward Data Architects, Data Scientists, and Operations/I.T. audiences.
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...Data Con LA
OnPrem Solution Partners worked with NBCU to profile in-house data to determine data quality, and recommend process and quality improvements. We present our process for data import, improvements we want to make, and lessons learned regarding various tools used, including MariaDB, ElasticSearch, Cassandra, and others.
This deck cover Microsoft Analytics Platform System (APS) formerly known as Parallel Data Warehouse (PDW). This is based on massively parallel processing technology and can typically reduce your OLAP workloads by 98%.
APS AU3 is a phenomenal technology based on SQL Server 2014 and costs a fraction of a comparable Netezza or Teradata.
Big data security challenges are bit different from traditional client-server applications and are distributed in nature, introducing unique security vulnerabilities. Cloud Security Alliance (CSA) has categorized the different security and privacy challenges into four different aspects of the big data ecosystem. These aspects are infrastructure security, data privacy, data management and, integrity and reactive security. Each of these aspects are further divided into following security challenges:
1. Infrastructure security
a. Secure distributed processing of data
b. Security best practices for non-relational data stores
2. Data privacy
a. Privacy-preserving analytics
b. Cryptographic technologies for big data
c. Granular access control
3. Data management
a. Secure data storage and transaction logs
b. Granular audits
c. Data provenance
4. Integrity and reactive security
a. Endpoint input validation/filtering
b. Real-time security/compliance monitoring
In this talk, we are going to refer above classification and identify existing security controls, best practices, and guidelines. We will also paint a big picture about how collective usage of all discussed security controls (Kerberos, TDE, LDAP, SSO, SSL/TLS, Apache Knox, Apache Ranger, Apache Atlas, Ambari Infra, etc.) can address fundamental security and privacy challenges that encompass the entire Hadoop ecosystem. We will also discuss briefly recent security incidents involving Hadoop systems.
Speakers
Krishna Pandey, Staff Software Engineer, Hortonworks
Kunal Rajguru, Premier Support Engineer, Hortonworks
Testing Big Data: Automated ETL Testing of HadoopBill Hayduk
Learn why testing your enterprise's data is pivotal for success with Big Data and Hadoop. See how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.
Testing the Data Warehouse―Big Data, Big ProblemsTechWell
Data warehouses have become a popular mechanism for collecting, organizing, and making information readily available for strategic decision making. The ability to review historical trends and monitor near real-time operational data has become a key competitive advantage for many organizations. Yet the methods for assuring the quality of these valuable assets are quite different from those of transactional systems. Ensuring that the appropriate testing is performed is a major challenge for many enterprises. Geoff Horne has led a number of data warehouse testing projects in both the telecommunications and ERP sectors. Join Geoff as he shares his approaches and experiences, focusing on the key “uniques” of data warehouse testing including methods for assuring data completeness, monitoring data transformations, and measuring quality. He also explores the opportunities for test automation as part of the data warehouse process, describing how it can be harnessed to streamline and minimize overhead.
Big Data analytics is estimated to save over $450B in healthcare costs, and there is exciting adoption of big data platforms with healthcare payers and providers. Hadoop and cloud computing have emerged as one of the most promising technologies for implementing big data at scale for healthcare workloads in production, using Hadoop as a service. Common considerations in the healthcare industry include privacy and data security, and the challenges of regulatory compliance with HIPAA and HITECH. Intel provides a security framework for Hadoop that enables enterprises to deploy big data analytics without compromising performance or security. Intel is contributing to a common security framework for Apache Hadoop, in the form of Project Rhino, which enables Hadoop to run workloads without compromising performance or security. Join this session to learn how your enterprise can take advantage of the security capabilities in the Intel Data Platform running on AWS to analyze healthcare data while ensuring technical safeguards that help you remain in compliance.
MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented...MongoDB
This talk covers the MongoDB deployment architecture used at Castlight Health to support very low latency spatial searches against our database of hundreds of millions of healthcare prices. The Geo haystack index in MongoDB and SSDs turned out to be the perfect solution for our problem. A strategy of replica set flipping also enables Castlight to swap in very large changes to the pricing data with no impact to the running application.
SQL Server Managing Test Data & Stress Testing January 2011Mark Ginnebaugh
A quick look at some of the available functionality for SQL Server developers who have access to Visual Studio 2010 and SQL-Hero.
With Visual Studio 2010 Premium (and Professional to a degree) delivering similar capabilities to what was available in VS 2008 Database Pro Edition, the ability to generate a mass amount of sample data for your database has only gotten more accessible with time.
Realizing that other tools exist in this space and not all SQL developers use Visual Studio, we’ll also take a look at the third party data generation facility available in SQL-Hero, seeing how we can create thousands (or millions!) of records very quickly using a powerful rules engine, plus automate this process to support continuous integration strategies.
Applying Testing Techniques for Big Data and HadoopMark Johnson
Testing “Big Data” can mean big time investment; several hours often spent just realize you made a simple typo. You fix the typo and then wait another couple hours for your script to hopefully this time run to completion. Even if the Big Data script or program ran to completion are you sure your data analysis is correct? Getting programs to run to completion and to assure functional accuracy per the requirements are some of the biggest hidden problems in big data today.
During this overview presentation we will first introduce unit and functional testing techniques and high level concepts to consider in the Hadoop Ecosystem. The second half of the presentation we will explore real testing examples using tools such as PigUnit, JUnit for UDF testing, BeeTest and Hive limited test data set testing.
Pervasive analytics through data & analytic centricityCloudera, Inc.
Cloudera and Teradata discuss the best-in-class solution enabling companies to put data and analytics at the center of their strategy, achieve the highest forms of agility, while reducing the costs and complexity of their current environment.
Introducing a horizontally scalable, inference-based business Rules Engine fo...Cask Data
Speaker: Nitin Motgi, Cask
Big Data Applications Meetup, 09/20/2017
Palo Alto, CA
More info here: http://www.meetup.com/BigDataApps/
Link to video: https://www.youtube.com/watch?v=FnQwDaKii2U
About the talk:
Business Rules are statements that describe business policies or procedures to process data. Rules engines or inference engines execute business rules in a runtime production environment, and have become commonplace for many IT applications. Except in the world of big data, where there has been a gap for a horizontally scalable, lightweight inference-based business rules engine for big data processing.
In this session, you learn about Cask’s new business Rule Rngine built on top of CDAP, which is a sophisticated if-then-else statement interpreter that runs natively on big data systems such as Spark, Hadoop, Amazon EMR, Azure HDInsight and GCE. It provides an alternative computational model for transforming your data while empowering business users to specify and manage the transformations and policy enforcements.
In his talk, Nitin Motgi, Cask co-founder and CTO, demonstrates this new, distributed rule engine and explain how business users in big data environments can make decisions on their data, enforce policies, and be an integral part of the data ingestion and ETL process. He also shows how business users can write, manage, deploy, execute and monitor business data transformation and policy enforcements.
Check out http://bdam.io/ for more info on the Big Data Apps meetup!
GDPR compliance application architecture and implementation using Hadoop and ...DataWorks Summit
The General Data Protection Regulation (GDPR) is a legislation designed to protect personal data of European Union citizens and residents. The main requirement is to log personal data accesses/changes in customer-specific applications. These logs can then be audited by owning entities to provide reporting to end users indicating usage of their personal data. Users have the ""right to be forgotten,â€Âmeaning their personal data can be purged from the system at their request. The regulation goes into effect on May 25,2018 with significant fines for non-compliance.
This session will provide insight on how to approach/implement a GDPR compliance solution using Hadoop and Streaming for any enterprise with heavy volumes of data.This session will delve into deployment strategies, architecture of choice (Kafka,NiFi. and Hive ACID with streaming), implementation best practices, configurations, and security requirements. Hortonworks Professional Services System Architects helped the customer on ground to design, implement, and deploy this application in production.
Speaker
Saurabh Mishra, Hortonworks, Systems Architect
Arun Thangamani, Hortonworks, Systems Architect
HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at ExplorysCloudera, Inc.
Explorys leverages HBase and the Hadoop stack to power the next generation of Enterprise Performance Management for Healthcare. The Explorys team will present an overview in 3 parts: Explorys functional and technical overview, approaches in MapReduce performance tuning, and system operations for HBase and Hadoop.
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...DataWorks Summit
The business and technology teams within a health insurer must align the company’s central data platform with its data strategy. That requires substantial organizational alignment. Hear the firsthand perspective from Health Care Service Corporation (HCSC), the largest customer-owned health insurance company in the United States. The speaker will cover how they integrated membership information, regulatory compliance, and the general ledger, to improve overall healthcare management. At HCSC, the strong alignment between executive leadership, business portfolio direction, architectural strategy, technology delivery, and program management have helped create leading-edge capabilities which help the company respond nimbly to a quickly evolving healthcare industry.
Just the sketch: advanced streaming analytics in Apache MetronDataWorks Summit
Doing advanced analytics in streaming architectures presents unique challenges around the tradeoff of having more context vs. performance. Typically performance and scalability requirements mandate that each message in a stream be operated on without the context of other messages in the stream that may have come before. In this talk, we will talk about using sketching algorithms to engineering a compromise which allows us to consider historical state without compromising scalability.
What we found analyzing the capabilities of many similar SIEMs and cybersecurity platforms is that a good portion of the advanced anaytics boil down to either simple rules enriched with the ability to do statistical baselining, set existence, and set cardinality computations. These operations are necessarily difficult to do in-stream, so often they're done after the fact. We look at ways to open up these analytics to stream computation without sacrificing scalability.
Specifically, we will introduce the infrastructure built for Apache Metron to perform these kinds of tasks. We will cover the novel integration between an Apache Storm and Apache Hbase and orchestrated by a custom domain specific language called Stellar to take all the sting out of constructing sketches and using them to accomplish simple and more advanced analytics such as statistical outlier analysis in stream. CASEY STELLA, Principal Software Engineer, Hortonworks
According to a recent Harvard Business Review study, there’s only a 43% chance that customers who have a poor experience will stick with you for the next 12 months. Contrast that to the 74% that will remain your customer if they have a great experience. Learn how Macy’s, a leading American department store chain founded in 1858 with over 750 stores in North America, is transforming their customer experience with DataStax Enterprise.
Webinar recording: https://youtu.be/CiUVxh6Ov_E
View current and past DataStax webinars: http://www.datastax.com/resources/webinars
The integration of data from multiple distributed and heterogeneous sources has long been an important issue in information system research. In this study, we considered the query access and its optimization in such an integration scenario in the context of energy management by using SPARQL. Specifically, we provided a federated approach - a mediator server - that allows users to query access to multiple heterogeneous data sources, including four typical types of databases in energy data resources: relational database Triplestore, NoSQL database, and XML. A MUSYOP architecture based on this approach is then presented and our solution can realize the process data acquisition and integration without the need to rewrite or transform the local data into a unified data.
Testing Big Data: Automated ETL Testing of HadoopBill Hayduk
Learn why testing your enterprise's data is pivotal for success with Big Data and Hadoop. See how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.
Testing the Data Warehouse―Big Data, Big ProblemsTechWell
Data warehouses have become a popular mechanism for collecting, organizing, and making information readily available for strategic decision making. The ability to review historical trends and monitor near real-time operational data has become a key competitive advantage for many organizations. Yet the methods for assuring the quality of these valuable assets are quite different from those of transactional systems. Ensuring that the appropriate testing is performed is a major challenge for many enterprises. Geoff Horne has led a number of data warehouse testing projects in both the telecommunications and ERP sectors. Join Geoff as he shares his approaches and experiences, focusing on the key “uniques” of data warehouse testing including methods for assuring data completeness, monitoring data transformations, and measuring quality. He also explores the opportunities for test automation as part of the data warehouse process, describing how it can be harnessed to streamline and minimize overhead.
Big Data analytics is estimated to save over $450B in healthcare costs, and there is exciting adoption of big data platforms with healthcare payers and providers. Hadoop and cloud computing have emerged as one of the most promising technologies for implementing big data at scale for healthcare workloads in production, using Hadoop as a service. Common considerations in the healthcare industry include privacy and data security, and the challenges of regulatory compliance with HIPAA and HITECH. Intel provides a security framework for Hadoop that enables enterprises to deploy big data analytics without compromising performance or security. Intel is contributing to a common security framework for Apache Hadoop, in the form of Project Rhino, which enables Hadoop to run workloads without compromising performance or security. Join this session to learn how your enterprise can take advantage of the security capabilities in the Intel Data Platform running on AWS to analyze healthcare data while ensuring technical safeguards that help you remain in compliance.
MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented...MongoDB
This talk covers the MongoDB deployment architecture used at Castlight Health to support very low latency spatial searches against our database of hundreds of millions of healthcare prices. The Geo haystack index in MongoDB and SSDs turned out to be the perfect solution for our problem. A strategy of replica set flipping also enables Castlight to swap in very large changes to the pricing data with no impact to the running application.
SQL Server Managing Test Data & Stress Testing January 2011Mark Ginnebaugh
A quick look at some of the available functionality for SQL Server developers who have access to Visual Studio 2010 and SQL-Hero.
With Visual Studio 2010 Premium (and Professional to a degree) delivering similar capabilities to what was available in VS 2008 Database Pro Edition, the ability to generate a mass amount of sample data for your database has only gotten more accessible with time.
Realizing that other tools exist in this space and not all SQL developers use Visual Studio, we’ll also take a look at the third party data generation facility available in SQL-Hero, seeing how we can create thousands (or millions!) of records very quickly using a powerful rules engine, plus automate this process to support continuous integration strategies.
Applying Testing Techniques for Big Data and HadoopMark Johnson
Testing “Big Data” can mean big time investment; several hours often spent just realize you made a simple typo. You fix the typo and then wait another couple hours for your script to hopefully this time run to completion. Even if the Big Data script or program ran to completion are you sure your data analysis is correct? Getting programs to run to completion and to assure functional accuracy per the requirements are some of the biggest hidden problems in big data today.
During this overview presentation we will first introduce unit and functional testing techniques and high level concepts to consider in the Hadoop Ecosystem. The second half of the presentation we will explore real testing examples using tools such as PigUnit, JUnit for UDF testing, BeeTest and Hive limited test data set testing.
Pervasive analytics through data & analytic centricityCloudera, Inc.
Cloudera and Teradata discuss the best-in-class solution enabling companies to put data and analytics at the center of their strategy, achieve the highest forms of agility, while reducing the costs and complexity of their current environment.
Introducing a horizontally scalable, inference-based business Rules Engine fo...Cask Data
Speaker: Nitin Motgi, Cask
Big Data Applications Meetup, 09/20/2017
Palo Alto, CA
More info here: http://www.meetup.com/BigDataApps/
Link to video: https://www.youtube.com/watch?v=FnQwDaKii2U
About the talk:
Business Rules are statements that describe business policies or procedures to process data. Rules engines or inference engines execute business rules in a runtime production environment, and have become commonplace for many IT applications. Except in the world of big data, where there has been a gap for a horizontally scalable, lightweight inference-based business rules engine for big data processing.
In this session, you learn about Cask’s new business Rule Rngine built on top of CDAP, which is a sophisticated if-then-else statement interpreter that runs natively on big data systems such as Spark, Hadoop, Amazon EMR, Azure HDInsight and GCE. It provides an alternative computational model for transforming your data while empowering business users to specify and manage the transformations and policy enforcements.
In his talk, Nitin Motgi, Cask co-founder and CTO, demonstrates this new, distributed rule engine and explain how business users in big data environments can make decisions on their data, enforce policies, and be an integral part of the data ingestion and ETL process. He also shows how business users can write, manage, deploy, execute and monitor business data transformation and policy enforcements.
Check out http://bdam.io/ for more info on the Big Data Apps meetup!
GDPR compliance application architecture and implementation using Hadoop and ...DataWorks Summit
The General Data Protection Regulation (GDPR) is a legislation designed to protect personal data of European Union citizens and residents. The main requirement is to log personal data accesses/changes in customer-specific applications. These logs can then be audited by owning entities to provide reporting to end users indicating usage of their personal data. Users have the ""right to be forgotten,â€Âmeaning their personal data can be purged from the system at their request. The regulation goes into effect on May 25,2018 with significant fines for non-compliance.
This session will provide insight on how to approach/implement a GDPR compliance solution using Hadoop and Streaming for any enterprise with heavy volumes of data.This session will delve into deployment strategies, architecture of choice (Kafka,NiFi. and Hive ACID with streaming), implementation best practices, configurations, and security requirements. Hortonworks Professional Services System Architects helped the customer on ground to design, implement, and deploy this application in production.
Speaker
Saurabh Mishra, Hortonworks, Systems Architect
Arun Thangamani, Hortonworks, Systems Architect
HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at ExplorysCloudera, Inc.
Explorys leverages HBase and the Hadoop stack to power the next generation of Enterprise Performance Management for Healthcare. The Explorys team will present an overview in 3 parts: Explorys functional and technical overview, approaches in MapReduce performance tuning, and system operations for HBase and Hadoop.
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...DataWorks Summit
The business and technology teams within a health insurer must align the company’s central data platform with its data strategy. That requires substantial organizational alignment. Hear the firsthand perspective from Health Care Service Corporation (HCSC), the largest customer-owned health insurance company in the United States. The speaker will cover how they integrated membership information, regulatory compliance, and the general ledger, to improve overall healthcare management. At HCSC, the strong alignment between executive leadership, business portfolio direction, architectural strategy, technology delivery, and program management have helped create leading-edge capabilities which help the company respond nimbly to a quickly evolving healthcare industry.
Just the sketch: advanced streaming analytics in Apache MetronDataWorks Summit
Doing advanced analytics in streaming architectures presents unique challenges around the tradeoff of having more context vs. performance. Typically performance and scalability requirements mandate that each message in a stream be operated on without the context of other messages in the stream that may have come before. In this talk, we will talk about using sketching algorithms to engineering a compromise which allows us to consider historical state without compromising scalability.
What we found analyzing the capabilities of many similar SIEMs and cybersecurity platforms is that a good portion of the advanced anaytics boil down to either simple rules enriched with the ability to do statistical baselining, set existence, and set cardinality computations. These operations are necessarily difficult to do in-stream, so often they're done after the fact. We look at ways to open up these analytics to stream computation without sacrificing scalability.
Specifically, we will introduce the infrastructure built for Apache Metron to perform these kinds of tasks. We will cover the novel integration between an Apache Storm and Apache Hbase and orchestrated by a custom domain specific language called Stellar to take all the sting out of constructing sketches and using them to accomplish simple and more advanced analytics such as statistical outlier analysis in stream. CASEY STELLA, Principal Software Engineer, Hortonworks
According to a recent Harvard Business Review study, there’s only a 43% chance that customers who have a poor experience will stick with you for the next 12 months. Contrast that to the 74% that will remain your customer if they have a great experience. Learn how Macy’s, a leading American department store chain founded in 1858 with over 750 stores in North America, is transforming their customer experience with DataStax Enterprise.
Webinar recording: https://youtu.be/CiUVxh6Ov_E
View current and past DataStax webinars: http://www.datastax.com/resources/webinars
The integration of data from multiple distributed and heterogeneous sources has long been an important issue in information system research. In this study, we considered the query access and its optimization in such an integration scenario in the context of energy management by using SPARQL. Specifically, we provided a federated approach - a mediator server - that allows users to query access to multiple heterogeneous data sources, including four typical types of databases in energy data resources: relational database Triplestore, NoSQL database, and XML. A MUSYOP architecture based on this approach is then presented and our solution can realize the process data acquisition and integration without the need to rewrite or transform the local data into a unified data.
Can data virtualization uphold performance with complex queries?Denodo
Watch full webinar here: https://bit.ly/2JzypTx
There are myths about data virtualization that are based on misconceptions and even falsehoods. These myths can confuse and worry people who - quite rightly - look at data virtualization as a critical technology for a modern, agile data architecture.
We've decided that we need to set the record straight, so we put together this webinar series. It's time to bust a few myths!
In the first webinar of the series, we’ll be busting the 'performance' myth. “What about performance?” is usually the first question that we get when talking to people about data virtualization. After all, the data virtualization layer sits between you and your data, so how does this affect the performance of your queries? Sometimes the myth is perpetuated by people with alternative solutions…the ‘Put all your data in our Cloud and everything will be fine. Data virtualization? Nah, you don’t need that! It can't handle big queries anyway,’ type of thing.
Join us for this webinar to look at the basis of the 'performance' myth and examine whether there is any underlying truth to it.
At Adjust we use PostgreSQL the way a lot of people may use Hadoop or other big data platforms. This presentation goes through three of our environments to discuss how we go about doing this and the challenges we face.
Tuning Java Driver for Apache CassandraNenad Bozic
Apache Cassandra is distributed masterless column store database which is becoming mainstream for analytics and IoT data. Many use cases where Cassandra is natural fit require latency tuning in order to serve requests really fast. DataStax driver has many options, some less familiar, which can greatly influence performance aspect. This talk will focus on Java applications and options at your disposal in DataStax Java driver which became standard when you want to use this database. We will concentrate on both monitoring and tuning aspect of things and we will provide different options for different use cases. There is no silver bullet solution and having many options requires deep dive when you want to figure out right decision. This talk will narrow down options and give you push in the right direction.
A podium abstract presented at AMIA 2016 Joint Summits on Translational Science. This discusses Data Café — A Platform For Creating Biomedical Data Lakes.
Work with hundred of hot terabytes in JVMsMalin Weiss
Third-party updates to the database can cause Hazelcast applications to work with data which is out-of-date.
By synchronizing with an underlying database using an SQL Reflector, the Hazelcast Maps will be “alive” and change whenever the underlying data changes. The solution can also automatically derive domain models directly from the database schemas, so that you can start using the solution very quickly and handle extreme volumes of data.
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
Presented at The Hawaii International Conference on System Sciences by Hong-Mei Chen and Rick Kazman (University of Hawaii), Serge Haziyev (SoftServe).
Difference between data warehouse and data miningmaxonlinetr
What exactly is a Data Warehouse?
Termed as a special type of database, a Data Warehouse is used for storing large amounts of data, such as analytics, historical, or customer data, which can be leveraged to build large reports and also ensure data mining against it.@ http://maxonlinetraining.com/why-is-data-warehousing-online-training-important/
What is Data mining?
The process of extracting valid, previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions’
Call us at For any queries, please contact:
+1 940 440 8084 / +91 953 383 7156 TODAY to join our Online IT Training course & find out how Max Online Training.com can help you embark on an exciting and lucrative IT career.
TODAY to join our Online IT Training course & find out how Max Online Training.com can help you embark on an exciting and lucrative IT career.
Visit www.maxonlinetraining.com
For Complete Course Overview and to a book @https://goo.gl/QbTVal
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Big Data Spain
Apache Cassandra is distributed masterless column store database which is becoming mainstream for analytics and IoT data.
https://www.bigdataspain.org/2017/talk/tuning-java-driver-for-apache-cassandra
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Distributed RDBMS: Data Distribution Policy: Part 1 - What is a Data Distribu...ScaleBase
Distributed RDBMSs provide many scalability, availability and performance advantages.
But how do you “distribute” data? This presentation gives you a practical understanding of key issues to a successful distributed RDBMS.
The presentation explores:
1. What a data distribution policy is
2. The challenges faced when data is distributed via sharding
3. What defines a good data distribution policy
4. The best way to distribute data for your application and workload
Distributed RDBMS: Data Distribution Policy: Part 2 - Creating a Data Distrib...ScaleBase
Distributed RDBMSs provide many scalability, availability and performance advantages.
This presentation examines steps to create a customized data distribution policy for your RDBMS that best suits your application’s needs to provide maximum scalability.
We will discuss:
1. The different approaches to data distribution
2. How to create your own data distribution policy, whether you are scaling an exisiting application or creating a new app.
3. How ScaleBase can help you create your policy
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
The Indian economy is classified into different sectors to simplify the analysis and understanding of economic activities. For Class 10, it's essential to grasp the sectors of the Indian economy, understand their characteristics, and recognize their importance. This guide will provide detailed notes on the Sectors of the Indian Economy Class 10, using specific long-tail keywords to enhance comprehension.
For more information, visit-www.vavaclasses.com
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
This is a presentation by Dada Robert in a Your Skill Boost masterclass organised by the Excellence Foundation for South Sudan (EFSS) on Saturday, the 25th and Sunday, the 26th of May 2024.
He discussed the concept of quality improvement, emphasizing its applicability to various aspects of life, including personal, project, and program improvements. He defined quality as doing the right thing at the right time in the right way to achieve the best possible results and discussed the concept of the "gap" between what we know and what we do, and how this gap represents the areas we need to improve. He explained the scientific approach to quality improvement, which involves systematic performance analysis, testing and learning, and implementing change ideas. He also highlighted the importance of client focus and a team approach to quality improvement.
2. Agenda
ntroduction to Trade-offs in Integration Systems
equirements and Research Questions
ontributions
onclusions and Future Work
3. Introduction
hat is data integration?
• “Combining data from different distributed sources”1
.
hy is it important?
• Most queries requires integrating data from various sources.
hy is it challenging?
• Sources are autonomous and distributed.
• Distributing query among sources to provide the response has
performance, scalability and availability problems.
• Caching solves above problems but leads to inconsistencies.
• Maintaining cache increases latency.
3
1. https://en.wikipedia.org/wiki/Data_integration
5. Data integration
ata integration approaches
• Data warehouse (DW)
• Low latency
• Low consistency
High consistencyLow consistency
Low latency
High latency
Ideal case
Data
warehouse
Mediator
systems
7. Data Market: Lowest latency with a
consistency threshold
Minimize cost (financial and latency) as
far as consistency is above a threshold
Find me
emails of
“The North
Face”
customers.
My existing
data can
provide you
a response
with 60%
freshness.
Ok
Here is the
responseNo, I want
the fastest
response
with at least
80%
freshness
To provide
80%
freshness
you need to
wait 30 sec
and pay 60$
8. Research Question 1
How to optimally maintain data when
consistency is restricted and latency is demanded
to be minimized?
8
9. Summary of contribution 1
method to estimate the response freshness using the existing data
(JIST2014, ISWC2014).
• Extend summarization techniques to trace the freshness.
• Indexing, histogram and Qtree
• Use summary to estimate the response freshness.
valuation
• We managed to estimate the freshness of a query with 6% error rate.
uture work
• Use more advanced summarizations to lower the error rate.
9
10. Data integration
ata integration approaches
• Data warehouse (DW)
• Low latency
• Low consistency
• Mediator systems (MS)
• High latency
• High consistency
High consistencyLow consistency
Low latency
High latency
Ideal case
Data
warehouse
Mediator
systems
Data
warehouse
12. Mediator system: Highest consistency
with a latency threshold
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)
12
13. Mediator system: Highest consistency
with a latency threshold
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)
Local
View
13
14. Mediator system: Highest consistency
with a latency threshold
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)
Local
View
Maintenance
Process
Freshness decreases Refresh
Cost/Quality trade-
off
14
15. Research Question 2
How to optimally maintain data when the latency
is restricted and consistency is demanded to be
maximized?
15
16. Summary of contribution 2
maintenance process to maximize consistency with respect to latency
constraint (WWW2015, ICWE2015).
• Query driven: maintain cache entries that are involved in current
evaluation
• Freshness driven: maintain cache entries that
• Are stale
• Change less frequently
• Affect future evaluations
valuation
• The proposed approach outperforms a set of baseline policies.
his work has already been followed up
• Queries with FILTER clauses (ICWE2016)
• Queries with complex join patterns (ISWC2016) 16
17. Data integration
ata integration approaches
• Data warehouse (DW)
• Low latency
• Low consistency
• Mediator systems (MS)
• High latency
• High consistency
ntegration in a real system
High consistencyLow consistency
Low latency
High latency
Ideal case
Data
warehouse
Mediator
systems
Data
warehouse
Mediator
systems
18. Contributing the proposed policies to
CSPARQL
• So far we assumed all
required data to provide
the response exists in
the local cache but
needs to be maintained.
• What if required data
does not fit in the local
cache?
18
entries
SERVICE
Provider
Local cache
19. Research Question 3
How to take into account space constraint while
optimizing data integration with regards to
latency or consistency constraints?
19
20. 20
Summary of contribution 3
• An extension of the maintenance policy (contribution 2) to take into
account both latency and space constraints.
• Fetching policies to cope with cache incompleteness
• A freshness based cache replacement policy
• An implementation in CSPARQL
• Evaluation
• The proposed replacement policy outperforms state-of-the-art
replacement policies.
• Future work
• Investigating more complex queries (e.g., with multiple SERVICE
clauses, complex join patterns)
21. Conclusions
n ideal integration engine (low latency and high consistency) is not
possible because these two dimensions are in trade-off.
ontributions:
• Optimizing response latency with consistency threshold has been
studied in the context of Data Marketplace.
• A maintenance policy to optimize response consistency with latency
threshold in the context of knowledge-based event processing.
• Introduction of space constraints to integrate my approach in CSPARQL.
21
High consistencyLow consistency
Low latency
High latency
Ideal case
Data
warehouse
Mediator
systems
Data
warehouse
Mediator
systems
23. Data Integration
Data Stream Data Source
Cache
Maintenance
Process
Freshness
decreases
Refresh based on latency constraint
Query (critical latency)
Data Source Data Source
Cache
Maintenance
Process
Freshness
decreases
Refresh based on consistency constraint
Query (critical consistency)
1. Maintaining
cache based on
latency
constraint of
query
(Event
Detection)
2. Maintaining
cache based on
consistency
constraint of
query
(Data Market)
Soheila.dehghanzdeh@insight-centre.org Unit for Reasoning
and Querying
24. Mediator system: Highest consistency with
a latency threshold
24
Query: find Twitter users that have been
mentioned more than 5 times in the last
minute and are followed by more than
1000 users
Stream Processor
Twitter mention stream
#X has 1007 followers
#Y has 2000 followers
#Z has 500 followers
Twitter Follower API
#X is super hero
#X won the gold medal
#X broke the world record
#X is awesome
#X
…
#Y is super hero
#Y won the bronze medal
#Y broke the world record
#Y is awesome
#Y
…
#Z is great
#Z won the silver
medal
#Z broke the world record
#Z is awesome
Well done to #Z, #Y,
#X
User Mentione
d
Followed
by
#X 7 1007
#Y 6 2000
#X has 1007 followers
#Y has 2000 followers
#Z has 600 followers
#X has 998
followers
25. Contributing the proposed policies to
CSPARQL
Requirements
•A local cache R
•Fetch SERVICE from R
•Maintain R
•ESPER external time
25
The modified engine is available on github
Time stamp
entries
SERVICE
Provider
Local cache
26. Workloads with significant improvements
with proposed policy
e hypothesize that WSJ-WBM is more influential if :
• Hypothesis 1: the BKG data change slower
• Hypothesis 2: the BKG data changes with more diversity in change rate
• Hypothesis 3: there is a negative correlation between the streaming rate
and the change rate
• Hypothesis 4: total number of possible events (i.e., caching space) is
larger
he time overhead of WSJ-WBM is negligible
26
27. Experiments set up
data generator to generate various workloads with
• Various change rate distributions within an interval- random or normal
distribution
• Various streaming rates- the inter arrival time of elements follows a
Poisson distribution with various lambda intervals
27
31. Hypothesis 4: total number of possible
events (i.e., caching space) is larger
31
32. Hypothesis 4: The time overhead of WSJ-
WBM is negligible
32
LocalRemote
33. Combining RDF Streams and Remotely Stored
Background Data
e move to an approximate setting, and we introduce a local view to
store part of the data involved in the query processing, and update part
of it to capture the dynamicity
33
34. A query-driven maintenance process
ELECT * WHERE WINDOW(S, ω, β) PW
. SERVICE(BKG) PS
34
WINDOW clause
JOIN Proposer Ranker
Maintaine
r
Local View
4 2
3
1
SERVICE clause
E
C
RND
LRU
WBM
CWSJ
WSJ
GNR
LRU
FRP
Introduction to trade-offss
Given that not every body knows my work, I give a summary of my work first then …
I want to briefly explain what I did in previous years and particularly I focus on what I did in last year.
What is the problem ?
Talk about the trade-off
The problem I investigate is ….
First I studied this problem in the data ware house setting.
Users want to keep low latency but with a reasonable amount of consistency => consistency constraint
Fetch those that can increase the response latency more
User aim to minimize the cost as far as the quality of data is reasonable
A threshold for consistency
Cost is the result of fetching request => latency
Infochimps and Microsoft azure Market place.
You are charged for the amount of requested freshness in your response.
What was the contribution on DW
What is the problem ?
Talk about the trade-off
The problem I investigate is ….
First I studied this problem in the data ware house setting.
We want to keep the high consistency but with a reasonable latency=> latency constraint
The less we maintain the faster we can process queries, but how much less? How to minimize the maintenance?
Extension: to consider all users from the stream, if a user doesn’t exist in the local view, we fetch it and replace it with one of the existing entries from the local view
The less we maintain the faster we can process queries, but how much less? How to minimize the maintenance?
Extension: to consider all users from the stream, if a user doesn’t exist in the local view, we fetch it and replace it with one of the existing entries from the local view
2-3 slides introducing WSJ and WBM
Limitation of ICWE => we assume that the local view always contains all the elements needed to compute the current answer
I ALSO DID SOME OTHER EXPERIEMNTS WITH csaprql, I MEASURED THE OVERHEAD AND IT IS LESS THAN 1%
The clock of CSPARQL should consider the time-stamp carried by the streaming data
The less we maintain the faster we can process queries, but how much less? How to minimize the maintenance?
Extension: to consider all users from the stream, if a user doesn’t exist in the local view, we fetch it and replace it with one of the existing entries from the local view
I ALSO DID SOME OTHER EXPERIEMNTS WITH csaprql, I MEASURED THE OVERHEAD AND IT IS LESS THAN 1%
The clock of CSPARQL should consider the time-stamp carried by the streaming data
The time overhead of WSJ-WBM is negligible
????
Put the size of the cache in these 2 plots
Introduce overhead percentage…
Local vs remote? Why remote has the less overhead that local???
In real use cases, bkg data is located on different locations, and it is not possible to replicate it on the engine machine:
Limitations on the number of data that can be retrieved over time
Data changes on the source and changes are not pushed to the engine
RSP-QL captures:
the dynamicity of graph -> time-varying graph
the remote location -> SERVICE clause
We move to an approximate setting, and we introduce a cache to store part of the data involved in the query processing, and update part of it to capture the dynamicity
Ranker -> introdurre i simboli della slide successiva
Introdurre prima assi, poi linee un po’ alla volta