This document summarizes a lecture on data integration. It discusses key challenges in data integration including providing uniform access to multiple autonomous and heterogeneous data sources. It describes common solutions like data warehousing and the virtual integration approach. Research projects on data integration and current industry solutions are also mentioned. Key concepts in data integration like wrappers, mediated schemas, query reformulation, and optimization are covered at a high level.
How is organized data used by some web players having not the best intentions?
How can tools that try to help individual authors be subverted by spammers?
Also, how does Zemanta work and why are we interested in this topic.
Presented at JavaOne 2013, Tuesday September 24.
"Data Modeling Patterns" co-created with Ian Robinson.
"Pitfalls and Anti-Patterns" created by Ian Robinson.
In the domain of data science, solving problems and answering questions through data analysis is standard practice. Data scientists experiment continuously by constructing models to predict outcomes or discover underlying patterns, with the goal of gaining new insights. Organizations can then use these insights to strengthen customer relationships, improve service delivery and drive new opportunities. To help guide the processes and activities within a given domain, data scientists and engineers need a foundational methodology that provides a framework for how to proceed with whichever methods or tools they will use to obtain answers and deliver results. In this presentation, we will share data science tips for data engineers.
Join the Data Science Experience: http://ibm.co/data-science
Solving the Issue of Mysterious Database Benchmarking ResultsScyllaDB
Benchmarking tremendously helps to move forward the database industry and the database research community, especially since all database providers promise high performance and “unlimited” horizontal scalability. However, demonstrating these claims with comparable, transparent and reproducible database benchmarks is a methodological and technical challenge faced by every research paper, whitepaper, technical blog or customer benchmark. Moreover, running database benchmarks in the cloud adds unique challenges since differences in infrastructure across cloud providers makes apples to apples comparison even more difficult.
With benchANT, we address these challenges by providing a fully automated benchmarking platform that provides comprehensive data sets for ensuring full transparency and reproducibility of the benchmark results. We apply benchANT in a multi cloud context to benchmark ScyllaDB and other NoSQL databases using established open source benchmarks. These experiments demonstrate that unlike many competitors, ScyllaDB is able to keep up with its performance and scalability promises. The talks covers not only the in-depth discussion of the performance results and its impact on cloud TCO but also outlines how to specify fair and comparable benchmark scenarios and their execution. All discussed benchmarking data is released as open data on GitHub to ensure full transparency and reproducibility.
How is organized data used by some web players having not the best intentions?
How can tools that try to help individual authors be subverted by spammers?
Also, how does Zemanta work and why are we interested in this topic.
Presented at JavaOne 2013, Tuesday September 24.
"Data Modeling Patterns" co-created with Ian Robinson.
"Pitfalls and Anti-Patterns" created by Ian Robinson.
In the domain of data science, solving problems and answering questions through data analysis is standard practice. Data scientists experiment continuously by constructing models to predict outcomes or discover underlying patterns, with the goal of gaining new insights. Organizations can then use these insights to strengthen customer relationships, improve service delivery and drive new opportunities. To help guide the processes and activities within a given domain, data scientists and engineers need a foundational methodology that provides a framework for how to proceed with whichever methods or tools they will use to obtain answers and deliver results. In this presentation, we will share data science tips for data engineers.
Join the Data Science Experience: http://ibm.co/data-science
Solving the Issue of Mysterious Database Benchmarking ResultsScyllaDB
Benchmarking tremendously helps to move forward the database industry and the database research community, especially since all database providers promise high performance and “unlimited” horizontal scalability. However, demonstrating these claims with comparable, transparent and reproducible database benchmarks is a methodological and technical challenge faced by every research paper, whitepaper, technical blog or customer benchmark. Moreover, running database benchmarks in the cloud adds unique challenges since differences in infrastructure across cloud providers makes apples to apples comparison even more difficult.
With benchANT, we address these challenges by providing a fully automated benchmarking platform that provides comprehensive data sets for ensuring full transparency and reproducibility of the benchmark results. We apply benchANT in a multi cloud context to benchmark ScyllaDB and other NoSQL databases using established open source benchmarks. These experiments demonstrate that unlike many competitors, ScyllaDB is able to keep up with its performance and scalability promises. The talks covers not only the in-depth discussion of the performance results and its impact on cloud TCO but also outlines how to specify fair and comparable benchmark scenarios and their execution. All discussed benchmarking data is released as open data on GitHub to ensure full transparency and reproducibility.
Module 9: Natural Language Processing Part 2Sara Hooker
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org. If you would like to use this material to further our mission of improving access to machine learning. Education please reach out to inquiry@deltanalytics.org .
Storage Systems for High Scalable Systems Presentationandyman3000
Presentation from http://www.hfadeel.com/Blog/?p=151 what kind of storage systems players like Facebook or Google use for their extreme scalability requirements
These are the slides from my presentation on Running R in the Database using Oracle R Enterprise. The second half of the presentation is a live demo of using the Oracle R Enterprise. Unfortunately the demo is not listed in these slides
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...Databricks
Rather than running pre-defined queries embedded in dashboards, business users and data scientists want to explore data in more intuitive ways. Natural language interfaces for data exploration have gained considerable traction in industry. Their success is triggered by advancements in machine learning and by novel big data technologies that enable processing large amounts of data in real-time. However, even though these systems show significant progress, they have not yet reached the maturity level to support real users in data exploration scenarios either due to the lack of supported functionality or the narrow application scope, remaining one of the ‘holy grails’ of the data analytics community.
In this talk, we will present a Spark-based architecture of an intelligent data assistant, a system that combines real-time data processing and analytics over large amounts of data with user interaction in natural language, and we will argue why Spark is the right platform for next-gen intelligent data assistants.
Our intelligent data assistant
(a) enables a more natural interaction with the user through natural language;
(b) offers active guidance through explanations and suggestions;
(c) constantly learns and improves its performance. To build an intelligent data assistant, there are several challenges. Unlike search engines, users tend to express sophisticated query logics and expect perfect results. The inherent complexity of natural languages complicates things in several ways. The intricacies of the data domain require that the system constantly expands its domain knowledge and its ability to interpret new data and user queries by constantly analyzing data and queries.
Our intelligent data assistant brings together several components, including natural language processing for understanding user queries and generating answers in natural language, automatic knowledge base construction techniques for learning about data sources and how to find the information requested, as well as deep learning methods for query disambiguation and domain understanding.
What Your Database Query is Really DoingDave Stokes
Do you ever wonder what your database servers is REALLY doing with that query you just wrote. This is a high level overview of the process of running a query
Archaic database technologies just don't scale under the always on, distributed demands of modern IOT, mobile and web applications. We'll start this Intro to Cassandra by discussing how its approach is different and why so many awesome companies have migrated from the cold clutches of the relational world into the warm embrace of peer to peer architecture. After this high-level opening discussion, we'll briefly unpack the following:
• Cassandra's internal architecture and distribution model
• Cassandra's Data Model
• Reads and Writes
5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
PGQL: A Query Language for Graphs
Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.
Module 9: Natural Language Processing Part 2Sara Hooker
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org. If you would like to use this material to further our mission of improving access to machine learning. Education please reach out to inquiry@deltanalytics.org .
Storage Systems for High Scalable Systems Presentationandyman3000
Presentation from http://www.hfadeel.com/Blog/?p=151 what kind of storage systems players like Facebook or Google use for their extreme scalability requirements
These are the slides from my presentation on Running R in the Database using Oracle R Enterprise. The second half of the presentation is a live demo of using the Oracle R Enterprise. Unfortunately the demo is not listed in these slides
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...Databricks
Rather than running pre-defined queries embedded in dashboards, business users and data scientists want to explore data in more intuitive ways. Natural language interfaces for data exploration have gained considerable traction in industry. Their success is triggered by advancements in machine learning and by novel big data technologies that enable processing large amounts of data in real-time. However, even though these systems show significant progress, they have not yet reached the maturity level to support real users in data exploration scenarios either due to the lack of supported functionality or the narrow application scope, remaining one of the ‘holy grails’ of the data analytics community.
In this talk, we will present a Spark-based architecture of an intelligent data assistant, a system that combines real-time data processing and analytics over large amounts of data with user interaction in natural language, and we will argue why Spark is the right platform for next-gen intelligent data assistants.
Our intelligent data assistant
(a) enables a more natural interaction with the user through natural language;
(b) offers active guidance through explanations and suggestions;
(c) constantly learns and improves its performance. To build an intelligent data assistant, there are several challenges. Unlike search engines, users tend to express sophisticated query logics and expect perfect results. The inherent complexity of natural languages complicates things in several ways. The intricacies of the data domain require that the system constantly expands its domain knowledge and its ability to interpret new data and user queries by constantly analyzing data and queries.
Our intelligent data assistant brings together several components, including natural language processing for understanding user queries and generating answers in natural language, automatic knowledge base construction techniques for learning about data sources and how to find the information requested, as well as deep learning methods for query disambiguation and domain understanding.
What Your Database Query is Really DoingDave Stokes
Do you ever wonder what your database servers is REALLY doing with that query you just wrote. This is a high level overview of the process of running a query
Archaic database technologies just don't scale under the always on, distributed demands of modern IOT, mobile and web applications. We'll start this Intro to Cassandra by discussing how its approach is different and why so many awesome companies have migrated from the cold clutches of the relational world into the warm embrace of peer to peer architecture. After this high-level opening discussion, we'll briefly unpack the following:
• Cassandra's internal architecture and distribution model
• Cassandra's Data Model
• Reads and Writes
5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
PGQL: A Query Language for Graphs
Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
3. What is Data Integration
• Providing
– Uniform (same query interface to all sources)
– Access to (queries; eventually updates too)
– Multiple (we want many, but 2 is hard too)
– Autonomous (DBA doesn’t report to you)
– Heterogeneous (data models are different)
– Structured (or at least semi-structured)
– Data Sources (not only databases).
5. Motivation(s)
• Enterprise data integration; web-site construction.
• WWW:
– Comparison shopping
– Portals integrating data from multiple sources
– B2B, electronic marketplaces
• Science and culture:
– Medical genetics: integrating genomic data
– Astrophysics: monitoring events in the sky.
– Environment: Puget Sound Regional Synthesis Model
– Culture: uniform access to all cultural databases
produced by countries in Europe.
7. Current Solutions
• Mostly ad-hoc programming: create a
special solution for every case; pay
consultants a lot of money.
• Data warehousing: load all the data
periodically into a warehouse.
– 6-18 months lead time
– Separates operational DBMS from decision
support DBMS. (not only a solution to data
integration).
– Performance is good; data may not be fresh.
– Need to clean, scrub you data.
9. The Virtual Integration
Architecture
• Leave the data in the sources.
• When a query comes in:
– Determine the relevant sources to the query
– Break down the query into sub-queries for the
sources.
– Get the answers from the sources, and combine
them appropriately.
• Data is fresh.
• Challenge: performance.
13. Dimensions to Consider
• How many sources are we accessing?
• How autonomous are they?
• Meta-data about sources?
• Is the data structured?
• Queries or also updates?
• Requirements: accuracy, completeness,
performance, handling inconsistencies.
• Closed world assumption vs. open world?
15. Wrapper Programs
• Task: to communicate with the data sources
and do format translations.
• They are built w.r.t. a specific source.
• They can sit either at the source or at the
mediator.
• Often hard to build (very little science).
• Can be “intelligent”: perform source-
specific optimizations.
16. Example
<b> Introduction to DB </b>
<i> Phil Bernstein </i>
<i> Eric Newcomer </i>
Addison Wesley, 1999
<book>
<title> Introduction to DB </title>
<author> Phil Bernstein </author>
<author> Eric Newcomer </author>
<publisher> Addison Wesley </publisher>
<year> 1999 </year>
</book>
Transform:
into:
17. Data Source Catalog
• Contains all meta-information about the
sources:
– Logical source contents (books, new cars).
– Source capabilities (can answer SQL queries)
– Source completeness (has all books).
– Physical properties of source and network.
– Statistics about the data (like in an RDBMS)
– Source reliability
– Mirror sources
– Update frequency.
18. Content Descriptions
• User queries refer to the mediated schema.
• Data is stored in the sources in a local
schema.
• Content descriptions provide the semantic
mappings between the different schemas.
• Data integration system uses the
descriptions to translate user queries into
queries on the sources.
19. Desiderata from Source
Descriptions
• Expressive power: distinguish between
sources with closely related data. Hence, be
able to prune access to irrelevant sources.
• Easy addition: make it easy to add new data
sources.
• Reformulation: be able to reformulate a user
query into a query on the sources efficiently
and effectively.
20. Reformulation Problem
• Given:
– A query Q posed over the mediated schema
– Descriptions of the data sources
• Find:
– A query Q’ over the data source relations, such
that:
• Q’ provides only correct answers to Q, and
• Q’ provides all possible answers from to Q given the
sources.
21. Approaches to Specifying Source
Descriptions
• Global-as-view: express the mediated
schema relations as a set of views over the
data source relations
• Local-as-view: express the source relations
as views over the mediated schema.
• Can be combined with no additional cost.
22. Global-as-View
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create View Movie AS
select * from S1 [S1(title,dir,year,genre)]
union
select * from S2 [S2(title, dir,year,genre)]
union [S3(title,dir), S4(title,year,genre)]
select S3.title, S3.dir, S4.year, S4.genre
from S3, S4
where S3.title=S4.title
23. Global-as-View: Example 2
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create View Movie AS [S1(title,dir,year)]
select title, dir, year, NULL
from S1
union [S2(title, dir,genre)]
select title, dir, NULL, genre
from S2
24. Global-as-View: Example 3
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4: S4(cinema, genre)
Create View Movie AS
select NULL, NULL, NULL, genre
from S4
Create View Schedule AS
select cinema, NULL, NULL
from S4.
But what if we want to find which cinemas are playing
comedies?
25. Global-as-View Summary
• Query reformulation boils down to view
unfolding.
• Very easy conceptually.
• Can build hierarchies of mediated schemas.
• You sometimes loose information. Not
always natural.
• Adding sources is hard. Need to consider all
other sources that are available.
26. Local-as-View: example 1
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create Source S1 AS
select * from Movie
Create Source S3 AS [S3(title, dir)]
select title, dir from Movie
Create Source S5 AS
select title, dir, year
from Movie
where year > 1960 AND genre=“Comedy”
27. Local-as-View: Example 2
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4: S4(cinema, genre)
Create Source S4
select cinema, genre
from Movie m, Schedule s
where m.title=s.title
.
Now if we want to find which cinemas are playing
comedies, there is hope!
28. Local-as-View Summary
• Very flexible. You have the power of the
entire query language to define the contents
of the source.
• Hence, can easily distinguish between
contents of closely related sources.
• Adding sources is easy: they’re independent
of each other.
• Query reformulation: answering queries
using views!
29. The General Problem
• Given a set of views V1,…,Vn, and a query
Q, can we answer Q using only the answers to
V1,…,Vn?
• Many, many papers on this problem.
• The best performing algorithm: The MiniCon
Algorithm, (Pottinger & Levy, 2000).
• Great survey on the topic: (Halevy, 2001).
30. Local Completeness Information
• If sources are incomplete, we need to look
at each one of them.
• Often, sources are locally complete.
• Movie(title, director, year) complete for
years after 1960, or for American directors.
• Question: given a set of local completeness
statements, is a query Q’ a complete answer
to Q?
31. Example
• Movie(title, director, year) (complete after
1960).
• Show(title, theater, city, hour)
• Query: find movies (and directors) playing
in Seattle:
Select m.title, m.director
From Movie m, Show s
Where m.title=s.title AND city=“Seattle”
• Complete or not?
32. Example #2
• Movie(title, director, year), Oscar(title, year)
• Query: find directors whose movies won
Oscars after 1965:
select m.director
from Movie m, Oscar o
where m.title=o.title AND m.year=o.year
AND o.year > 1965.
• Complete or not?
33. Query Optimization
• Very related to query reformulation!
• Goal of the optimizer: find a physical plan
with minimal cost.
• Key components in optimization:
– Search space of plans
– Search strategy
– Cost model
34. Optimization in Distributed
DBMS
• A distributed database (2-minute tutorial):
– Data is distributed over multiple nodes, but is
uniform.
– Query execution can be distributed to sites.
– Communication costs are significant.
• Consequences for optimization:
– Optimizer needs to decide locality
– Need to exploit independent parallelism.
– Need operators that reduce communication
costs (semi-joins).
35. DDBMS vs. Data Integration
• In a DDBMS, data is distributed over a set
of uniform sites with precise rules.
• In a data integration context:
– Data sources may provide only limited access
patterns to the data.
– Data sources may have additional query
capabilities.
– Cost of answering queries at sources unknown.
– Statistics about data unknown.
– Transfer rates unpredictable.
36. Modeling Source Capabilities
• Negative capabilities:
– A web site may require certain inputs (in an
HTML form).
– Need to consider only valid query execution
plans.
• Positive capabilities:
– A source may be an ODBC compliant system.
– Need to decide placement of operations
according to capabilities.
• Problem: how to describe and exploit
source capabilities.
37. Example #1: Access Patterns
Mediated schema relation: Cites(paper1, paper2)
Create Source S1 as
select *
from Cites
given paper1
Create Source S2 as
select paper1
from Cites
Query: select paper1 from Cites where paper2=“Hal00”
38. Example #1: Continued
Create Source S1 as
select *
from Cites
given paper1
Create Source S2 as
select paper1
from Cites
Select p1
From S1, S2
Where S2.paper1=S1.paper1 AND S1.paper2=“Hal00”
39. Example #2: Access Patterns
Create Source S1 as
select *
from Cites
given paper1
Create Source S2 as
select paperID
from UW-Papers
Create Source S3 as
select paperID
from AwardPapers
given paperID
Query: select * from AwardPapers
40. Example #2: Solutions
• Can’t go directly to S3 because it requires a
binding.
• Can go to S1, get UW papers, and check if they’re
in S3.
• Can go to S1, get UW papers, feed them into S2,
and feed the results into S3.
• Can go to S1, feed results into S2, feed results into
S2 again, and then feed results into S3.
• Strictly speaking, we can’t a priori decide when to
stop.
• Need recursive query processing.
41. Handling Positive Capabilities
• Characterizing positive capabilities:
– Schema independent (e.g., can always perform joins,
selections).
– Schema dependent: can join R and S, but not T.
– Given a query, tells you whether it can be handled.
• Key issue: how do you search for plans?
• Garlic approach (IBM): Given a query, STAR
rules determine which subqueries are executable
by the sources. Then proceed bottom-up as in
System-R.
42. Matching Objects Across Sources
• How do I know that A. Halevy in source 1 is the
same as Alon Halevy in source 2?
• If there are uniform keys across sources, no
problem.
• If not:
– Domain specific solutions (e.g., maybe look at the
address, ssn).
– Use Information retrieval techniques (Cohen, 98).
Judge similarity as you would between documents.
– Use concordance tables. These are time-consuming to
build, but you can then sell them for lots of money.
43. Optimization and Execution
• Problem:
– Few and unreliable statistics about the data.
– Unexpected (possibly bursty) network transfer
rates.
– Generally, unpredictable environment.
• General solution: (research area)
– Adaptive query processing.
– Interleave optimization and execution. As you
get to know more about your data, you can
improve your plan.
45. Double Pipelined Join (Tukwila)
Hash Join
Partially pipelined: no
output until inner read
Asymmetric (inner vs.
outer) — optimization
requires source behavior
knowledge
Double Pipelined Hash Join
Outputs data immediately
Symmetric — requires less
source knowledge to optimize
46. Piazza: A Peer-Data Management System
Goal: To enable users to share data across
local or wide area networks in an ad-hoc,
highly dynamic distributed architecture.
Peers share data, mediated views.
Peers act as both clients and servers
Rich semantic relationships between peers.
Ad-hoc collaborations (peers join and leave
at will).
47. Extending the Vision to Data Sharing
911 Dispatch
Center (9DC)
Fire
Services (FS)
Portland
Fire District (PFD)
Vancouver Fire
District (VFD)
Station 12Station 19Station 3 Station 32
First
Hospital
(FH)
Hospitals
(H)
Lakeview
Hospital (LH)
Medical
Aid (MA)
Earthquake
Command
Center (ECC)
Search &
Rescue (SR)
Emergency
Workers (EW)
Washington
State
National
Guard
48. The Structure Mapping Problem
• Types of structures:
– Database schemas, XML DTDs, ontologies, …,
• Input:
– Two (or more) structures, S1 and S2
– (perhaps) Data instances for S1 and S2
– Background knowledge
• Output:
– A mapping between S1 and S2
• Should enable translating between data instances.
49. Semantic Mappings between
Schemas
• Source schemas = XML DTDs
house
location contact
house
address
name phone
num-baths
full-baths half-baths
contact-info
agent-name agent-phone
1-1 mapping non 1-1 mapping
50. Why Matching is Difficult
• Structures represent same entity differently
– different names => same entity:
• area & address => location
– same names => different entities:
• area => location or square-feet
• Intended semantics is typically subjective!
– IBM Almaden Lab = IBM?
• Schema, data and rules never fully capture semantics!
– not adequately documented, certainly not for machine
consumption.
• Often hard for humans (committees are formed!)
51. Desiderata from Proposed
Solutions
• Accuracy, efficiency, ease of use.
• Realistic expectations:
– Unlikely to be fully automated. Need user in the loop.
• Some notion of semantics for mappings.
• Extensibility:
– Solution should exploit additional background
knowledge.
• “Memory”, knowledge reuse:
– System should exploit previous manual or
automatically generated matchings.
– Key idea behind LSD.
52. Learning for Mapping
• Context: generating semantic mappings between
a mediated schema and a large set of data source
schemas.
• Key idea: generate the first mappings manually,
and learn from them to generate the rest.
• Technique: multi-strategy learning (extensible!)
• L(earning) S(ource) D(escriptions) [SIGMOD 2001].
53. Data Integration (a simple
PDMS)
Find houses with four bathrooms priced under $500,000
mediated schema
homes.comrealestate.com
source schema 2
homeseekers.com
source schema 3source schema 1
Applications: WWW, enterprises, science projects
Techniques: virtual data integration, warehousing, custom code.
Query reformulation
and optimization.
54. price agent-name agent-phone office-phone description
Learning from the Manual Mappings
listed-price contact-name contact-phone office comments
Schema of realestate.com
Mediated schema
$250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house
$320K Mike Doan (617) 253 1429 (617) 112 2315 Great location
listed-price contact-name contact-phone office comments
realestate.com
If “fantastic” & “great”
occur frequently in
data instances
=> descriptionsold-at contact-agent extra-info
$350K (206) 634 9435 Beautiful yard
$230K (617) 335 4243 Close to Seattle
$190K (512) 342 1263 Great lot
homes.com
If “office” occurs in the name
=> office-phone
55. Multi-Strategy Learning
• Use a set of base learners:
– Name learner, Naïve Bayes, Whirl, XML learner
• And a set of recognizers:
– County name, zip code, phone numbers.
• Each base learner produces a prediction weighted
by confidence score.
• Combine base learners with a meta-learner, using
stacking.
56. The Semantic Web
• How does it relate to data integration?
• How are we going to do it?
• Why should we do it? Do we need a killer
app or is the semantic web a killer app?