A wealth of information produced by individuals and organizations is expressed in natural language text. Text lacks the explicit structure that is necessary to support rich querying and analysis. Information extraction systems are sophisticated software tools to discover structured information in natural language text. Unfortunately, information extraction is a challenging and time-consuming task.
In this talk, I will first present our proposal to optimize information extraction programs. It consists of a holistic approach that focuses on: (i) optimizing all key aspects of the information extraction process collectively and in a coordinated manner, rather than focusing on individual subtasks in isolation; (ii) accurately predicting the execution time, recall, and precision for each information extraction execution plan; and (iii) using these predictions to choose the best execution plan to execute a given information extraction program.
Then, I will briefly present a principled, learning-based approach for ranking documents according to their potential usefulness for an extraction task. Our online learning-to-rank methods exploit the information collected during extraction, as we process new documents and the fine-grained characteristics of the useful documents are revealed. Then, these methods decide when the ranking model should be updated, hence significantly improving the document ranking quality over time.
This is joint work with Gonçalo Simões, INESC-ID and IST/University of Lisbon, and Pablo Barrio and Luis Gravano from Columbia University, NY.
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
This document discusses computing challenges posed by rapidly increasing data scales in scientific applications and high performance computing. It introduces the concept of online data analysis and reduction as an alternative to traditional offline analysis to help address these challenges. The key messages are that dramatic changes in HPC system geography due to different growth rates of technologies are driving new application structures and computational logistics problems, presenting exciting new computer science opportunities in online data analysis and reduction.
Going Smart and Deep on Materials at ALCFIan Foster
As we acquire large quantities of science data from experiment and simulation, it becomes possible to apply machine learning (ML) to those data to build predictive models and to guide future simulations and experiments. Leadership Computing Facilities need to make it easy to assemble such data collections and to develop, deploy, and run associated ML models.
We describe and demonstrate here how we are realizing such capabilities at the Argonne Leadership Computing Facility. In our demonstration, we use large quantities of time-dependent density functional theory (TDDFT) data on proton stopping power in various materials maintained in the Materials Data Facility (MDF) to build machine learning models, ranging from simple linear models to complex artificial neural networks, that are then employed to manage computations, improving their accuracy and reducing their cost. We highlight the use of new services being prototyped at Argonne to organize and assemble large data collections (MDF in this case), associate ML models with data collections, discover available data and models, work with these data and models in an interactive Jupyter environment, and launch new computations on ALCF resources.
This document describes research on implementing Curran's approximation algorithm for pricing Asian options using a dataflow architecture. The algorithm was implemented on a Maxeler dataflow engine (DFE) and compared to a CPU implementation. Different fixed-point precisions were tested on the DFE and 54-bit fixed-point provided the best balance of precision and resource usage. Implementing the algorithm across multiple DFEs provided speedups of 5-12x over a 48-core CPU. Further optimization of dynamic ranges allowed increasing the unrolling factor, improving performance and energy efficiency.
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
a talk given at the VLDB 2021 conference, August, 2021, presenting our paper:
Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507–520, January, 2021.
http://doi.org/10.14778/3436905.3436911
Heterogeneous Defect Prediction ( ESEC/FSE 2015)Sung Kim
This document describes a technique called Heterogeneous Defect Prediction (HDP) that aims to perform cross-project defect prediction even when the source and target projects have different sets of metrics (i.e. heterogeneous metrics). HDP first selects informative metrics from the source and target projects, then matches metrics between the projects that have similar distributions. It uses the matched metrics to build a prediction model on the source project and apply it to the target project. The technique is evaluated on multiple public defect datasets and is shown to outperform whole-project defect prediction and other cross-project defect prediction baselines in most cases, demonstrating the potential of HDP to reuse existing defect data more widely.
Self-adaptive container monitoring with performance-aware Load-Shedding policies, by Rolando Brondolin, PhD student in System Architecture at Politecnico di Milano
Preserving the currency of analytics outcomes over time through selective re-...Paolo Missier
The document discusses techniques for preserving the accuracy of analytics results over time through selective recomputation as meta-knowledge and datasets change. It presents the ReComp project which aims to quantify the impact of changes to algorithms, data, and databases on prior analytics outcomes. The techniques developed include capturing workflow execution history and provenance, defining data difference functions, and estimating the effect of changes to determine what recomputation is needed. Open challenges include understanding change frequency and impact, and when re-running expensive simulations is necessary due to modifications in inputs.
The document discusses a project that used accelerometers to recognize gestures for a virtual environment. The project utilized Wii remotes and an Acceleglove to collect accelerometer data and recognize gestures through MATLAB. A C# program extracted accelerometer data from the Wii remotes for training and testing gesture recognition algorithms in MATLAB. Key algorithms used in MATLAB included dynamic time warping, affinity propagation, and random projection for gesture recognition. The digital glove also used an artificial neural network approach for gesture recognition.
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
This document discusses computing challenges posed by rapidly increasing data scales in scientific applications and high performance computing. It introduces the concept of online data analysis and reduction as an alternative to traditional offline analysis to help address these challenges. The key messages are that dramatic changes in HPC system geography due to different growth rates of technologies are driving new application structures and computational logistics problems, presenting exciting new computer science opportunities in online data analysis and reduction.
Going Smart and Deep on Materials at ALCFIan Foster
As we acquire large quantities of science data from experiment and simulation, it becomes possible to apply machine learning (ML) to those data to build predictive models and to guide future simulations and experiments. Leadership Computing Facilities need to make it easy to assemble such data collections and to develop, deploy, and run associated ML models.
We describe and demonstrate here how we are realizing such capabilities at the Argonne Leadership Computing Facility. In our demonstration, we use large quantities of time-dependent density functional theory (TDDFT) data on proton stopping power in various materials maintained in the Materials Data Facility (MDF) to build machine learning models, ranging from simple linear models to complex artificial neural networks, that are then employed to manage computations, improving their accuracy and reducing their cost. We highlight the use of new services being prototyped at Argonne to organize and assemble large data collections (MDF in this case), associate ML models with data collections, discover available data and models, work with these data and models in an interactive Jupyter environment, and launch new computations on ALCF resources.
This document describes research on implementing Curran's approximation algorithm for pricing Asian options using a dataflow architecture. The algorithm was implemented on a Maxeler dataflow engine (DFE) and compared to a CPU implementation. Different fixed-point precisions were tested on the DFE and 54-bit fixed-point provided the best balance of precision and resource usage. Implementing the algorithm across multiple DFEs provided speedups of 5-12x over a 48-core CPU. Further optimization of dynamic ranges allowed increasing the unrolling factor, improving performance and energy efficiency.
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
a talk given at the VLDB 2021 conference, August, 2021, presenting our paper:
Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507–520, January, 2021.
http://doi.org/10.14778/3436905.3436911
Heterogeneous Defect Prediction ( ESEC/FSE 2015)Sung Kim
This document describes a technique called Heterogeneous Defect Prediction (HDP) that aims to perform cross-project defect prediction even when the source and target projects have different sets of metrics (i.e. heterogeneous metrics). HDP first selects informative metrics from the source and target projects, then matches metrics between the projects that have similar distributions. It uses the matched metrics to build a prediction model on the source project and apply it to the target project. The technique is evaluated on multiple public defect datasets and is shown to outperform whole-project defect prediction and other cross-project defect prediction baselines in most cases, demonstrating the potential of HDP to reuse existing defect data more widely.
Self-adaptive container monitoring with performance-aware Load-Shedding policies, by Rolando Brondolin, PhD student in System Architecture at Politecnico di Milano
Preserving the currency of analytics outcomes over time through selective re-...Paolo Missier
The document discusses techniques for preserving the accuracy of analytics results over time through selective recomputation as meta-knowledge and datasets change. It presents the ReComp project which aims to quantify the impact of changes to algorithms, data, and databases on prior analytics outcomes. The techniques developed include capturing workflow execution history and provenance, defining data difference functions, and estimating the effect of changes to determine what recomputation is needed. Open challenges include understanding change frequency and impact, and when re-running expensive simulations is necessary due to modifications in inputs.
The document discusses a project that used accelerometers to recognize gestures for a virtual environment. The project utilized Wii remotes and an Acceleglove to collect accelerometer data and recognize gestures through MATLAB. A C# program extracted accelerometer data from the Wii remotes for training and testing gesture recognition algorithms in MATLAB. Key algorithms used in MATLAB included dynamic time warping, affinity propagation, and random projection for gesture recognition. The digital glove also used an artificial neural network approach for gesture recognition.
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
The Advanced Photon Source (APS) at Argonne National Laboratory produces intense beams of x-rays for scientific research. Experimental data from the APS is growing dramatically due to improved detectors and a planned upgrade. This is creating data and computation challenges across the entire experimental process. Efforts are underway to accelerate the experimental feedback loop through automated data analysis, optimized data streaming, and computer-steered experiments to minimize data collection. The goal is to enable real-time insights and knowledge-driven experiments.
How might machine learning help advance solar PV research?Anubhav Jain
Machine learning techniques can help optimize solar PV systems in several ways:
1) Clear sky detection algorithms using ML were developed to more accurately classify sky conditions from irradiance data, improving degradation rate calculations.
2) Site-specific modeling of module voltages over time, validated with field data, allows more optimal string sizing compared to traditional worst-case assumptions.
3) ML and data-driven approaches may help optimize other aspects of solar plant design like climate zone definitions and extracting module parameters from production data.
A Framework and Infrastructure for Uncertainty Quantification and Management ...aimsnist
QuesTek Innovations presented a framework to incorporate materials genome initiatives (MGI) and artificial intelligence (AI) into their integrated computational materials engineering (ICME) practice. They discussed three key aspects: (1) MaGICMaT, a materials genome and ICME toolkit to manage data and property-structure-performance linkages, (2) an uncertainty quantification framework for CALPHAD modeling, and (3) a cloud-based platform to enable rapid development and deployment of ICME models with an HPC backend. The presentation provided details on their approaches for each aspect and highlighted opportunities to further enhance ICME with MGI and AI.
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
MOA is a framework for online machine learning from data streams. It includes algorithms for classification, regression, clustering and frequent pattern mining that can incorporate data and update models on the fly. MOA is closely related to WEKA and includes tools for evaluating streaming algorithms on data from sensors and IoT devices. It provides an environment for designing and running experiments on streaming machine learning algorithms at massive scales.
The document discusses using artificial intelligence (AI) to accelerate materials innovation for clean energy applications. It outlines six elements needed for a Materials Acceleration Platform: 1) automated experimentation, 2) AI for materials discovery, 3) modular robotics for synthesis and characterization, 4) computational methods for inverse design, 5) bridging simulation length and time scales, and 6) data infrastructure. Examples of opportunities include using AI to bridge simulation scales, assist complex measurements, and enable automated materials design. The document argues that a cohesive infrastructure is needed to make effective use of AI, data, computation, and experiments for materials science.
Overview of DuraMat software tool development(poster version)Anubhav Jain
This document provides an overview of software tools being developed by the DuraMat project to analyze photovoltaic systems. It summarizes six software tools that serve two main purposes: core functions for PV analysis and modeling operation/degradation, and tools for project planning and reducing levelized cost of energy (LCOE). The core function tools include PVAnalytics for data processing and a PV-Pro preprocessor. Tools for operation/degradation include PV-Pro, PVOps, PVArc, and pv-vision. Tools for project planning and LCOE include a simplified LCOE calculator and VocMax string length calculator. All tools are open source and designed for large PV data sets.
STKO - A revolutionary toolkit for OpenSeesopenseesdays
This document introduces STKO, a toolkit for OpenSees that allows for efficient storage and visualization of simulation results. STKO uses HDF5 to store complex data in a hierarchical format. It implements a new MPCO file format and recorder class to organize node and element results, including results on fibers. STKO provides pre- and post-processing tools, allowing users to import models, run simulations in OpenSees, and visualize results like fiber stresses and displacements. The toolkit aims to improve work with large, complex models and heterogeneous simulation outputs.
Automated Machine Learning Applied to Diverse Materials Design ProblemsAnubhav Jain
Automated Machine Learning Applied to Diverse Materials Design Problems
Anubhav Jain presented on developing standardized benchmark datasets and algorithms for automated machine learning in materials science. Matbench provides a diverse set of materials design problems for evaluating ML algorithms, including classification and regression tasks of varying sizes from experiments and DFT. Automatminer is a "black box" ML algorithm that uses genetic algorithms to automatically generate features, select models, and tune hyperparameters on a given dataset, performing comparably to specialized literature methods on small datasets but less well on large datasets. Standardized evaluations can help accelerate progress in automated ML for materials design.
VL/HCC 2014 - A Longitudinal Study of Programmers' BacktrackingYoungSeok Yoon
This document describes a longitudinal study of programmers' backtracking behavior. The study analyzed coding logs collected from 21 programmers over 1,460 total hours. It found that programmers backtrack frequently, with an average of 10.3 backtracking instances per hour. Backtrackings varied in size from simple parameter changes to significant algorithmic changes. Nearly 40% of backtrackings were performed manually without tool support. The study provided evidence that programmers do exploratory programming by making changes, running code, and reverting changes. Selective backtracking, which is not supported by regular undo commands, accounted for around 10% of instances.
Exquiron migrated their cheminformatics platform from PipelinePilot to KNIME and ChemAxon technologies due to rising PipelinePilot costs. They worked with ChemAxon consultants to port complex workflows like hit expansion and dose-response reporting to KNIME. While the migration was successful, KNIME has higher memory requirements and slower loops than PipelinePilot. However, KNIME provides a modern reporting interface and faster fingerprint searches. With help from ChemAxon, Exquiron was able to optimize workflows and maintain project timelines during the platform migration.
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
1) The document discusses evaluating machine learning algorithms for materials science using the Matbench protocol.
2) Matbench provides standardized datasets, testing procedures, and an online leaderboard to benchmark and compare machine learning performance.
3) This allows different groups to evaluate algorithms independently and identify best practices for materials science predictions.
Architectural decisions in designing data and computation intensive systems can have a major impact on the ability of these systems to perform statistical and other complex calculations efficiently. The storage, processing, tools, and associated databases coupled with the networking and compute infrastructure make some kinds of computations easier, and other harder. This talk will provide an introduction to software and data systems components that are important for understanding how these choices impact data analysis uncertainties and costs, and thus for developing system and software designs best suited to statistical analyses.
Scientific Software: Sustainability, Skills & SociologyNeil Chue Hong
This document discusses the importance of software sustainability. It notes that software is everywhere, long-lived, and hard to define, which makes sustainability challenging. It emphasizes that software sustainability requires cultivating skills in developers and researchers, providing proper incentives, and recognizing that people are a key part of maintaining software over long periods of time.
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
The constantly increasing number of connected devices and sensors results in increasing volume and velocity of sensor-based streaming data. Traditional approaches for processing high velocity sensor data rely on stream processing engines. However, the increasing complexity of continuous queries executed on top of high velocity data has resulted in growing demand for federated systems composed of data stream processing engines and database engines. One of major challenges for such systems is to devise the optimal query execution plan to maximize the throughput of continuous queries.
In this paper we present a general framework for federated database and stream processing systems, and introduce the design and implementation of a cost-based optimizer for optimizing relational continuous queries in such systems. Our optimizer uses characteristics of continuous queries and source data streams to devise an optimal placement for each operator of a continuous query. This fine level of optimization, combined with the estimation of the feasibility of query plans, allows our optimizer to devise query plans which result in 8 times higher throughput as compared to the baseline approach which uses only stream processing engines. Moreover, our experimental results showed that even for simple queries, a hybrid execution plan can result in 4 times and 1.6 times higher throughput than a pure stream processing engine plan and a pure database engine plan, respectively.
This document discusses performance analysis and parallel computing. It defines performance metrics like speedup, efficiency, and scalability that are used to evaluate parallel programs. Sources of parallel overhead like synchronization, load imbalance, and communication are described. The document also discusses benchmarks used to evaluate parallel systems like PARSEC and Rodinia. It emphasizes that overall execution time captures a system's real performance and depends on factors like CPU time, I/O, memory access, and interactions between programs.
Delivered at the TAUS Machine Translation Showcase.
June 6th 2014
Dublin, Ireland.
In this talk, we explain how machine translation systems can be developed for highly technical content types.
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
The document describes the ReComp framework for efficiently recomputing analytics processes when changes occur. ReComp uses provenance data from past executions to estimate the impact of changes and selectively re-execute only affected parts of processes. It identifies changes, computes data differences, and estimates impacts on past outputs to determine the minimum re-executions needed. For genomic analysis workflows, ReComp reduced re-executions from 495 to 71 by caching intermediate data and re-running only impacted fragments. The framework is customizable via difference and impact functions tailored to specific applications and data types.
The experiment aimed to monitor energy and gas consumption on an aircraft parts production line. Sensors were installed to measure consumption of autoclaves and machines. Apache Flink processed streaming data and Orion stored it. Knowage visualized trends and provided indicators. The system identified optimization rules, applying a new recipe reducing energy 11.12%, gases 12.92%, and production time 8%, meeting KPI targets. COVID-19 impacted sensor installation but rules still provided decision support.
Chronix is a domain specific time series database designed for anomaly detection in operational data. It is optimized for the needs of anomaly detection by supporting domain specific data types, analysis algorithms, data models, and query languages. It aims to address limitations of general purpose time series databases by exploiting characteristics of operational data through features like optional pre-computation of extras, timestamp compression, domain specific records and compression techniques, and multi-dimensional storage. An evaluation using data from five industry projects found that Chronix has significantly smaller memory and storage footprints and faster data retrieval and analysis times compared to other time series databases.
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
The Advanced Photon Source (APS) at Argonne National Laboratory produces intense beams of x-rays for scientific research. Experimental data from the APS is growing dramatically due to improved detectors and a planned upgrade. This is creating data and computation challenges across the entire experimental process. Efforts are underway to accelerate the experimental feedback loop through automated data analysis, optimized data streaming, and computer-steered experiments to minimize data collection. The goal is to enable real-time insights and knowledge-driven experiments.
How might machine learning help advance solar PV research?Anubhav Jain
Machine learning techniques can help optimize solar PV systems in several ways:
1) Clear sky detection algorithms using ML were developed to more accurately classify sky conditions from irradiance data, improving degradation rate calculations.
2) Site-specific modeling of module voltages over time, validated with field data, allows more optimal string sizing compared to traditional worst-case assumptions.
3) ML and data-driven approaches may help optimize other aspects of solar plant design like climate zone definitions and extracting module parameters from production data.
A Framework and Infrastructure for Uncertainty Quantification and Management ...aimsnist
QuesTek Innovations presented a framework to incorporate materials genome initiatives (MGI) and artificial intelligence (AI) into their integrated computational materials engineering (ICME) practice. They discussed three key aspects: (1) MaGICMaT, a materials genome and ICME toolkit to manage data and property-structure-performance linkages, (2) an uncertainty quantification framework for CALPHAD modeling, and (3) a cloud-based platform to enable rapid development and deployment of ICME models with an HPC backend. The presentation provided details on their approaches for each aspect and highlighted opportunities to further enhance ICME with MGI and AI.
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
MOA is a framework for online machine learning from data streams. It includes algorithms for classification, regression, clustering and frequent pattern mining that can incorporate data and update models on the fly. MOA is closely related to WEKA and includes tools for evaluating streaming algorithms on data from sensors and IoT devices. It provides an environment for designing and running experiments on streaming machine learning algorithms at massive scales.
The document discusses using artificial intelligence (AI) to accelerate materials innovation for clean energy applications. It outlines six elements needed for a Materials Acceleration Platform: 1) automated experimentation, 2) AI for materials discovery, 3) modular robotics for synthesis and characterization, 4) computational methods for inverse design, 5) bridging simulation length and time scales, and 6) data infrastructure. Examples of opportunities include using AI to bridge simulation scales, assist complex measurements, and enable automated materials design. The document argues that a cohesive infrastructure is needed to make effective use of AI, data, computation, and experiments for materials science.
Overview of DuraMat software tool development(poster version)Anubhav Jain
This document provides an overview of software tools being developed by the DuraMat project to analyze photovoltaic systems. It summarizes six software tools that serve two main purposes: core functions for PV analysis and modeling operation/degradation, and tools for project planning and reducing levelized cost of energy (LCOE). The core function tools include PVAnalytics for data processing and a PV-Pro preprocessor. Tools for operation/degradation include PV-Pro, PVOps, PVArc, and pv-vision. Tools for project planning and LCOE include a simplified LCOE calculator and VocMax string length calculator. All tools are open source and designed for large PV data sets.
STKO - A revolutionary toolkit for OpenSeesopenseesdays
This document introduces STKO, a toolkit for OpenSees that allows for efficient storage and visualization of simulation results. STKO uses HDF5 to store complex data in a hierarchical format. It implements a new MPCO file format and recorder class to organize node and element results, including results on fibers. STKO provides pre- and post-processing tools, allowing users to import models, run simulations in OpenSees, and visualize results like fiber stresses and displacements. The toolkit aims to improve work with large, complex models and heterogeneous simulation outputs.
Automated Machine Learning Applied to Diverse Materials Design ProblemsAnubhav Jain
Automated Machine Learning Applied to Diverse Materials Design Problems
Anubhav Jain presented on developing standardized benchmark datasets and algorithms for automated machine learning in materials science. Matbench provides a diverse set of materials design problems for evaluating ML algorithms, including classification and regression tasks of varying sizes from experiments and DFT. Automatminer is a "black box" ML algorithm that uses genetic algorithms to automatically generate features, select models, and tune hyperparameters on a given dataset, performing comparably to specialized literature methods on small datasets but less well on large datasets. Standardized evaluations can help accelerate progress in automated ML for materials design.
VL/HCC 2014 - A Longitudinal Study of Programmers' BacktrackingYoungSeok Yoon
This document describes a longitudinal study of programmers' backtracking behavior. The study analyzed coding logs collected from 21 programmers over 1,460 total hours. It found that programmers backtrack frequently, with an average of 10.3 backtracking instances per hour. Backtrackings varied in size from simple parameter changes to significant algorithmic changes. Nearly 40% of backtrackings were performed manually without tool support. The study provided evidence that programmers do exploratory programming by making changes, running code, and reverting changes. Selective backtracking, which is not supported by regular undo commands, accounted for around 10% of instances.
Exquiron migrated their cheminformatics platform from PipelinePilot to KNIME and ChemAxon technologies due to rising PipelinePilot costs. They worked with ChemAxon consultants to port complex workflows like hit expansion and dose-response reporting to KNIME. While the migration was successful, KNIME has higher memory requirements and slower loops than PipelinePilot. However, KNIME provides a modern reporting interface and faster fingerprint searches. With help from ChemAxon, Exquiron was able to optimize workflows and maintain project timelines during the platform migration.
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
1) The document discusses evaluating machine learning algorithms for materials science using the Matbench protocol.
2) Matbench provides standardized datasets, testing procedures, and an online leaderboard to benchmark and compare machine learning performance.
3) This allows different groups to evaluate algorithms independently and identify best practices for materials science predictions.
Architectural decisions in designing data and computation intensive systems can have a major impact on the ability of these systems to perform statistical and other complex calculations efficiently. The storage, processing, tools, and associated databases coupled with the networking and compute infrastructure make some kinds of computations easier, and other harder. This talk will provide an introduction to software and data systems components that are important for understanding how these choices impact data analysis uncertainties and costs, and thus for developing system and software designs best suited to statistical analyses.
Scientific Software: Sustainability, Skills & SociologyNeil Chue Hong
This document discusses the importance of software sustainability. It notes that software is everywhere, long-lived, and hard to define, which makes sustainability challenging. It emphasizes that software sustainability requires cultivating skills in developers and researchers, providing proper incentives, and recognizing that people are a key part of maintaining software over long periods of time.
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
The constantly increasing number of connected devices and sensors results in increasing volume and velocity of sensor-based streaming data. Traditional approaches for processing high velocity sensor data rely on stream processing engines. However, the increasing complexity of continuous queries executed on top of high velocity data has resulted in growing demand for federated systems composed of data stream processing engines and database engines. One of major challenges for such systems is to devise the optimal query execution plan to maximize the throughput of continuous queries.
In this paper we present a general framework for federated database and stream processing systems, and introduce the design and implementation of a cost-based optimizer for optimizing relational continuous queries in such systems. Our optimizer uses characteristics of continuous queries and source data streams to devise an optimal placement for each operator of a continuous query. This fine level of optimization, combined with the estimation of the feasibility of query plans, allows our optimizer to devise query plans which result in 8 times higher throughput as compared to the baseline approach which uses only stream processing engines. Moreover, our experimental results showed that even for simple queries, a hybrid execution plan can result in 4 times and 1.6 times higher throughput than a pure stream processing engine plan and a pure database engine plan, respectively.
This document discusses performance analysis and parallel computing. It defines performance metrics like speedup, efficiency, and scalability that are used to evaluate parallel programs. Sources of parallel overhead like synchronization, load imbalance, and communication are described. The document also discusses benchmarks used to evaluate parallel systems like PARSEC and Rodinia. It emphasizes that overall execution time captures a system's real performance and depends on factors like CPU time, I/O, memory access, and interactions between programs.
Delivered at the TAUS Machine Translation Showcase.
June 6th 2014
Dublin, Ireland.
In this talk, we explain how machine translation systems can be developed for highly technical content types.
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
The document describes the ReComp framework for efficiently recomputing analytics processes when changes occur. ReComp uses provenance data from past executions to estimate the impact of changes and selectively re-execute only affected parts of processes. It identifies changes, computes data differences, and estimates impacts on past outputs to determine the minimum re-executions needed. For genomic analysis workflows, ReComp reduced re-executions from 495 to 71 by caching intermediate data and re-running only impacted fragments. The framework is customizable via difference and impact functions tailored to specific applications and data types.
The experiment aimed to monitor energy and gas consumption on an aircraft parts production line. Sensors were installed to measure consumption of autoclaves and machines. Apache Flink processed streaming data and Orion stored it. Knowage visualized trends and provided indicators. The system identified optimization rules, applying a new recipe reducing energy 11.12%, gases 12.92%, and production time 8%, meeting KPI targets. COVID-19 impacted sensor installation but rules still provided decision support.
Chronix is a domain specific time series database designed for anomaly detection in operational data. It is optimized for the needs of anomaly detection by supporting domain specific data types, analysis algorithms, data models, and query languages. It aims to address limitations of general purpose time series databases by exploiting characteristics of operational data through features like optional pre-computation of extras, timestamp compression, domain specific records and compression techniques, and multi-dimensional storage. An evaluation using data from five industry projects found that Chronix has significantly smaller memory and storage footprints and faster data retrieval and analysis times compared to other time series databases.
How to Improve Computer Vision with Geospatial ToolsSafe Software
Can computer vision mimic human vision? Maybe – but we need the right tools to process the high volume of data required by machine learning algorithms. Integration tools like FME can be used to harness the power of geospatial and machine learning for object detection.
In this webinar, you will learn how to:
- Use the ML libraries exposed in FME for object detection on photos or with remote sensing data (with an Open CV integration)
- How to improve the detection results with geospatial analysis
- Deliver results to stakeholders with quality outputs like maps, images, or info shared directly to a destination system
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Lionel Briand
This document discusses testing dynamic behavior in executable software models for cyber-physical systems. It presents challenges for model-in-the-loop (MiL) testing due to large input spaces, expensive simulations, and lack of simple oracles. The document proposes using search-based testing to generate critical test cases by formulating it as a multi-objective optimization problem. It demonstrates the approach on an advanced driver assistance system and discusses improving performance with surrogate modeling.
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
In this presentation, given to graduate students at Universita' RomaTre, Italy, we suggest that concepts well-known in Data Provenance can be exploited to provide explanations in the context of data-centric AI processes. Through use cases (incremental data cleaning, training set pruning), we build up increasingly complex provenance patterns, culminating in an open question:
how to describe "why" a specific data item has been manipulated as part of data processing, when such processing may consist of a complex data transformation algorithm.
This document summarizes a thesis on automating test routine creation through natural language processing. The author proposes using word embeddings and recommender systems to automatically generate test cases from requirements documents and link them together. The methodology involves representing text as word vectors, calculating similarity between requirements and test blocks, and applying association rule mining on test block sequences. An experiment on a space operations dataset showed the approach improved productivity in test creation and requirements tracing over manual methods. Future work could explore using deep learning models and collecting additional evaluation metrics from users.
Enabling Physics and Empirical-Based Algorithms with Spark Using the Integrat...Databricks
John Deere is a leading manufacturer of agricultural, construction and forestry machinery, diesel engines, drivetrains for a variety of applications ranging from lawn care to heavy equipment. The company collects large transient engineering datasets from John Deere test vehicles in the field, and via telematic data-loggers.
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
These are the slides I used for a crash course (4 hours) on data streaming. It contains both theory / research aspects as well as examples based on Apache Flink (DataStream API)
Design and Implementation of A Data Stream Management SystemErdi Olmezogullari
This presentation is related to my Master's Thesis at Ozyegin University. We focused on data mining on the real streaming (not binary) data. The most popular data mining algorithm, Association Rule Mining (ARM), was performed during this study from scratch. At the end of the thesis, we published four national/international papers in the different conferences such as Cloud Computing and Big Data.
Analytics of analytics pipelines:from optimising re-execution to general Dat...Paolo Missier
This document discusses using data provenance to optimize re-execution of analytics pipelines and enable transparency in data science workflows. It proposes a framework called ReComp that selectively recomputes parts of expensive analytics workflows when inputs change based on provenance data. It also discusses applying provenance techniques to collect fine-grained data on data preparation steps in machine learning pipelines to help explain model decisions and data transformations. Early results suggest provenance can be collected with reasonable overhead and enables useful queries about pipeline execution.
Similar to Speeding up information extraction programs: a holistic optimizer and a learning-based approach to rank documents (20)
Change Management in the Traditional and Semantic WebINRIA-OAK
Data has played such a crucial role in the development of modern society that relational databases, the most popular data management systems, have been defined as the "foundation of western civilization" for their massive adoption in business, government and education, that allowed to reach the necessary productivity and standardization rate.
Data, however, in order to become knowledge, needs to be organized in a way that enables the retrieval of relevant information and data analysis, for becoming a real asset. Another fundamental aspect for an effective data management is the full support for changes at the data and metadata level.
A failure in the management of data changes usually results in a dramatic diminishment of its usefulness.Data does evolve, in its format and its content, due to changes in the modeled domain, error correction, different required level of granularity, in order to accommodate new information.
Change management has been covered extensively in literature, we concentrate in particular on two enabling factors of the World Wide Web: XML (the W3C-endorsed and widely used markup language to define semi-structured data) and
ontologies (a major enabling factor of the Semantic Web vision), with related metadata.
Concretely, we cover change management techniques for XML, and we then concentrate on the additional problems arising when considering the evolution of semantically-equipped data (with the help of a use-case featuring evolving ontologies and related mappings), which cannot be exclusively operated at the syntactical level, disregarding logical consequences.
A Network-Aware Approach for Searching As-You-Type in Social MediaINRIA-OAK
We present a novel approach for as-you-type top-k keyword search over social media. We adopt a natural "network-aware" interpretation for information relevance, by which information produced by users who are closer to the seeker is considered more relevant. In practice, this query model poses new challenges for effectiveness and efficiency in online search, even when a complete query is given as input in one keystroke. This is mainly because it requires a joint exploration of the social space and classic IR indexes such as inverted lists. We describe a memory-efficient and incremental prefix-based retrieval algorithm, which also exhibits an anytime behavior, allowing to output the most likely answer within any chosen running-time limit. We evaluate it through extensive experiments for several applications and search scenarios , including searching for posts in micro-blogging (Twitter and Tumblr), as well as searching for businesses based on reviews in Yelp. They show that our solution is effective in answering real-time as-you-type searches over social media.
Data is incomplete when it contains missing/unknown information, or more generally when it is only partially available, e.g. because of restrictions on data access.
Incompleteness is receiving a renewed interest as it is naturally generated in data interoperation, a very common framework for today's data-centric applications. In this setting data is decentralized, needs to be integrated from several sources and exchanged between different applications. Incompleteness arises from the semantic and syntactic heterogeneity of different data sources.
Querying incomplete data is usually an expensive task. In this talk we survey on the state of the art and recent developments on the tractability of querying incomplete data, under different possible interpretations of incompleteness.
In recent years, several important content providers such as Amazon, Musicbrainz, IMDb, Geonames, Google, and Twitter, have chosen to export their data through Web services. To unleash the potential of these sources for new intelligent applications, the data has to be combined across different APIs.
To this end, we have developed ANGIE, a framework that maps the knowledge provided by Web services dynamically into a local knowledge base. ANGIE represents Web services as views with binding patterns over the schema of the knowledge base. In this talk, I will focus on two problems related to our framework.
In the first part, the focus will be on the automatic integration of new Web services. I will present a novel algorithm for inferring the view definition of a given Web service in terms of the schema of the global knowledge base. The algorithm also generates a declarative script can transform the call results into results of the view. Our experiments on real Web services show the viability of our approach.
The second part will address the evaluation of conjunctive queries under a budget of calls. Conjunctive queries may require an unbound number of calls in order to compute the maximal answers. However, Web services typically allow only a fixed number of calls per session. Therefore, we have to prioritize query evaluation plans. We are working on distinguishing among all plans that could return answers those plans that actually will. Finally, I will show an application for this new notion of plans.
On building more human query answering systemsINRIA-OAK
The underlying principle behind every query answering system is the existence of a query describing the information of interest. When this model is applied to non-expert users, two traditional issues become highly significant.
The first is that many queries are often over specified leading to empty answers. We propose a principled optimization-based interactive query relaxation framework for such queries. The framework computes dynamically and suggests alternative queries with less conditions to help the user arrive at a query with a non-empty answer, or at a query for which it is clear that independently of the relaxations the answer will always be empty.
The second issue is the lack of expertise from the user to accurately describe the requirements of the elements of interest. The user may though know examples of elements that would like to have in the results. We introduce a novel form of query paradigm in which queries are not any more specifications of what the user is searching for, but simply a sample of what the user knows to be of interest. We refer to this novel form of queries as Exemplar Queries.
Dynamically Optimizing Queries over Large Scale Data PlatformsINRIA-OAK
Enterprises are adapting large-scale data processing platforms, such as Hadoop, to gain actionable insights from their "big data". Query optimization is still an open challenge in this environment due to the volume and heterogeneity of data, comprising both structured and un/semi-structured datasets. Moreover, it has become common practice to push business logic close to the data via user-defined functions (UDFs), which are usually opaque to the optimizer, further complicating cost-based optimization. As a result, classical relational query optimization techniques do not fit well in this setting, while at the same time, suboptimal query plans can be disastrous with large datasets. In this talk, I will present new techniques that take into account UDFs and correlations between relations for optimizing queries running on large scale clusters. We introduce "pilot runs", which execute part of the query over a sample of the data to estimate selectivities, and employ a cost-based optimizer that uses these selectivities to choose an initial query plan. Then, we follow a dynamic optimization approach, in which plans evolve as parts of the queries get executed. Our experimental results show that our techniques produce plans that are at least as good as, and up to 2x (4x) better for Jaql (Hive) than, the best hand-written left-deep query plans.
The document discusses the growth of RDF data volumes on the web. It notes that the Linked Open Data cloud currently contains over 325 datasets with more than 25 billion triples, and the size is doubling every year. It provides examples of popular datasets that are part of the Linked Open Data cloud.
This document summarizes the OAK project meeting in fall 2014. It discusses OAK's history and environment, members, current grants, main research themes, and software. Key areas of research include database techniques for the semantic web, efficient parallel processing of web data, crowdsourcing and graphs, parallel XML and provenance, and hybrid and redundant data stores. Upcoming events include PhD defenses in September and the INRIA activity report.
Nautilus is a SQL query development tool started in 2009 that is still under development. It helps users understand query results by answering questions about why tuples are or aren't received. It uses various algorithms like TedNaive, TedRevised, and NedExplain, which are open source code implemented in Java as an Eclipse plugin. The tool has over 28,000 lines of code and 230 classes, and relies on dependencies like PostgreSQL, ZQL, and various Java libraries. It is developed by a team that has included past contributors and is currently worked on by Katerina Tzompanaki and Alexandre Constantin.
WaRG is a framework that allows compositional and execution of analytical RDF queries against a given RDF dataset. It contains 4924 lines of Java code and 1091 lines of Q scripts. The framework includes a GUI for visualizing the analytical schema, RDF dataset, and query results. It also allows building and executing analytical queries composed of a classifier query, measure query, and aggregate function. WaRG has dependencies on Kdb and Prefuse and future work includes Postgres integration and improving intermediate result handling.
The document describes ViP2P, a system for distributed data management and querying in peer-to-peer networks using materialized views. It enables peers to pose queries over distributed data by rewriting queries using available views. The querying peer looks up view definitions, rewrites the query into a logical plan based on the views, generates an optimized physical plan, and executes it to return results. The system uses distributed hash tables, third-party libraries, and supports defining, indexing, materializing and querying distributed views across the peer-to-peer network.
The S4 project is a social semantic search engine that searches Twitter tweets for keywords defined in RDF semantics from DBPedia and ranks the results based on proximity and position of keywords. The code is written in Python and stores tweet data either through serialization or a PostgreSQL database. It retrieves tweets and RDF semantics, stores the parsed tweet data, and runs a search algorithm to return top results to users.
This document describes an RDF saturation algorithm implemented in Java within the RDFVS project. The saturation algorithm is responsible for finding and storing implicit RDF triples by reasoning over input RDF data and RDFS schema. It uses PostgreSQL for data storage and supports updating the saturated triples when new data or schema information is received. The algorithm has been implemented as modules with over 6,000 lines of code by a single contributor.
The document describes an RDF data generator that can produce synthetic RDF data and schemas based on user-defined parameters. The generator consists of modules that control the generation of literals, data, schemas, distributions, and graphs. It allows configuring how subjects, predicates, and objects are generated by selecting sources and distributions. The generator codebase contains over 3,000 lines of Java code and relies on external Colt, Piccolo, and ViP2P libraries.
This document describes a project that provides methods for estimating the cardinality of conjunctive queries over RDF data. It discusses the key modules including an RDF loader, parser and importer to extract triples from RDF files and load them into a database. The database stores the triples and generates statistics tables. A cardinality estimator takes a conjunctive query and database statistics to output an estimate of the query's cardinality.
This document describes an RDF query reformulation algorithm. The algorithm takes as input a conjunctive RDF query and RDF Schema, and outputs a union of equivalent conjunctive queries. The code is written in Java and located on a source code repository. It contains several packages for reasoning, rules, signatures, and utilities. Libraries used include Jena and Xerces. Papers related to the algorithm are cited. Future work includes adding more robust tests of reformulated query content.
This document describes an RDF loader that takes plain RDF triple files as input and loads them into a PostgreSQL database. There are two versions - an older Java version and a newer bash script version that is faster. The loader creates tables to store the triples and indexes, and can optionally create an encoded version of the triples table and statistics on the URIs in the triples. It works by extracting triples from the NT files and inserting them into the database tables. Key libraries used include PostgreSQL and Log4j. Potential next steps include handling longer URIs and rewriting the bash version in Java.
This document describes a reuse-based optimization framework for Pig Latin queries. The framework identifies common subexpressions across a workload of Pig Latin scripts, selects the most efficient ones to execute based on a cost model, and reuses their results to compute the original script outputs. It analyzes query graphs to merge equivalent subexpressions and finds an optimal plan using a binary integer linear programming solver. The framework codebase contains classes for the optimizer, normalization rules, and decomposition of join operators.
PAXQuery is an execution engine for the XQuery query language that runs on the Apache Flink platform. It translates XML queries to algebraic trees, maps the algebraic trees to PACT plans to parallelize the queries. The code is organized into several modules that have dependencies between them. It has external dependencies on Apache Flink and other libraries. Future work includes finishing the development of the paxquery-xparser module and developing a web client interface.
This document summarizes two projects: ConjunctiveQueries, which parses RDF conjunctive query files and represents them as Java objects, and SQLTranslator, which translates the ConjunctiveQuery objects into SQL queries. It provides information on the owners, code locations, volumes, contributors, users, structures, dependencies, and diagram of how the two projects work together.
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...Sérgio Sacani
Context. The observation of several L-band emission sources in the S cluster has led to a rich discussion of their nature. However, a definitive answer to the classification of the dusty objects requires an explanation for the detection of compact Doppler-shifted Brγ emission. The ionized hydrogen in combination with the observation of mid-infrared L-band continuum emission suggests that most of these sources are embedded in a dusty envelope. These embedded sources are part of the S-cluster, and their relationship to the S-stars is still under debate. To date, the question of the origin of these two populations has been vague, although all explanations favor migration processes for the individual cluster members. Aims. This work revisits the S-cluster and its dusty members orbiting the supermassive black hole SgrA* on bound Keplerian orbits from a kinematic perspective. The aim is to explore the Keplerian parameters for patterns that might imply a nonrandom distribution of the sample. Additionally, various analytical aspects are considered to address the nature of the dusty sources. Methods. Based on the photometric analysis, we estimated the individual H−K and K−L colors for the source sample and compared the results to known cluster members. The classification revealed a noticeable contrast between the S-stars and the dusty sources. To fit the flux-density distribution, we utilized the radiative transfer code HYPERION and implemented a young stellar object Class I model. We obtained the position angle from the Keplerian fit results; additionally, we analyzed the distribution of the inclinations and the longitudes of the ascending node. Results. The colors of the dusty sources suggest a stellar nature consistent with the spectral energy distribution in the near and midinfrared domains. Furthermore, the evaporation timescales of dusty and gaseous clumps in the vicinity of SgrA* are much shorter ( 2yr) than the epochs covered by the observations (≈15yr). In addition to the strong evidence for the stellar classification of the D-sources, we also find a clear disk-like pattern following the arrangements of S-stars proposed in the literature. Furthermore, we find a global intrinsic inclination for all dusty sources of 60 ± 20◦, implying a common formation process. Conclusions. The pattern of the dusty sources manifested in the distribution of the position angles, inclinations, and longitudes of the ascending node strongly suggests two different scenarios: the main-sequence stars and the dusty stellar S-cluster sources share a common formation history or migrated with a similar formation channel in the vicinity of SgrA*. Alternatively, the gravitational influence of SgrA* in combination with a massive perturber, such as a putative intermediate mass black hole in the IRS 13 cluster, forces the dusty objects and S-stars to follow a particular orbital arrangement. Key words. stars: black holes– stars: formation– Galaxy: center– galaxies: star formation
Anti-Universe And Emergent Gravity and the Dark UniverseSérgio Sacani
Recent theoretical progress indicates that spacetime and gravity emerge together from the entanglement structure of an underlying microscopic theory. These ideas are best understood in Anti-de Sitter space, where they rely on the area law for entanglement entropy. The extension to de Sitter space requires taking into account the entropy and temperature associated with the cosmological horizon. Using insights from string theory, black hole physics and quantum information theory we argue that the positive dark energy leads to a thermal volume law contribution to the entropy that overtakes the area law precisely at the cosmological horizon. Due to the competition between area and volume law entanglement the microscopic de Sitter states do not thermalise at sub-Hubble scales: they exhibit memory effects in the form of an entropy displacement caused by matter. The emergent laws of gravity contain an additional ‘dark’ gravitational force describing the ‘elastic’ response due to the entropy displacement. We derive an estimate of the strength of this extra force in terms of the baryonic mass, Newton’s constant and the Hubble acceleration scale a0 = cH0, and provide evidence for the fact that this additional ‘dark gravity force’ explains the observed phenomena in galaxies and clusters currently attributed to dark matter.
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfSelcen Ozturkcan
Ozturkcan, S., Berndt, A., & Angelakis, A. (2024). Mending clothing to support sustainable fashion. Presented at the 31st Annual Conference by the Consortium for International Marketing Research (CIMaR), 10-13 Jun 2024, University of Gävle, Sweden.
2. Who am I?
• Assistant Professor at IST, University of Lisbon (http://
www.ist.utl.pt)
• Researcher at INESC-ID (http://www.inesc-id.pt)
• Former (2000) PhD student at INRIA Rocquencourt/Univ.
Versailles
• Main research interests:
– Data Cleaning
• Support for User Involvement in Data Cleaning, DaWaK 2011
• Automatic optimization of the match operation (similarity join)
– Medical Information Systems
• A Machine Learning based Natural Language Interface for a
Database of Medicines, poster paper DILS 2014
• ProntApp: A Mobile Question Answering System for Medicines
– Information Extraction
4/3/15 2 2
3. Who am I?
• Assistant Professor at IST, University of Lisbon (http://
www.ist.utl.pt)
• Researcher at INESC-ID (http://www.inesc-id.pt)
• Former (2000) PhD student at INRIA Rocquencourt/Univ.
Versailles
• Main research interests:
– Data Cleaning
• Support for User Involvement in Data Cleaning, DaWaK 2011
• Automatic optimization of the match operation (similarity join)
– Medical Information Systems
• A Machine Learning based Natural Language Interface for a
Database of Medicines, poster paper DILS 2014
• ProntApp: A Mobile Question Answering System for Medicines
Ø Information Extraction
4/3/15 3 3
4. Agenda
• A Holistic Optimizer for Speeding up Information
Extraction Programs
– PVLDB’13 and presented at VLDB’14
– Joint work with Gonçalo Simões, INESC-ID and IST/Univ.
Lisboa and Luis Gravano, Columbia University
• A Learning-based Approach to Rank Documents
(briefly)
– EDBT’15
– Joint work with Pablo Barrio, Columbia University, Gonçalo
Simões and Luis Gravano
4
5. Information Extraction Discovers
Structured Information In Text Documents
The
erup+on
of
Grímsvötn
during
22-‐25
May
2011
brought
back
the
memories
of
the
erup+ons
of
EyjaEallajökull
in
2010.
Natural
Disaster
erup+on
of
Grímsvötn
erup+ons
of
EyjaEallajökull
Temporal
Expression
22-‐25
May
2011
2010
Temporal
Expression
22-‐25
May
2011
Natural
Disaster
Temporal
Expression
erup+on
of
Grímsvötn
22-‐25
May
2011
Extract
Occurs-‐In
Extract
Natural
Disaster
Select
Temporal
Expressions
From
2011
Extract
Temporal
Expressions
Much
richer
querying
and
analysis
5
6. IE
SpecificaCon
Naïve
ExecuCon
Plan
Information Extraction Specification
and Execution Plans are Graphs of
Operators
Type
of
Operator:
EE:
En+ty
Extractor
RE:
Rela+on
Extractor
Technique
Algorithm
7. Information Extraction is Challenging
and Time-consuming
• Operates on a large set of features
• Relies on complex text analysis
• Often applied over large document collections
erupCon
brought
memories
erupCons
erupCon
bring
memory
erupCon
Grímsvötn
22-‐25
brought
2010
Cc+
N+.N+
c+
N+
Ccccccccc
NN.NN
ccccccc
NNNN
The
erupCon
brought
back
the
memories
the
erupCons
2010
det
nsubj
prt
det
dobj
prep_of
prep_of
det
LemmaCzaCon
Features
Character
CapitalizaCon
Features
7
8. IE Implementation Choices to the Rescue
• These implementation choices have an impact on execution time and quality of results
(recall and precision)
– Recall: fraction of correct tuples that the plan produces
– Precision: fraction of produced tuples that are correct
D
A
C
B
D1
A1
A2
A3
C1
B1
B2
Choosing
an
Algorithm
for
Each
Operator
Choosing
Operator
ExecuCon
Order
A
D
C
B
A
D
C
B
A
D
C
B
A2
D1
C1
B1
ExecuCon
Plan
Query(210K
docs)
EECRF
EECRF
Viterbi
EECRF
VBS-‐10
EECRF
VBS-‐5
RESVM
RESVM
All
SVM
Choosing
Document
Retrieval
Strategy
Query
for
Relevant
Docs
Scan
whole
CollecCon
1.
Execute
A
2.
Find
sentences
that
produced
results
for
A.
3.
Execute
B
with
those
sentences
only
Query
for:
[“natural
disaster”],
[earthquake],
[erup+on],
[tornado],
(…)
8
9. State-of-the-art IE Optimization Systems
D
A
C
B
D1
A1
A2
A3
C1
B1
B2
Choosing
an
Algorithm
for
Each
Operator
Choosing
Operator
ExecuCon
Order
A
D
C
B
A
D
C
B
A
D
C
B
SystemT
CIMPLE
¡ Both use a cost-based optimizer to choose the execution order of operators
¡ System T uses heuristics to choose among different algorithms for an operator
¡ Do not allow approximate results
Choosing
Document
Retrieval
Strategy
Query
for
Relevant
Docs
Scan
whole
CollecCon
10. • An information extraction optimizer should make
these implementation choices collectively!
D
A
C
B
D1
A1
A2
A3
C1
B1
B2
Choosing
an
Algorithm
for
Each
Operator
Choosing
Operator
ExecuCon
Order
A
D
C
B
A
D
C
B
A
D
C
B
SQoUT
Choosing
Document
Retrieval
Strategy
Query
for
Relevant
Docs
Scan
whole
CollecCon
10
State-of-the-art IE Optimization Systems
11. Making implementation choices
collectively leads to faster plans
A
D
C
B
A1
D1
C1
B1
Scan(1M
docs)
Time:
7h
30
m
Recall:
93%
Precision:
100%
A3
D1
C1
B2
Scan(1M
docs)
Time:
9h
23
m
Recall:
91%
Precision:
95%
A1
D1
C1
B1
Scan(1M
docs)
Time:
13h
16
m
Recall:
100%
Precision:
100%
Naive
Plan
A1
D1
C1
B1
Query(153K
docs)
Time:
2h
07
m
Recall:
50%
Precision:
100%
• By combining multiple choices we get a faster plan
• We only control the quality results with collective
optimization
Algorithms
OpCmizaCon
Operator
Order
OpCmizaCon
Document
Retrieval
OpCmizaCon
A3
D1
C1
B2
Query(210K
docs)
Time:
1h
31
m
Recall:
50%
Precision:
96%
CollecCvely
OpCmized
Plan
A3
D1
C1
B2
Query(153K
docs)
Time:
57
m
Recall:
39%
Precision:
98%
Direct
CombinaCon
Plan
Hard
to
control
the
expected
recall
and
precision
Precision
>=
90%
Recall
>=
50%
Recall
and
Precision
Constraints
12. Holistic Optimizer for Information
Extraction
A2
D1
C1
B1
Query(210K
docs)
Time:
43
m
Recall:
50%
Precision:
95%
Best
Plan
Document
CollecCon
A
IE
SpecificaCon
Precision
>=
90%
Recall
>=
50%
Recall
and
Precision
Constraints
A3
D1
C1
B2
A2
D1
C1
B1
A2
D1
C1
B1
Query (210K docs) Query (210K docs) Query (210K docs)
Candidate Plans Best Plan
Document
Sample
Predictor
Parameters
Recall
and
Precision
Predictor
Time:
57 m
Recall:
39%
Precision:
78%
A2
D1
C1
B1
Time:
1h 24 m
Recall:
98%
Precision:
95%Query
A3
D1
C1
B2
Query
HolisCc
OpCmizer
(…)
EnumeraCon
of
ExecuCon
Plans
A2
D1
C1
B1
A3
D1
C1
B2
13. Predictor Parameters Estimation
• Predictor parameters are determined from a
document sample
• Three phases:
1. Sampling to select the subset of input documents to
use in the estimation (stratified sampling)
2. Extraction to retrieve tuples from the sample using
the plans produced by the enumeration step
(dynamic programming)
3. Estimation to determine the optimizer parameters
using the results of 2) (Maximum-a-posteriori
estimation)
Document
Sample
Predictor
Parameters
Recall
and
Precision
Predictor
14. Extraction Quality Prediction
• Based on two predictions:
– Number of input tuples of operator
– Number of output tuples of operator
A2
D1
C1
(…)
(…)
Path
1
Path
2
D1
t1
t2
t3
t4
(…)
•
•
•
•
•
•
•
•
•
•
A
A2
A1
A3
t1
t2
t3
(…)
t1
t2
t1
t1
t3
15. Experimental Validation
• Impact of the Sample Size and Prediction
Quality
• Impact of the Parameter Estimation
Strategy
• Impact of Individual Implementation
Choices
• Impact of Precision Constraints
• Comparison With the State-of-the-Art
Techniques
15
16. Experimental Validation
• Impact of the Sample Size and Prediction
Quality
• Impact of the Parameter Estimation
Strategy
Ø Impact of Individual Implementation
Choices
• Impact of Precision Constraints
Ø Comparison With the State-of-the-Art
Techniques
16
17. Experimental Setup
• Datasets
• Information Extraction Programs
Researchers’
Homepages
Kedar
Bellare’
Homepage
PhD
student
at
University
of
Massachusegs,
Amherst
under
supervision
of
Andrew
McCallum.
Publica+ons
Generalized
Expecta+on
Criteria
for
Bootstrapping
Extractors
using
Record-‐Text
Alignment,
EMNLP
2009,
August
2009
Advisor
Advisee
Andrew
McCallum
Kedar
Bellare
Conference
Date
EMNLP
2009
August
2009
Advises
ConferencesDates
The
Hai+
cholera
outbreak
between
2010
and
2013
was
the
worst
epidemic
of
cholera
in
recent
history
Google
co-‐founders
Larry
Page
and
Sergey
Brin
recently
sat
down
with
billionaire
venture
capitalist
Vinod
Khosla
for
a
lengthy
interview.
Disease
Time
Period
Cholera
Between
2010
and
2013
DiseaseOutbreaks
Person
OrganizaCon
Larry
Page
Google
Sergey
Brin
Google
Organiza+onAffilia+on
Sparse
RelaCons
Dense
RelaCons
17
18. Experimental Setup (cont.)
• IE Optimization Systems (baselines):
– Cimple
– System T
– SQoUT
– SQoUt-Boosted: variation of SQoUT that uses the execution
plans produced by Cimple and System T
• IE Optimization Techniques:
– Optimized-Ret: implementation choice of retrieval strategy
– Optimized-Alg: implementation choice of the algorithms
– Optimized-Order: implementation choice of the operator execution
order
– Holistic-Map: uses maximum a posteriori for estimating the
parameters
– Holistic-ML: uses maximum likelihood for estimating the parameters
18
19. Impact of Individual Implementation
Choices
• Holistic-MAP obtains the fastest plan in every case
• Optimized-Ret becomes too slow when desired recall
increases
• Optimized-Order and Optimized-Alg consistently produce slow
plans
• Holistic optimization outperforms individual optimization
ConferencesDates
19
21. Comparison with State-of-the-Art
Techniques
• For sparse relations, Holistic-MAP
– Obtains faster plans
– Does not always matches recall constraints
– Can correct the errors in recall constraints as the target recall
increases
DiseaseOutbreaks
21
22. Conclusions: Holistic Optimizer
• IE programs can be optimized along multiple
interrelated dimensions, namely, the choice of:
– Algorithms for each extraction operator
– Operator execution order
– Document retrieval strategy
• We presented the first holistic optimization
approach for IE that collectively optimizes all
dimensions simultaneously
– It outperforms the state-of-the-art techniques and
selects efficient plans that meet target recall and
precision values
22
23. What Comes Next?
• Ranking Models for IE (second part of the
talk)
• Distributed Executions for IE (on-going work)
– by determining document placement in
distributed file system
Random
order
Ranked
order
24. Agenda
• A Holistic Optimizer for Speed up Information
Extraction Programs
– PVLDB’13 and presented at VLDB’14
– Joint work with Gonçalo Simões and Luis
Gravano
• A learning-based Approach to Rank
Documents
– EDBT’15
– Joint work with Pablo Barrio, Gonçalo Simões
and Luis Gravano
24
25. Reducing IE Processing Time
• Small, topic-specific fraction of the entire
collection is useful
Ex: Only 2% of documents in a New York Times archive,
mostly environment-related, are useful for Natural Disaster-
Location with a state-of-the-art IE system
• Useful documents share distinctive words
and phrases
Ex: “earthquake”, “storm,” “richter,” “volcano
eruption” for Natural Disaster-Location
• Information extraction system labels
documents as useful or not, for free
Should focus extraction
over these documents
and ignore rest
Should learn to
distinguish useful
documents for an IE task
Could use the result of
an IE process to
generate an ever-
expanding training set
for learning and further
identifying useful
documents
Documents are useful if they produce output for a given IE task
26. Existing Approaches: Qxtract and
FactCrawl
• Qxtract and FactCrawl learn from
a small document sample and so
exhibit far-from-perfect recall
• FactCrawl re-ranks documents
using the quality of learned
queries
• FactCrawl relies on queries
derived from small set of
documents and does not adapt to
new processed documents
27. Our Approach
• Focus on the potentially useful documents for the extraction
task at hand
Ø Learning to rank approach for document ranking
• Results of extraction process form ever-expanding training set
Ø Adaptive approach to update
document ranking continuously
Features:
Words
and
phrases,
and
agribute
values
Document
Collec+on
f(di)
=
si
s1
≥
s2
≥
s3
≥
...
si
≥
…
sn
New
words:
lava,
fissure
Learning
Ranking
and
processing
New
training
instances
28. Ranking Documents Adaptively for
IE
Learns
that
“tornado,”
“earthquake,”
are
makers
of
useful
documents
Learning
Document
processing
and
update
detec+on
Document
Collec+on
Useful
documents
but
on
volcanoes,
not
yet
observed
prominently
in
IE
process
<tornado,
hawaii>
“…
‘Acermath’
narrates
the
story
of
a
man
that
goes
missing…”
“…
S+ll
recovering
from
an
earthquake,
Chile
is
threatened
by
the
erupCon
of
Copahue
volcano…”
Online
relearning
+
Performs
online
learning
New
informaCon
can
poten+ally
help
improve
ranking,
so
Update!
<volcano,
chile>
Learns
that
“volcano”
and
“erup+on”
are
now
makers
of
useful
documents
Ranking
adaptaCon
29. Ranking Documents Adaptively for
IE: Key Ideas
• To address efficiency, rely on online learning
– Train the ranking model incrementally, one document at a time
• To handle large (and expanding) feature sets, rely on in-training
feature selection
– The learning-to-rank algorithm can efficiently identify the most
discriminative features during the training of the document ranking
model
• We propose two learning-to-rank techniques for IE that
integrate online learning and in-training feature selection:
– BAgg-IE
– RSVM-IE
• And two update detection techniques for document ranking
adaptation:
– Top-K
– Mod-C
30. Experimental Validation
1. Impact of learning-to-rank approach
2. Impact of sampling strategies
3. Impact of adaptation
4. Impact of update detection
5. Scalability of our approach
6. Comparison with the state-of-the-art
ranking strategies
30
31. Experimental Validation
1. Impact of learning-to-rank approach
2. Impact of sampling strategies
3. Impact of adaptation
4. Impact of update detection
5. Scalability of our approach
Ø Comparison with the state-of-the-art
ranking strategies
31
32. Complex
extrac+on
systems:
CRFs,
SVM
kernels
Simple
extrac+on
systems:
HMMs,
text
pagerns
• Dataset:
• Information extraction systems
1.8
million
ar+cles
from
1987-‐2007
Experimental Setup
The
Hai+
cholera
outbreak
between
2010
and
2013
was
the
worst
epidemic
of
cholera
in
recent
history
Google
co-‐founders
Larry
Page
and
Sergey
Brin
recently
sat
down
with
billionaire
venture
capitalist
Vinod
Khosla
for
a
lengthy
interview.
"This
is
not
a
vic+mless
crime,"
said
Jim
Kendall,
president
of
the
Washington
Associa+on
of
Internet
Service
Providers.
A
fire
destroyed
a
Cargill
Meat
Solu+ons
beef
processing
plant
in
Booneville.
Disease
Time
Period
cholera
between
2010
and
2013
Disease-‐Outbreaks
Person
OrganizaCon
Larry
Page
Google
Sergey
Brin
Google
Person-‐Organiza+on
Person
Career
Jim
Kendall
President
Person-‐Career
Disaster
LocaCon
fire
Booneville
Man
Made
Disaster-‐Loca+on
Other
rela+ons:
Person-‐Charge,
Elec+on-‐Winner,
Natural
Disaster-‐Loca+on
Dense
rela+ons
Sparse
rela+ons
33. Recall Analysis
Update
detec+on:
Mod-‐C
Our
adapCve
implementa+on
of
the
state
of
the
art
Disease-‐Outbreak
• Our techniques bring significant improvement
• RSVM-IE performs best, as it prioritizes useful documents better, favoring
adaptation
35. Conclusion: Document Ranking for
Scalable Information Extraction
• Running IE system over large text
collections is computationally expensive
• Proposed lightweight, adaptive approach
and learning-based alternatives
– Online learning algorithms with in-
training feature selection: RSVM-IE,
Bagg-IE
– Update detection based on feature
changes: Mod-C, Top-K
– RSVM-IE + Mod-C performs best: Useful
documents are better prioritized, enabling
richer, more efficient ranking adaptation
Text
Collec+on
IE
system
<tornado,
Florida>
<volcano,
Chile>
…
36. Future Work: Ranking at Different
Granularities
• Few collections on the Web
are relevant to an IE task
• Prioritize them based on number of
useful documents
• Few sentences in a text
document output tuples for an
IE task
• Prioritize them based on usefulness
and diversity
37. Try REEL, our toolkit to easily develop and
evaluate IE systems
– Open source and freely available at
http://reel.cs.columbia.edu
Before We Leave…
Thank
you!