PWE: Datalog & ASP for the Rest of Us discusses using Possible Worlds Explorer (PWE) to make Datalog and Answer Set Programming (ASP) more accessible to non-experts. It covers topics like using provenance to explain query results, capturing rule firings to track provenance, representing provenance as a graph, using states to track derivation rounds, and declarative profiling of Datalog programs. The presentation advocates for tools like PWE that wrap Datalog/ASP engines to combine them with Python ecosystems and allow interactive use in Jupyter notebooks. This makes the languages more approachable and helps users build on existing work by experimenting further.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
The Science Information Network (SINET) is a Japanese academic backbone network for more than 800 universities and research institutions. The characteristic of SINET traffic is that it is enormous and highly variable. In this paper, we present a task-decomposition based anomaly detection of massive and highvolatility session data of SINET. Three main features are discussed: Tash scheduling, Traffic discrimination, and Histogramming. We adopt a task-decomposition based dynamic scheduling method to handle the massive session data stream of SINET. In the experiment, we have analysed SINET traffic from 2/27 to 3/8 and detect some anomalies by LSTM based time-series data processing.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
A New Architecture for Group Replication in Data GridEditor IJCATR
Nowadays, grid systems are vital technology for programs running with high performance and problems solving with largescale
in scientific, engineering and business. In grid systems, heterogeneous computational resources and data should be shared
between independent organizations that are scatter geographically. A data grid is a kind of grid types that make relations computational
and storage resources. Data replication is an efficient way in data grid to obtain high performance and high availability by saving
numerous replicas in different locations e.g. grid sites. In this research, we propose a new architecture for dynamic Group data
replication. In our architecture, we added two components to OptorSim architecture: Group Replication Management component
(GRM) and Management of Popular Files Group component (MPFG). OptorSim developed by European Data Grid projects for
evaluate replication algorithm. By using this architecture, popular files group will be replicated in grid sites at the end of each
predefined time interval.
A comparative study in dynamic job scheduling approaches in grid computing en...ijgca
Grid computing is one of the most interesting research areas for present and future computing strategy
and methodology. The dramatic changes in the complexity of scientific applications and part of nonscientific
applications increase the need for distributed systems in general and grid computing
specifically. One of the main challenges in grid computing environment is the way of handling the jobs
(tasks) in the grid environment. Job scheduling is the activity to schedule the submitted jobs in the grid
environment. There are many approaches in job scheduling in grid computing.
This paper provides an experimental study of different approaches in grid computing job scheduling. The
involved approaches in this paper are “4-levels/RMFF” and our previously published approach “XLevels/
XD-Binary Tree”. First of all, introduction to grid computing and job scheduling techniques is
provided. Then the description of currently existing approaches will be presented. After that, experiments
and provided results give a practical evaluation of these approaches from different perspectives.
Conclusion of the comparative study states that overall average tasks waiting time is enhanced by
approximately 30% by using the X-levels/XD-binary tree approach against 4-levels/RMFF approach.
A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING EN...ijgca
Grid computing is one of the most interesting research areas for present and future computing strategy and methodology. The dramatic changes in the complexity of scientific applications and part of nonscientific applications increase the need for distributed systems in general and grid computing specifically. One of the main challenges in grid computing environment is the way of handling the jobs (tasks) in the grid environment. Job scheduling is the activity to schedule the submitted jobs in the grid environment. There are many approaches in job scheduling in grid computing. This paper provides an experimental study of different approaches in grid computing job scheduling. The involved approaches in this paper are “4-levels/RMFF” and our previously published approach “XLevels/XD-Binary Tree”. First of all, introduction to grid computing and job scheduling techniques is provided. Then the description of currently existing approaches will be presented. After that, experiments and provided results give a practical evaluation of these approaches from different perspectives. Conclusion of the comparative study states that overall average tasks waiting time is enhanced by approximately 30% by using the X-levels/XD-binary tree approach against 4-levels/RMFF approach.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
The Science Information Network (SINET) is a Japanese academic backbone network for more than 800 universities and research institutions. The characteristic of SINET traffic is that it is enormous and highly variable. In this paper, we present a task-decomposition based anomaly detection of massive and highvolatility session data of SINET. Three main features are discussed: Tash scheduling, Traffic discrimination, and Histogramming. We adopt a task-decomposition based dynamic scheduling method to handle the massive session data stream of SINET. In the experiment, we have analysed SINET traffic from 2/27 to 3/8 and detect some anomalies by LSTM based time-series data processing.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
A New Architecture for Group Replication in Data GridEditor IJCATR
Nowadays, grid systems are vital technology for programs running with high performance and problems solving with largescale
in scientific, engineering and business. In grid systems, heterogeneous computational resources and data should be shared
between independent organizations that are scatter geographically. A data grid is a kind of grid types that make relations computational
and storage resources. Data replication is an efficient way in data grid to obtain high performance and high availability by saving
numerous replicas in different locations e.g. grid sites. In this research, we propose a new architecture for dynamic Group data
replication. In our architecture, we added two components to OptorSim architecture: Group Replication Management component
(GRM) and Management of Popular Files Group component (MPFG). OptorSim developed by European Data Grid projects for
evaluate replication algorithm. By using this architecture, popular files group will be replicated in grid sites at the end of each
predefined time interval.
A comparative study in dynamic job scheduling approaches in grid computing en...ijgca
Grid computing is one of the most interesting research areas for present and future computing strategy
and methodology. The dramatic changes in the complexity of scientific applications and part of nonscientific
applications increase the need for distributed systems in general and grid computing
specifically. One of the main challenges in grid computing environment is the way of handling the jobs
(tasks) in the grid environment. Job scheduling is the activity to schedule the submitted jobs in the grid
environment. There are many approaches in job scheduling in grid computing.
This paper provides an experimental study of different approaches in grid computing job scheduling. The
involved approaches in this paper are “4-levels/RMFF” and our previously published approach “XLevels/
XD-Binary Tree”. First of all, introduction to grid computing and job scheduling techniques is
provided. Then the description of currently existing approaches will be presented. After that, experiments
and provided results give a practical evaluation of these approaches from different perspectives.
Conclusion of the comparative study states that overall average tasks waiting time is enhanced by
approximately 30% by using the X-levels/XD-binary tree approach against 4-levels/RMFF approach.
A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING EN...ijgca
Grid computing is one of the most interesting research areas for present and future computing strategy and methodology. The dramatic changes in the complexity of scientific applications and part of nonscientific applications increase the need for distributed systems in general and grid computing specifically. One of the main challenges in grid computing environment is the way of handling the jobs (tasks) in the grid environment. Job scheduling is the activity to schedule the submitted jobs in the grid environment. There are many approaches in job scheduling in grid computing. This paper provides an experimental study of different approaches in grid computing job scheduling. The involved approaches in this paper are “4-levels/RMFF” and our previously published approach “XLevels/XD-Binary Tree”. First of all, introduction to grid computing and job scheduling techniques is provided. Then the description of currently existing approaches will be presented. After that, experiments and provided results give a practical evaluation of these approaches from different perspectives. Conclusion of the comparative study states that overall average tasks waiting time is enhanced by approximately 30% by using the X-levels/XD-binary tree approach against 4-levels/RMFF approach.
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
Distributed databases and data replication are effective ways to increase the accessibility and reliability of
un-structured, semi-structured and structured data to extract new knowledge. Replications offer better
performance and greater availability of data. With the advent of Big Data, new storage and processing
challenges are emerging.
To meet these challenges, Hadoop and DHTs compete in the storage domain and MapReduce and others in
distributed processing, with their strengths and weaknesses.
We propose an analysis of the circular and radial replication mechanisms of the CLOAK DHT. We
evaluate their performance through a comparative study of data from simulations. The results show that
radial replication is better in storage, unlike circular replication, which gives better search results.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
General purpose computation on graphics processing unit (GPU) is prominent in the high performance computing era of this time. Porting or accelerating the data parallel applications onto GPU gives the default performance improvement because of the increased computational units. Better performances can be seen if application specific fine tuning is done with respect to the architecture under consideration. One such very widely used computation intensive kernel is sparse matrix vector multiplication (SPMV) in sparse matrix based applications. Most of the existing data format representations of sparse matrix are developed with respect to the central processing unit (CPU) or multi cores. This paper gives a new format for sparse matrix representation with respect to graphics processor architecture that can give 2x to 5x performance improvement compared to CSR (compressed row format), 2x to 54x performance improvement with respect to COO (coordinate format) and 3x to 10 x improvement compared to CSR vector format for the class of application that fit for the proposed new format. It also gives 10% to 133% improvements in memory transfer (of only access information of sparse matrix) between CPU and GPU. This paper gives the details of the new format and its requirement with complete experimentation details and results of comparison.
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONcscpconf
Recently the incremental growth of the storage space and data is parallel. At any instant data may go beyond than storage capacity. A good RDBMS should try to reduce the redundancies as far as possible to maintain the consistencies and storage cost. Apart from that a huge database with replicated copies wastes essential spaces which can be utilized for other purposes. The first aim should be to apply some techniques of data deduplication in the field of RDBMS. It is obvious to check the accessing time complexity along with space complexity. Here different techniques of data de duplication approaches are discussed. Finally based on the drawback of those approaches a new approach involving row id, column id and domain-key constraint of RDBMS is theoretically illustrated. Though apparently this model seems to be very tedious and non-optimistic, but in reality for a large database with lot of tables containing lot of lengthy fields it can be proved that it reduces the space complexity vigorously with same accessing speed.
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...idescitation
With the rapid growth of information technology and in many business
applications, mining frequent patterns and finding associations among them requires
handling large and distributed databases. As FP-tree considered being the best compact data
structure to hold the data patterns in memory there has been efforts to make it parallel and
distributed to handle large databases. However, it incurs lot of communication over head
during the mining. In this paper parallel and distributed frequent pattern mining algorithm
using Hadoop Map Reduce framework is proposed, which shows best performance results
for large databases. Proposed algorithm partitions the database in such a way that, it works
independently at each local node and locally generates the frequent patterns by sharing the
global frequent pattern header table. These local frequent patterns are merged at final stage.
This reduces the complete communication overhead during structure construction as well as
during pattern mining. The item set count is also taken into consideration reducing
processor idle time. Hadoop Map Reduce framework is used effectively in all the steps of the
algorithm. Experiments are carried out on a PC cluster with 5 computing nodes which
shows execution time efficiency as compared to other algorithms. The experimental result
shows that proposed algorithm efficiently handles the scalability for very large datab ases.
Index Terms—
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive DataWaqas Tariq
We propose a list data sharing model, which utilizes semantics expressed in DTD for concurrency control of shared XML trees. In this model, tree updating actions such as inserting and/or deleting subtrees are allowed only for repetitive parts. The proposed model guarantees that the resulting XML tree is valid even when tree update actions are applied concurrently. In addition, we propose a new multi-granularity locking mechanism called list locking protocol. This protocol locks on the (index) list of repetitive children nodes and thus allows updates to the descendents when the node child¡¯s subtree is being deleted or inserted. This protocol is expected to be more accessible and to produce fewer locking objects on XML data compared to other methods. Moreover, the prototype system shows that list locking is well suited to user interface of shared XML clients by enabling/disabling corresponding edit operation controls.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Converting UML Class Diagrams into Temporal Object Relational DataBase IJECEIAES
Number of active researchers and experts, are engaged to develop and implement new mechanism and features in time varying database management system (TVDBMS), to respond to the recommendation of modern business environment.Time-varying data management has been much taken into consideration with either the attribute or tuple time stamping schema. Our main approach here is to try to offer a better solution to all mentioned limitations of existing works, in order to provide the nonprocedural data definitions, queries of temporal data as complete as possible technical conversion ,that allow to easily realize and share all conceptual details of the UML class specifications, from conception and design point of view. This paper contributes to represent a logical design schema by UML class diagrams, which are handled by stereotypes to express a temporal object relational database with attribute timestamping.
Over time, Machine Learning inference workloads became more and more demanding in terms of latency and throughput, with multiple models being deployed in the system. This scenario provides large rooms for optimizations of runtime and memory, which current systems fall short in exploring because they employ a black-box model of ML models and tasks.
On the opposite side, Pretzel adopts a white-box description of ML models, which allows the framework to perform optimizations over deployed models and running tasks, saving memory and increasing the overall system performance. In this talk we will show the motivations behind Pretzel, its current design and possible future developments.
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIScscpconf
This paper presents the detailed survey of scheduling and allocation techniques in the High Level Synthesis (HLS) presented in the research literature. It also presents the methodologies and techniques to improve the Speed, (silicon) Area and Power in High Level Synthesis, which are presented in the research literature.
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERYijcseit
In this paper, clustering of scientific workflows is investigated. It proposes a work to encode workflows through workflow representations as sets of embedded workflows. Then, it embeds extracted workflow
motifs in sets of workflows. By motifs, common patterns of workflow steps and relationships are replaced with indices. Motifs are defined as small functional units that occur much more frequently than expected.They can show hidden relationships, and they keep as much underlying information as possible. In order to have a good estimate on distances between observed workflows, this work proposes the scientific workflow clustering problem with exploiting set descriptors, instead of vector based descriptors. It uses k-means
algorithm as a popular clustering algorithm for workflow clustering. However, one of the biggest limitations of the k-means algorithm is the requirement of the number of clusters, K, to be specified before the algorithm is applied. To address this problem it proposes a method based on the SFLA. The simulation results show that the proposed method is better than PSO and GA algorithms in the K selection.
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERYijcseit
In this paper, clustering of scientific workflows is investigated. It proposes a work to encode workflows
through workflow representations as sets of embedded workflows. Then, it embeds extracted workflow
motifs in sets of workflows. By motifs, common patterns of workflow steps and relationships are replaced
with indices. Motifs are defined as small functional units that occur much more frequently than expected.
They can show hidden relationships, and they keep as much underlying information as possible. In order to
have a good estimate on distances between observed workflows, this work proposes the scientific workflow
clustering problem with exploiting set descriptors, instead of vector based descriptors. It uses k-means
algorithm as a popular clustering algorithm for workflow clustering. However, one of the biggest
limitations of the k-means algorithm is the requirement of the number of clusters, K, to be specified before
the algorithm is applied. To address this problem it proposes a method based on the SFLA. The simulation
results show that the proposed method is better than PSO and GA algorithms in the K selection.
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
Distributed databases and data replication are effective ways to increase the accessibility and reliability of
un-structured, semi-structured and structured data to extract new knowledge. Replications offer better
performance and greater availability of data. With the advent of Big Data, new storage and processing
challenges are emerging.
To meet these challenges, Hadoop and DHTs compete in the storage domain and MapReduce and others in
distributed processing, with their strengths and weaknesses.
We propose an analysis of the circular and radial replication mechanisms of the CLOAK DHT. We
evaluate their performance through a comparative study of data from simulations. The results show that
radial replication is better in storage, unlike circular replication, which gives better search results.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
General purpose computation on graphics processing unit (GPU) is prominent in the high performance computing era of this time. Porting or accelerating the data parallel applications onto GPU gives the default performance improvement because of the increased computational units. Better performances can be seen if application specific fine tuning is done with respect to the architecture under consideration. One such very widely used computation intensive kernel is sparse matrix vector multiplication (SPMV) in sparse matrix based applications. Most of the existing data format representations of sparse matrix are developed with respect to the central processing unit (CPU) or multi cores. This paper gives a new format for sparse matrix representation with respect to graphics processor architecture that can give 2x to 5x performance improvement compared to CSR (compressed row format), 2x to 54x performance improvement with respect to COO (coordinate format) and 3x to 10 x improvement compared to CSR vector format for the class of application that fit for the proposed new format. It also gives 10% to 133% improvements in memory transfer (of only access information of sparse matrix) between CPU and GPU. This paper gives the details of the new format and its requirement with complete experimentation details and results of comparison.
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONcscpconf
Recently the incremental growth of the storage space and data is parallel. At any instant data may go beyond than storage capacity. A good RDBMS should try to reduce the redundancies as far as possible to maintain the consistencies and storage cost. Apart from that a huge database with replicated copies wastes essential spaces which can be utilized for other purposes. The first aim should be to apply some techniques of data deduplication in the field of RDBMS. It is obvious to check the accessing time complexity along with space complexity. Here different techniques of data de duplication approaches are discussed. Finally based on the drawback of those approaches a new approach involving row id, column id and domain-key constraint of RDBMS is theoretically illustrated. Though apparently this model seems to be very tedious and non-optimistic, but in reality for a large database with lot of tables containing lot of lengthy fields it can be proved that it reduces the space complexity vigorously with same accessing speed.
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...idescitation
With the rapid growth of information technology and in many business
applications, mining frequent patterns and finding associations among them requires
handling large and distributed databases. As FP-tree considered being the best compact data
structure to hold the data patterns in memory there has been efforts to make it parallel and
distributed to handle large databases. However, it incurs lot of communication over head
during the mining. In this paper parallel and distributed frequent pattern mining algorithm
using Hadoop Map Reduce framework is proposed, which shows best performance results
for large databases. Proposed algorithm partitions the database in such a way that, it works
independently at each local node and locally generates the frequent patterns by sharing the
global frequent pattern header table. These local frequent patterns are merged at final stage.
This reduces the complete communication overhead during structure construction as well as
during pattern mining. The item set count is also taken into consideration reducing
processor idle time. Hadoop Map Reduce framework is used effectively in all the steps of the
algorithm. Experiments are carried out on a PC cluster with 5 computing nodes which
shows execution time efficiency as compared to other algorithms. The experimental result
shows that proposed algorithm efficiently handles the scalability for very large datab ases.
Index Terms—
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive DataWaqas Tariq
We propose a list data sharing model, which utilizes semantics expressed in DTD for concurrency control of shared XML trees. In this model, tree updating actions such as inserting and/or deleting subtrees are allowed only for repetitive parts. The proposed model guarantees that the resulting XML tree is valid even when tree update actions are applied concurrently. In addition, we propose a new multi-granularity locking mechanism called list locking protocol. This protocol locks on the (index) list of repetitive children nodes and thus allows updates to the descendents when the node child¡¯s subtree is being deleted or inserted. This protocol is expected to be more accessible and to produce fewer locking objects on XML data compared to other methods. Moreover, the prototype system shows that list locking is well suited to user interface of shared XML clients by enabling/disabling corresponding edit operation controls.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Converting UML Class Diagrams into Temporal Object Relational DataBase IJECEIAES
Number of active researchers and experts, are engaged to develop and implement new mechanism and features in time varying database management system (TVDBMS), to respond to the recommendation of modern business environment.Time-varying data management has been much taken into consideration with either the attribute or tuple time stamping schema. Our main approach here is to try to offer a better solution to all mentioned limitations of existing works, in order to provide the nonprocedural data definitions, queries of temporal data as complete as possible technical conversion ,that allow to easily realize and share all conceptual details of the UML class specifications, from conception and design point of view. This paper contributes to represent a logical design schema by UML class diagrams, which are handled by stereotypes to express a temporal object relational database with attribute timestamping.
Over time, Machine Learning inference workloads became more and more demanding in terms of latency and throughput, with multiple models being deployed in the system. This scenario provides large rooms for optimizations of runtime and memory, which current systems fall short in exploring because they employ a black-box model of ML models and tasks.
On the opposite side, Pretzel adopts a white-box description of ML models, which allows the framework to perform optimizations over deployed models and running tasks, saving memory and increasing the overall system performance. In this talk we will show the motivations behind Pretzel, its current design and possible future developments.
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIScscpconf
This paper presents the detailed survey of scheduling and allocation techniques in the High Level Synthesis (HLS) presented in the research literature. It also presents the methodologies and techniques to improve the Speed, (silicon) Area and Power in High Level Synthesis, which are presented in the research literature.
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERYijcseit
In this paper, clustering of scientific workflows is investigated. It proposes a work to encode workflows through workflow representations as sets of embedded workflows. Then, it embeds extracted workflow
motifs in sets of workflows. By motifs, common patterns of workflow steps and relationships are replaced with indices. Motifs are defined as small functional units that occur much more frequently than expected.They can show hidden relationships, and they keep as much underlying information as possible. In order to have a good estimate on distances between observed workflows, this work proposes the scientific workflow clustering problem with exploiting set descriptors, instead of vector based descriptors. It uses k-means
algorithm as a popular clustering algorithm for workflow clustering. However, one of the biggest limitations of the k-means algorithm is the requirement of the number of clusters, K, to be specified before the algorithm is applied. To address this problem it proposes a method based on the SFLA. The simulation results show that the proposed method is better than PSO and GA algorithms in the K selection.
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERYijcseit
In this paper, clustering of scientific workflows is investigated. It proposes a work to encode workflows
through workflow representations as sets of embedded workflows. Then, it embeds extracted workflow
motifs in sets of workflows. By motifs, common patterns of workflow steps and relationships are replaced
with indices. Motifs are defined as small functional units that occur much more frequently than expected.
They can show hidden relationships, and they keep as much underlying information as possible. In order to
have a good estimate on distances between observed workflows, this work proposes the scientific workflow
clustering problem with exploiting set descriptors, instead of vector based descriptors. It uses k-means
algorithm as a popular clustering algorithm for workflow clustering. However, one of the biggest
limitations of the k-means algorithm is the requirement of the number of clusters, K, to be specified before
the algorithm is applied. To address this problem it proposes a method based on the SFLA. The simulation
results show that the proposed method is better than PSO and GA algorithms in the K selection.
Novel Database-Centric Framework for Incremental Information Extractionijsrd.com
Information extraction (IE) has been an active research area that seeks techniques to uncover information from a large collection of text. IE is the task of automatically extracting structured information from unstructured and/or semi structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in document processing like automatic annotation and content extraction could be seen as information extraction. Many applications call for methods to enable automatic extraction of structured information from unstructured natural language text. Due to the inherent challenges of natural language processing, most of the existing methods for information extraction from text tend to be domain specific. In this project a new paradigm for information extraction. In this extraction framework, intermediate output of each text processing component is stored so that only the improved component has to be deployed to the entire corpus. Extraction is then performed on both the previously processed data from the unchanged components as well as the updated data generated by the improved component. Performing such kind of incremental extraction can result in a tremendous reduction of processing time and there is a mechanism to generate extraction queries from both labeled and unlabeled data. Query generation is critical so that casual users can specify their information needs without learning the query language.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
STORAGE GROWING FORECAST WITH BACULA BACKUP SOFTWARE CATALOG DATA MININGcsandit
Backup software information is a potential source for data mining: not only the unstructured
stored data from all other backed-up servers, but also backup jobs metadata, which is stored in
a formerly known catalog database. Data mining this database, in special, could be used in
order to improve backup quality, automation, reliability, predict bottlenecks, identify risks,
failure trends, and provide specific needed report information that could not be fetched from
closed format property stock property backup software database. Ignoring this data mining
project might be costly, with lots of unnecessary human intervention, uncoordinated work and
pitfalls, such as having backup service disruption, because of insufficient planning. The specific
goal of this practical paper is using Knowledge Discovery in Database Time Series, Stochastic
Models and R scripts in order to predict backup storage data growth. This project could not be
done with traditional closed format proprietary solutions, since it is generally impossible to
read their database data from third party software because of vendor lock-in deliberate
overshadow. Nevertheless, it is very feasible with Bacula: the current third most popular backup
software worldwide, and open source. This paper is focused on the backup storage demand
prediction problem, using the most popular prediction algorithms. Among them, Holt-Winters
Model had the highest success rate for the tested data sets.
Towards an Infrastructure for Enabling Systematic Development and Research of...Rafael Ferreira da Silva
Presentation held at the 17th IEEE eScience Conference
Scientific workflows have been used almost universally across scientific domains, and have underpinned some of the most significant discoveries of the past several decades. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale high-performance computing (HPC) platforms. These executions must be managed using some software infrastructure. Due to the popularity of workflows, workflow management systems (WMSs) have been developed to provide abstractions for creating and executing workflows conveniently, efficiently, and portably. While these efforts are all worthwhile, there are now hundreds of independent WMSs, many of which are moribund. As a result, the WMS landscape is segmented and presents significant barriers to entry due to the hundreds of seemingly comparable, yet incompatible, systems that exist. As a result, many teams, small and large, still elect to build their own custom workflow solution rather than adopt, or build upon, existing WMSs. This current state of the WMS landscape negatively impacts workflow users, developers, and researchers. In this talk, I will provide a view of the state of the art and some of my previous research and technical contributions, and identify crucial research challenges in the workflow community.
An octa core processor with shared memory and message-passingeSAT Journals
Abstract This being the era of fast, high performance computing, there is the need of having efficient optimizations in the processor architecture and at the same time in memory hierarchy too. Each and every day, the advancement of applications in communication and multimedia systems are compelling to increase number of cores in the main processor viz., dual-core, quad-core, octa-core and so on. But, for enhancing the overall performance of multi processor chip, there are stringent requirements to improve inter-core synchronization. Thus, a MPSoC with 8-cores supporting both message-passing and shared-memory inter-core communication mechanisms is implemented on Virtex 5 LX110T FPGA. Each core is based on MIPS III (Microprocessor without interlocked pipelined stages) ISA, handling only integer type instructions and having six-stage pipeline with data hazard detection unit and forwarding logic. The eight processing cores and one central shared memory core are inter connected using 3x3 2-D mesh topology based Network-on-chip (NoC) with virtual channel router. The router is four stage pipelined supporting DOR X-Y routing algorithm and with round robin arbitration technique. For verification and functionality test of above fully synthesized multi core processor, matrix multiplication operation is mapped onto the above said. Partitioning and scheduling of multiple multiplications and addition for each element of resultant matrix has been done accordingly among eight cores to get maximum throughput. All the codes for processor design are written in Verilog HDL. Keywords: MPSoC, message-passing, shared memory, MIPS, ISA, wormhole router, network-on-chip, SIMD, data level parallelism, 2-D Mesh, virtual channel
This talk will examine issues of workflow execution, in particular using the Pegasus Workflow Management System, on distributed resources and how these resources can be provisioned ahead of the workflow execution. Pegasus was designed, implemented and supported to provide abstractions that enable scientists to focus on structuring their computations without worrying about the details of the target cyberinfrastructure. To support these workflow abstractions Pegasus provides automation capabilities that seamlessly map workflows onto target resources, sparing scientists the overhead of managing the data flow, job scheduling, fault recovery and adaptation of their applications. In some cases, it is beneficial to provision the resources ahead of the workflow execution, enabling the re-use of resources across workflow tasks. The talk will examine the benefits of resource provisioning for workflow execution.
Benchmarking open source deep learning frameworksIJECEIAES
Deep Learning (DL) is one of the hottest fields. To foster the growth of DL, several open source frameworks appeared providing implementations of the most common DL algorithms. These frameworks vary in the algorithms they support and in the quality of their implementations. The purpose of this work is to provide a qualitative and quantitative comparison among three such frameworks: TensorFlow, Theano and CNTK. To ensure that our study is as comprehensive as possible, we consider multiple benchmark datasets from different fields (image processing, NLP, etc.) and measure the performance of the frameworks’ implementations of different DL algorithms. For most of our experiments, we find out that CNTK’s implementations are superior to the other ones under consideration.
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
Yilin Xia (yilinx2@illinois.edu),
Shawn Bowers (bowers@gonzaga.edu),
Lan Li (lanl2@illinois.edu), and
Bertram Ludäscher (ludaesch@illinois.edu)
Presented at IDCC-2024 in Edinburg.
ABSTRACT. We propose a new approach for modeling and reconciling conflicting data cleaning actions. Such conflicts arise naturally in collaborative data curation settings where multiple experts work independently and then aim to put their efforts together to improve and accelerate data cleaning. The key idea of our approach is to model conflicting updates as a formal argumentation framework (AF). Such argumentation frameworks can be automatically analyzed and solved by translating them to a logic program PAF whose declarative semantics yield a transparent solution with many desirable properties, e.g., uncontroversial updates are accepted, unjustified ones are rejected, and the remaining ambiguities are exposed and presented to users for further analysis. After motivating the problem, we introduce our approach and illustrate it with a detailed running example introducing both well-founded and stable semantics to help understand the AF solutions. We have begun to develop open source tools and Jupyter notebooks that demonstrate the practicality of our approach. In future work we plan to develop a toolkit for conflict resolution that can be used in conjunction with OpenRefine, a popular interactive data cleaning tool.
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
Research Seminar Talk (online) at KRR@UP (Uni Potsdam) on Dec 6, 2023, loosely based on a paper with the same title at the 7th Workshop on Advances in Argumentation in Artificial Intelligence (AI3)
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Bertram Ludäscher
7th Workshop on Advances in Argumentation in Artificial Intelligence (AI3) at
AIxIA 2023: 22nd International Conference of the Italian Association for Artificial Intelligence.
Presentation of a paper by Bertram Ludäscher, Shawn Bowers, and Yilin Xia, given virtually on November 9, 2023.
[Flashback] Integration of Active and Deductive Database RulesBertram Ludäscher
Slides of my PhD defense at the University of Freiburg, 1998.
Statelog and similar state-oriented extensions of Datalog have seen renewed interest subsequently, e.g., see
[Hel10] Hellerstein, J.M., 2010. The declarative imperative: experiences and conjectures in distributed logic. ACM SIGMOD Record, 39(1), pp.5-19.
[AMC+11]
Alvaro, P., Marczak, W.R., Conway, N., Hellerstein, J.M., Maier, D. and Sears, R., 2011. Dedalus: Datalog in time and space. In Datalog Reloaded: First International Workshop, Datalog 2010, Oxford, UK, March 16-19, 2010. Revised Selected Papers (pp. 262-281). Springer
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Bertram Ludäscher
Keynote at CLIR Workshop (Webinar): Torward Open, Reproducible, and Reusable Research. February 10, 2021. https://reusableresearch.com/
ABSTRACT. The “reproducibility crisis” has resulted in much interest in methods and tools to improve computational reproducibility. FAIR data principles (data should be findable, accessible, interoperable, and reusable) are also being adapted and evolved to apply to other artifacts, notably computational analyses (scientific workflows, Jupyter notebooks, etc.). The current focus on computational reproducibility of scripts and other computational workflows sometimes overshadows a somewhat neglected and arguably more important issue: transparency of data analysis, including data wrangling and cleaning. In this talk I will ask the question: What information is gained by conducting a reproducibility experiment? This leads to a simple model (PRIMAD) that aims to answer this question by sorting out different scenarios. Finally, I will present some features of Whole-Tale, a computational platform for reproducible and transparent computational experiments.
By Michael Gryk and Bertram Ludäscher. Presented at 2020 JCDL-SIGCM Workshop, August 1, 2020.
ABSTRACT. Conceptual models can serve multiple purposes: communication of information between stakeholders, information abstraction and generalization, and information organization for archival and retrieval. An ongoing research question is how to formally define the fit-for-purpose of a conceptual model as well as to define metrics or tests to determine whether a given model faithfully supports a designated purpose.
This paper summarizes preliminary investigations in this area by presenting toy problems along with different conceptual models for the system under study. It is argued that the different models are adequate in supporting a sophisticated query and yet they adopt different normalization schemes and will differ in expressiveness depending on the implied purpose of the models. As the subtitle suggests, this work is intended to be primarily exploratory as to the constraints a formal system would require in defining the “usefulness”, “expressiveness” and “equivalence” of conceptual models.
From Research Objects to Reproducible Science TalesBertram Ludäscher
University of Southampton. Electronics & Computer Science. Research Seminar (Invited Talk).
TITLE: From Research Objects to Reproducible Science Tales
ABSTRACT. Rumor has it that there is a reproducibility crisis in science. Or maybe there are multiple crises? What do we mean by reproducibility and replicability anyways? In this talk I will first make an attempt at sorting out some of the terminological confusion in this area, focusing on computational aspects. The PRIMAD model is another attempt to describe different aspects of reproducibility studies by focusing on the "delta" between those studies and the original study. In addition to these more theoretical investigations, I will discuss practical efforts to create more reproducible and more transparent computational platforms such as the one developed by the Whole-Tale project: here 'tales' are executable research objects that may combine data, code, runtime environments, and narratives (i.e., the traditional "science story"). I will conclude with some thoughts about the remaining challenges and opportunities to bridge the large conceptual gaps that continue to exist despite the recognition of problems of reproducibility and transparency in science.
ABOUT the Speaker. Bertram Ludäscher is a professor at the School of Information Sciences at the University of Illinois, Urbana-Champaign and a faculty affiliate with the National Center for Supercomputing Applications (NCSA) and the Department of Computer Science at Illinois. Until 2014 he was a professor at the Department of Computer Science at the University of California, Davis. His research interests range from practical questions in scientific data and workflow management, to database theory and knowledge representation and reasoning. Prior to his faculty appointments, he was a research scientist at the San Diego Supercomputer Center (SDSC) and an adjunct faculty at the CSE Department at UC San Diego. He received his M.S. (Dipl.-Inform.) in computer science from the University of Karlsruhe (now K.I.T.), and his PhD (Dr. rer. nat.) from the University of Freiburg, in Germany.
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseBertram Ludäscher
Deductive Databases & Logic Programs: Back to the Future!
Colloquium talk on the occasion of the retirement of Prof. Dr. Georg Lausen, May 10th, 2019, Universität Freiburg, Germany
Dissecting Reproducibility: A case study with ecological niche models in th...Bertram Ludäscher
Bertram Ludäscher and Santiago Núñez-Corrales.
Presentation at "Research Synthesis in the Hierarchy of Hypotheses" Workshop, October 10-12, 2018.
Schloss Herrenhausen, Hannover, Germany.
Incremental Recomputation: Those who cannot remember the past are condemned ...Bertram Ludäscher
Talk given at "Problems and techniques for Incremental Re-computation: provenance and beyond".
A workshop co-organized with Provenance Week 2018
King's College London, 12th and 13th July, 2018
Organizers: Paolo Missier (Newcastle University), Tanu Malik (DePaul University), Jacek Cala (Newcastle University)
Abstract: Incremental recomputation has applications, e.g., in databases and workflow systems. Methods and algorithms for recomputation depend on the underlying model of computation (MoC) and model of provenance (MoP). This relation is explored with some examples from databases and workflow systems.
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsBertram Ludäscher
Presentation slides of paper by Shawn Bowers, Timothy McPhillips, and Bertram Ludäscher, given by Shawn at Provenance and Annotation of Data and Processes - 7th International Provenance and Annotation Workshop, IPAW 2018, King's College London, UK, July 9-10, 2018.
The paper won a the IPAW best paper award: https://twitter.com/kbelhajj/status/1017082775856467968
ABSTRACT. An advantage of scientific workflow systems is their ability to collect runtime provenance information as an execution trace. Traces include the computation steps invoked as part of the workflow run along with the corresponding data consumed and produced by each workflow step. The information captured by a trace is used to infer "lineage'' relationships among data items, which can help answer provenance queries to find workflow inputs that were involved in producing specific workflow outputs. Determining lineage relationships, however, requires an understanding of the dependency patterns that exist between each workflow step's inputs and outputs, and this information is often under-specified or generally assumed by workflow systems. For instance, most approaches assume all outputs depend on all inputs, which can lead to lineage "false positives''. In prior work, we defined annotations for specifying detailed dependency relationships between inputs and outputs of computation steps. These annotations are used to define corresponding rules for inferring fine-grained data dependencies from a trace. In this paper, we extend our previous work by considering the impact of dependency annotations on workflow specifications. In particular, we provide a reasoning framework to ensure the set of dependency annotations on a workflow specification is consistent. The framework can also infer a complete set of annotations given a partially annotated workflow. Finally, we describe an implementation of the reasoning framework using answer-set programming.
An ontology-driven framework for data transformation in scientific workflowsBertram Ludäscher
Presentation given by Bertram at the Data Integration in the Life Sciences (DILS) Workshop in Leipzig, Germany, 2004.
Reference:
Bowers, Shawn, and Bertram Ludäscher. "An ontology-driven framework for data transformation in scientific workflows." In International Workshop on Data Integration in the Life Sciences (DILS), pp. 1-16. Springer, 2004.
So this isn't new -- but still relevant :-)
ABSTRACT. Ecologists spend considerable effort integrating heterogeneous data for statistical analyses and simulations, for example, to run and test predictive models. Our research is focused on reducing this effort by providing data integration and transformation tools, allowing researchers to focus on “real science,” that is, discovering new knowledge through analysis and modeling. This paper defines a generic framework for transforming heterogeneous data within scientific workflows. Our approach relies on a formalized ontology, which serves as a simple, unstructured global schema. In the framework, inputs and outputs of services within scientific workflows can have structural types and separate seman- tic types (expressions of the target ontology). In addition, a registration mapping can be defined to relate input and output structural types to their corresponding semantic types. Using registration mappings, ap- propriate data transformations can then be generated for each desired service composition. Here, we describe our proposed framework and an initial implementation for services that consume and produce XML data.
From Provenance Standards and Tools to Queries and Actionable ProvenanceBertram Ludäscher
Presentation given at AGU 2017, New Orleans.
Session IN42C: Research Integrity, Reproducible Science, and Quantifying Return on Investment in Data and Software Management II: Focus on Challenges with Provenance, Reuse, and Citatio.
Title: From Provenance Standards and Tools to Queries and Actionable Provenance (Invited)
Abstract. The W3C PROV standard provides a minimal core for sharing retrospective provenance information for scientific workflows and scripts. PROV extensions such as DataONE’s ProvONE model are necessary for linking runtime observables in retrospective provenance records with conceptual-level prospective provenance information, i.e., workflow (or dataflow) graphs. Runtime provenance recorders, such as DataONE’s RunManager for R, or noWorkflow for Python capture retrospective provenance automatically. YesWorkflow (YW) is a toolkit that allows researchers to declare high-level prospective provenance models of scripts via simple inline comments (YW-annotations), revealing the computational modules and dataflow dependencies in the script. By combining and linking both forms of provenance, important queries and use cases can be supported that neither provenance model can afford on its own.
We present existing and emerging provenance tools developed for the DataONE and SKOPE (Synthesizing Knowledge of Past Environments) projects. We show how the different tools can be used individually and in combination to model, capture, share, query, and visualize provenance information. We also present challenges and opportunities for making provenance information more immediately actionable for the researchers who create it in the first place. We argue that such a shift towards “provenance-for-self” is necessary to accelerate the creation, sharing, and use of provenance in support of transparent, reproducible computational and data science.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
1. Possible Worlds Explorer (PWE):
Datalog & Answer Set Programming
for the Rest of Us
Sahil Gupta
Jessica Yi-Yun Cheng
Bertram Ludäscher
PHILADELPHIA LOGIC WEEK
June 3-7, 2019
2. Intros should come first (“provenance”)
• Memory Lane (& Quiz): SLD-CNF …
• [Cha88] Chan, D., Constructive Negation Based on the Completed Database. 5th ICLP, 1988
• … F-Logic (Datalog + OO) … Flip … Flora(-2) …
• … Statelog (Datalog + States) ….
• Scientific Workflow Design for Mere Mortals (“Kepler”)
• … Datalog as a Lingua Franca for Querying Provenance …
• … Declarative Debugging for Mere Mortals …
2
B
cuting
order
an be
: e.g.,
data y
nding
ainers
der to
tation
other
cle in
e sure
ples).
a con-
which
x A).
rules
models
aces),
to the
A
a
X
x
in
read
Y
y
out
write
B
b
in
read
firing
constraint
data
constraint
homomorphism h
≤f ≤d
Z
z
out
write
in
read
Workflow W
Trace T
Fig. 4: Workflow W (top) vs Trace T (bottom): Traces
are associated to workflows, guaranteeing structural con-
sistency; workflow-level (firing or data) constraints in-
duce temporal constraints f and d on traces.
and in which way, they can be stateful; how they con-
sume their inputs, produce their outputs; and so on. As
a result, different systems use different models of prove-
nance (MoPs), with different temporal semantics. Thus,
instead of “hard-wiring” a fixed temporal semantics to
a particular graph-based MoP, we again use logic con-
straints to obtain a “customizable” temporal semantics.
(3) We illustrate this concept by providing firing con-
straints at the workflow level, which induce temporal
constraints f at the level of traces (cf. Figure 4). These
temporal-constraint generating rules can be chosen to
conform to the temporal axioms in [15], or to accom-
ARTICLE IN PRESS
2 T. McPhillips et al. / Future Generation Computer Systems ( ) –
Fig. 1. A phylogenetics workflow implemented in the Kepler system. Kepler workflows are built from actors (boxes) that perform computational tasks. Users can select
actors from component libraries (panel on the left) and connect them on the canvas to form a workflow graph (center/right). Connections specify dataflow between actors.
Configuration parameters can also be provided (top center), e.g., the location of input data and the initial jumble seed value are given. A director (top left corner on the canvas)
is a special component, specifying a model of computation and controlling its execution.
become broadly adopted as a technology for assembling and au-
tomating analyses, these systems must provide scientists concrete
and demonstrable advantages, both over general-purpose script-
ing languages and more focused scientific computing environ-
ments currently occupying the tool-integration niche.
Scientific workflow systems. Existing scientific workflow systems
generally share a number of common goals and characteristics [17]
that differentiate them from tool-integration approaches based
on scripting languages and other platforms with tool-automation
features. One of the most significant differences is that whereas
scripting approaches are largely based on imperative languages,
scientific workflow systems are typically based on dataflow
languages [23,17] in which workflows are represented as directed
graphs, with nodes denoting computational steps (or actors),
and connections representing data dependencies (and data flow)
between steps. Many systems (e.g., [3,27,29,33]) allow workflows
to be created and edited using graphical interfaces (see Fig. 1
for an example in Kepler). The dataflow paradigm is well-suited
for supporting modular workflow design and facilitating reuse of
components [23,25,27,5]. Many workflow systems (e.g., [33,27])
further allow workflows to be used as actors in other workflows,
thus providing workflow authors an abstraction mechanism for
hiding implementation details and facilitating even more reuse.
One advantage of workflow systems that derives from this
dataflow-orientation is the ease with which data produced by one
actor can be routed to multiple downstream actors. While the flow
of data to multiple receivers is often difficult to describe clearly
in plain text, the dataflow approach makes explicit this detailed
routing of data. For instance, in Fig. 1 it is clear that data can
flow directly from Refine alignment only to Iterate over seeds.
The result is that scientific workflows can be more declarative
about the interactions between actors than scripts, where the
flow of data between components is typically hidden within
(often complex) code. The downside of this approach is that if
taken too far, specifications of complex scientific workflows can
become a confusing tangle of actors and wires unless the workflow
specification language provides additional, more sophisticated
means for declaring how data is to be routed (as comad does—see
below as well as [30,6]).
Other notable advantages of scientific workflow systems
over traditional approaches are their potential for transparently
optimizing workflow performance and automatically recording
data and process provenance. Unlike most scripting language
implementations, scientific workflow systems often provide
capabilities for executing workflow tasks concurrently where data
dependencies between tasks allow, either in an ‘‘assembly-line’’
fashion with actors connected in a linear pipeline performing their
tasks simultaneously, or in parallel with multiple such pipelines
operating at the same time (e.g., over multiple input data sets or via
explicit branches in the workflow specification) [43,34,30]. Many
scientific workflow systems also can record, store, and query data
and process dependencies that result during one or more workflow
runs, enabling scientists to later investigate the data and processes
used to derive results and to examine intermediate data products
[38,31].
While these and other advantages of systems designed
specifically to automate scientific workflows help to position these
technologies as viable alternatives to traditional approaches based
on scripting languages and the like, much is yet required to achieve
the vision of putting workflow automation fully into the hands of
‘‘mere mortals’’ [17]. Much remains to be done to realize the vision
of scientists untrained in programming and relatively ignorant
of the details of information technology rapidly composing,
deploying, executing, monitoring, and reviewing the results
of scientific workflows without assistance from information-
technology experts.
Contributions and paper outline. In this paper we describe key
aspects of scientific workflow systems that can help broader-scale
adoption of workflow technology by scientists, and demonstrate
how these properties can be realized by a novel and generic work-
flow modeling paradigm that extends existing dataflow computa-
tion models. In Section 2, we present what we see as important
desiderata for scientific workflow systems from a workflow mod-
eling and design perspective. In Section 3, we describe our main
contribution, the collection oriented modeling and design (comad)
framework, for delivering on the expectations described in Sec-
tion 2. Our framework is especially suited for cases where data
is nested in structure and computational steps can be pipelined
(which is often true, e.g., in bioinformatics). The comad frame-
work provides an assembly-line style computation approach that
Please cite this article in press as: T. McPhillips, et al., Scientific workflow design for mere mortals, Future Generation Computer Systems (2008),
doi:10.1016/j.future.2008.06.013
1
2
3
5
4
6
10
9
15
25
8
12
20
18
30
50
27
45
75
16
24
40
36
60
100
125
54
90
150
32
48
80
72
120
200
81
135
225
250
108
180
300
375
64
96
160
144
240
400
162
270
450
500
216
360
600
625
243
405
675
750
128
192
320
288
480
800
324
540
900
1000 432
720
486
810
256
384
640
576
960
648
729
864
972
512
768
1
2
3
5
4
6
10
9
15
25
8
12
20
18
30
50
27
45
75
16
24
40
36
60
100
125
54
90
150
32
48
80
72
120
200
81
135
225
250
108
180
300
375
64
96
160
144
240
400
162
270
450
500
216
360
600
625
243
405
675
750
128
192
320
288
480
800
324
540
900
1000
432
720
486
810
256
384
640
576
960
648
729
864
972
512
768
PWE: Datalog & ASP for the Rest of Us
3. From past Provenance … to the Future!
• Time flies: “… for Mere Mortals” => “for the Rest of Us”
• If Datalog & ASP are so great, why don’t more people use it?
– MA: “one generation has to die …” ?
• Alt-answer: “Be a teacher!” (Tim Minchin) + Use Tools!
3PWE: Datalog & ASP for the Rest of Us
4. Human Cycles vs Machine Cycles
• Where is the semantics (e.g., in the “Semantic Web”)?
• Ask a DB-theory/LP person: …
– "A query is a question about a concept"
– Google it => 1 hit (Bing it => millions of “hits” ..)
• Datalog & ASP occupy a sweet spot …
– … between conceptual modeling & computational thinking
– … optimizing human cycles!
– cf. Brains & Brawns (Molham Aref’s keynote)
4PWE: Datalog & ASP for the Rest of Us
5. Motivation for PWE
• Datalog & ASP for a larger community?
– … meet users (novices) where they are!
– … plus: a “logic lab” for DBLP gurus, teachers, …
• Ideas:
– Wrap existing engines (dlv, clingo, … XSB … <yours> ..)
– Allow to easily combine with Python ecosystem!
• … meet users where they are!
– … inside of Jupyter (and deployed in the cloud…)
• It isn’t that hard
– … with the right people … J
PWE: Datalog & ASP for the Rest of Us 5
6. Partial Recall (Datalog 2.0 Vienna 2012)
PWE: Datalog & ASP for the Rest of Us 6
Pop Quiz: Why/how come tc(a,b) ?
¨ Why/how is (a,b) in the transitive closure tc of e ?
¨ What about ?-tc(e,X) vs ?-tc(X,e)
7. e/2 cycles => tc/2 SLD(NF) issues
Prolog’s SLD-NF resolution does not seem to work for declarative/naïve tc/2 rules
=> What’s happening anyways?
7PWE: Datalog & ASP for the Rest of Us
8. Explaining Derivations via Provenance
PWE: Datalog & ASP for the Rest of Us 8
[r1] tc(X,Y) :- e(X,Y)
[r2] tc(X,Y) :- e(X,Z), tc(Z,Y)
A firing [F] à (H) is called unfounded, if all derivations of F require H as an assumption!
Here tc(a,b) has (at least) two different derivations, neither of which is unfounded.
However, [r2] à tc(c,b) is unfounded: The firing of r2 depends on tc(b,b) which can only be
derived by already assuming the desired conclusion tc(c,b)!
9. 9PWE: Datalog & ASP for the Rest of Us
Step 1: Capturing Rule Firings (“F-trick”)
¨ Capture rule firings and keep “witness info” (existential variables)
¤ no premature projections in the rule head please!
¨ Example. Instead of a given rule …
tc(X,Y) :- e(X,Z), tc(Z,Y).
… we rather use these two rules, keeping witnesses Z around:
fire2(X,Z,Y) :- e(X,Z), tc(Z,Y).
tc(X,Y) :- fire2(X,Z,Y).
Example rule firings
This is the “secret sauce” in Orchestra, provenance polynomials (Val’s TaPP Keynote), …
10. 10PWE: Datalog & ASP for the Rest of Us
Step 2: Graph Transformation (“G-trick”)
¨ Reify provenance atoms & firings in a labeled graph g/3
¨ Example for N = 2 subgoals and 1 head atom …
fire2(X,Z,Y) :- e(X,Z), tc(Z,Y). % two in-edges
tc(X,Y) :- fire2(X,Z,Y). % one out-edge
… generates N+1 “reification rules” (Skolems are safe):
g( e(X,Z), in, skfire2(X,Z,Y) ) :- fire2(X,Z,Y).
g( tc(Z,Y), in, skfire2(X,Z,Y) ) :- fire2(X,Z,Y).
g( skfire2(X,Z,Y), out, tc(X,Y) ) :- fire2(X,Z,Y).
e(a,b)
fire2(a,b,d)
in
tc(a,d)
out
tc(b,d)
in
Example instance generated by these rules
This is the “secret sauce” in Frame-Logic, RDF, …
11. 11PWE: Datalog & ASP for the Rest of Us
Step 3: Using Statelog (“S-Trick”)
¨ Use Statelog to keep record of firing rounds:
¤ Add state (=stage) argument to provenance rules and graph relations
¤ EDB facts are derived in state 0.
¤ Subsequently: extract earliest round for firings and IDB facts
¨ Example:
rin : firer(S1, X) :- B1(S, X1), … , Bn(S, Xn), next(S, S1).
rout : H(S, Y) :- firer(S, X).
e(a,b) r1 [1]
r2 [3]
tc(a,b)
[1]e(b,c)
r2 [2]
tc(b,b)
[2]
e(c,b)
r1 [1]
r2 [3]
tc(c,b)
[1]
This is the “secret sauce” in Statelog, Datalog1S, …
12. 12PWE: Datalog & ASP for the Rest of Us
How long (does it take) Provenance!
¨ These definitions are recursive but well-founded
¨ The numbers can be easily obtained via Statelog
This is the “secret sauce” behind declarative profiling …
13. 13PWE: Datalog & ASP for the Rest of Us
Declarative Profiling
¨ Number of Facts:
derived(H) :-
g(_,out, H).
derivedHeadCount(C) :-
C = count{
H : derived(H)
}.
¨ Number of Firings:
firing(F) :- g(_,F,out,_).
firingCount(C) :-
C = count{F : firing(F)}.
e(a,b) 1
2
3
4
tc(a,b)
[1]
tc(a,c)
[2]
tc(a,d)
[3]
tc(a,e)
[4]
e(b,c) 1
2
3
tc(b,c)
[1]
tc(b,d)
[2]
tc(b,e)
[3]
e(c,d)
1
2
tc(c,d)
[1]
tc(c,e)
[2]
e(d,e) 1
tc(d,e)
[1]
3
tc(a,d)
[3]
3
3
tc(a,e)
[3]
3
tc(b,e)
[3]
3
4
4
e(a,b) 1
tc(a,b)
[1]
e(b,c) 1
tc(b,c)
[1]
e(c,d) 1
tc(c,d)
[1]
e(d,e) 1
tc(d,e)
[1]
2
2
2
tc(a,c)
[2]
tc(b,d)
[2]
tc(c,e)
[2]
(a) right-recursive
(b) doubly-recursive
14. … from a Vienna Datalog 2.0 paper
... but where is the code?
Can I reproduce the results?
Work with the examples?
Build on them? Extend them?
… try new things???
PWE: Datalog & ASP for the Rest of Us 14
15. ASP + PWE: Possible Worlds Explorer
15
https://github.com/idaks/PW-explorer
https://github.com/idaks/PWE-demosPWE: Datalog & ASP for the Rest of Us
16. 16PWE: Datalog & ASP for the Rest of Us
PWE/Python visualization of input
graph e/2 (solid edges) and output
graph tc/2 (dashed edges)
… the F-trick (= firing rules =
provenance capture)
… compute via clingo
17. 17PWE: Datalog & ASP for the Rest of Us
… the G-trick (reify as a graph
using Skolem terms)
18. 18PWE: Datalog & ASP for the Rest of Us
… the S-trick
(Statelog encoding)
19. graph/4 = Firing + Graph + Statelog
• Et Voilà: Rule Firings captured, reified as a Graph,
derivations through States!
• It’s all relational! J
19PWE: Datalog & ASP for the Rest of Us
20. Let’s mix in some Python and Graphviz
PWE: Datalog & ASP for the Rest of Us 20
21. …now reproducing a figure from [KLS12]!
PWE: Datalog & ASP for the Rest of Us 21
22. … and another one via graph pruning
… in Python
PWE: Datalog & ASP for the Rest of Us 22
23. Answer Set Programming: a superpower for “doing semantics”
• ASP = DB+LP+KR+SAT
• Reasoning spectrum: …queries … constraint solving
• … OWL/DL, FO, SQL, Datalog, ..., ASP, ...
• ... occupying a “sweet spot”
• ... but needs GTD extensions:
• PWE = ASP + Python + Jupyter
23https://github.com/idaks/PWE-demos
PWE: Datalog & ASP for the Rest of Us
24. Datalog .. ASP: Hitting KR&R Sweet Spots
24
Variations on
FOL +
Recursion +
Negation
=
S/I/W/P/… -
Datalog
… ASP ...
Many Results from Theory
Getting Things Done with Jupyter notebooks & Python
RPQ:
similar
Unique 3-valued Model
vs
Set of Stable Models
PWE: Datalog & ASP for the Rest of Us
25. a b
tc(X,Y) :- e(X,Y) # (1)--e(X,Y)-->(2)
tc(X,Y) :- # (1)--exists:Z-->(3)
e(X,Z), # (3)->(4)-e(X,Z)->(5)
tc(Z,Y). # (3)--X:=Z-->(1) 2
3
1
X := Z
4 5
e(X,Y)
exists:Z
e(X,Z)
3:(b,b,b) 1
1:(b,b) 11
4:(b,b) 1
1
1:(a,b) 1
3:(a,b,a) 1
2:(a,b) 01
3:(a,b,b) 1
2
2
3:(b,b,a) 1
2:(b,b) 01
4:(a,b) 1 5:(a,b) 01
5:(b,b) 01
3:(a,a,a) 1
4:(a,a) 0
1
1:(a,a) 2
1
3:(b,a,a) 1
4:(b,a) 0
1
1
1
1
3:(a,a,b) 2 1:(b,a) 2 3:(b,a,b) 2
EDB: e(a,b), e(b,b)
Game
diagram
Instantiated
move graph
Flum, Kubierschky, Ludäscher, Total
and partial well-founded Datalog
coincide, ICDT-The-Bag-1997,
Delphi, Greece
25
Eureka moment:
1. query evaluation = evaluation game (argument about truth in a database)
2. provenance = winning strategies (justified/winning arguments)
PWE: Datalog & ASP for the Rest of Us
26. Reproducing some TaPP’12 Graph Queries:
Datalog as a Lingua Franca for Querying … Provenance
PWE: Datalog & ASP for the Rest of Us 26
28. Visualized in PWE via Python under the hood!
PWE: Datalog & ASP for the Rest of Us 28
29. … for a few Python LOCs more …
(growing the target audience)
PWE: Datalog & ASP for the Rest of Us 29
30. … we get highlighting of the LCAs!
PWE: Datalog & ASP for the Rest of Us 30
31. “Boring” (ASCII) answer sets become
informative Timeline Visualization
(Here: IC Checking & Repair rules!)
PWE: Datalog & ASP for the Rest of Us 31
32. … visualizing clusters of PWs (answer sets) …
PWE: Datalog & ASP for the Rest of Us 32
… easily plug in different
ranking/distance/similarity functions!
33. … to discover additional structure!
• … discover similar (here:
isomorphic) solutions
• … and display them!
PWE: Datalog & ASP for the Rest of Us 33
34. One more thing …
… time allowing!
PWE: Datalog & ASP for the Rest of Us 34
35. 35
Y X X YX Y X Y X Y
Congruence
X == Y
Inclusion
X > Y
Inverse Inclusion
X < Y
Overlap
X>< Y
Disjointness
X ! Y
Origins:
Euler diagrams ...
... limited FO reasoning
... RCC-5++ reasoning
Application: Geo-Taxonomy Alignment
The secret sauce inside: Moved from FO reasoner to … qualitative reasoning
(RCC-5) to … Answer Set Programming (ASP) + some more secret sauce
Taxonomy Alignment Problem
PWE: Datalog & ASP for the Rest of Us
36. • Euler/X & LeanEuler projects
employ qualitative reasoning
(RCC-5), implemented in ASP to
align, merge taxonomies, debug
alignments, etc.
36
Reasoning with Incomplete Knowledge:
Exploring Possible Worlds
PWE: Datalog & ASP for the Rest of Us
37. Summary & Conclusions
• Possible Worlds Explorer (PWE):
– loosely coupling (= wrapping) Datalog & ASP systems
• DLV, clingo, …, XSB, … , <you-name-it>
– … with Python
– … and Jupyter notebooks
=> where the users are!
=> leveraging Python, Pandas, … analytics and visualization!
• Datalog & ASP for the rest of us!
– … and for LP / DB-Theory gurus :-)
• Work in progress
– join or fork: https://github.com/idaks/PW-explorer
– or talk, to get started: ludaesch@Illinois.edu
PWE: Datalog & ASP for the Rest of Us 37
39. Some Partial Provenance ...
• [Cha88] Chan, D.: Constructive Negation Based on the Completed Database. 5th ICLP, Seattle, 1988
• [KLW95] Kifer, M., Lausen, Georg, G., Wu, J.: Logical Foundations of Object-oriented and Frame-based
Languages. JACM 42 (4), 1995,741–843.
• [LLM98] Lausen, G, Ludäscher, B, May, W.: On Active Deductive Databases: The Statelog Approach.
Transactions and Change in Logic Databases, LNCS 1472, 1998, 69–106.
• [MBZL09] McPhillips, T, Bowers, S., Zinn, D., Ludäscher, B: Scientific Workflow Design for Mere
Mortals. Future Generation Computer Systems 25(5), 2009, 541–551.
• [DKBL12] Dey, S., Köhler, S, Bowers, S, Ludäscher, B.: Datalog as a Lingua Franca for Provenance
Querying and Reasoning. Boston, TaPP, 2012.
• [KLS12] Köhler, S, Ludäscher, B., Smaragdakis, Y.: Declarative Datalog Debugging for Mere Mortals.
Datalog 2.0: Datalog in Academia & Industry, LNCS 7494, Vienna, 2012, 111–122.
• [CFSY17] Cheng, Y.-Y., Franz, N., Schneider, J., Yu, S., Rodenhausen, T., Ludäscher, B.: Agreeing to
disagree: Reconciling conflicting taxonomic views using a logic-based approach. In: Association for
Information Science and Technology, 54(1), 2017, 46–56.
• [GCL19] Gupta, S., Cheng, Yi-Yun, Ludäscher, B.: Possible Worlds Explorer: Datalog & Answer Set
Programming for the Rest of Us. Datalog 2.0: 3rd Workshop on the Resurgence of Datalog in
Academia & Industry, Philadelphia, 2019, 44–55.
PWE: Datalog & ASP for the Rest of Us 39