The document describes scientific workflows for big data and the challenges they present. It discusses Prof. Shiyong Lu's work on developing the VIEW system for designing, executing, and analyzing scientific workflows. The VIEW system provides a runtime environment for workflows, supports their execution on servers or clouds, and enables efficient storage, querying and visualization of workflow provenance data.
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
Decision Forest machine learning algorithm is adopted to find out the features to affect the temperature of fueling valve and controller and to predict it.
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
An unstructured data poses challenges to storing da ta. Experts estimate that 80 to 90 percent of the d ata in any organization is unstructured. And the amount of uns tructured data in enterprises is growing significan tly� often many times faster than structured databases are gro wing. As structured data is existing in table forma t i,e having proper scheme but unstructured data is schema less database So it�s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processi ng unstructured data,where in existing it is given to Cassandra dataset. Here in present system along wit h Cassandra dataset,Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amou nt of options for querying unstructured data. Where as Cassandra model their data in such a way as to mini mize the total number of queries through more caref ul planning and renormalizations. It offers basic secondary ind exes but for the best performance it�s recommended to model our data as to use them infrequently. So to process
Big data appliance ecosystem - in memory db, hadoop, analytics, data mining, business intelligence with multiple data source charts, twitter support and analysis.
STORAGE GROWING FORECAST WITH BACULA BACKUP SOFTWARE CATALOG DATA MININGcsandit
Backup software information is a potential source for data mining: not only the unstructured
stored data from all other backed-up servers, but also backup jobs metadata, which is stored in
a formerly known catalog database. Data mining this database, in special, could be used in
order to improve backup quality, automation, reliability, predict bottlenecks, identify risks,
failure trends, and provide specific needed report information that could not be fetched from
closed format property stock property backup software database. Ignoring this data mining
project might be costly, with lots of unnecessary human intervention, uncoordinated work and
pitfalls, such as having backup service disruption, because of insufficient planning. The specific
goal of this practical paper is using Knowledge Discovery in Database Time Series, Stochastic
Models and R scripts in order to predict backup storage data growth. This project could not be
done with traditional closed format proprietary solutions, since it is generally impossible to
read their database data from third party software because of vendor lock-in deliberate
overshadow. Nevertheless, it is very feasible with Bacula: the current third most popular backup
software worldwide, and open source. This paper is focused on the backup storage demand
prediction problem, using the most popular prediction algorithms. Among them, Holt-Winters
Model had the highest success rate for the tested data sets.
To develop a system which will assist us to determine the revenue generated by students.
Examining the relationship between new student enrollments and institutional income at public colleges, universities and professional organizations in the US.
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
Decision Forest machine learning algorithm is adopted to find out the features to affect the temperature of fueling valve and controller and to predict it.
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
An unstructured data poses challenges to storing da ta. Experts estimate that 80 to 90 percent of the d ata in any organization is unstructured. And the amount of uns tructured data in enterprises is growing significan tly� often many times faster than structured databases are gro wing. As structured data is existing in table forma t i,e having proper scheme but unstructured data is schema less database So it�s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processi ng unstructured data,where in existing it is given to Cassandra dataset. Here in present system along wit h Cassandra dataset,Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amou nt of options for querying unstructured data. Where as Cassandra model their data in such a way as to mini mize the total number of queries through more caref ul planning and renormalizations. It offers basic secondary ind exes but for the best performance it�s recommended to model our data as to use them infrequently. So to process
Big data appliance ecosystem - in memory db, hadoop, analytics, data mining, business intelligence with multiple data source charts, twitter support and analysis.
STORAGE GROWING FORECAST WITH BACULA BACKUP SOFTWARE CATALOG DATA MININGcsandit
Backup software information is a potential source for data mining: not only the unstructured
stored data from all other backed-up servers, but also backup jobs metadata, which is stored in
a formerly known catalog database. Data mining this database, in special, could be used in
order to improve backup quality, automation, reliability, predict bottlenecks, identify risks,
failure trends, and provide specific needed report information that could not be fetched from
closed format property stock property backup software database. Ignoring this data mining
project might be costly, with lots of unnecessary human intervention, uncoordinated work and
pitfalls, such as having backup service disruption, because of insufficient planning. The specific
goal of this practical paper is using Knowledge Discovery in Database Time Series, Stochastic
Models and R scripts in order to predict backup storage data growth. This project could not be
done with traditional closed format proprietary solutions, since it is generally impossible to
read their database data from third party software because of vendor lock-in deliberate
overshadow. Nevertheless, it is very feasible with Bacula: the current third most popular backup
software worldwide, and open source. This paper is focused on the backup storage demand
prediction problem, using the most popular prediction algorithms. Among them, Holt-Winters
Model had the highest success rate for the tested data sets.
To develop a system which will assist us to determine the revenue generated by students.
Examining the relationship between new student enrollments and institutional income at public colleges, universities and professional organizations in the US.
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
DK Panda from Ohio State University presented this deck at the Switzerland HPC Conference.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Mem- cached on modern HPC clusters. An overview of RDMA-based designs for multiple com- ponents of Hadoop (HDFS, MapReduce, RPC and HBase), Spark, and Memcached will be presented. Enhanced designs for these components to exploit in-memory technology and parallel file systems (such as Lustre) will be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video presentation: https://www.youtube.com/watch?v=glf2KITDdVs
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Data Partitioning in Mongo DB with CloudIJAAS Team
Cloud computing offers various and useful services like IAAS, PAAS SAAS for deploying the applications at low cost. Making it available anytime anywhere with the expectation to be it scalable and consistent. One of the technique to improve the scalability is Data partitioning. The alive techniques which are used are not that capable to track the data access pattern. This paper implements the scalable workload-driven technique for polishing the scalability of web applications. The experiments are carried out over cloud using NoSQL data store MongoDB to scale out. This approach offers low response time, high throughput and less number of distributed transaction. The result of partitioning technique is conducted and evaluated using TPC-C benchmark.
High Performance Computing and Big Data Geoffrey Fox
We propose a hybrid software stack with Large scale data systems for both research and commercial applications running on the commodity (Apache) Big Data Stack (ABDS) using High Performance Computing (HPC) enhancements typically to improve performance. We give several examples taken from bio and financial informatics.
We look in detail at parallel and distributed run-times including MPI from HPC and Apache Storm, Heron, Spark and Flink from ABDS stressing that one needs to distinguish the different needs of parallel (tightly coupled) and distributed (loosely coupled) systems.
We also study "Java Grande" or the principles to use to allow Java codes to perform as fast as those written in more traditional HPC languages. We also note the differences between capacity (individual jobs using many nodes) and capability (lots of independent jobs) computing.
We discuss how this HPC-ABDS concept allows one to discuss convergence of Big Data, Big Simulation, Cloud and HPC Systems. See http://hpc-abds.org/kaleidoscope/
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONcscpconf
Recently the incremental growth of the storage space and data is parallel. At any instant data may go beyond than storage capacity. A good RDBMS should try to reduce the redundancies as far as possible to maintain the consistencies and storage cost. Apart from that a huge database with replicated copies wastes essential spaces which can be utilized for other purposes. The first aim should be to apply some techniques of data deduplication in the field of RDBMS. It is obvious to check the accessing time complexity along with space complexity. Here different techniques of data de duplication approaches are discussed. Finally based on the drawback of those approaches a new approach involving row id, column id and domain-key constraint of RDBMS is theoretically illustrated. Though apparently this model seems to be very tedious and non-optimistic, but in reality for a large database with lot of tables containing lot of lengthy fields it can be proved that it reduces the space complexity vigorously with same accessing speed.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
The integration of data from multiple distributed and heterogeneous sources has long been an important issue in information system research. In this study, we considered the query access and its optimization in such an integration scenario in the context of energy management by using SPARQL. Specifically, we provided a federated approach - a mediator server - that allows users to query access to multiple heterogeneous data sources, including four typical types of databases in energy data resources: relational database Triplestore, NoSQL database, and XML. A MUSYOP architecture based on this approach is then presented and our solution can realize the process data acquisition and integration without the need to rewrite or transform the local data into a unified data.
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
Data analysis using hive ql & tableaupkale1708
The purpose of this study is to develop a system which will assist a user to determine if a location can be entitled as a “Safe” residence or not. The output will be based on an analysis carried out on the local crime history of the city. This involves examining a huge geolocation data and zeroing down to a single area. The area with majority crime incidents will be highlighted as Unsafe. Clicking/hovering on a single record will display name, associated crime and its rank depending on number of crimes occurred. Big Data Hadoop and Hive systems are implemented in Azure for the analysis.
An overview of crime report and analysis shows a significant amount of information related to crime. Multiple factors need to be considered while studying the different aspects of crime. These multiple measures are found in Uniform Crime Reports data and the National Crime Victimization Survey, a survey that interrogates the victim about their experience. Our paper depicts the nature and characteristics of crime using Hadoop Big Data systems, especially Hive in Azure. Besides, the map of the Geo-location presents which area is safe or unsafe. The results of different Hive queries are visualized using Tableau.
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
DK Panda from Ohio State University presented this deck at the Switzerland HPC Conference.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Mem- cached on modern HPC clusters. An overview of RDMA-based designs for multiple com- ponents of Hadoop (HDFS, MapReduce, RPC and HBase), Spark, and Memcached will be presented. Enhanced designs for these components to exploit in-memory technology and parallel file systems (such as Lustre) will be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video presentation: https://www.youtube.com/watch?v=glf2KITDdVs
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Data Partitioning in Mongo DB with CloudIJAAS Team
Cloud computing offers various and useful services like IAAS, PAAS SAAS for deploying the applications at low cost. Making it available anytime anywhere with the expectation to be it scalable and consistent. One of the technique to improve the scalability is Data partitioning. The alive techniques which are used are not that capable to track the data access pattern. This paper implements the scalable workload-driven technique for polishing the scalability of web applications. The experiments are carried out over cloud using NoSQL data store MongoDB to scale out. This approach offers low response time, high throughput and less number of distributed transaction. The result of partitioning technique is conducted and evaluated using TPC-C benchmark.
High Performance Computing and Big Data Geoffrey Fox
We propose a hybrid software stack with Large scale data systems for both research and commercial applications running on the commodity (Apache) Big Data Stack (ABDS) using High Performance Computing (HPC) enhancements typically to improve performance. We give several examples taken from bio and financial informatics.
We look in detail at parallel and distributed run-times including MPI from HPC and Apache Storm, Heron, Spark and Flink from ABDS stressing that one needs to distinguish the different needs of parallel (tightly coupled) and distributed (loosely coupled) systems.
We also study "Java Grande" or the principles to use to allow Java codes to perform as fast as those written in more traditional HPC languages. We also note the differences between capacity (individual jobs using many nodes) and capability (lots of independent jobs) computing.
We discuss how this HPC-ABDS concept allows one to discuss convergence of Big Data, Big Simulation, Cloud and HPC Systems. See http://hpc-abds.org/kaleidoscope/
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONcscpconf
Recently the incremental growth of the storage space and data is parallel. At any instant data may go beyond than storage capacity. A good RDBMS should try to reduce the redundancies as far as possible to maintain the consistencies and storage cost. Apart from that a huge database with replicated copies wastes essential spaces which can be utilized for other purposes. The first aim should be to apply some techniques of data deduplication in the field of RDBMS. It is obvious to check the accessing time complexity along with space complexity. Here different techniques of data de duplication approaches are discussed. Finally based on the drawback of those approaches a new approach involving row id, column id and domain-key constraint of RDBMS is theoretically illustrated. Though apparently this model seems to be very tedious and non-optimistic, but in reality for a large database with lot of tables containing lot of lengthy fields it can be proved that it reduces the space complexity vigorously with same accessing speed.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
The integration of data from multiple distributed and heterogeneous sources has long been an important issue in information system research. In this study, we considered the query access and its optimization in such an integration scenario in the context of energy management by using SPARQL. Specifically, we provided a federated approach - a mediator server - that allows users to query access to multiple heterogeneous data sources, including four typical types of databases in energy data resources: relational database Triplestore, NoSQL database, and XML. A MUSYOP architecture based on this approach is then presented and our solution can realize the process data acquisition and integration without the need to rewrite or transform the local data into a unified data.
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
Data analysis using hive ql & tableaupkale1708
The purpose of this study is to develop a system which will assist a user to determine if a location can be entitled as a “Safe” residence or not. The output will be based on an analysis carried out on the local crime history of the city. This involves examining a huge geolocation data and zeroing down to a single area. The area with majority crime incidents will be highlighted as Unsafe. Clicking/hovering on a single record will display name, associated crime and its rank depending on number of crimes occurred. Big Data Hadoop and Hive systems are implemented in Azure for the analysis.
An overview of crime report and analysis shows a significant amount of information related to crime. Multiple factors need to be considered while studying the different aspects of crime. These multiple measures are found in Uniform Crime Reports data and the National Crime Victimization Survey, a survey that interrogates the victim about their experience. Our paper depicts the nature and characteristics of crime using Hadoop Big Data systems, especially Hive in Azure. Besides, the map of the Geo-location presents which area is safe or unsafe. The results of different Hive queries are visualized using Tableau.
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
ELECTRODEPOSITION OF TITANIUM AND ITS DIOXIDE FROM ILMENITE Al Baha University
The aim of the present work was to develop a simple and rapid electrolytic extraction process of titanium [l-3] and its dioxide from the ilmenite ore of the Eastern Desert. The ore mother liquor used for the electrolysis process is either produced by direct leaching with 98% H,SO, (S/L = 1 : 15), 35% HCl (S/L = 1: 20) and alkaline digestion with caustic soda in a ball-mill autoclave at 175°C under a pressure of 9.5 kg cmP2, or it is prepared through the fusion method using NaOH or Na,S,O, separately as fluxes at 600-700°C.
EXPERIMENTAL
الدراسة تحاول بيان دور اللوبى الصهيونى فى الاستراتيجية الأمريكية للشرق الأوسط وبالتالى كيف أدانت هذه الاستراتيجية لأزمة فى العلاقات الأمريكية العربية إن اللوبى الصهيونى يشكل قوة رئيسية وهامة فى الولايات المتحدة
These slides explain how Albert, the virtual agent created by noHold, Inc., works and all the benefits of having an automated Help Desk agent 24 hours a day, 7 days a week.
Instant GMP Compliance Series -Better Compliance through Master Manufacturing...InstantGMP™
FDA inspections are increasing every year and they have published the results on their website. 50% of dietary supplement manufacturers are not in GMP compliance and 1 in 4 dietary supplement companies have received a warning letter which could result in a significant enforcement action such as halting production and distribution. Many of these producers received citation because they were not using Master Manufacturing Records.
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.
Integration Patterns for Big Data ApplicationsMichael Häusler
Big Data technologies like distributed databases, queues, batch processors, and stream processors are fun and exciting to play with. Making them play nicely together can be challenging. Keeping it fun for engineers to continuously improve and operate them is hard. At ResearchGate, we run thousands of YARN applications every day to gain insights and to power user facing features. Of course, there are numerous integration challenges on the way:
* integrating batch and stream processors with operational systems
* ingesting data and playing back results while controlling performance crosstalk
* rolling out new versions of synchronous, stream, and batch applications and their respective data schemas
* controlling the amount of glue and adapter code between different technologies
* modeling cross-flow dependencies while handling failures gracefully and limiting their repercussions
We describe our ongoing journey in identifying patterns and principles to make our big data stack integrate well. Technologies to be covered will include MongoDB, Kafka, Hadoop (YARN), Hive (TEZ), Flink Batch, and Flink Streaming.
ADV Slides: Comparing the Enterprise Analytic SolutionsDATAVERSITY
Data is the foundation of any meaningful corporate initiative. Fully master the necessary data, and you’re more than halfway to success. That’s why leverageable (i.e., multiple use) artifacts of the enterprise data environment are so critical to enterprise success.
Build them once (keep them updated), and use again many, many times for many and diverse ends. The data warehouse remains focused strongly on this goal. And that may be why, nearly 40 years after the first database was labeled a “data warehouse,” analytic database products still target the data warehouse.
Examples of using data lakes from different AWS customers.
Level: Intermediate
Speaker: Ryan Jancaitis - Sr. Product Manager, EEC , AWS WWPS TechVision & Business Development
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
The title of this talk is a crass attempt to be catchy and topical, by referring to the recent victory of Watson in Jeopardy.
My point (perhaps confusingly) is not that new computer capabilities are a bad thing. On the contrary, these capabilities represent a tremendous opportunity for science. The challenge that I speak to is how we leverage these capabilities without computers and computation overwhelming the research community in terms of both human and financial resources. The solution, I suggest, is to get computation out of the lab—to outsource it to third party providers.
Abstract follows:
We have made much progress over the past decade toward effective distributed cyberinfrastructure. In big-science fields such as high energy physics, astronomy, and climate, thousands benefit daily from tools that enable the distributed management and analysis of vast quantities of data. But we now face a far greater challenge. Exploding data volumes and new research methodologies mean that many more--ultimately most?--researchers will soon require similar capabilities. How can we possible supply information technology (IT) at this scale, given constrained budgets? Must every lab become filled with computers, and every researcher an IT specialist?
I propose that the answer is to take a leaf from industry, which is slashing both the costs and complexity of consumer and business IT by moving it out of homes and offices to so-called cloud providers. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity, empowering investigators with new capabilities and freeing them to focus on their research.
I describe work we are doing to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date, and suggest a path towards large-scale delivery of these capabilities. I also suggest that these developments are part of a larger "revolution in scientific affairs," as profound in its implications as the much-discussed "revolution in military affairs" resulting from more capable, low-cost IT. I conclude with some thoughts on how researchers, educators, and institutions may want to prepare for this revolution.
A talk at the RPI-NSF Workshop on Multiscale Modeling of Complex Data, September 12, 2011, Troy NY, USA.
We have made much progress over the past decade toward effectively
harnessing the collective power of IT resources distributed across the
globe. In fields such as high-energy physics, astronomy, and climate,
thousands benefit daily from tools that manage and analyze large
quantities of data produced and consumed by large collaborative teams.
But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that far more--ultimately
most?--researchers will soon require capabilities not so different from those used by these big-science teams. How is the general population of researchers and institutions to meet these needs? Must every lab be filled
with computers loaded with sophisticated software, and every researcher become an information technology (IT) specialist? Can we possibly afford to equip our labs in this way, and where would we find the experts to operate them?
Consumers and businesses face similar challenges, and industry has
responded by moving IT out of homes and offices to so-called cloud providers (e.g., GMail, Google Docs, Salesforce), slashing costs and complexity. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity. More importantly, we can free researchers from the burden of managing IT, giving them back their time to focus on research and empowering them to go beyond the scope of what was previously possible.
I describe work we are doing at the Computation Institute to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date and suggest a path towards
large-scale delivery of these capabilities.
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumVMware Tanzu
Data is at the center of digital transformation; using data to drive action is how transformation happens. But data is messy, and it’s everywhere. It’s in the cloud and on-premises. It’s in different types and formats. By the time all this data is moved, consolidated, and cleansed, it can take weeks to build a predictive model.
Even with data lakes, efficiently integrating multi-structured data from different data sources and streams is a major challenge. Enterprises struggle with a stew of data integration tools, application integration middleware, and various data quality and master data management software. How can we simplify this complexity to accelerate and de-risk analytic projects?
The data warehouse—once seen as only for traditional business intelligence applications — has learned new tricks. Join James Curtis from 451 Research and Pivotal’s Bob Glithero for an interactive discussion about the modern analytic data warehouse. In this webinar, we’ll share insights such as:
- Why after much experimentation with other architectures such as data lakes, the data warehouse has reemerged as the platform for integrated operational analytics
- How consolidating structured and unstructured data in one environment—including text, graph, and geospatial data—makes in-database, highly parallel, analytics practical
- How bringing open-source machine learning, graph, and statistical methods to data accelerates analytical projects
- How open-source contributions from a vibrant community of Postgres developers reduces adoption risk and accelerates innovation
We thank you in advance for joining us.
Presenter : Bob Glithero, PMM, Pivotal and James Curtis Senior Analyst, 451 Research
Estimating the Total Costs of Your Cloud Analytics PlatformDATAVERSITY
Organizations today need a broad set of enterprise data cloud services with key data functionality to modernize applications and utilize machine learning. They need a platform designed to address multi-faceted needs by offering multi-function Data Management and analytics to solve the enterprise’s most pressing data and analytic challenges in a streamlined fashion. They need a worry-free experience with the architecture and its components.
New to MongoDB? We'll provide an overview of installation, high availability through replication, scale out through sharding, and options for monitoring and backup. No prior knowledge of MongoDB is assumed. This session will jumpstart your knowledge of MongoDB operations, providing you with context for the rest of the day's content.
Science as a Service: How On-Demand Computing can Accelerate DiscoveryIan Foster
My talk at ScienceCloud 2013 in NYC. Thanks to the organizers for the invitation to talk.
A bit of new material relative to previous talks posted, e.g., on Globus Genomics.
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Mark Tabladillo
If you have a SQL Server license (Standard or higher) then you already have the ability to start data mining. In this new presentation, you will see how to scale up data mining from the free Excel 2013 add-in to production use. Aimed at beginning to intermediate data miners, this presentation will show how mining models move from development to production. We will use SQL Server 2014 tools including SSMS, SSIS, and SSDT.
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
1. Scientific Workflows for Big Data
Prof. Shiyong Lu
Big Data Research Laboratory
Department of Computer Science
Wayne State University
shiyong@wayne.edu
3. Big Data Challenges
Looking for needle
in haystack
For Big Data, data
management and
movement is a frequent
challenge
…between facilities, Looking needle in
archives, researchers… haystack
Many files, large data
volumes
With security, reliability,
performance…
Ian Foster: Father of Grid Computing
4. Big Data Challenges
Looking for needle
in haystack
Capture
Curation
Looking needle in
haystack
Storage
Search
Sharing
Analysis
Visualization
5. Big Data Science
Large Hardron Collider (LHC))
15 PB/year
173 TB/day
500 MB/sec
Higgs discovery is “only
possible because of the
extraordinary
achievements of … grid
computing”
—Rolf Heuer, CERN DG
6. Data flows at Argonne National Lab
Data management challenges
External
Argonne data
sources
flows in
163
9
9
TB/day
Advanced Photon Source
(estimates)
Argonne
Leadership
Computing
Facility
143
100
Shortterm
storage
100
150
Credit: Ian Foster
Data
analysis
10
50
Longterm
storage
7. Big Data demands new CS research
For example, existing clustering algorithms are typically cubic in N, and
when N is too big, they do not work! - Jim Gray
8. What is Big Data?
•Definition of Big Data:
“…refers to large, diverse, complex, longitudinal, and/or
distributed data sets generated from instruments, sensors,
Internet transactions, email, video, click streams, and/or
all other digital sources available today and in the future.”
from nsf.gov website
9. Big Data Challenges
•Challenges of Big Data:
“national big data challenges, which include advances in core
techniques and technologies; big data infrastructure projects in
various science, biomedical research, health and engineering
communities; education and workforce development; and a
comprehensive integrative program to support collaborations of
multi-disciplinary teams and communities to make advances in the
complex grand challenge science, biomedical research, and
engineering problems of a computational- and data-intensive world.”
from nsf.gov website
13. Introduction
Data Intensive Science
From computation intensive to data intensive.
A new research cycle – from data capture and data
curation to data analysis and data visualization.
“In the future, the rapidity with which any given
discipline advances is likely to depend on how well
the community acquires the necessary expertise in
database, workflow management, visualization,
and cloud computing technologies.” (“Beyond
the Data Deluge”, Science, Vol. 323. no. 5919, pp.
1297 – 1298, 2009.)
14. Introduction
Scientific Workflow
A formal specification of a scientific
process.
Represents, streamlines, and
automates the steps from dataset
selection and integration,
computation and analysis, to final
data product presentation and
visualization.
Applications: Bioinformatics,
Oceanography, Neuroinformatics,
Astronomy, etc.
15. Introduction
Scientific Workflow Management System
(SWFMS)
Supports the specification, modification, execution,
failure handling, and monitoring of a scientific
workflow.
Existing SWFMSs:
•
•
•
•
Taverna,
Kepler,
Pegasus,
VisTrails,
• VIEW,
• …
18. Our VIEW System
Enables scientist to design workflows
Provides runtime system to execute workflow
19. Our VIEW System
Enables scientist to design workflows
Provides runtime system to execute workflow
on dedicated VIEW server
20. Our VIEW System
Enables scientist to design workflows
Provides runtime system to execute workflow
on dedicated VIEW server
in Cloud computing environment
21. Our VIEW System
Enables scientist to design workflows
Provides runtime system to execute workflow
on dedicated VIEW server
in Cloud computing environment
Supports efficient collection, storage,
querying, and visualization of workflow
provenance
22. Our VIEW System
Enables scientist to design workflows
Provides runtime system to execute workflow
on dedicated VIEW server
in Cloud computing environment
Supports efficient collection, storage,
querying, and visualization of workflow
provenance
Is currently used in several bioinformatics
applications, including genomic recombination
and gene conversion data analysis
29. An Example Workflow in VIEW
FiberFlow
Transforms the large-scale neuroimaging data to knowledge through crosssubject, cross-modality computation, ultimately leading to high clinical
intelligence in neural diseases.
30. VIEW: A Prototypical SWFMS
Minimum complexity for users, but massive
techniques in the backstage.
To provide a clear and simple abstraction for manipulating
and coordinating resources
Service-oriented architecture.
Intuitive, user-friendly GUI
33. A Reference Architecture for SWFMSs
Other advantages of
:
VIEW workflows can be executed in other
systems (specifications are not tied to a particular
SWFMS)
Use of open standards (Web Services, XML)
promotes collaboration, interoperability and
extensibility of the system
Workflow and data models implemented in VIEW
are specifically geared towards heavy scientific
data
36. Workflow Engine
Workflow Engine is the heart of the
system.
Workflow Orchestration.
Workflow Execution.
Coordination of other subsystems.
Workflow Engine in VIEW.
Dataflow based.
Pure workflow composition.
Workflow constructs.
37. SWL
Example of our proposed scientific workflow
specification language (SWL).
39. Workflow Execution
Workflow Execution
Primitive workflow
Unary construct based workflow
Graph based workflow
• A workflow graph is a composition of workflows by
binary constructs.
• Optimistic scheduling.
41. Data Product Manager
Data Product Manager
Solid data model.
Scalable data storage.
Convenient data access.
Data Independence.
Data Product Manager is based on the
collectional data model.
42. DPM Architecture
Architecture of the Data Product Manager.
Data Product Manager
Main
Server
Master
Data Access Layer
Node
Database
Node
Database
Node
Database
Data Mapping Layer
Data Set 1
Relational
Databases
File
Repositorys
Data Set 2
Relational
Databases
File
Repositorys
Data Storage Layer
43. DPL
Example of the XML description of a
collectional data product.
44. Data Storage
VIEW supports two ways of storage:
A collection can be stored in a table containing a
set of its key/value pairs, whose values are
references to existing collections.
A collection can be expanded and stored in two
tables.
• The Group By operator.
• The Compress operator.
45. Data Typing
A Data Product
a Collection
or a List
or an Empty.
The List type
Introduced in the workflow engine.
Each element is a data product.
Heterogeneous.
46. Collectional Data Querying
Operators are implemented in primitive
workflows.
Arithmetic operators.
Boolean operators.
Collectional operators.
List operators.
Queries are implemented in workflow
compositions.
47. Example
Given a table Reference < Student, Company,
GradTime >, Find the total number of
students offered in each company and each
graduation year; Sort the result in descending
GradTime and ascending Company order.
SQL query.
SELECT Company, GradTime, COUNT(DISTINCT Student)
AS NumberOfJob
FROM Reference
GROUP BY Company, GradTime
ORDER BY GradTime DESC, Company ASC;
49. Key Requirements for Workflow Modeling
R1: Programming-in-the-large.
R2: Dataflow programming model.
R3: Composable dataflow constructs.
R4: Workflow encapsulation and
hierarchical composition.
R5: Single-assignment property.
R6: Physical and logical data models.
R7: Exception handling.
50. A Scientific Workflow Model
Workflows are the basic and the only
operands for workflow composition.
M
i1
ii1 W1 o1
k
i1
ii1 W2 o1
k
o1 o1
W3
Task components (e.g. Web services)
are constructed to primitive workflows
(a.k.a. tasks) which are the basic
building blocks of scientific workflows.
51. A Scientific Workflow Model
A workflow construct is a mapping
from a set of workflows to a workflow.
Unary workflow constructs
Binary workflow constructs
…
A construct C takes a set of workflows W1, ...., Wn as input,
and composes them into Wc as the output workflow.
52. A Scientific Workflow Model
Our proposed scientific workflow model
consists of the following two layers:
The logical layer contains the workflow interface that
models the input ports and output ports of a workflow.
The physical layer contains the workflow body that models
the physical implementation of the workflow.
• Primitive workflows.
• Graph-based workflows.
• Unary-construct-based workflows.
54. The Map Construct
The Map construct enables the parallel
processing of a collection of data products
based on a workflow that can only process a
single data product.
Example:
[[1,2],[3,6],[4,7]]
[1,2]
ii1
k
W1 o1
W2
o1
W1
o1
2
[3,6]
M
i1
i1
ik
i1
k
W1
o1
18
[4,7]
i1
ik
W1
o1
28
55. The Reduce Construct
The Reduce construct enables the aggregation
of a list of data products to a single data
product based on a workflow that aggregates
a limited (two or more) number of input data
products.
Example:
R
i1
0
[3,5,9]
i2
i1 Add o
1
i2
k
W3
0
3
o1
i1 Addo1
3
i2
5
i1 Add o1
i2
8
9
i1 Add o1
17
i2
56. The Tree Construct
The Tree construct
Enables parallel aggregation of a collection of data products.
Aggregates a collection pairwisely as a binary tree until one
single aggregated product is generated.
The Tree construct can be applied on
associative workflows.
Example:
T
[0,3,5,9]
i1
i1 Add o
1
i2
k
W4
o1
0
3
i1 Addo1 3
i2
5
9
i1 Addo1
i2
14
i1 Add o1
i2
17
57. The Conditional Construct
The Conditional construct enables the
conditional execution of a workflow based on a
condition on one of the inputs.
Example:
[2,3]
2
p=(PI 1 < PI 2 ) C
i1 p i1
o1 o1 p=true [2,3] i1
o
iProjection
k
Projection 1
i2
2
i2
W4
[2,3]
1
p=(PI 1 >= PI 2 ) C
i1 p i1
o1 o1 p=false
Projection
ik
i2
i2
W4
i2
Fail
i1
2
Projection
i2
3
58. The Loop Construct
The Loop construct enables cyclic executions
of a workflow.
The output of the workflow will be repetitively
returned (fed back) to a specified input port
until the predicate evaluates to true.
Example:
p=(PI 1 >100) L
0
1
i1 i1
i2
i2
ik Add
o1 o1
p
0
1
i1
o1 p=false
1
Add
i2
i1
1
o1 p=false
2
Add
i2
...
1
101
Add
i2
p=true
59. The Curry Construct
The Curry construct allows users to fix one of the input
ports with a specified argument and thus reduce the
number of input ports.
By applying multiple Curry constructs, a workflow that
takes multiple arguments can be translated into a chain
of workflows each with a single argument.
Example:
U
4
1
i1
i1 Add o
1
i2
k
W8
o1
1
4
i1 Add o
1
i2
k
5
60. Workflow Composition
Example of the composition of Map and Map
constructs.
A Workflow that increase all the numbers in a nested list
by 1.
1
i1
o
M M
1
i1
i2
[[1,2,3],[4,5,6]]
i1
o1
k
ii2 Add
(a) W9
o1
1
1
2
1
3
1
4
1
5
1
6
1
k
ii2 Add
i1
o
ik Add 1
i2
i1
o
ik Add 1
i2
i1
o1
k
ii2 Add
i1
o
ik Add 1
i2
i1
o
ik Add 1
i2
2
3
4
5
6
7
61. Workflow Composition
Example of the composition of Map and Reduce
constructs.
A workflow for parallel summation of each row in a
matrix .
0
o1 o1 1
o1
i1
Addition
i2
k
2
o1
i1
Addition
ik
i2
0
4
o1
i1
Addition
i2
k
5
o1
i1
Addition
ik
i2
M R
0
i1 i1
i2 i2
ik Add
[[1,2,3],[4,5,6]] W11
3
6
o1
i1
Addition 6
ii2
k
o1
i1
Addition 15
ii2
k
62. Workflow Composition
Example of complicated workflow composition.
A workflow to calculate the greatest common divisor.
L
p=(PI(2)==0)
i1
i1
i1 Split o1
o2
G2W
o
i1
iModulus 1
k
i2
ii1
o o
kMerge 1 1
i2
W13
o1
W14
G2W
i1
i2
o1
i1
Merge
i2
i1
M
o1
i1W14 o1
W15
W17
i1
1
M U
o1 o1
i1
iikProjection
2
W16
o1
63. A Collectional Data Model
A collectional data model
Support collection oriented datasets.
• Scientists often work with collection oriented datasets,
such as arrays, lists, tables or file collections.
• A collection-oriented data model enables data
parallelism in scientific workflows.
Support nested data structures.
• Scientific data is often hierarchically organized.
• Scientific workflow tasks often produce collections of
data products, and the execution of a workflow
composed from such tasks can create increasingly
nested data collections.
Provide well-defined operators and their arbitrary
compositions to manipulate and query scientific data
collections.
64. A Collectional Data Model
A relation is a pair < R, r > where R is a
schema of the relation and r is an instance of
that schema.
A relation schema can be defined as an
unordered tuple < c1 : d1, c2 : d2, …, cn : dn >
where c1, c2, …, cn are column names and d1,
d2, …, dn are domain names.
A relation instance is a table with rows
(called tuples) and named columns (called
attributes).
65. A Collectional Data Model
A collection schema is a pair < K, V >.
K, the key, is a pair k : d where k is the key name and d is
the domain name .
V, the value, is either a relation schema or a collection
schema.
A collection instance is a set of key-value
pairs (pi, qi) (i∈ {1,…,m}).
Each pi is a scalar value.
Each qi is either a relation instance or a collection instance.
66. A Collectional Data Model
An example:
Parameters< Model : String, Experiments :
Integer, <Concentration : Double, Degree :
Integer >>.
67. The Collectional Operators
We extend the relational operators to the
collectional operators of which the collections
are the only operands.
Six primitive operators: union, set difference,
selection, projection, Cartesian product and
renaming.
The set of the collections is closed under those
operators.
A relation can be defined as a collection whose
height and cardinality are equal to 1. The
collectional operators will then reduce to the
relational operators.
68. The Collectional Operators
The union and the set difference operators can
only be applied on union-compatible
collections.
Result
Model
26
m1
Result
m2
32
Result
Model
32
m2
Result
m3
31
69. The Collectional Operators
Example of the union operator and the set
difference operator.
Model
m1
m2
m3
Result
26
Model
Result
Result
m1
26
32
m2
Result
Result
31
70. The Collectional Operators
Example of the Cartesian product Operator
and the Renaming Operator.
M1.Result M2.Result
M2.model
M1.model
m1
m2
26
32
m1
M1.Result M2.Result
m2
M2.model
m1
m2
26
31
M1.Result M2.Result
32
32
M1.Result M2.Result
32
31
71. The Collectional Operators
Example of the selection operator.
Model
m2
Experiment
1
Concentration Degree
7.1
15
...
...
72. The Collectional Operators
Example of the projection operator.
Concentration Degree
Experiment
1
2
...
7.0
15
...
7.1
15
...
Concentration Degree
...
7.0
...
30
...
7.1
30
...
73. Key Features of VIEW
F1: VIEW features the first uniform workflow
model, in which workflows are the only
building blocks. In VIEW, tasks are primitive
workflows and all workflow constructs do not
discriminate workflows from tasks. Such a
model greatly simplifies workflow design, in
which a workflow designer only needs to
compose complex workflows from simpler
ones without the need to first encapsulate
workflows to tasks or vice versa during the
composition process.
74. F2: VIEW has a powerful workflow composition
power in which workflow constructs are fully
compositional one with another with arbitrary
levels. This often results in VIEW workflows
that are more concise and efficient to
execute, which can be hard to model in other
workflow systems.
75. F3: VIEW features a pure dataflow-based
workflow language SWL, including the
dataflow counterparts of controlflow-style
constructs, such as conditional and loop.
Existing workflow languages often require
both controlflow and dataflow constructs,
resulting in complex or even obscure
semantics and non-trivial workflow design.
76. F4: VIEW supports the cloud MapReduce
programming model not only at the job level,
but also at the workflow level. Therefore, one
can apply the Map and Reduce constructs on
an arbitrary workflow with arbitrary number
of times. As a result, VIEW can process
nested lists of data products in parallel using
multiple runs of a workflow.
77. F5: VIEW features a collectional data model
that supports not only traditional primitive
data types, such as integer, float, double,
boolean, char, string, but also files, relations,
hierarchical collections (hierarchical key-value
pairs) to support parallel processing of data
collections.
78. F6: VIEW supports a high-level graphbased provenance query language
OPQL. In most cases, users can
formulate lineage queries easily without
the need of writing recursive queries or
knowing the underlying database
schema.
79. F7: VIEW features the first service-oriented
architecture that conforms to the reference
architecture for scientific workflow
management systems (SWFMSs). This
architecture greatly facilitates interoperability
and subsystem reusability in the community.
This architecture also provides a generic
infrastructure upon which a domain-specific
scientific workflow application system
(SWFAS) can be easily developed with custom
interface for various platforms and devices.
80. Conclusions and Future Works
A scientific workflow composition model.
A collectional data model.
A protypical SWFMS.
Future work:
Formalization of the scientific workflow algebra and
collectional algebra.
• Completeness.
• Integration.
Collaborative scientific workflow composition.
• Concurrent design and composition.
• Concurrent execution.