Talk given at Los Alamos National Labs in Fall 2015.
As research becomes more data-intensive and platforms become more heterogeneous, we need to shift focus from performance to productivity.
Talk at ISIM 2017 in Durham, UK on applying database techniques to querying model results in the geosciences, with a broader position about the interaction between data science and simulation as modes of scientific inquiry.
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
Talk given at Los Alamos National Labs in Fall 2015.
As research becomes more data-intensive and platforms become more heterogeneous, we need to shift focus from performance to productivity.
Talk at ISIM 2017 in Durham, UK on applying database techniques to querying model results in the geosciences, with a broader position about the interaction between data science and simulation as modes of scientific inquiry.
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Alexandru Iosup
Data are pouring in, and defining and providing data-processing services at massive scale, in short, Big Data services, could significantly improve the revenue of Europe's Small and Medium Enterprises (SMEs). A paradigm shift is about occur, one in which data processing becomes a basic life utility, for both SMEs and the European people. Although the burgeoning datacenter industry, of which the Netherlands is a top player in Europe, is promising to enable Big Data services, the architectures and even infrastructure for these services are still lagging behind in performance, efficiency, and sophistication, and are built as monoliths reminding us of traditional data silos. Can we remove the performance and efficiency limitations of the current Big Data ecosystems, that is, of the complex stacks of middleware that are currently in use, for Big Data services? In this talk, I will present several use cases (workloads) of Big Data services for time-stamped [2,3] and graph data [4], evaluate or benchmark the performance of several Big Data stacks [3,4] for these use-cases, and present a path (and promising early results) to providing a generic, data-agnostic, non-monolithic Big Data architecture that can efficiently and elastically use datacenter resources via cloud computing interfaces [1,5].
[1] A. L. Varbanescu and A. Iosup, On Many-Task Big Data Processing: from GPUs to Clouds. Proc. of SC|12 (MTAGS).? http://www.pds.ewi.tudelft.nl/~iosup/many-tasks-big-data-vision13mtags_v100.pdf
[2] de Ruiter and Iosup. A workload model for MapReduce. MSc thesis at TU Delft. Jun 2012. Available online via TU Delft Library, http://library.tudelft.nl
[3] Hegeman, Ghit, Capotã, Hidders, Epema, Iosup. The BTWorld Use Case for Big Data Analytics: Description, MapReduce Logical Workflow, and Empirical Evaluation. IEEE Big Data 2013. http://www.pds.ewi.tudelft.nl/~iosup/btworld-mapreduce-workflow13ieeebigdata.pdf
[4] Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis. IEEE IPDPS 2014. http://www.pds.ewi.tudelft.nl/~iosup/perf-eval-graph-proc14ipdps.pdf
[5] B. Ghit, N. Yigitbasi, A. Iosup, and D. Epema. Balanced Resource Allocations Across Multiple Dynamic MapReduce Clusters. ACM SIGMETRICS 2014. http://pds.twi.tudelft.nl/~iosup/dynamic-mapreduce14sigmetrics.pdf
Big Data, Beyond the Data Center
Increasingly the next scientific discoveries and the next industrial innovative breakthroughs will depend on the capacity to extract knowledge and sense from gigantic amount of information. Examples vary from processing data provided by scientific instruments such as the CERN’s LHC; collecting data from large-scale sensor networks; grabbing, indexing and nearly instantaneously mining and searching the Web; building and traversing the billion-edges social network graphs; anticipating market and customer trends through multiple channels of information. Collecting information from various sources, recognizing patterns and distilling insights constitutes what is called the Big Data challenge. However, As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key challenge is to handle the complexity of data management on Hybrid distributed infrastructures, i.e assemblage of Cloud, Grid or Desktop Grids. In this talk, I will overview our works in this research area; starting with BitDew, a middleware for large scale data management on Clouds and Desktop Grids. Then I will present our approach to enable MapReduce on Desktop Grids. Finally, I will present our latest results around Active Data, a programming model for managing data life cycle on heterogeneous systems and infrastructures.
A Biological Internet: Building Eywa from a Social Web of Things with a Little Fog, Stream processing and Linked Data.
Keynote at the Web Science Summer School 2017.
http://www.webscience.org/2017/04/19/shenzhen-web-science-summer-school-2017/
Astronomy is a collaborative science, but it has also become highly specialized, as many other disciplines. Improvement of sharing, discovery and access to resources will enable astronomers to greatly benefit from each other’s highly specialized knowhow. Some initiatives led by scientists and publishers, complement traditional paper publishing with assets published in more interactive digital formats. Among the main goals of these efforts are improving the reproducibility and clarity of the scientific outcome, going beyond the static PDF file, and fostering re-use, which turns into a more efficient exploitation of available digital resources.
Under the grid computing paradigm, large sets of heterogeneous resources can be aggregated and shared. Grid development and acceptance hinge on proving that grids reliably support real applications, and on creating adequate benchmarks to quantify this support. However, applications of grids (and clouds) are just beginning to emerge, and traditional benchmarks have yet to prove representative in grid environments. To address this chicken-and-egg problem, we propose a middle-way approach: create and run synthetic grid workloads comprised of applications representative for today's grids (and clouds). For this purpose, we have designed and implemented GrenchMark, a framework for synthetic workload generation and submission. The framework greatly facilitates synthetic workload modeling, comes with over 35 synthetic and real applications, and is extensible and flexible. We show how the framework can be used for grid system analysis, functionality testing in grid environments, and for comparing different grid settings, and present the results obtained with GrenchMark in our multi-cluster grid, the DAS.
Making Small Data BIG (UT Austin, March 2016)Kerstin Lehnert
Presentation given at the Texas Advanced Computing Center. It describes the potential of re-using small data for new science, achievements and the challenges to make small data re-usable.
RDA Fourth Plenary Keynote - Prof. Christine L. Borgman, Professor Presidential Chair in Information Studies at UCLA: "Data, Data, Everywhere, Nor Any Drop to Drink." Tuesday 23rd Sept 2014, Amsterdam, the Netherlands
https://rd-alliance.org/plenary-meetings/fourth-plenary/plenary4-programme.html
Facilitating Web Science Collaboration through Semantic MarkupJames Hendler
These are the slides that accompanied the paper "Dominic DiFranzo, John S. Erickson, Marie Joan Kristine T. Gloria, Joanne S. Luciano, Deborah McGuinness, & James Hendler, The Web Observatory Extension: Facilitating Web Science Collaboration through Semantic Markup, Proc. WWW 2014 (Web Science Track), Seoul, Korea, 2014." They describe an extension to schema.org that can be used for sharing Web-related datasets and projects.
The science performed in Astronomy is digital science, from observing proposals to final publication, including data and software used: each of the elements and actions involved in the scientific output could be recorded in electronic form.
This fact does not prevent the final outcome of an experiment is still difficult to reproduce. An exhaustive process of documentation can be long, tedious, where access to all the resources must be granted, and after all, the repeatability of results is not even guaranteed. At the same time, we have access to a wealth of files, observational data and publications that could be used more efficiently with a better visibility of the scientific production, avoiding duplication of effort and reinvention.
FANS Finding Auburn's New Students Presentation 2016AuburnClubs
FANS promotes Auburn University to future students and engages alumni to be the best recruiters to the university. This presentation was shared with Auburn Club leaders at the 2016 Club Leadership Conference.
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Alexandru Iosup
Data are pouring in, and defining and providing data-processing services at massive scale, in short, Big Data services, could significantly improve the revenue of Europe's Small and Medium Enterprises (SMEs). A paradigm shift is about occur, one in which data processing becomes a basic life utility, for both SMEs and the European people. Although the burgeoning datacenter industry, of which the Netherlands is a top player in Europe, is promising to enable Big Data services, the architectures and even infrastructure for these services are still lagging behind in performance, efficiency, and sophistication, and are built as monoliths reminding us of traditional data silos. Can we remove the performance and efficiency limitations of the current Big Data ecosystems, that is, of the complex stacks of middleware that are currently in use, for Big Data services? In this talk, I will present several use cases (workloads) of Big Data services for time-stamped [2,3] and graph data [4], evaluate or benchmark the performance of several Big Data stacks [3,4] for these use-cases, and present a path (and promising early results) to providing a generic, data-agnostic, non-monolithic Big Data architecture that can efficiently and elastically use datacenter resources via cloud computing interfaces [1,5].
[1] A. L. Varbanescu and A. Iosup, On Many-Task Big Data Processing: from GPUs to Clouds. Proc. of SC|12 (MTAGS).? http://www.pds.ewi.tudelft.nl/~iosup/many-tasks-big-data-vision13mtags_v100.pdf
[2] de Ruiter and Iosup. A workload model for MapReduce. MSc thesis at TU Delft. Jun 2012. Available online via TU Delft Library, http://library.tudelft.nl
[3] Hegeman, Ghit, Capotã, Hidders, Epema, Iosup. The BTWorld Use Case for Big Data Analytics: Description, MapReduce Logical Workflow, and Empirical Evaluation. IEEE Big Data 2013. http://www.pds.ewi.tudelft.nl/~iosup/btworld-mapreduce-workflow13ieeebigdata.pdf
[4] Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis. IEEE IPDPS 2014. http://www.pds.ewi.tudelft.nl/~iosup/perf-eval-graph-proc14ipdps.pdf
[5] B. Ghit, N. Yigitbasi, A. Iosup, and D. Epema. Balanced Resource Allocations Across Multiple Dynamic MapReduce Clusters. ACM SIGMETRICS 2014. http://pds.twi.tudelft.nl/~iosup/dynamic-mapreduce14sigmetrics.pdf
Big Data, Beyond the Data Center
Increasingly the next scientific discoveries and the next industrial innovative breakthroughs will depend on the capacity to extract knowledge and sense from gigantic amount of information. Examples vary from processing data provided by scientific instruments such as the CERN’s LHC; collecting data from large-scale sensor networks; grabbing, indexing and nearly instantaneously mining and searching the Web; building and traversing the billion-edges social network graphs; anticipating market and customer trends through multiple channels of information. Collecting information from various sources, recognizing patterns and distilling insights constitutes what is called the Big Data challenge. However, As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key challenge is to handle the complexity of data management on Hybrid distributed infrastructures, i.e assemblage of Cloud, Grid or Desktop Grids. In this talk, I will overview our works in this research area; starting with BitDew, a middleware for large scale data management on Clouds and Desktop Grids. Then I will present our approach to enable MapReduce on Desktop Grids. Finally, I will present our latest results around Active Data, a programming model for managing data life cycle on heterogeneous systems and infrastructures.
A Biological Internet: Building Eywa from a Social Web of Things with a Little Fog, Stream processing and Linked Data.
Keynote at the Web Science Summer School 2017.
http://www.webscience.org/2017/04/19/shenzhen-web-science-summer-school-2017/
Astronomy is a collaborative science, but it has also become highly specialized, as many other disciplines. Improvement of sharing, discovery and access to resources will enable astronomers to greatly benefit from each other’s highly specialized knowhow. Some initiatives led by scientists and publishers, complement traditional paper publishing with assets published in more interactive digital formats. Among the main goals of these efforts are improving the reproducibility and clarity of the scientific outcome, going beyond the static PDF file, and fostering re-use, which turns into a more efficient exploitation of available digital resources.
Under the grid computing paradigm, large sets of heterogeneous resources can be aggregated and shared. Grid development and acceptance hinge on proving that grids reliably support real applications, and on creating adequate benchmarks to quantify this support. However, applications of grids (and clouds) are just beginning to emerge, and traditional benchmarks have yet to prove representative in grid environments. To address this chicken-and-egg problem, we propose a middle-way approach: create and run synthetic grid workloads comprised of applications representative for today's grids (and clouds). For this purpose, we have designed and implemented GrenchMark, a framework for synthetic workload generation and submission. The framework greatly facilitates synthetic workload modeling, comes with over 35 synthetic and real applications, and is extensible and flexible. We show how the framework can be used for grid system analysis, functionality testing in grid environments, and for comparing different grid settings, and present the results obtained with GrenchMark in our multi-cluster grid, the DAS.
Making Small Data BIG (UT Austin, March 2016)Kerstin Lehnert
Presentation given at the Texas Advanced Computing Center. It describes the potential of re-using small data for new science, achievements and the challenges to make small data re-usable.
RDA Fourth Plenary Keynote - Prof. Christine L. Borgman, Professor Presidential Chair in Information Studies at UCLA: "Data, Data, Everywhere, Nor Any Drop to Drink." Tuesday 23rd Sept 2014, Amsterdam, the Netherlands
https://rd-alliance.org/plenary-meetings/fourth-plenary/plenary4-programme.html
Facilitating Web Science Collaboration through Semantic MarkupJames Hendler
These are the slides that accompanied the paper "Dominic DiFranzo, John S. Erickson, Marie Joan Kristine T. Gloria, Joanne S. Luciano, Deborah McGuinness, & James Hendler, The Web Observatory Extension: Facilitating Web Science Collaboration through Semantic Markup, Proc. WWW 2014 (Web Science Track), Seoul, Korea, 2014." They describe an extension to schema.org that can be used for sharing Web-related datasets and projects.
The science performed in Astronomy is digital science, from observing proposals to final publication, including data and software used: each of the elements and actions involved in the scientific output could be recorded in electronic form.
This fact does not prevent the final outcome of an experiment is still difficult to reproduce. An exhaustive process of documentation can be long, tedious, where access to all the resources must be granted, and after all, the repeatability of results is not even guaranteed. At the same time, we have access to a wealth of files, observational data and publications that could be used more efficiently with a better visibility of the scientific production, avoiding duplication of effort and reinvention.
FANS Finding Auburn's New Students Presentation 2016AuburnClubs
FANS promotes Auburn University to future students and engages alumni to be the best recruiters to the university. This presentation was shared with Auburn Club leaders at the 2016 Club Leadership Conference.
Accounting Best Practices and Silent Auctions AuburnClubs
This presentation was presented at the 2016 Club Leadership Conference and contains vital information concerning accounting practices and silent auction procedures for Auburn Clubs.
John piper cuando no deseo a dios x eltropicalbecemi
¿Qué haces si no puedes sentir la dicha de tener a Dios en tu corazón? ¿Qué haces cuándo descubres la buena noticia de que Dios quiere que te alegres en su presencia, sin embargo no es así como te sientes? John Piper expone en este libro que el gozo en Dios es mucho más complejo de lo que se aprecia a simple vista.
Talk delivered at High Performance Transaction Processing 2013
Myria is a new Big Data service being developed at the University of Washington. We feature high level language interfaces, a hybrid graph-relational data model, database-style algebraic optimization, a comprehensive REST API, an iterative programming model suitable for machine learning and graph analytics applications, and a tight connection to new theories of parallel computation.
In this talk, we describe the motivation for another big data platform emphasizing requirements emerging from the physical, life, and social sciences.
A taxonomy for data science curricula; a motivation for choosing a particular point in the design space; an overview of some our activities, including a coursera course slated for Spring 2012
Getting the most out of your containerized databaseClaus Matzinger
Microservice environments with databases often grow to be a complex architecture behind the scenes to the point where requirements can’t be met. This talk will show how to run a scalable stack with persistent data storage based on Docker and how that will lead to less grey hairs on the Ops team.
Unified Data API for Distributed Cloud Analytics and AIAlluxio, Inc.
Alluxio Day x APAC Modern Data Stack
September 22, 2022
For more on Alluxio Day: https://www.alluxio.io/alluxio-day/
For more Alluxio events: https://alluxio.io/events/
Speaker: Bin Fan (Founding Member & VP of Open Source, Alluxio)
Alluxio (www.alluxio.io) is an open-source virtual distributed file system that provides a unified data access layer for hybrid and multi-cloud deployments. It enables distributed compute engines like Spark, Presto or Machine Learning frameworks like TensorFlow to transparently access different persistent storage systems (including HDFS, S3, Azure and etc) while actively leveraging in-memory cache to accelerate data access. Developed originally from UC Berkeley AMPLab as research project “Tachyon”, Alluxio has more than 1200 contributors and is used by over 100 companies worldwide with the largest production deployment over 1000 nodes.
This presentation focuses on how Alluxio helps the big data analytics stack to be cloud-native. The trending Cloud object storage systems provide more cost-effective and scalable storage solutions but also different semantics and performance implications compared to HDFS. Applications like Spark or Presto will not benefit from the node-level locality or cross-job caching when retrieving data from the cloud object storage. Deploying Alluxio to access cloud solves these problems because data will be retrieved and cached in Alluxio instead of the underlying cloud or object storage repeatedly.
OSDC 2017 - Claus Matzinger - An Open Machine Data Analysis Srack with Docker...NETWAYS
Predictive analytics, Internet of Things, Industry 4.0 - everybody has heard them at least once, but what do real installations look like? How can containerized Microservices help deployment and increase productivity? Claus from Crate.io will answer any and all of these questions and show real world examples with a stack based on Raspberry Pis, Grafana, Docker, and Rust.
OSDC 2017 | An Open Machine Data Analysis Stack with Docker, CrateDB, and Gr...NETWAYS
Predictive analytics, Internet of Things, Industry 4.0 - everybody has heard them at least once, but what do real installations look like? How can containerized Microservices help deployment and increase productivity? Claus from Crate.io will answer any and all of these questions and show real world examples with a stack based on Raspberry Pis, Grafana, Docker, and Rust.
This is a presentation I delivered at CodeMash 2.0.1.0 dealing with lessons learned while building an application for handling the post-processing of scientific data using the Windows Azure platform.
Accelerating data-intensive science by outsourcing the mundaneIan Foster
Talk at eResearch New Zealand Conference, June 2011 (given remotely from Italy, unfortunately!)
Abstract: Whitehead observed that "civilization advances by extending the number of important operations which we can perform without thinking of them." I propose that cloud computing can allow us to accelerate dramatically the pace of discovery by removing a range of mundane but timeconsuming research data management tasks from our consciousness. I describe the Globus Online system that we are developing to explore these possibilities, and propose milestones for evaluating progress towards smarter science.
Scott Edmunds slides for class 8 from the HKU Data Curation (module MLIM7350 from the Faculty of Education) course covering science data, medical data and ethics, and the FAIR data principles.
We present a system to support generalized SQL workload analysis and management for multi-tenant and multi-database platforms. Workload analysis applications are becoming more sophisticated to support database administration, model user behavior, audit security, and route queries, but the methods rely on specialized feature engineering, and therefore must be carefully implemented and reimplemented for each SQL dialect, database system, and application. Meanwhile, the size and complexity of workloads are increasing as systems centralize in the cloud. We model workload analysis and management tasks as variations on query labeling, and propose a system design that can support general query labeling routines across multiple applications and database backends. The design relies on the use of learned vector embeddings for SQL queries as a replacement for application-specific syntactic features, reducing custom code and allowing the use of off-the-shelf machine learning algorithms for labeling. The key hypothesis, for which we provide evidence in this paper, is that these learned features can outperform conventional feature engineering on representative machine learning tasks. We present the design of a database-agnostic workload management and analytics service, describe potential applications, and show that separating workload representation from labeling tasks affords new capabilities and can outperform existing solutions for representative tasks, including workload sampling for index recommendation and user labeling for security audits.
Brief remarks on big data trends and responsible data science at the Workshop on Science and Technology for Washington State: Advising the Legislature, October 4th 2017 in Seattle.
Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems.
Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information.
I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute.
In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs.
The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.
A talk at the Urban Science workshop at the Puget Sound Regional Council July 20 2014 organized by the Northwest Institute for Advanced Computing, a joint effort between Pacific Northwest National Labs and the University of Washington.
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
A 25 minute talk from a panel on big data curricula at JSM 2013
http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208664
Relational databases remain underused in the long tail of science, despite a number of significant
success stories and a natural correspondence between scientific inquiry and ad hoc database query.
Barriers to adoption have been articulated in the past, but spreadsheets and other file-oriented ap-
proaches still dominate. At the University of Washington eScience Institute, we are exploring a new
“delivery vector” for selected database features targeting researchers in the long tail: a web-based
query-as-a-service system called SQLShare that eschews conventional database design, instead empha-
sizing a simple Upload-Query-Share workflow and exposing a direct, full-SQL query interface over
“raw” tabular data. We augment the basic query interface with services for cleaning and integrating
data, recommending and authoring queries, and automatically generating visualizations. We find that
even non-programmers are able to create and share SQL views for a variety of tasks, including quality
control, integration, basic analysis, and access control. Researchers in oceanography, molecular biol-
ogy, and ecology report migrating data to our system from spreadsheets, from conventional databases,
and from ASCII files. In this paper, we will provide some examples of how the platform has enabled sci-
ence in other domains, describe our SQLShare system, and propose some emerging research directions
in this space for the database community.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
Visual Data Analytics in the Cloud for Exploratory Science
1. Visual Data Analytics in the Cloud
for Exploratory Science
Bill Howe, UW
QuickTime™ and a
decompressor
are needed to see this picture.
Huy Vo, Utah
Claudio Silva, Utah
Juliana Freire, Utah
YingYi Bu, UW
2. 3/12/09 Bill Howe, UW 2VisTrails + GridFields
Data acquisition is no longer the bottleneck
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, in support of many hypotheses)
Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
Oceanography: high-resolution models, cheap sensors, satellites
Biology: lab automation, high-throughput sequencing,
3. 3/12/09 Bill Howe, UW 3VisTrails + GridFields
Biology
Oceanography
Astronomy
Two dimensions#ofbytes
# of apps
LSST
SDSS
Galaxy
BioMart
GEO
IOOS
OOI
LANL
HIVPathway
Commons
PanSTARRS
4. 3/12/09 Bill Howe, UW 4VisTrails + GridFields
This Talk
# of Bytes: MapReduce for Scientific Viz
# of Apps: Other VDA Projects
5. 3/12/09 Bill Howe, UW 5VisTrails + GridFields
Converging Requirements
Vis DB
6. 3/12/09 Bill Howe, UW 6VisTrails + GridFields
Why Vis Needs DB
“Transferring the whole data generated … to a storage device or a visualization
machine could become a serious bottleneck, because I/O would take most of the …
time. A more feasible approach is to reduce and prepare the data in situ for
subsequent visualization and data analysis tasks.”
-- SciDAC Review
Current Research Topics in Vis:
• “Query-driven Visualization”
• “In Situ Visualization”
• “Remote Visualization”
8. 3/12/09 Bill Howe, UW 8VisTrails + GridFields
Why DB Needs Vis (2)
“What does the salt wedge look like?”
9. 3/12/09 Bill Howe, UW 9VisTrails + GridFields
Thesis
We can no longer afford to build separate
visualization and data management systems
Data is increasingly destined for the cloud
First Attack: Implement Vis primitives in an
existing “cloud” DM system
10. 3/12/09 Bill Howe, UW 10VisTrails + GridFields
Core Vis Algorithms in MapReduce
Scalar/Volume Rendering
Isosurface Extraction
Mesh Simplification
11. 3/12/09 Bill Howe, UW 11VisTrails + GridFields
Some distributed algorithm…
Map
(Shuffle)
Reduce
12. 3/12/09 Bill Howe, UW 12VisTrails + GridFields
CluE Cluster
410 nodes
Dual Intel Xeon 2.8GHz, hyperthreading
8GB main memory each
Hadoop, no access to OS
Google provided, IBM maintaine, NSF
funded
24. 3/12/09 Bill Howe, UW 24VisTrails + GridFields
Roadmap
# of Bytes: MapReduce for Scientific Viz
# of Apps: Other VDA projects
Azure Ocean
SQLShare
Automating Mashups
25. 3/12/09 Bill Howe, UW 25VisTrails + GridFields
[John Delaney, University of Washington]
26. 3/12/09 Bill Howe, UW 26VisTrails + GridFields
Azure OceanAzure Ocean
COVE for
Visualization
Trident for
Processing
Azure for
Data+ +
27. 3/12/09 Bill Howe, UW 27VisTrails + GridFields
SQLShare: Query Services
for Ad Hoc Research Data
28. 3/12/09 Bill Howe, UW 28VisTrails + GridFields
Ad Hoc Research Data
5/18/10 Garret Cole, eScience Institute
Fasta format
Spread sheets
Tabular data
29. 3/12/09 Bill Howe, UW 29VisTrails + GridFields5/18/10 Garret Cole, eScience Institute
Problem
“I spend 90% of my time handling
data rather than doing science”
-- Robin Kodner, Postdoc, Armbrust Lab
30. 3/12/09 Bill Howe, UW 30VisTrails + GridFields
An observation about “handling data”
How often does each RNA hit appear inside my
annotated surface group?
SELECT hit, COUNT(*) as cnt FROM tigrfamannotation_surface
GROUP BY hit ORDER BY cnt DESC
5/18/10 Garret Cole, eScience Institute
31. 3/12/09 Bill Howe, UW 31VisTrails + GridFields 31
Discovery: SQL Does not Terrify Scientists
5/18/10 Garret Cole, eScience Institute
33. 3/12/09 Bill Howe, UW 33VisTrails + GridFields5/18/10 Garret Cole, eScience Institute
Technology used in 1st
Gen
Component Stack
34. 3/12/09 Bill Howe, UW 34VisTrails + GridFields
SQLShare Redux
Conventional wisdom says “Scientists won’t write SQL”
We don’t believe it!
Instead, we implicate difficulty in
installation
configuration
schema design
performance tuning
data ingest
over-reliance on GUIs
Critical need for visualization
Clear role for Tableau!
We are asking “What kind of platform will
make SQL useful for scientific inquiry?”
36. 3/12/09 Bill Howe, UW 36VisTrails + GridFields
Why Mashups?
Jim Gray: # of datasets scales as N2
Each pairwise comparison generates a new dataset
Corollary: # of apps scales as N2
Every pairwise comparison motivates a new mashup
To keep up, we need to
entrain new programmers,
make existing programmers more productive,
or both
39. 3/12/09 Bill Howe, UW 39VisTrails + GridFields
Why Mashups?
The time of one’s data fitting into a 15 page research paper is past.
Datasets are too large and complex to be conveyed with a handful
of static images
Prediction: succinct, targeted, interactive web apps will become the
currency of scientific communication
with the public
with policy makers
with colleagues in other disciplines
with peers
with students (K12 - grad)
41. 3/12/09 Bill Howe, UW 41VisTrails + GridFields
Conclusions
Converging requirements for DB and Vis
At high scale:
A Vis library in MapReduce
At high complexity:
Azure Ocean
Data + Workflow + Vis
“Client + Cloud”,“Computational mobility”
SQLShare
Ad Hoc data -- “anything goes”
Visualization critical
(semi-)automated mashups
“Show me what’s interesting”
42. 3/12/09 Bill Howe, UW 42VisTrails + GridFields
Acknowledgments
http://escience.washington.edu
47. 3/12/09 Bill Howe, UW 47VisTrails + GridFields
Azure OceanAzure Ocean
COVE for
Visualization
Trident for
Processing
Azure for
Data+ +
48. COVECOVE
Research into new interfaces for cross-disciplinary ocean scienceResearch into new interfaces for cross-disciplinary ocean science
Extensive instrument and cable layout for creating experimentsExtensive instrument and cable layout for creating experiments
Flexible terrain and image engine for visualizing siteFlexible terrain and image engine for visualizing site
True 3D/4D science dataset visualizationTrue 3D/4D science dataset visualization
Field tested in RSN observatory layout and on ocean expeditionsField tested in RSN observatory layout and on ocean expeditions
Cross platform and extensible with python and workflow systemsCross platform and extensible with python and workflow systems
49. 3/12/09 Bill Howe, UW 49VisTrails + GridFields
TridentTrident
Microsoft Research scientific workflow systemMicrosoft Research scientific workflow system
Visual programming environment for connecting tasksVisual programming environment for connecting tasks
Science-specific task libraries including one for ocean sciencesScience-specific task libraries including one for ocean sciences
Automated provenance capture, monitoring, and fault toleranceAutomated provenance capture, monitoring, and fault tolerance
Runs on local system, Windows server, or HPC ClusterRuns on local system, Windows server, or HPC Cluster
Cross platform with Silverlight and web service interfaceCross platform with Silverlight and web service interface
50. 3/12/09 Bill Howe, UW 50VisTrails + GridFields
AzureAzure
Microsoft’s cloud computing platformMicrosoft’s cloud computing platform
Provides storage and computing as pay-as-you-go servicesProvides storage and computing as pay-as-you-go services
From development standpoint, system looks like provisioned VM’sFrom development standpoint, system looks like provisioned VM’s
SQL, table, and blob (file system) storage models are includedSQL, table, and blob (file system) storage models are included
Access to storage via RESTful HTTP interfaceAccess to storage via RESTful HTTP interface
51. 3/12/09 Bill Howe, UW 51VisTrails + GridFields
Azure OceanAzure Ocean
COVE + Trident + Azure provides visual analytics to scientistsCOVE + Trident + Azure provides visual analytics to scientists
Any component –Any component – VisualizationVisualization,, ComputingComputing, or, or DataData –– can becan be
provisioned locally, on a server, or in the cloudprovisioned locally, on a server, or in the cloud
When on same machine, system APIs are leveraged for speedWhen on same machine, system APIs are leveraged for speed
When distributed, communication is through HTTP and RESTful APIsWhen distributed, communication is through HTTP and RESTful APIs
Flexible platform for the diverse ocean science needsFlexible platform for the diverse ocean science needs
53. 3/12/09 Bill Howe, UW 53VisTrails + GridFields
MapReduce Programming Model
Input & Output: each a set of key/value pairs
Programmer specifies two functions:
Processes input key/value pair
Produces set of intermediate pairs
Combines all intermediate values for a particular key
Produces a set of merged output values (usually just one)
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
slide source: Google, Inc.
55. 3/12/09 Bill Howe, UW 55VisTrails + GridFields
Isosurface Example
<Vis movie>QuickTime™ and a
decompressor
are needed to see this picture.
Key idea: Zooplankton correlated with temperature
57. 3/12/09 Bill Howe, UW 57VisTrails + GridFields
Example Query: Climatology
Feb May
Average Surface Salinity by Month
Columbia River Plume 1999-2006
Columbia
River
psu
Washington
Oregon
animation
58. 3/12/09 Bill Howe, UW 58VisTrails + GridFields
UW + Utah CluE Program
Goals
10+-year “climatologies” at interactive speeds
…with provenance, reproducibility, collaboration …on a
shared-nothing, commodity platform
In general: Explore the intersection of scientific
databases and scientific visualization, at scale
Methods
“Cloud-Enable” two projects
GridFields: Query algebra for mesh data
VisTrails: Scientific workflow and provenance
60. 3/12/09 Bill Howe, UW 60VisTrails + GridFields
Converging Requirements
Vis: “Query-driven Visualization”
Vis: “In Situ Visualization”
Vis: “Remote Visualization”
DB: Millions of tuples per result
Vis DB
61. 3/12/09 Bill Howe, UW 61VisTrails + GridFields
Preliminary results
Managing Hadoop jobs with VisTrails
GridField queries in Hadoop
Core Visualization algorithms in Hadoop
62. 3/12/09 Bill Howe, UW 62VisTrails + GridFields
Core Vis Algorithms in MapReduce
Scalar/Volume Rendering
Map: Rasterization
Reduce: Compositing, blending
Isosurface Extraction
Map: Isosurface Extraction
Reduce: Combine like isovalues
Mesh Simplification
Map: Bin vertices
Reduce: Collapse binned triangles
66. 3/12/09 Bill Howe, UW 66VisTrails + GridFields
“Query-Driven Visualization”
Vis perspective:
query = subsetting
DB perspective:
query = manipulation, preparation, restructuring, index-building,
aggregation, regridding, downsampling, simplification,
reformatting, etc.
Database Maxims:
1. Push the computation to the data.
2. Declarative programming is a good thing.
67. 3/12/09 Bill Howe, UW 67VisTrails + GridFields
Why Cloud?
“Cloud”?
Software as a Service (SaaS)
Infrastructure as a Service (IaaS)
Platform as a Service (PaaS)
Working definition:
General, elastic, data-intensive, scalable computing
This work: Vis techniques + DB techniques in the Cloud
68. 3/12/09 Bill Howe, UW 68VisTrails + GridFields
Shared Nothing Parallel Databases
Teradata
Greenplum
Netezza
Aster Data Systems
Datallegro
Vertica
MonetDB
Microsoft
Recently commercialized as “Vectorwise”
69. 3/12/09 Bill Howe, UW 69VisTrails + GridFields
Taxonomy of Parallel Architectures
Easiest to program, but
$$$$
Scales to 1000s of nodes
70. 3/12/09 Bill Howe, UW 70VisTrails + GridFieldsscreenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
VisTrails
71. 3/12/09 Bill Howe, UW 71VisTrails + GridFieldsscreenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
Version Tree
72. 3/12/09 Bill Howe, UW 72VisTrails + GridFields
Collaboration
Bill Howe @ UW
computes salt flux
using GridFields
Erik Anderson @ Utah
adds vector
streamlines and
adjusts opacity
Bill Howe @ UW adds
an isosurface of
salinity
Peter Lawson adds
discussion of the
scientific
interpretation
Howe et al., eScience 2008
73. 3/12/09 Bill Howe, UW 73VisTrails + GridFields
Preliminary results
Managing Hadoop jobs with VisTrails
GridField queries in Hadoop
Core Visualization algorithms in Hadoop
74. 3/12/09 Bill Howe, UW 74VisTrails + GridFields
Preliminary results
Managing Hadoop jobs with VisTrails
GridField queries in Hadoop
Core Visualization algorithms in Hadoop
75. 3/12/09 Bill Howe, UW 75VisTrails + GridFields
Hadoop in VisTrails
Wrap Hadoop Streaming/HDFS Operations
Plug “PreProcess” to actual Vis Pipeline
3/12/09 75
76. 3/12/09 Bill Howe, UW 76VisTrails + GridFields
Hadoop in VisTrails
Provenance and Monitoring
3/12/09 76
77. 3/12/09 Bill Howe, UW 77VisTrails + GridFields
Preliminary results
Managing Hadoop jobs with VisTrails
GridField queries in Hadoop
Core Visualization algorithms in Hadoop
78. 3/12/09 Bill Howe, UW 78VisTrails + GridFields
All Science is reducing to a database problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, independent of hypotheses)
Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
Medicine: ubiquitous digital records, MRI, ultrasound
Oceanography: high-resolution models, cheap sensors, satellites
Biology: lab automation, high-throughput sequencing
“Increase Data Collection Exponentially in Less Time, with FlowCAM”
Empirical X Analytical X Computational X X-informatics
79. 3/12/09 Bill Howe, UW 79VisTrails + GridFields
Key Idea: Declarative Languages
SELECT *
FROM Order o, Item i
WHERE o.item = i.item
AND o.date = today()
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
Find all orders from today, along with the items ordered
80. 3/12/09 Bill Howe, UW 80VisTrails + GridFields
Example System: Teradata
AMP = unit of parallelism
81. 3/12/09 Bill Howe, UW 81VisTrails + GridFields
Example System: Teradata
AMP 1 AMP 2 AMP 3
select
date=today()
select
date=today()
select
date=today()
scan
Order o
scan
Order o
scan
Order o
hash
h(item)
hash
h(item)
hash
h(item)
AMP 4 AMP 5 AMP 6
82. 3/12/09 Bill Howe, UW 82VisTrails + GridFields
Example System: Teradata
AMP 1 AMP 2 AMP 3
scan
Item i
AMP 4 AMP 5 AMP 6
hash
h(item)
scan
Item i
hash
h(item)
scan
Item i
hash
h(item)
83. 3/12/09 Bill Howe, UW 83VisTrails + GridFields
Example System: Teradata
AMP 4 AMP 5 AMP 6
join join join
o.item = i.item o.item = i.item o.item = i.item
contains all orders and all lines
where hash(item) = 1
contains all orders and all lines
where hash(item) = 2
contains all orders and all lines
where hash(item) = 3
84. 3/12/09 Bill Howe, UW 84VisTrails + GridFields
Workflow Execution Plans
Need execution plans spanning client/server/cloud
85. 3/12/09 Bill Howe, UW 85VisTrails + GridFields
Example: Isosurface Browsing
QuickTime™ and a
decompressor
are needed to see this picture.
86. 3/12/09 Bill Howe, UW 86VisTrails + GridFields
Example: Isosurface Browsing
Plan A
Subset Subset Subset Subset
tstep 0 tstep 1 tstep 2 tstep 3
87. 3/12/09 Bill Howe, UW 87VisTrails + GridFields
Example: Isosurface Browsing
Plan B: Build an index
Build Index, e.g., an Interval Tree (Cignoni 97)
Subset Subset Subset
tstep 0 tstep 1 tstep 2 tstep 3
Subset
Render
Isosurface Isosurface Isosurface Isosurface
Render Render Render
88. 3/12/09 Bill Howe, UW 88VisTrails + GridFields
Example: Isosurface Browsing
Plan C: Build a spatial index to support panning
Plan D: Build a multi-resolution index to support zoom
…and so on
Why not precompute all appropriate indexes?
Some will (partially) reside on client
Storage is not as cheap as we pretend
Need a flexible system where
a “query result” can be explored interactively, and
we prepare for similar queries
similarity defined by natural “browsing patterns” in visualization
systems
90. 3/12/09 Bill Howe, UW 90VisTrails + GridFields
Why MapReduce/Hadoop?
Popular
AWS Elastic MapReduce
100s of startups
# of downloads
# of blog posts
Free as in Speech
Free as in Beer
Flexible, Lightweight
Scalable
Fault-tolerant
98. 3/12/09 Bill Howe, UW 98VisTrails + GridFields
As a GridField Expression
⊗
H0 : (x,y,b) V0 : (σ )
apply(0, z=(surf − b) * σ )
bind(0, surf)
C
H = Scan(contxt, "H")
rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H)
T = Scan(contxt, “T”)
V = Scan(contxt, “V”)
HxV = Cross(H, V)
HxVxT = Cross(HxV, T)
salt = Bind(contxt, HxVxT, “salt”)
onemonth = Regrid(salt, HxV, equijoin(“hpos,vpos”), avg())
99. 3/12/09 Bill Howe, UW 99VisTrails + GridFields
As a SQL Query
Select hpos, vpos, avg(salt)
from ocean
group by hpos, vpos
100. 3/12/09 Bill Howe, UW 100VisTrails + GridFields
Scientific Workflow Systems
Value proposition: More time on science, less time on code
How: By providing language features emphasizing sharing,
reuse, reproducibility, rapid prototyping, efficiency
Provenance
Visual programming
Caching
Integration with domain-specific tools
Scheduling
101. 3/12/09 Bill Howe, UW 101VisTrails + GridFields
Related Vis Work
Parallel visualization systems
ParaView, VisIt
Query-Driven Visualization
[Bethel et al 2006,2008,2009]
FastBit Index
[Shoshani et al 2007]
DB Vis systems
Tableau
102. 3/12/09 Bill Howe, UW 102VisTrails + GridFields
Feeding the Pipeline
source: Ken Moreland
missing step?
104. 3/12/09 Bill Howe, UW 104VisTrails + GridFields
Role 2: Move Computation to the Data
“Transferring the whole data generated … to a storage device or a
visualization machine could become a serious bottleneck, because I/O
would take most of the … time. A more feasible approach is to reduce
and prepare the data in situ for subsequent visualization and data
analysis tasks.”
-- SciDAC Review
105. 3/12/09 Bill Howe, UW 105VisTrails + GridFields
Remote Visualization
Reduce and render remotely, transfer images
++ transfers less data
-- specialized hardware, high load
Reduce remotely, transfer data/geometry, render locally
++ uses local graphics pipeline
-- transfers more data
107. 3/12/09 Bill Howe, UW 107VisTrails + GridFields
Scientific Vis System Roundup
General
ParaView [KitWare, Los Alamos, Sandia]
VisIt [LLNL]
Specialized
SALSA, particles, Quinn, UW
VISUS, streaming/progressive, Jones, LLNL
SAGE,
Hyperwall, tiled display, NASA
Editor's Notes
Drowning in data; starving for information
We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
“Typical large pharmas today are generating 20 terabytes of data daily. That’s probably going up to 100 terabytes per day in the next year or so.”
“tens of terabytes of data per day” -- genome center at Washignton University
Increase data collection exponentially with flowcam
Analytics and Visualization are mutually dependent
Scalability
Fault-tolerance
Exploit shared-nothing, commodity clusters
In general: Move computation to the data
Data is ending up in the cloud; we need to figure out how to use it.
Visualization is a more efficient way to query data -- you can browse and explore.
But you need to be able to switch back and forth between interactive browsing and symbolic querying
What exactly is Ad Hoc Research data?
It is data that can come in any size shape or form, where the data is heterogeneous within its structure, format, quality, and more.
(granted we had a minute for Bill (clearly Bill) to describe this new eScience movement)
We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve.
Essentially, we want to remove the speed-bump of data handling from the scientists.
To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on?
Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart.
If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database.
This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
One claim we are trying to prove with this project is that scientists are not afraid to learn a bit of SQL
In our first generation deployment, we used the asp.net front end on the windows azure cloud to host our web service and Amazon’s ec2 cloud as the backend to host our Microsoft SQL Server database.
Data products are the currency of scientific and statistical communication with the public
Ex: Obama map
Ex: Mars Rover pictures generate 218M hits in 24 hrs
But: Datasets are growing too big and too complex to view through a few static images
Scientists want to create interactive visualizations that allow others to explore their results
Ex: Nasa 3D with Photosynth
Ex: CAMERA
Ex:
On the order of hundreds of points. Manual browsing.
Ex: Nasa 3D with Photosynth
Ex: CAMERA
Ex:
Data-intensive science
This movie was rendered offline, but it’s increasingly important to be able to create visualizations on the fly to allow interactive exploration of large datasets.
Need to consider private clouds
Not just renting hardware: general-purpose data processing
The goal here is to make Shared Nothing Architecturs easier to program.
We only wrap the interface for Hadoop Streaming in VisTrails with the additional suppport of HDFS operations to upload/download data/libraries for the job.
The Hadoop Streaming is plugged into a local VTK rendering pipeline that would grab data from the cloud and generate an animation on the VisTrails Spreadsheet.
Users can specify their own Python Source as mapper/reducer. In this case, a VTK script is specified in the mapper. Also, VTK libraries are shipped along with the code to the computing node. This uses the underlying –cacheArchive of Hadoop streaming.
By default, Hadoop logs are output to the standard output of VisTrails app. Jobs are killed by terminate the program and run an extra command returned by Hadoop. However, one can plug a HadoopTrackerCell to the end of the pipeline to have their log messages to be monitored on the VisTrails Spreadsheet. There are also button to kill the job or show Job Tracker, which would automatically connect through the CLuE’s specific proxy to see additional logs/error messages of jobs.
Drowning in data; starving for information
We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
Need to assign workflows to resources for execution in a heterogeneous compute environment. Parts of this workflow can be compiled into Hadoop jobs, parts should be run locally so that they exploit hardware acceleration.
But this is not just computation placement -- there are different execution plans, similar to relational execution plans.
Gridfields expressions can be algebraically optimized, for example.
Plan C: Build a spatial index to support panning
Plan D: Build a multi-resolution index to support zoom
…and so on
Why not precompute all appropriate indexes?
Some will (partially) reside on client
Storage is not as cheap as we pretend
Need a flexible system where
a “query result” can be explored interactively, and
we prepare for similar queries
similarity defined by natural “browsing patterns” in visualization systems
We can’t just precompute the indexes, since they may reside on
Analytics and Visualization are mutually dependent
Scalability
Fault-tolerance
Exploit shared-nothing, commodity clusters
In general: Move computation to the data
Upper left: Average
Sweeping through the velocity fields quickly exposed the location of the “upstream” salt flux -- where salty water made its way back upstream.