The document discusses the formation of a new partnership between the University of Washington and Carnegie Mellon University called the eScience Institute. The partnership will receive $1 million per year in funding from the state of Washington and $1.5 million from the Gordon and Betty Moore Foundation. The goal of the institute is to help universities stay competitive by positioning them at the forefront of modern techniques in data-intensive science fields like sensors, databases, and data mining.
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
A talk at the RPI-NSF Workshop on Multiscale Modeling of Complex Data, September 12, 2011, Troy NY, USA.
We have made much progress over the past decade toward effectively
harnessing the collective power of IT resources distributed across the
globe. In fields such as high-energy physics, astronomy, and climate,
thousands benefit daily from tools that manage and analyze large
quantities of data produced and consumed by large collaborative teams.
But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that far more--ultimately
most?--researchers will soon require capabilities not so different from those used by these big-science teams. How is the general population of researchers and institutions to meet these needs? Must every lab be filled
with computers loaded with sophisticated software, and every researcher become an information technology (IT) specialist? Can we possibly afford to equip our labs in this way, and where would we find the experts to operate them?
Consumers and businesses face similar challenges, and industry has
responded by moving IT out of homes and offices to so-called cloud providers (e.g., GMail, Google Docs, Salesforce), slashing costs and complexity. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity. More importantly, we can free researchers from the burden of managing IT, giving them back their time to focus on research and empowering them to go beyond the scope of what was previously possible.
I describe work we are doing at the Computation Institute to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date and suggest a path towards
large-scale delivery of these capabilities.
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
A talk at the RPI-NSF Workshop on Multiscale Modeling of Complex Data, September 12, 2011, Troy NY, USA.
We have made much progress over the past decade toward effectively
harnessing the collective power of IT resources distributed across the
globe. In fields such as high-energy physics, astronomy, and climate,
thousands benefit daily from tools that manage and analyze large
quantities of data produced and consumed by large collaborative teams.
But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that far more--ultimately
most?--researchers will soon require capabilities not so different from those used by these big-science teams. How is the general population of researchers and institutions to meet these needs? Must every lab be filled
with computers loaded with sophisticated software, and every researcher become an information technology (IT) specialist? Can we possibly afford to equip our labs in this way, and where would we find the experts to operate them?
Consumers and businesses face similar challenges, and industry has
responded by moving IT out of homes and offices to so-called cloud providers (e.g., GMail, Google Docs, Salesforce), slashing costs and complexity. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity. More importantly, we can free researchers from the burden of managing IT, giving them back their time to focus on research and empowering them to go beyond the scope of what was previously possible.
I describe work we are doing at the Computation Institute to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date and suggest a path towards
large-scale delivery of these capabilities.
A Biological Internet: Building Eywa from a Social Web of Things with a Little Fog, Stream processing and Linked Data.
Keynote at the Web Science Summer School 2017.
http://www.webscience.org/2017/04/19/shenzhen-web-science-summer-school-2017/
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
The science performed in Astronomy is digital science, from observing proposals to final publication, including data and software used: each of the elements and actions involved in the scientific output could be recorded in electronic form.
This fact does not prevent the final outcome of an experiment is still difficult to reproduce. An exhaustive process of documentation can be long, tedious, where access to all the resources must be granted, and after all, the repeatability of results is not even guaranteed. At the same time, we have access to a wealth of files, observational data and publications that could be used more efficiently with a better visibility of the scientific production, avoiding duplication of effort and reinvention.
These are the slides from a plenary panel that I participated in at IEEE Cloud 2011 on July 5, 2011 in Washington, D.C. I discussed the Open Science Data Cloud and concluded the talk by three research questions
Data-intensive applications on cloud computing resources: Applications in lif...Ola Spjuth
Presentation at the de.NBI 2017 symposium “The Future Development of Bioinformatics in Germany and Europe” held at the Center for Interdisciplinary Research (ZiF) of Bielefeld University, October 23-25, 2017.
https://www.denbi.de/symposium2017
Astronomy is a collaborative science, but it has also become highly specialized, as many other disciplines. Improvement of sharing, discovery and access to resources will enable astronomers to greatly benefit from each other’s highly specialized knowhow. Some initiatives led by scientists and publishers, complement traditional paper publishing with assets published in more interactive digital formats. Among the main goals of these efforts are improving the reproducibility and clarity of the scientific outcome, going beyond the static PDF file, and fostering re-use, which turns into a more efficient exploitation of available digital resources.
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
A talk at the CODATA/RDA meeting in Gaborone, Botswana. I made the case that the biggest barriers to effective data sharing and reuse are often those associated with "data friction" and that cloud automation can be used to overcome those barriers.
The image on the first slide shows a few of the more than 20,000 active Globus endpoints.
My talk at the Winter School on Big Data in Tarragona, Spain.
Abstract: We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers.
Los IPython Notebooks nos han proporcionado una sustancial mejora en la documentación del scripts, así como su inspección y una mayor re-utilización. Los IPython Notebooks también permiten acceder a distintos lenguajes de programación (Fortran, IDL, R, Shell,..) en un mismo script, lo que unido a su modo de acceso Web les hace ser un elemento ideal para el trabajo colaborativo (multi-lenguaje, multi-usuario, multi-plataforma, etc..) Os contaré qué tipo de cosas pueden hacerse con IPython Notebooks, desde desarrollo colaborativo de código multi-lenguaje, pasando por la reutilización de tutoriales, visualización interactiva de resultados, hasta la distribución de código más modular, y la publicación final de un experimento digital verificable y reproducible: el preámbulo de los papers ejecutables.
Digital Science: Reproducibility and Visibility in AstronomyJose Enrique Ruiz
The science done in Astronomy is digital science, from observing proposals to final publication, to data and software used: each of the elements and actions involved in scientific output could be recorded in electronic form. This fact does not prevent the final outcome of an experiment is still difficult to reproduce. This procedure can be long, tedious, not easily accessible or understandable, even to the author. At the same time, we have a rich infrastructure of files, observational data and publications. This could be used more efficiently if we reach greater visibility of the scientific production, which avoids duplication of effort and reinvention.
Reproducibility is a cornerstone in scientific method, and extraction of relevant information in the current and future data flood is key in Astronomy. The AMIGA group (Analysis of the interstellar Medium of Isolated GAlaxies, IAA-CSIC, http://amiga.iaa.es) faces these two challenges in the European project "Wf4Ever: Advanced technologies for enhanced preservation workflow Science" to enable the preservation of the methodology in scalable semantic repositories to facilitate their discovery, access, inspection, exploitation and distribution. These repositories store the experiments on "Research Objects" whose main constituents are digital scientific workflows. These provide a comprehensive view and clear scientific interpretation of the experiment as well as the automation of the method, going beyond the usual pipelines that normally end up in data processing.
The quantitative leap in volume and complexity of the next generation of archives will need analysis and data mining tasks to live closer to the data, in computing and distributed storage environments, but they should also be modular enough to allow customization from scientists and be easily accessible to foster their dissemination among the community. Astronomy is a collaborative science, but it has also become highly specialized, as many other disciplines. Sharing, preservation, discovery and a much simplified access to resources in the composition of scientific workflows will enable astronomers to greatly benefit from each other’s highly specialized knowhow, they constitute a way to push Astronomy to share and publish not only results and data, but also processes and methodologies.
We will show how the use of scientific workflows can help to improve the reproducibility of the experiment and a more efficient exploitation of astronomical archives, as well as the visibility of the scientific methodology and its reuse.
Accelerating data-intensive science by outsourcing the mundaneIan Foster
Talk at eResearch New Zealand Conference, June 2011 (given remotely from Italy, unfortunately!)
Abstract: Whitehead observed that "civilization advances by extending the number of important operations which we can perform without thinking of them." I propose that cloud computing can allow us to accelerate dramatically the pace of discovery by removing a range of mundane but timeconsuming research data management tasks from our consciousness. I describe the Globus Online system that we are developing to explore these possibilities, and propose milestones for evaluating progress towards smarter science.
Large Scale On-Demand Image Processing For Disaster ReliefRobert Grossman
This is a status update (as of Feb 22, 2010) of a new Open Cloud Consortium project that will provide on-demand, large scale image processing to assist with disaster relief efforts.
A talk at the Urban Science workshop at the Puget Sound Regional Council July 20 2014 organized by the Northwest Institute for Advanced Computing, a joint effort between Pacific Northwest National Labs and the University of Washington.
A 25 minute talk from a panel on big data curricula at JSM 2013
http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208664
A Biological Internet: Building Eywa from a Social Web of Things with a Little Fog, Stream processing and Linked Data.
Keynote at the Web Science Summer School 2017.
http://www.webscience.org/2017/04/19/shenzhen-web-science-summer-school-2017/
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
The science performed in Astronomy is digital science, from observing proposals to final publication, including data and software used: each of the elements and actions involved in the scientific output could be recorded in electronic form.
This fact does not prevent the final outcome of an experiment is still difficult to reproduce. An exhaustive process of documentation can be long, tedious, where access to all the resources must be granted, and after all, the repeatability of results is not even guaranteed. At the same time, we have access to a wealth of files, observational data and publications that could be used more efficiently with a better visibility of the scientific production, avoiding duplication of effort and reinvention.
These are the slides from a plenary panel that I participated in at IEEE Cloud 2011 on July 5, 2011 in Washington, D.C. I discussed the Open Science Data Cloud and concluded the talk by three research questions
Data-intensive applications on cloud computing resources: Applications in lif...Ola Spjuth
Presentation at the de.NBI 2017 symposium “The Future Development of Bioinformatics in Germany and Europe” held at the Center for Interdisciplinary Research (ZiF) of Bielefeld University, October 23-25, 2017.
https://www.denbi.de/symposium2017
Astronomy is a collaborative science, but it has also become highly specialized, as many other disciplines. Improvement of sharing, discovery and access to resources will enable astronomers to greatly benefit from each other’s highly specialized knowhow. Some initiatives led by scientists and publishers, complement traditional paper publishing with assets published in more interactive digital formats. Among the main goals of these efforts are improving the reproducibility and clarity of the scientific outcome, going beyond the static PDF file, and fostering re-use, which turns into a more efficient exploitation of available digital resources.
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
A talk at the CODATA/RDA meeting in Gaborone, Botswana. I made the case that the biggest barriers to effective data sharing and reuse are often those associated with "data friction" and that cloud automation can be used to overcome those barriers.
The image on the first slide shows a few of the more than 20,000 active Globus endpoints.
My talk at the Winter School on Big Data in Tarragona, Spain.
Abstract: We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers.
Los IPython Notebooks nos han proporcionado una sustancial mejora en la documentación del scripts, así como su inspección y una mayor re-utilización. Los IPython Notebooks también permiten acceder a distintos lenguajes de programación (Fortran, IDL, R, Shell,..) en un mismo script, lo que unido a su modo de acceso Web les hace ser un elemento ideal para el trabajo colaborativo (multi-lenguaje, multi-usuario, multi-plataforma, etc..) Os contaré qué tipo de cosas pueden hacerse con IPython Notebooks, desde desarrollo colaborativo de código multi-lenguaje, pasando por la reutilización de tutoriales, visualización interactiva de resultados, hasta la distribución de código más modular, y la publicación final de un experimento digital verificable y reproducible: el preámbulo de los papers ejecutables.
Digital Science: Reproducibility and Visibility in AstronomyJose Enrique Ruiz
The science done in Astronomy is digital science, from observing proposals to final publication, to data and software used: each of the elements and actions involved in scientific output could be recorded in electronic form. This fact does not prevent the final outcome of an experiment is still difficult to reproduce. This procedure can be long, tedious, not easily accessible or understandable, even to the author. At the same time, we have a rich infrastructure of files, observational data and publications. This could be used more efficiently if we reach greater visibility of the scientific production, which avoids duplication of effort and reinvention.
Reproducibility is a cornerstone in scientific method, and extraction of relevant information in the current and future data flood is key in Astronomy. The AMIGA group (Analysis of the interstellar Medium of Isolated GAlaxies, IAA-CSIC, http://amiga.iaa.es) faces these two challenges in the European project "Wf4Ever: Advanced technologies for enhanced preservation workflow Science" to enable the preservation of the methodology in scalable semantic repositories to facilitate their discovery, access, inspection, exploitation and distribution. These repositories store the experiments on "Research Objects" whose main constituents are digital scientific workflows. These provide a comprehensive view and clear scientific interpretation of the experiment as well as the automation of the method, going beyond the usual pipelines that normally end up in data processing.
The quantitative leap in volume and complexity of the next generation of archives will need analysis and data mining tasks to live closer to the data, in computing and distributed storage environments, but they should also be modular enough to allow customization from scientists and be easily accessible to foster their dissemination among the community. Astronomy is a collaborative science, but it has also become highly specialized, as many other disciplines. Sharing, preservation, discovery and a much simplified access to resources in the composition of scientific workflows will enable astronomers to greatly benefit from each other’s highly specialized knowhow, they constitute a way to push Astronomy to share and publish not only results and data, but also processes and methodologies.
We will show how the use of scientific workflows can help to improve the reproducibility of the experiment and a more efficient exploitation of astronomical archives, as well as the visibility of the scientific methodology and its reuse.
Accelerating data-intensive science by outsourcing the mundaneIan Foster
Talk at eResearch New Zealand Conference, June 2011 (given remotely from Italy, unfortunately!)
Abstract: Whitehead observed that "civilization advances by extending the number of important operations which we can perform without thinking of them." I propose that cloud computing can allow us to accelerate dramatically the pace of discovery by removing a range of mundane but timeconsuming research data management tasks from our consciousness. I describe the Globus Online system that we are developing to explore these possibilities, and propose milestones for evaluating progress towards smarter science.
Large Scale On-Demand Image Processing For Disaster ReliefRobert Grossman
This is a status update (as of Feb 22, 2010) of a new Open Cloud Consortium project that will provide on-demand, large scale image processing to assist with disaster relief efforts.
A talk at the Urban Science workshop at the Puget Sound Regional Council July 20 2014 organized by the Northwest Institute for Advanced Computing, a joint effort between Pacific Northwest National Labs and the University of Washington.
A 25 minute talk from a panel on big data curricula at JSM 2013
http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208664
A taxonomy for data science curricula; a motivation for choosing a particular point in the design space; an overview of some our activities, including a coursera course slated for Spring 2012
Relational databases remain underused in the long tail of science, despite a number of significant
success stories and a natural correspondence between scientific inquiry and ad hoc database query.
Barriers to adoption have been articulated in the past, but spreadsheets and other file-oriented ap-
proaches still dominate. At the University of Washington eScience Institute, we are exploring a new
“delivery vector” for selected database features targeting researchers in the long tail: a web-based
query-as-a-service system called SQLShare that eschews conventional database design, instead empha-
sizing a simple Upload-Query-Share workflow and exposing a direct, full-SQL query interface over
“raw” tabular data. We augment the basic query interface with services for cleaning and integrating
data, recommending and authoring queries, and automatically generating visualizations. We find that
even non-programmers are able to create and share SQL views for a variety of tasks, including quality
control, integration, basic analysis, and access control. Researchers in oceanography, molecular biol-
ogy, and ecology report migrating data to our system from spreadsheets, from conventional databases,
and from ASCII files. In this paper, we will provide some examples of how the platform has enabled sci-
ence in other domains, describe our SQLShare system, and propose some emerging research directions
in this space for the database community.
Making It Your Own: Transitioning Into a New Electronic Resources RoleAlana Nuth
Making it Your Own: Transitioning into a New Electronic Resources Role
Presented by Kelly Blanchat and Alana Verminski
ER&L Conference 2015
Austin, TX
Kelly Blanchat
Electronic Resources Librarian
Queens College, CUNY
kelly.blanchat@qc.cuny.edu
@kellyblanchat
Alana Verminski
Reference and Instruction Librarian
St. Mary's College of Maryland Library
alana.verminski@gmail.com
www.alanaverminski.com
Abstract:
Transitioning into a new role is challenging, especially one as vast and nuanced as electronic resources. With a new position comes endless opportunities, along with unknown or unexpected situations. Using entertaining anecdotes, presenters will share strategies to revamp legacy workflows, assess current practices, and make a new position your own.
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
Talk at JISC Repositories conference intended for repository managers or research managers on some of the issues involved. Talk had to be originally given unaided because of a technology problem!
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
Thinking about the need for deeper provenance for knowledge graphs but also using knowledge graphs to enrich provenance. Presented at https://seminariomirianandres.unirioja.es/sw19/
Talk delivered at High Performance Transaction Processing 2013
Myria is a new Big Data service being developed at the University of Washington. We feature high level language interfaces, a hybrid graph-relational data model, database-style algebraic optimization, a comprehensive REST API, an iterative programming model suitable for machine learning and graph analytics applications, and a tight connection to new theories of parallel computation.
In this talk, we describe the motivation for another big data platform emphasizing requirements emerging from the physical, life, and social sciences.
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesIan Foster
Argonne’s Discovery Engines for Big Data project is working to enable new research modalities based on the integration of advanced computing with experiments at facilities such as the Advanced Photon Source (APS). I review science drivers and initial results in diffuse scattering, high energy diffraction microscopy, tomography, and pythography. I also describe the computational methods and infrastructure that we leverage to support such applications, which include the Petrel online data store, ALCF supercomputers, Globus research data management services, and Swift parallel scripting. This work points to a future in which tight integration of DOE’s experimental and computational facilities enables both new science and more efficient and rapid discovery.
Accelerating Discovery via Science ServicesIan Foster
[A talk presented at Oak Ridge National Laboratory on October 15, 2015]
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In big-science projects in high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to develop suites of science services to which researchers can dispatch mundane but time-consuming tasks, and thus to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers. I use examples from Globus and other projects to demonstrate what can be achieved.
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
Director's Colloquium at Los Alamos National Laboratory, September 18, 2014.
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.
The computational infrastructure is becoming a vast interconnected fabric of formal methods, including per a major shift from 2d grids to 3d graphs in machine learning architectures
The implication is systems-level digital science at unprecedented scale for discovery in a diverse range of scientific disciplines
Scott Edmunds slides for class 8 from the HKU Data Curation (module MLIM7350 from the Faculty of Education) course covering science data, medical data and ethics, and the FAIR data principles.
Python's Role in the Future of Data AnalysisPeter Wang
Why is "big data" a challenge, and what roles do high-level languages like Python have to play in this space?
The video of this talk is at: https://vimeo.com/79826022
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...Safe Software
University and College Campuses are complex environments. The campus comprises many physical sub-systems, such as buildings, outdoor spaces, utilities, transportation, which are maintained by several divisions using multiple IT tools and different formats. Making campus-wide analytics requires bringing all these data elements and different formats (CAD, GIS, BIM) together to create a comprehensive common operating picture. In this presentation we will demonstrate how FME is a key and crucial technology to create campus wide data warehouse.
Similar to A New Partnership for Cross-Scale, Cross-Domain eScience (20)
We present a system to support generalized SQL workload analysis and management for multi-tenant and multi-database platforms. Workload analysis applications are becoming more sophisticated to support database administration, model user behavior, audit security, and route queries, but the methods rely on specialized feature engineering, and therefore must be carefully implemented and reimplemented for each SQL dialect, database system, and application. Meanwhile, the size and complexity of workloads are increasing as systems centralize in the cloud. We model workload analysis and management tasks as variations on query labeling, and propose a system design that can support general query labeling routines across multiple applications and database backends. The design relies on the use of learned vector embeddings for SQL queries as a replacement for application-specific syntactic features, reducing custom code and allowing the use of off-the-shelf machine learning algorithms for labeling. The key hypothesis, for which we provide evidence in this paper, is that these learned features can outperform conventional feature engineering on representative machine learning tasks. We present the design of a database-agnostic workload management and analytics service, describe potential applications, and show that separating workload representation from labeling tasks affords new capabilities and can outperform existing solutions for representative tasks, including workload sampling for index recommendation and user labeling for security audits.
Brief remarks on big data trends and responsible data science at the Workshop on Science and Technology for Washington State: Advising the Legislature, October 4th 2017 in Seattle.
Talk at ISIM 2017 in Durham, UK on applying database techniques to querying model results in the geosciences, with a broader position about the interaction between data science and simulation as modes of scientific inquiry.
Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems.
Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information.
I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute.
In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs.
The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.
Talk given at Los Alamos National Labs in Fall 2015.
As research becomes more data-intensive and platforms become more heterogeneous, we need to shift focus from performance to productivity.
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Leading Change strategies and insights for effective change management pdf 1.pdf
A New Partnership for Cross-Scale, Cross-Domain eScience
1. A New Partnership for eScience
Bill Howe, UW
Ed Lazowska, UW
Garth Gibson, CMU
Christos Faloutsos, CMU
Peter Lee, CMU (DARPA)
Chris Mentzel, Moore
QuickTime™ and a
decompressor
are needed to see this picture.
6. 3/12/09 Bill Howe, eScience Institute6
The University of Washington
eScience Institute
Rationale
The exponential increase in sensors is transitioning all fields of science and
engineering from data-poor to data-rich
Techniques and technologies include
Sensors and sensor networks, databases, data mining, machine learning,
visualization, cluster/cloud computing
If these techniques and technologies are not widely available and widely
practiced, UW will cease to be competitive
Mission
Help position the University of Washington at the forefront of research both in
modern eScience techniques and technologies, and in the fields that depend
upon them
Strategy
Bootstrap a cadre of Research Scientists
Add faculty in key fields
Build out a “consultancy” of students and non-research staff
QuickTime™ and a
decompressor
are needed to see this picture.
7. 3/12/09 Bill Howe, eScience Institute7
Staff and Funding
Funding
$1M/year direct appropriation from WA State Legislature
$1.5M from Gordon and Betty Moore Foundation (joint with CMU)
Multiple proposals outstanding
Staffing
Dave Beck, Research Scientist: Biosciences and software eng.
Jeff Gardner, Research Scientist: Astrophysics and HPC
Bill Howe,Research Scientist: Databases, visualization, DISC
Ed Lazowska, Director
Erik Lundberg (50%), Operations Director
Mette Peters, Health Sciences Liaison
Chance Reschke, Research Engineer: large scale computing platforms
…plus a senior faculty search underway
…plus a “consultancy” of students and professional staff
8. 3/12/09 Bill Howe, eScience Institute8
All science is reducing to a database problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, in support of many hypotheses)
Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
Medicine: ubiquitous digital records, MRI, ultrasound
Oceanography: high-resolution models, cheap sensors, satellites
Biology: lab automation, high-throughput sequencing
“Increase data collection exponentially with FlowCam!”
9. 3/12/09 Bill Howe, eScience Institute9
The long tail is getting fatter:
notebooks become spreadsheets (MB),
spreadsheets become databases (GB),
databases become clusters (TB)
clusters become clouds (PB)
The Long Tail
datavolume
rank
Researchers with growing data management challenges
but limited resources for cyberinfrastructure
• No dedicated IT staff
• Over-reliance on inadequate but familiar tools
CERN
(~15PB/year)
LSST
(~100PB)
PanSTARRS
(~40PB)
Ocean
Modelers <Spreadsheet
users>
SDSS
(~100TB)
Seis-
mologists
MicrobiologistsCARMEN
(~50TB)
“The future is already here. It’s just not very
evenly distributed.” -- William Gibson
12. 3/12/09 Bill Howe, eScience Institute12
What Does Scalable Mean?
Operationally:
In the past: “Works even if data doesn’t fit in main memory”
Now: “Can make use of 1000s of cheap computers”
Formally:
In the past: polynomial time and space. If you have N data
items, you must do no more than Nk
operations
Soon: logarithmic time and linear space. If you have N data
items, you must do no more than N log(N) operations
Soon, you’ll only get one pass at the data
So you better make that one pass count
13. 3/12/09 Bill Howe, eScience Institute13
A Goal: Cross-Scale Solutions
Gracefully scale up
from files to databases to cluster to cloud
from MB to GB to TB to PB
“Gracefully” means:
logical data independence
no expensive ETL migration projects
“Gracefully” means: everyone can use it
Hackers / Computational Scientists
Lab/Field Scientists
The Public
K12
Legislators
14. 3/12/09 Bill Howe, eScience Institute14
Data Model Operations Services
GPL * * None for free
Workflow * arbitrary boxes-
and-arrows
typing, provenance,
Pegasus-style resource
mapping, task
parallelism
SQL /
Relational
Algebra
Relations Select, Project,
Join, Aggregate, …
optimization, physical
data independence,
indexing, parallelism
MapReduce [(key,value)] Map, Reduce massive data
parallelism, fault
tolerance, scheduling
Pig Nested
Relations
RA-like, with
Nest/Flatten
optimization,
monitoring, scheduling
DryadLINQ IQueryable,
IEnumerable
RA + Apply +
Partitioning
typing, massive data
parallelism, fault
tolerance
MPI Arrays/
Matrices
70+ ops data parallelism, full
control
15. 3/12/09 Bill Howe, eScience Institute15
MapReduce
Many tasks process big data, produce big data
Want to use hundreds or thousands of CPUs
... but this needs to be easy
Parallel databases exist, but require DBAs and $$$$
…and do not easily scale to thousands of computers
MapReduce is a lightweight framework, providing:
Automatic parallelization and distribution
Fault-tolerance
I/O scheduling
Status and monitoring
16. 3/12/09 Bill Howe, eScience Institute16
public class LogEntry {
public string user, ip;
public string page;
public LogEntry(string line) {
string[] fields = line.Split(' ');
this.user = fields[8];
this.ip = fields[9];
this.page = fields[5];
}
}
public class UserPageCount{
public string user, page;
public int count;
public UserPageCount(
string usr, string page, int cnt){
this.user = usr;
this.page = page;
this.count = cnt;
}
}
PartitionedTable<string> logs =
PartitionedTable.Get<string>(@”file:…logfile.pt”);
var logentries =
from line in logs
where !line.StartsWith("#")
select new LogEntry(line);
var user =
from access in logentries
where access.user.EndsWith(@"ulfar")
select access;
var accesses =
from access in user
group access by access.page into pages
select new UserPageCount("ulfar", pages.Key, pages.Count());
var htmAccesses =
from access in accesses
where access.page.EndsWith(".htm")
orderby access.count descending
select access;
htmAccesses.ToPartitionedTable(@”file:…results.pt”);
slide source: Christophe Poulain, MSR
A complete DryadLINQ program
17. 3/12/09 Bill Howe, eScience Institute17
Relational Databases
Pre-relational DBMS brittleness: if your
data changed, your application often
broke.
Early RDBMS were buggy and slow (and
often reviled), but required only 5% of the
application code.
physical data independence
logical data independence
files and
pointers
relations
view
s
“Activities of users at terminals and
most application programs should
remain unaffected when the internal
representation of data is changed and
even when some aspects of the
external representation are changed.”
Key Idea: Programs that manipulate tabular
data exhibit an algebraic structure allowing
reasoning and manipulation independently
of physical data representation
18. 3/12/09 Bill Howe, eScience Institute18
Relational Databases
Databases are especially, but exclusively, effective at
“Needle in Haystack” problems:
Extracting small results from big datasets
Transparently provide “old style” scalability
Your query will always* finish, regardless of dataset size.
Indexes are easily built and automatically used when
appropriateCREATE INDEX seq_idx ON sequence(seq);
SELECT seq
FROM sequence
WHERE seq = ‘GATTACGATATTA’;
*almost
19. 3/12/09 Bill Howe, eScience Institute19
Key Idea: Data Independence
physical data independence
logical data independence
files and
pointers
relations
view
s
SELECT *
FROM my_sequences
SELECT seq
FROM ncbi_sequences
WHERE seq =
‘GATTACGATATTA’;
f = fopen(‘table_file’);
fseek(10030440);
while (True) {
fread(&buf, 1, 8192, f);
if (buf == GATTACGATATTA) {
. . .
20. 3/12/09 Bill Howe, eScience Institute20
Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product
21. 3/12/09 Bill Howe, eScience Institute21
Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!
22. 3/12/09 Bill Howe, eScience Institute22
Shared Nothing Parallel Databases
Teradata
Greenplum
Netezza
Aster Data Systems
DataAllegro
Vertica
MonetDB
Microsoft
Recently commercialized as “Vectorwise”
24. 24
N-body Astrophysics Simulation
• 15 years in dev
• 109
particles
• Gravity
• Months to run
• 7.5 million
CPU hours
• 500 timesteps
• Big Bang to now
Simulations from Tom Quinn’s Lab, work by Sarah Loebman, YongChul
Kwon, Bill Howe, Jeff Gardner, Magda Balazinska
34. 3/12/09 Bill Howe, eScience Institute34
Data explosion, again
Data growth is outpacing Moore’s Law
Why?
Cost of acquisition has dropped through the floor
Every pairwise comparison of datasets
generates a new dataset -- N2
growth
So: Scalable analysis is necessary
But: Scalable analysis is hard
35. 3/12/09 Bill Howe, eScience Institute35
It’s not just the size….
Corollary: # of apps scales as N2
Every pairwise comparison motivates a new application
To keep up, we need to
entrain new programmers,
make existing programmers more productive,
or both
38. 3/12/09 Bill Howe, eScience Institute38
Zooplankton and Temperature
<Vis movie>
QuickTime™ and a
decompressor
are needed to see this picture.
39. 3/12/09 Bill Howe, eScience Institute39
Why Visualization?
High bandwidth of the human visual cortex
Query-writing presumes a precise goal
Try this in SQL: “What does the salt wedge look like?”
40. 3/12/09 Bill Howe, eScience Institute40
Data Product Ensembles
source: Antonio Baptista, Center for Coastal Margin Observation and Prediction
41. 3/12/09 Bill Howe, eScience Institute41
Example: Find matching sequences
Given a set of sequences
Find all sequences equal to
“GATTACGATATTA”
42. 3/12/09 Bill Howe, eScience Institute42
Example System: Teradata
AMP = unit of parallelism
43. 3/12/09 Bill Howe, eScience Institute43
Example System: Teradata
SELECT *
FROM Orders o, Lines i
WHERE o.item = i.item
AND o.date = today()
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
Find all orders from today, along with the items ordered
44. 3/12/09 Bill Howe, eScience Institute44
Example System: Teradata
AMP 1 AMP 2 AMP 3
select
date=today()
select
date=today()
select
date=today()
scan
Order o
scan
Order o
scan
Order o
hash
h(item)
hash
h(item)
hash
h(item)
AMP 1 AMP 2 AMP 3
45. 3/12/09 Bill Howe, eScience Institute45
Example System: Teradata
AMP 1 AMP 2 AMP 3
scan
Item i
AMP 1 AMP 2 AMP 3
hash
h(item)
scan
Item i
hash
h(item)
scan
Item i
hash
h(item)
46. 3/12/09 Bill Howe, eScience Institute46
Example System: Teradata
AMP 1 AMP 2 AMP 3
join join join
o.item = i.item o.item = i.item o.item = i.item
contains all orders and all lines
where hash(item) = 1
contains all orders and all lines
where hash(item) = 2
contains all orders and all lines
where hash(item) = 3
47. 3/12/09 Bill Howe, eScience Institute47
MapReduce Programming Model
Input & Output: each a set of key/value pairs
Programmer specifies two functions:
Processes input key/value pair
Produces set of intermediate pairs
Combines all intermediate values for a particular key
Produces a set of merged output values (usually just one)
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
Inspired by primitives from functional programming
languages such as Lisp, Scheme, and Haskell
slide source: Google, Inc.
48. 3/12/09 Bill Howe, eScience Institute48
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Document Processing
49. 3/12/09 Bill Howe, eScience Institute49
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Word length histogram
How many “big”, “medium”, and “small” words are used?
50. Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Big = Yellow = 10+ letters
Medium = Red = 5..9 letters
Small = Blue = 2..4 letters
Tiny = Pink = 1 letter
Example: Word length histogram
51. Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Word length histogram
Split the document into
chunks and process
each chunk on a
different computer
Chunk 1
Chunk 2
52. (yellow, 20)
(red, 71)
(blue, 93)
(pink, 6 )
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Map Task 1
(204 words)
Map Task 2
(190 words)
(key, value)
(yellow, 17)
(red, 77)
(blue, 107)
(pink, 3)
Example: Word length histogram
53. 3/12/09 Bill Howe, eScience Institute53
(yellow, 17)
(red, 77)
(blue, 107)
(pink, 3)
(yellow, 20)
(red, 71)
(blue, 93)
(pink, 6 )
Reduce tasks
(yellow, 17)
(yellow, 20)
(red, 77)
(red, 71)
(blue, 93)
(blue, 107)
(pink, 6)
(pink, 3)
Example: Word length histogram
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Map task 1
Map task 2
“Shuffle step”
(yellow, 37)
(red, 148)
(blue, 200)
(pink, 9)
54. 3/12/09 Bill Howe, eScience Institute54
New Example: What does this do?
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, 1);
reduce(String output_key, Iterator intermediate_values):
// output_key: word
// output_values: ????
int result = 0;
for each v in intermediate_values:
result += v;
Emit(result);
slide source: Google, Inc.
55. 3/12/09 Bill Howe, eScience Institute55
Before RDBMS: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but
required only 5% of the application code.
“Activities of users at terminals and most application programs
should remain unaffected when the internal representation of data
is changed and even when some aspects of the external
representation are changed.” -- E.F. Codd 1979
Key Ideas: Programs that manipulate tabular data exhibit an
algebraic structure allowing reasoning and manipulation
independently of physical data representation
Relational Database Management Systems (RDBMS)
56. 3/12/09 Bill Howe, eScience Institute56
MapReduce is a Nascent Database Engine
Access Methods and
Scheduling:
Query Language:
Query Optimizer:
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Pig Latin
Graphics taken from: hadoop.apache.org and research.yahoo.com/node/90
57. 3/12/09 Bill Howe, eScience Institute57
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
MapReduce and Hadoop
MR introduced by Google
Published paper in OSDI 2004
MR: high-level programming model and
implementation for large-scale parallel data
processing
Hadoop
Open source MR implementation
Yahoo!, Facebook, New York Times
58. 3/12/09 Bill Howe, eScience Institute58
operators:
• LOAD
• STORE
• FILTER
• FOREACH … GENERATE
• GROUP
binary operators:
• JOIN
• COGROUP
• UNION
other support:
• UDFs
• COUNT
• SUM
• AVG
• MIN/MAX
Additional operators:
http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm
A Query Language for MR: Pig Latin
High-level, SQL-like dataflow language for MR
Goal: Sweet spot between SQL and MR
Applies SQL-like, high-level language constructs to
accomplish low-level MR programming.
59. 3/12/09 Bill Howe, eScience Institute59
New Task: k-mer Similarity
Given a set of database sequences and a
set of query sequences
Return the top N similar pairs, where
similarity is defined as the number of
common k-mers
60. 3/12/09 Bill Howe, eScience Institute60
Pig Latin program
D = LOAD ’db_sequences.fasta' USING FASTA() AS
(did,dsequence);
Q = LOAD ’query_sequences.fasta' USING FASTA() AS
(qid,qsequence);
Kd = FOREACH D GENERATE did, FLATTEN(kmers(7, dsequence));
Kq = FOREACH Q GENERATE qid, FLATTEN(kmers(7, qsequence));
R = JOIN Kd BY kmer, Kq BY kmer
G = GROUP R BY (qid, did);
C = FOREACH G GENERATE qid, did, COUNT(kmer) as score
T = FILTER C BY score > 4
STORE g INTO seqs.txt';
61. 3/12/09 Bill Howe, eScience Institute61
New Task: Alignment
RMAP alignment implemented in Hadoop
Michael Schatz, CloudBurst: highly sensitive read mapping with
MapReduce, Bioinformatics 25(11), April 2009
Goal: Align reads to a reference genome
Overview:
Map: Split reads and reference into k-mers
Reduce: for matching k-mers, find end-to-end
alignments (seed and extend)
62. 3/12/09 Bill Howe, eScience Institute62
MapReduce Overhead
QuickTime™ and a
decompressor
are needed to see this picture.
63. 3/12/09 Bill Howe, eScience Institute63
Elastic MapReduce
Custom Jar
Java
Streaming
Any language that can read/write stdin/stdout
Pig
Simple data flow language
Hive
SQL
64. 3/12/09 Bill Howe, eScience Institute64
Taxonomy of Parallel Architectures
Easiest to program, but
$$$$
Scales to 1000s of computers
Editor's Notes
&lt;number&gt;
My name is Bill Howe. I’m not Ed Lazowska.
In all fields of science, data is starting to come in faster than it can be analyzed, so we need to advance and proliferate computational technologies in sensor networking, databases and data mining, visualization, machine learning, and cluster/cloud computing.
And if we don’t, we see UW losing its competitive edge.
The mission of the eScience Institute is to prevent that from happening
So by the animation loophole, there we go.
Funding! We have $1M from the state, and we just got a nice award from the Moore foundation, and several proposals outstandind.
People! We have a fantastic team: Dave Beck in Biosciences, Jeff Gardner in Astrophysics and HPC, myself in Databases, Ed and Erik, Mette Peters in Health Sciences, and Chance Reshke in large-scale computing platforms.
And there’s our URL: escience.washington
Drowning in data; starving for information
We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
The long tail of eScience -- huge number of scientists who struggle with data management, but do not have access to IT resources -- no clusters, no system administrators, no programmers, and no computer scientists.
They rely on spreadsheets, email, and maybe a shared file system.
Their data challenges have more to do with heterogeneity than size: tens of spreadsheets from different sources.
However: the long tail is becoming the fat tail. Tens of spreadsheets are growing to hundreds, and the number of records in each goes from hundreds to thousands. How many of you know someone who was forced to split a large spreadsheet into multiple files in order to get around the 65k record limit in certain versions of Excel?
Further, medium data (gigabytes) becomes big data (terabytes). Ocean modelers are moving from regional-focus to meso-scale simulations to global simulations.
Armbrust Lab combines lab-based and field-based studies to address basic questions about the function of marine ecosystems.
Asterisk /Underlined (*) indicates custom software developed in the Armbrust Lab.
Blue: Traditional tools for “basement” bioinformatics -- individual scientists
Orange: Increased centralization, economies of scale, shared resources. Deployed in the Armbrust Lab
Yellow: Third-party tools developed for scalable bioinformatics
Purple: Emerging tools under evaluation for convenient petascale bioinformatics. Through a collaboration with the eScience Institute (under funding review by Moore Foundation!)
Thanks to advances in sensors, sequencing instruments, and algorithms, the field of bioinformatics is moving away from &quot;single-task&quot; software that operate on datasets that fit on a single computer in favor of flexible, &quot;multi-purpose&quot; frameworks that can operate on datasets that span clusters of computers.
In our lab, we have deployed a variety of flexible tools, and have developed our own software to streamline our scientific process and reduce the overall &quot;time to insight&quot;. (maybe talk about WebBlast and PPlacer here.)
Observing that the amount of data collected is doubling every year (outpacing even Moore&apos;s Law!), we are also collaborating with the UW eScience Institute to explore ways we can harness emerging technologies for massively parallel data analysis involving hundreds or thousands of machines. Some of these frameworks involve &quot;cloud computing&quot; -- the use of computational infrastructure provided, inexpensively, by &quot;big players&quot; in software and computing --- Amazon, Microsoft, Google. [Maybe more on the eScience Institute?]
&lt;number&gt;
Dial down the expressiveness but dial up the programming and execution services
It turns out that you can express a wide variety of computations using only a handful of operators.
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
Two nodes slower than one, four nodes slower than 8 -- shows overhead of providing parallel processing
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
Data products are the currency of scientific and statistical communication with the public
Ex: Obama map
Ex: Mars Rover pictures generate 218M hits in 24 hrs
But: Datasets are growing too big and too complex to view through a few static images
Scientists want to create interactive visualizations that allow others to explore their results
Ex: Nasa 3D with Photosynth
Ex: CAMERA
Ex:
On the order of hundreds of points. Manual browsing.
This movie was rendered offline, but it’s increasingly important to be able to create visualizations on the fly to allow interactive exploration of large datasets.
Visualization is a more efficient way to query data -- you can browse and explore.
But you need to be able to switch back and forth between interactive browsing and symbolic querying
Climatology is long-term average
Want to know the makeup of the text by word length. For example, we’d like to know how many words have greater than 10 characters. We’d also like to know how many words have between 5 and 9 characters, between 2 and 4 and those with just 1 character.
Map will read in text and tag each word as a different color depending on the length of the word.
Want to know the makeup of the text by word length. For example, we’d like to know how many words have greater than 10 characters. We’d also like to know how many words have between 5 and 9 characters, between 2 and 4 and those with just 1 character.
Map will read in text and tag each word as a different color depending on the length of the word.
Motivating Map task and intuition behind map…. Think of map as a group by.
Distribution of word lengths
Motivating Map task and intuition behind map…. Think of map as a group by.
Distribution of word lengths
Motivating Map task and intuition behind map…. Think of map as a group by.
Distribution of word lengths
It provides a means of describing data with its natural structure only--that is, without superimposing any additional structure for machine representation purposes. Accordingly, it provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation on the other.
So these two different views of the world, RDBMS and MapReduce are not really different at all -- just different feature sets along a continuum of data processing.
As evidence
Teradata
Greenplum
Netezza
Aster Data Systems
Dataupia
Vertica
MonetDB
Hadoop implementation based off details in MR 2004 paper.
Don’t have to write separate map and reduce functions… will take care of that for you as well as optimize for you.
This is by no means an exhaustive list of operators
Don’t have to write separate map and reduce functions… will take care of that for you as well as optimize for you.
The goal here is to make Shared Nothing Architecturs easier to program.