The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
These are the slides from a 5 minute Lightning Talk that I gave at XLDB 2015 on May 19, 2015 at Stanford. It is based in part on our experiences developing the NCI Genomic Data Commons (GDC).
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
The Matsu Project is an Open Cloud Consortium project that is developing open source software for processing satellite imagery data using Hadoop, OpenStack and R.
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
These are the slides from a 5 minute Lightning Talk that I gave at XLDB 2015 on May 19, 2015 at Stanford. It is based in part on our experiences developing the NCI Genomic Data Commons (GDC).
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
The Matsu Project is an Open Cloud Consortium project that is developing open source software for processing satellite imagery data using Hadoop, OpenStack and R.
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
This is a talk I gave at the Strata Conference and Hadoop World in New York City on October 28, 2013. It describes predictive modeling in the context of modeling an adversary's behavior.
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
At eScience in the Cloud 2014, Redmond WA, April 30 2014
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations.
We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on combining HPC and the Apache software stack that is well used in modern cloud computing.
Initial results on Azure and HPC Clusters are presented
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
This session demonstrates how cloud can accelerate breakthroughs in scientific research by providing on-demand access to powerful computing. You will gain insight into how scientific researchers are using the cloud to solve complex science, engineering, and business problems that require high bandwidth, low latency networking and very high compute capabilities. You will hear how leveraging the cloud reduces the costs and time to conduct large scale, worldwide collaborative research. Researchers can then access computational power, data storage, and supercomputing resources, and data sharing capabilities in a cost-efficient manner without implementation delays. Disease research can be accomplished in a fraction of the time, and innovative researchers in small schools or distant corners of the world have access to the same computing power as those at major research institutions by leveraging Amazon EC2, Amazon S3, optimizing C3 instances and more to increase collaboration. This session will provide best practices and insight from UC Berkeley AMP Lab on the services used to connect disparate sets of data to drive meaningful new insight and impact.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
5th Multicore World
15-17 February 2016 – Shed 6, Wellington, New Zealand
http://openparallel.com/multicore-world-2016/
We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Simulations) in the same way. These leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View.
We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) http://hpc-abds.org/kaleidoscope/ and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack.
We give examples of data analytics running on HPC systems including details on persuading Java to run fast.
Some details can be found at http://dsc.soic.indiana.edu/publications/HPCBigDataConvergence.pdf
Multipleregression covidmobility and Covid-19 policy recommendationKan Yuenyong
Multiple Regression Analysis and Covid-19 policy is the contemporary agenda. It demonstrates how to use Python to do data wrangler, to use R to do statistical analysis, and is enable to publish in standard academic journal. The model will explain whether lockdown policy is relevant to control Covid-19 outbreak? It cinc
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
Big data visualization frameworks and applications at Kitwarebigdataviz_bay
Big data visualization frameworks and applications at Kitware
Marcus Hanwell, Technical Leader at Kitware, Inc.
March 27th 2014
Kitware develops permissively licensed open source frameworks and applications for scientific data applications, and related areas. Some of the frameworks developed by our High Performance Computing and Visualization group address current challenges in big data visualization and analysis in a number of application domains including geospatial visualization, social media, finance, chemistry, biological (phylogenetics), and climate. The frameworks used to develop solutions in these areas will be described, along with the applications and the nature of the underlying data. These solutions focus on shared frameworks providing data storage, indexing, retrieval, client-server delivery models, server-side serial and parallel data reduction, analysis, and diagnostics. Additionally, they provide mechanisms that enable server-side or client-side rendering based on the capabilities and configuration of the system.
Big Data Visualization Meetup - South Bay
http://www.meetup.com/Big-Data-Visualisation-South-Bay/
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
This is a talk I gave at the Strata Conference and Hadoop World in New York City on October 28, 2013. It describes predictive modeling in the context of modeling an adversary's behavior.
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
At eScience in the Cloud 2014, Redmond WA, April 30 2014
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations.
We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on combining HPC and the Apache software stack that is well used in modern cloud computing.
Initial results on Azure and HPC Clusters are presented
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
This session demonstrates how cloud can accelerate breakthroughs in scientific research by providing on-demand access to powerful computing. You will gain insight into how scientific researchers are using the cloud to solve complex science, engineering, and business problems that require high bandwidth, low latency networking and very high compute capabilities. You will hear how leveraging the cloud reduces the costs and time to conduct large scale, worldwide collaborative research. Researchers can then access computational power, data storage, and supercomputing resources, and data sharing capabilities in a cost-efficient manner without implementation delays. Disease research can be accomplished in a fraction of the time, and innovative researchers in small schools or distant corners of the world have access to the same computing power as those at major research institutions by leveraging Amazon EC2, Amazon S3, optimizing C3 instances and more to increase collaboration. This session will provide best practices and insight from UC Berkeley AMP Lab on the services used to connect disparate sets of data to drive meaningful new insight and impact.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
5th Multicore World
15-17 February 2016 – Shed 6, Wellington, New Zealand
http://openparallel.com/multicore-world-2016/
We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Simulations) in the same way. These leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View.
We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) http://hpc-abds.org/kaleidoscope/ and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack.
We give examples of data analytics running on HPC systems including details on persuading Java to run fast.
Some details can be found at http://dsc.soic.indiana.edu/publications/HPCBigDataConvergence.pdf
Multipleregression covidmobility and Covid-19 policy recommendationKan Yuenyong
Multiple Regression Analysis and Covid-19 policy is the contemporary agenda. It demonstrates how to use Python to do data wrangler, to use R to do statistical analysis, and is enable to publish in standard academic journal. The model will explain whether lockdown policy is relevant to control Covid-19 outbreak? It cinc
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
Big data visualization frameworks and applications at Kitwarebigdataviz_bay
Big data visualization frameworks and applications at Kitware
Marcus Hanwell, Technical Leader at Kitware, Inc.
March 27th 2014
Kitware develops permissively licensed open source frameworks and applications for scientific data applications, and related areas. Some of the frameworks developed by our High Performance Computing and Visualization group address current challenges in big data visualization and analysis in a number of application domains including geospatial visualization, social media, finance, chemistry, biological (phylogenetics), and climate. The frameworks used to develop solutions in these areas will be described, along with the applications and the nature of the underlying data. These solutions focus on shared frameworks providing data storage, indexing, retrieval, client-server delivery models, server-side serial and parallel data reduction, analysis, and diagnostics. Additionally, they provide mechanisms that enable server-side or client-side rendering based on the capabilities and configuration of the system.
Big Data Visualization Meetup - South Bay
http://www.meetup.com/Big-Data-Visualisation-South-Bay/
Practical Methods for Identifying Anomalies That Matter in Large DatasetsRobert Grossman
Robert L. Grossman, Practical Methods for Identifying Anomalies That Matter in Large Datasets, O’Reilly, Strata + Hadoop World, San Jose, California, February 20, 2015.
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Otávio Carvalho
Work presented in partial fulfillment
of the requirements for the degree of
Bachelor in Computer Science - Federal University of Rio Grande do - Brazil
Hadoop, streaming, terabytes, machine learning, batch, etc. Ces problématiques sous-tendent le déploiement d’une architecture Big Data en production. Mais si fondamentales soient-elles, qu’en est-il de l’utilisation de ces données ? Du point de vue de l’utilisateur, il s’agit de répondre à des questions comme : qu’y-a-t-il dans mes données ? Quel est le modèle numérique pertinent pour adresser mes questions métiers ? Mon modèle délivre-t-il la valeur attendue ? Comment partager cette valeur ?
Ces questions partagent une même racine, à savoir, comment passer de Hadoop ou du data lake à un environnement de travail utile et utilisable sans se noyer ?
C’est ce que nous vous proposons de découvrir dans cette session
Nous naviguerons au travers de différents cas d’utilisation (exploration, interprétation, et communication des résultats) et découvrirons les architectures et les outils à notre disposition à même de nous ouvrir des horizons presque infinis : Superset, Tableau, PowerBI pour naviguer, des notebooks (Jupyter, Zeppelin, R) pour analyser, D3.js pour créer des visualisation personnalisées dans le browser.
Pour cela, nulle raison de plonger dans un océan de données. Un sous ensemble modeste, un échantillon qui peut tenir dans la mémoire d’un laptop suffit dans la plupart des cas. On parle alors de data science, de data lab, mais surtout de data visualization.
En se basant sur des cas d’utilisation, nous illustrerons ces différentes étapes et découvrirons ensemble comment faire sortir l’information de votre data lake pour l'amener sur votre écran.
Modernizing upstream workflows with aws storage - john malloryAmazon Web Services
Modernizing Upstream Workflows with AWS Storage
Accelerating seismic data retrieval, getting better data protection and reliability, and providing a common AWS data platform for compute and graphic intensive processing, simulation and visualization workloads.
Modernizing and transforming exploration and production workflows with AWS Storage services
Accelerating seismic data retrieval, getting better data protection and reliability, and providing a common AWS data platform for compute and graphic intensive processing, simulation and visualization workloads.
Capturing and processing streaming sensor data from remote oil rigs with Snowball Edge
Providing a Data Lake foundation for a next generation Digital Oilfield IoT analytics platform with Amazon S3
Speaker: John Mallory - AWS Storage Business Development Manager
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmadigital.com
Abstract: http://www.bigdataspain.org/program/thu/slot-7.html
The title of this talk is a crass attempt to be catchy and topical, by referring to the recent victory of Watson in Jeopardy.
My point (perhaps confusingly) is not that new computer capabilities are a bad thing. On the contrary, these capabilities represent a tremendous opportunity for science. The challenge that I speak to is how we leverage these capabilities without computers and computation overwhelming the research community in terms of both human and financial resources. The solution, I suggest, is to get computation out of the lab—to outsource it to third party providers.
Abstract follows:
We have made much progress over the past decade toward effective distributed cyberinfrastructure. In big-science fields such as high energy physics, astronomy, and climate, thousands benefit daily from tools that enable the distributed management and analysis of vast quantities of data. But we now face a far greater challenge. Exploding data volumes and new research methodologies mean that many more--ultimately most?--researchers will soon require similar capabilities. How can we possible supply information technology (IT) at this scale, given constrained budgets? Must every lab become filled with computers, and every researcher an IT specialist?
I propose that the answer is to take a leaf from industry, which is slashing both the costs and complexity of consumer and business IT by moving it out of homes and offices to so-called cloud providers. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity, empowering investigators with new capabilities and freeing them to focus on their research.
I describe work we are doing to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date, and suggest a path towards large-scale delivery of these capabilities. I also suggest that these developments are part of a larger "revolution in scientific affairs," as profound in its implications as the much-discussed "revolution in military affairs" resulting from more capable, low-cost IT. I conclude with some thoughts on how researchers, educators, and institutions may want to prepare for this revolution.
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Ian Foster
Ever more data- and compute-intensive science makes computing increasingly important for research. But for advanced computing infrastructure to benefit more than the scientific 1%, we need new delivery methods that slash access costs, new sustainability models beyond direct research funding, and new platform capabilities to accelerate the development of new, interoperable tools and services.
The Globus team has been working towards these goals since 2010. We have developed software-as-a-service methods that move complex and time-consuming research IT tasks out of the lab and into the cloud, thus greatly reducing the expertise and resources required to use them. We have demonstrated a subscription-based funding model that engages research institutions in supporting service operations. And we are now also showing how the platform services that underpin Globus applications can accelerate the development and use of an integrated ecosystem of advanced science applications, such as NCAR’s Research Data Archive and OSG Connect, thus enabling access to powerful data and compute resources by many more people than is possible today.
In this talk, I introduce Globus services and the underlying Globus platform. I present representative applications and discuss opportunities that this platform presents for both small science and large facilities.
A talk at the RPI-NSF Workshop on Multiscale Modeling of Complex Data, September 12, 2011, Troy NY, USA.
We have made much progress over the past decade toward effectively
harnessing the collective power of IT resources distributed across the
globe. In fields such as high-energy physics, astronomy, and climate,
thousands benefit daily from tools that manage and analyze large
quantities of data produced and consumed by large collaborative teams.
But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that far more--ultimately
most?--researchers will soon require capabilities not so different from those used by these big-science teams. How is the general population of researchers and institutions to meet these needs? Must every lab be filled
with computers loaded with sophisticated software, and every researcher become an information technology (IT) specialist? Can we possibly afford to equip our labs in this way, and where would we find the experts to operate them?
Consumers and businesses face similar challenges, and industry has
responded by moving IT out of homes and offices to so-called cloud providers (e.g., GMail, Google Docs, Salesforce), slashing costs and complexity. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity. More importantly, we can free researchers from the burden of managing IT, giving them back their time to focus on research and empowering them to go beyond the scope of what was previously possible.
I describe work we are doing at the Computation Institute to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date and suggest a path towards
large-scale delivery of these capabilities.
This talk was given at a workshop entitled "Cybersecurity Engagement in a Research Environment" at Rady School of Management at UCSD. The workshop was organized by Michael Corn, the UCSD CISO. It tries to provoke discussion around the cybersecurity features and requirements of international science collaborations, as well as more generally, federated cyberinfrastructure systems.
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Germany
Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.
Similar to Using the Open Science Data Cloud for Data Science Research (20)
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
I review three frameworks for analytic operations that are designed to improve the value obtained when deploying analytic models into products, services and internal operations.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
This is an overview of the Data Biosphere Project, its goals, its architecture, and the three core projects that form its foundation. We also discuss data commons.
What is Data Commons and How Can Your Organization Build One?Robert Grossman
This is a talk that I gave at the Molecular Medicine Tri Conference on data commons and data sharing to accelerate research discoveries and improve patient outcomes. It also covers how your organization can build a data commons using the Open Commons Consortium's Data Commons Framework and the University of Chicago's Gen3 data commons platform.
These are the slides from a plenary panel that I participated in at IEEE Cloud 2011 on July 5, 2011 in Washington, D.C. I discussed the Open Science Data Cloud and concluded the talk by three research questions
This is a talk I gave at a Northwestern University - Complete Genomics Workshop on April 21, 2011 about using clouds to support research in genomics and related areas.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Essentials of Automations: Optimizing FME Workflows with Parameters
Using the Open Science Data Cloud for Data Science Research
1. Using
the
Open
Science
Data
Cloud
for
Data
Science
Research
Robert
Grossman
University
of
Chicago
Open
Cloud
Consor=um
June
17,
2013
2. Data:
1
PB
of
OSDC
data
across
several
disciplines
Instrument:
3000
cores
/
5
PB
OSDC
science
cloud
+
+
Team:
you
and
your
colleagues
Discoveries
correla=on
algorithms
+
3. Part
1
What
Instrument
Do
we
Use
to
Make
Big
Data
Discoveries?
How
do
we
build
a
“datascope?”
5. An
algorithm
and
compu=ng
infrastructure
is
“big-‐
data
scalable”
if
adding
a
rack
(or
container)
of
data
(and
corresponding
processors)
allows
you
to
do
the
same
computa=on
in
the
same
=me
but
over
more
data.
6. Commercial
Cloud
Service
Provider
(CSP)
15
MW
Data
Center
100,000
servers
1
PB
DRAM
100’s
of
PB
of
disk
Automa=c
provisioning
and
infrastructure
management
Monitoring,
network
security
and
forensics
Accoun=ng
and
billing
Customer
Facing
Portal
Data
center
network
~1
Tbps
egress
bandwidth
25
operators
for
15
MW
Commercial
Cloud
7. OSDC’s
vote
for
a
datascope:
a
(bou=que)
data
center
scale
facility
with
a
big-‐data
scalable
analy=c
infrastructure.
8. Data:
1
PB
of
OSDC
data
across
several
disciplines
Instrument:
3000
cores
/
5
PB
OSDC
science
cloud
+
+
Team:
you
and
your
colleagues
Discoveries
correla=on
algorithms
+
9. Discipline
Dura2on
Size
#
Devices
HEP
-‐
LHC
10
years
15
PB/year*
One
Astronomy
-‐
LSST
10
years
12
PB/year**
One
Genomics
-‐
NGS
2-‐4
years
0.5
TB/genome
1000’s
Some
Examples
of
Big
Data
Science
*At
full
capacity,
the
Large
Hadron
Collider
(LHC),
the
world's
largest
par=cle
accelerator,
is
expected
to
produce
more
than
15
million
Gigabytes
of
data
each
year.
…
This
ambi=ous
project
connects
and
combines
the
IT
power
of
more
than
140
computer
centres
in
33
countries.
Source:
hhp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html
**As
it
carries
out
its
10-‐year
survey,
LSST
will
produce
over
15
terabytes
of
raw
astronomical
data
each
night
(30
terabytes
processed),
resul=ng
in
a
database
catalog
of
22
petabytes
and
an
image
archive
of
100
petabytes.
Source:
hhp://www.lsst.org/
News/enews/teragrid-‐1004.html
12. There
Are
Two
Essen=al
Characteris=cs
of
a
Cloud
1. Self
service
2. Scale
• Clouds
enable
you
to
compute
over
large
amounts
of
data
with
the
necessity
of
first
downloading
the
data.
• Clouds
can
be
designed
to
be
secure
and
compliant.
12
15. Types
of
Clouds
• Public
Clouds
– Amazon
• Private
Clouds
– Run
internally
by
universi=es
or
companies
• Community
Clouds
– Run
by
organiza=ons
(either
formally
or
informally),
such
as
the
Open
Cloud
Consor=um
15
16. Amazon
Web
Services
(AWS)?
Community
clouds,
science
clouds,
etc.
• Lower
cost
(at
medium
scale)
• Data
too
important
for
commercial
cloud
• Compu=ng
over
scien=fic
data
is
a
core
competency
• Can
support
any
required
governance
/
security
• Scale
• Simplicity
of
a
credit
card
• Wide
variety
of
offerings.
vs.
OCC
supports
AWS
interop
and
burs=ng
when
permissible.
16
17. Science
Clouds
NFP
Science
Clouds
Commercial
Clouds
POV
Democra=ze
access
to
data.
Integrate
data
to
make
discoveries.
Long
term
archive.
As
long
as
you
pay
the
bill;
as
long
as
the
business
model
holds.
Data
&
Storage
Data
intensive
compu=ng
&
HP
storage
Internet
style
scale
out
and
object-‐based
storage
Flows
Large
&
small
data
flows
Lots
of
small
web
flows
Streams
Streaming
processing
required
NA
Accoun=ng
Essen=al
Essen=al
Lock
in
Moving
environment
between
CSPs
essen=al
Lock
in
is
good
Interop
Cri=cal,
but
difficult
Customers
will
drive
to
some
degree
17
18. Essen=al
Services
for
a
Science
CSP
• Support
for
data
intensive
compu=ng
• Support
for
big
data
flows
• Account
management,
authen=ca=on
and
authoriza=on
services
• Health
and
status
monitoring
• Billing
and
accoun=ng
• Ability
to
rapidly
provision
infrastructure
• Security
services,
logging,
event
repor=ng
• Access
to
large
amounts
of
public
data
• High
performance
storage
• Simple
data
export
and
import
services
19. Datascope
–
Science
Cloud
Service
Provider
(Sci
CSP)
Data
scien=st
Sci
CSP
services
20. Cloud
Services
Opera=ons
Centers
(CSOC)
• The
OSDC
operates
Cloud
Services
Opera=ons
Center
(or
CSOC).
• It
is
a
CSOC
focused
on
suppor=ng
Science
Clouds
for
researchers.
• Compare
to
Network
Opera=ons
Center
or
NOC.
• Both
are
an
important
part
of
cyber
infrastructure
for
big
data
science.
21. Datascope
–
Science
Cloud
Service
Provider
(Sci
CSP)
Data
scien=st
Sci
CSP
services
Cloud
Service
Opera=ons
Center
(CSOC)
23. Data
Founda=ons
of
data
science
General
and
discipline
specific
souware
applica=ons
and
tools
Models
and
algorithms
Establish
best
prac=ces,
strategies
for
data
science
in
general
and
discipline
specific
data
science
in
par=cular
Analy=c
infrastructure
Data
25. Theory
to
Big
Data
Spectrum
Simple
counts
and
sta=s=cs
over
big
data
Mathema=cal
theorems
No
data
Small
data
Big
data
Tradi=onal
sta=s=cal
modeling
Medium
data
(Semi-‐)Automa=ng
sta=s=cal
modeling
GB
TB
PB
OSDC
Datascope
0.5-‐2.0
MW
26. Part
4
The
Open
Science
Data
Cloud
www.opensciencedatacloud.org
27. Data:
1
PB
of
OSDC
data
across
several
disciplines
Instrument:
3000
cores
/
5
PB
OSDC
science
cloud
+
+
Team:
you
and
your
colleagues
Discoveries
correla=on
algorithms
+
29. Tukey
• Tukey
(based
in
part
on
Horizon).
• We
have
factored
out
digital
ID
service,
file
sharing,
and
transport
from
Bionimbus
and
Matsu.
30. Yates
• Automa=on
installa=on
of
OSDC
souware
stack
on
rack
of
computers.
• Based
upon
Chef
• Version
0.1
31. UDR
• UDT
is
a
high
performance
network
transport
protocol
• UDR
=
rsync
+
UDT
• It
is
easy
for
an
average
systems
administrator
to
keep
100’s
of
TB
of
distributed
data
synchronized.
• We
are
using
it
to
distribute
c.
1
PB
from
the
OSDC
32. Open
Science
Data
Cloud
Services
• Digital
ID
services
• Data
sharing
services
• Data
transport
services
(UDR)
• What
other
core
services
are
essen&al?
• Of
course,
working
groups
and
applica=ons
always
add
their
own
services
• These
core
services
will
hopefully
make
the
OSDC
ahrac=ve
as
a
plaxorm
(PaaS)
for
scien=fic
discovery.
33. 33
www.opencloudconsor=um.org
• U.S
based
not-‐for-‐profit
corpora=on.
• Manages
cloud
compu=ng
infrastructure
to
support
scien=fic
research:
Open
Science
Data
Cloud.
• Manages
cloud
compu=ng
infrastructure
to
support
medical
and
health
care
research:
Biomedical
Commons
Cloud
• Manages
cloud
compu=ng
testbeds:
Open
Cloud
Testbed.
34. OCC
Members
&
Partners
• Companies:
Cisco,
Yahoo!,
Intel,
…
• Universi=es:
University
of
Chicago,
Northwestern
Univ.,
Johns
Hopkins,
Calit2,
ORNL,
University
of
Illinois
at
Chicago,
…
• Federal
agencies
and
labs:
NASA
• Interna=onal
Partners:
Univ.
Edinburgh,
AIST
(Japan),
Univ.
Amsterdam,
…
• Partners:
Na=onal
Lambda
Rail
34
35. Third
party
open
source
souware
+
Tukey
Yates
Open
source
souware
developed
by
the
OCC
and
open
standards
+
Data
center
+
Data
with
permissions
+
Authoriza=on
of
users
access
to
data
+
Policies,
procedures,
controls,
etc.
+
Governance,
legal
agreements
+
Sustainability
model
35
37. Data:
1
PB
of
OSDC
data
across
several
disciplines
Instrument:
3000
cores
/
5
PB
OSDC
science
cloud
+
+
Team:
you
and
your
colleagues
Discoveries
correla=on
algorithms
+
38.
39. OSDC
Public
Data
Sets
• Over
800
TB
of
open
access
data
in
the
OSDC
• Earth
sciences
data
• Biological
sciences
data
• Social
sciences
data
• Digital
humani=es
40. Part
6
OSDC
Working
Groups
Just
look
around
you
42. Matsu
Architecture
Hadoop
HDFS
Matsu
Web
Map
Tile
Service
(WMTS)
Matsu
MR-‐based
Tiling
Service
NoSQL
Database
Images
at
different
zoom
layers
suitable
for
OGC
Web
Mapping
Server
Level
0,
Level
1
and
Level
2
images
MapReduce
used
to
process
Level
n
to
Level
n+1
data
and
to
par==on
images
for
different
zoom
levels
NoSQL-‐based
Analy=c
Services
Streaming
Analy=c
Services
MR-‐based
Analy=c
Services
Analy=c
Services
Storage
for
WMS
=les
and
derived
data
products
Presenta=on
Services
Web
Coverage
Processing
Service
(WCPS)
Workflow
Services
46. Analyzing
Data
From
The
Cancer
Genome
Atlas
(TCGA)
1. Apply
to
dbGaP
for
access
to
data.
2. Hire
staff,
set
up
and
operate
secure
compliant
compu=ng
environment
to
mange
10
–
100+
TB
of
data.
3. Get
environment
approved
by
your
research
center.
4. Setup
analysis
pipelines.
5. Download
data
from
CG-‐
Hub
(takes
days
to
weeks).
6. Begin
analysis.
Current
Prac2ce
With
Protected
Data
Cloud
(PDC)
1. Apply
to
dbGaP
for
access
to
data.
2. Use
your
eRA
commons
creden=als
to
login
to
the
PDC,
select
the
data
that
you
want
to
analyze,
and
the
pipelines
that
you
want
to
use.
3. Begin
analysis.
46
47. One
Million
Genomes
• Sequencing
a
million
genomes
would
most
likely
fundamentally
change
the
way
we
understand
genomic
varia=on.
• The
genomic
data
for
a
pa=ent
is
about
1
TB
(including
samples
from
both
tumor
and
normal
=ssue).
• One
million
genomes
is
about
1000
PB
or
1
EB
• With
compression,
it
may
be
about
100
PB
• At
$1000/genome,
the
sequencing
would
cost
about
$1B
48. Big
data
driven
discovery
on
1,000,000
genomes
and
1
EB
of
data.
Genomic-‐
driven
diagnosis
Improved
understanding
of
genomic
science
Genomic-‐
driven
drug
development
Precision
diagnosis
and
treatment.
Preven=ve
health
care.
49. Biomedical
Commons
Cloud
(BCC)
Working
Group
Cloud
for
Public
Data
Cloud
for
Controlled
Genomic
Data
Cloud
for
EMR,
PHI,
data
Example:
Open
Cloud
Consor=um’s
Biomedical
Commons
Cloud
(BCC)
Medical
Research
Center
A
Medical
Research
Center
B
Hospital
D
Medical
Research
Center
C
49
50. Resource
Who
users
Who
operates
Open
Science
Data
Cloud
(OSDC)
Pan
science
data
for
researchers
Open
Cloud
Consor=um
(OCC)
supported
by
University
OCC
members
Biomedical
Commons
Clouds
(BCC)
(Interna=onal)
biomedical
researchers
OCC
Biomedical
Commons
Cloud
Working
Group
supported
by
OCC
University
members
Bionimbus
Protected
Data
Cloud
Genomics
researchers
University
of
Chicago
supported
by
the
OCC
50
51. OpenFlow-‐Enabled
Hadoop
WG
• When
running
Hadoop
some
map
and
reduce
jobs
take
significantly
longer
than
others.
• These
are
stragglers
and
can
significantly
slow
down
a
MapReduce
computa=on.
• Stragglers
are
common
(dirty
secret
about
Hadoop)
• Infoblox
and
UChicago
are
leading
a
OCC
Working
Group
on
OpenFlow-‐enabled
Hadoop
that
will
provide
addi=onal
bandwidth
to
stragglers.
• We
have
a
testbed
for
a
wide
area
version
of
this
project.
52. OSDC
PIRE
Project
We
select
OSDC
PIRE
Fellows
(US
ci=zens
or
permanent
residents):
• We
give
them
tutorials
and
training
on
big
data
science.
• We
provide
them
fellowships
to
work
with
OSDC
interna=onal
partners.
• We
give
them
preferred
access
to
the
OSDC.
Nominate
your
favorite
scien=st
as
an
OSDC
PIRE
Fellow.
www.opensciencedatacloud.org
(look
for
PIRE)
54. • Ques=on
1.
How
can
we
add
partner
sites
at
other
loca=ons
that
extend
the
OSDC?
In
par=cular,
how
can
we
extend
the
OSDC
to
sites
around
the
world?
How
can
the
OSDC
interoperate
with
other
science
clouds?
• Ques=on
2.
What
data
can
we
add
to
the
OSDC
to
facilitate
data
intensive
cross-‐disciplinary
discoveries?
• Ques=on
3.
How
can
we
build
a
plugin
structure
so
that
Tukey
can
be
extended
by
other
users
and
by
other
communi=es?
• Ques=on
4.
What
tools
and
applica=ons
can
we
add
to
the
OSDC
facilitate
data
intensive
cross-‐disciplinary
discoveries?
• Ques=on
5.
How
can
we
beher
integrate
digital
IDs
and
file
sharing
services
into
the
OSDC?
• Ques=on
6.
What
are
3-‐5
grand
challenge
ques=ons
that
leverage
the
OSDC?
56. Robert
Grossman
is
a
faculty
member
at
the
University
of
Chicago.
He
is
the
Chief
Research
Informa=cs
Officer
for
the
Biological
Sciences
Division,
a
Faculty
Member
and
Senior
Fellow
at
the
Computa=on
Ins=tute
and
the
Ins=tute
for
Genomics
and
Systems
Biology,
and
a
Professor
of
Medicine
in
the
Sec=on
of
Gene=c
Medicine.
His
research
group
focuses
on
big
data,
biomedical
informa=cs,
data
science,
cloud
compu=ng,
and
related
areas.
He
is
also
the
Founder
and
a
Partner
of
Open
Data
Group,
which
has
been
building
predic=ve
models
over
big
data
for
companies
for
over
ten
years.
He
recently
wrote
a
book
for
the
general
reader
that
discusses
big
data
(among
other
topics)
called
the
Structure
of
Digital
Compu=ng:
From
Mainframes
to
Big
Data,
which
can
be
purchased
from
Amazon.
He
blogs
occasionally
about
big
data
at
rgrossman.com.