Here, I describe how the Texas Advanced Computing Center has shifted its focus from traditional modeling and simulation towards fully embracing big data analytics performed by users with diverse technical backgrounds.
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
"Real-Time Analytics with Spark Streaming" presented at QCon São Paulo, 2015-03-26
http://qconsp.com/presentation/real-time-analytics-spark-streaming
This talk presents an overview of Spark and its history and applications, then focuses on the Spark Streaming component used for real-time analytics. We compare it with earlier frameworks such as MillWheel and Storm, and explore industry motivations for open-source micro-batch streaming at scale.
The talk will include demos for streaming apps that include machine-learning examples. We also consider public case studies of production deployments at scale.
We’ll review the use of open-source sketch algorithms and probabilistic data structures that get leveraged in streaming – for example, the trade-off of 4% error bounds on real-time metrics for two orders of magnitude reduction in required memory footprint of a Spark app.
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
Los Angeles Apache Spark Users Group 2014-12-11 http://meetup.com/Los-Angeles-Apache-Spark-Users-Group/events/218748643/
A look ahead at Spark Streaming in Spark 1.2 and beyond, with case studies, demos, plus an overview of approximation algorithms that are useful for real-time analytics.
With Dask and Numba, you can NumPy-like and Pandas-like code and have it run very fast on multi-core systems as well as at scale on many-node clusters.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
"Real-Time Analytics with Spark Streaming" presented at QCon São Paulo, 2015-03-26
http://qconsp.com/presentation/real-time-analytics-spark-streaming
This talk presents an overview of Spark and its history and applications, then focuses on the Spark Streaming component used for real-time analytics. We compare it with earlier frameworks such as MillWheel and Storm, and explore industry motivations for open-source micro-batch streaming at scale.
The talk will include demos for streaming apps that include machine-learning examples. We also consider public case studies of production deployments at scale.
We’ll review the use of open-source sketch algorithms and probabilistic data structures that get leveraged in streaming – for example, the trade-off of 4% error bounds on real-time metrics for two orders of magnitude reduction in required memory footprint of a Spark app.
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
Los Angeles Apache Spark Users Group 2014-12-11 http://meetup.com/Los-Angeles-Apache-Spark-Users-Group/events/218748643/
A look ahead at Spark Streaming in Spark 1.2 and beyond, with case studies, demos, plus an overview of approximation algorithms that are useful for real-time analytics.
With Dask and Numba, you can NumPy-like and Pandas-like code and have it run very fast on multi-core systems as well as at scale on many-node clusters.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Big Data Modeling Challenges and Machine Learning with No CodeLiana Ye
Presented at SF BAY ACM_202001015_by_Karthik Chinnusamy
What are the Big Data model challenges in today's field? With a few best practice recommendations and Machine Learning approaches, I will use Knime to show the modeling advantages for Big Data with the following themes:
.Performance: Good data models can help us quickly query the required data and reduce I/O throughput.
.Cost: Good data models can significantly reduce unnecessary data redundancy, reuse computing results, and reduce the storage and computing costs for the big data system.
.Efficiency: Good data models can greatly improve user experience and increase the efficiency of data utilization.
.Quality: Good data models make data statistics more consistent and reduce the possibility of computing errors.
I will also describe tools for Sources, Ingestion, Exploration, Modeling and Machine Learning.
Digital Science: Reproducibility and Visibility in AstronomyJose Enrique Ruiz
The science done in Astronomy is digital science, from observing proposals to final publication, to data and software used: each of the elements and actions involved in scientific output could be recorded in electronic form. This fact does not prevent the final outcome of an experiment is still difficult to reproduce. This procedure can be long, tedious, not easily accessible or understandable, even to the author. At the same time, we have a rich infrastructure of files, observational data and publications. This could be used more efficiently if we reach greater visibility of the scientific production, which avoids duplication of effort and reinvention.
Reproducibility is a cornerstone in scientific method, and extraction of relevant information in the current and future data flood is key in Astronomy. The AMIGA group (Analysis of the interstellar Medium of Isolated GAlaxies, IAA-CSIC, http://amiga.iaa.es) faces these two challenges in the European project "Wf4Ever: Advanced technologies for enhanced preservation workflow Science" to enable the preservation of the methodology in scalable semantic repositories to facilitate their discovery, access, inspection, exploitation and distribution. These repositories store the experiments on "Research Objects" whose main constituents are digital scientific workflows. These provide a comprehensive view and clear scientific interpretation of the experiment as well as the automation of the method, going beyond the usual pipelines that normally end up in data processing.
The quantitative leap in volume and complexity of the next generation of archives will need analysis and data mining tasks to live closer to the data, in computing and distributed storage environments, but they should also be modular enough to allow customization from scientists and be easily accessible to foster their dissemination among the community. Astronomy is a collaborative science, but it has also become highly specialized, as many other disciplines. Sharing, preservation, discovery and a much simplified access to resources in the composition of scientific workflows will enable astronomers to greatly benefit from each other’s highly specialized knowhow, they constitute a way to push Astronomy to share and publish not only results and data, but also processes and methodologies.
We will show how the use of scientific workflows can help to improve the reproducibility of the experiment and a more efficient exploitation of astronomical archives, as well as the visibility of the scientific methodology and its reuse.
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
How Apache Spark fits into the Big Data landscapePaco Nathan
How Apache Spark fits into the Big Data landscape http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/217858832/
2014-12-02 in Herndon, VA and sponsored by Raytheon, Tetra Concepts, and MetiStream
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
Los IPython Notebooks nos han proporcionado una sustancial mejora en la documentación del scripts, así como su inspección y una mayor re-utilización. Los IPython Notebooks también permiten acceder a distintos lenguajes de programación (Fortran, IDL, R, Shell,..) en un mismo script, lo que unido a su modo de acceso Web les hace ser un elemento ideal para el trabajo colaborativo (multi-lenguaje, multi-usuario, multi-plataforma, etc..) Os contaré qué tipo de cosas pueden hacerse con IPython Notebooks, desde desarrollo colaborativo de código multi-lenguaje, pasando por la reutilización de tutoriales, visualización interactiva de resultados, hasta la distribución de código más modular, y la publicación final de un experimento digital verificable y reproducible: el preámbulo de los papers ejecutables.
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
“Next Generation Grid – HPC Cloud” proposes a toolkit capturing current capabilities of Apache Hadoop, Spark, Flink and Heron as well as MPI and Asynchronous Many Task systems from HPC. This supports a Cloud-HPC-Edge (Fog, Device) Function as a Service Architecture. Note this "new grid" is focussed on data and IoT; not computing. Use interoperable common abstractions but multiple polymorphic implementations.
Building a distributed search system with Hadoop and LuceneMirko Calvaresi
This work analyses the problem coming from the so called Big Data scenario, which can be defined as the technological challenge to manage and administer quantity of information with global dimension in the order of Terabyte (〖10〗^12 bytes) or Petabyte (〖10〗^15 bytes) and with an exponential growth rate.
We’ll explore a technological and algorithmic approach to handle and calculate theses amounts of data that exceed the limit of computation of a traditional architecture based on real-time request processing: in particular we’ll analyze a singular open source technology, called Apache Hadoop, which implements the approach described as Map and Reduce.
We’ll describe also how to distribute a cluster of common server to create a Virtual File System and use this environment to populate a centralized search index (realized using another open source technology, called Apache Lucene).
The practical implementation will be a web based application which offers to the user a unified searching interface against a collection of technical papers.
The scope is to demonstrate that a performant search system can be obtained pre-processing the data using the Map and Reduce paradigm, in order to obtain a real time response, which is independent to the underlying amount of data.
Finally we’ll compare this solutions to different approaches based on clusterization or No SQL solutions, with the scope to describe the characteristics of concrete scenarios, which suggest the adoption of those technologies.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
Director's Colloquium at Los Alamos National Laboratory, September 18, 2014.
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.
High Performance Processing of Streaming DataGeoffrey Fox
Describes two parallel robot planning algorithms implemented with Apache Storm on OpenStack -- SLAM (Simultaneous Localization & Mapping) and collision avoidance. Performance (response time) studied and improved as example of HPC-ABDS (High Performance Computing enhanced Apache Big Data Software Stack) concept.
FOSS4G NA St. Louis 2018
https://2018.foss4g-na.org/session/drag-and-drop-open-source-geotools-etl-apache-nifi
Loading your GeoTools SimpleFeatures into your cloud or database never seems as easy as it should be. There's Twitter streams, FTP servers, inboxes, dropboxes, and all sorts of other data that you need to parse, convert to SimpleFeatures, and then ingest into your GeoTools datastore. Luckily, GeoMesa has teamed with HortonWorks DataFlow to provide a fully open source tool to ease the pain of ingesting data into ANY GeoTools datastore. It's as easy as drag and drop! Literally!
In this talk Constantin and Andrew will introduce Apache NiFi -- an open source project dedicated to making dataflow easy. We'll show how real-time streaming data such as satellite AIS can be ingested and managed in real-time using ingest pipelines with your web browser without having to compile any code.
Packaging computational biology tools for broad distribution and ease-of-reuseMatthew Vaughn
A typical instance of computational biology software is composed of interpreted code, compiled binaries, shared libraries, and shell scripts, sometimes mixed in with use of web services or databases, running in the context of a complex computer operating system, atop increasingly sophisticiated physical resources. How can we expect computations to be sharable and reproducible, and how can we hope to train people to use such resources? This talk will describe how the Texas Advanced Computing Center enables distribution and use scientific software via various approaches, including Jupyter notebooks, Github repositories, computation-oriented web service APIs, virtual machine images, and container technologies such as Docker, and how these approaches complement one another for training and education.
Values & Vision - Cloud Sandboxes for BIG Earth Sciencesterradue
Terradue is now engaging in the race for developing innovative solutions that aim at better prepared people for the new digital, data-intensive science era.
Big Data Modeling Challenges and Machine Learning with No CodeLiana Ye
Presented at SF BAY ACM_202001015_by_Karthik Chinnusamy
What are the Big Data model challenges in today's field? With a few best practice recommendations and Machine Learning approaches, I will use Knime to show the modeling advantages for Big Data with the following themes:
.Performance: Good data models can help us quickly query the required data and reduce I/O throughput.
.Cost: Good data models can significantly reduce unnecessary data redundancy, reuse computing results, and reduce the storage and computing costs for the big data system.
.Efficiency: Good data models can greatly improve user experience and increase the efficiency of data utilization.
.Quality: Good data models make data statistics more consistent and reduce the possibility of computing errors.
I will also describe tools for Sources, Ingestion, Exploration, Modeling and Machine Learning.
Digital Science: Reproducibility and Visibility in AstronomyJose Enrique Ruiz
The science done in Astronomy is digital science, from observing proposals to final publication, to data and software used: each of the elements and actions involved in scientific output could be recorded in electronic form. This fact does not prevent the final outcome of an experiment is still difficult to reproduce. This procedure can be long, tedious, not easily accessible or understandable, even to the author. At the same time, we have a rich infrastructure of files, observational data and publications. This could be used more efficiently if we reach greater visibility of the scientific production, which avoids duplication of effort and reinvention.
Reproducibility is a cornerstone in scientific method, and extraction of relevant information in the current and future data flood is key in Astronomy. The AMIGA group (Analysis of the interstellar Medium of Isolated GAlaxies, IAA-CSIC, http://amiga.iaa.es) faces these two challenges in the European project "Wf4Ever: Advanced technologies for enhanced preservation workflow Science" to enable the preservation of the methodology in scalable semantic repositories to facilitate their discovery, access, inspection, exploitation and distribution. These repositories store the experiments on "Research Objects" whose main constituents are digital scientific workflows. These provide a comprehensive view and clear scientific interpretation of the experiment as well as the automation of the method, going beyond the usual pipelines that normally end up in data processing.
The quantitative leap in volume and complexity of the next generation of archives will need analysis and data mining tasks to live closer to the data, in computing and distributed storage environments, but they should also be modular enough to allow customization from scientists and be easily accessible to foster their dissemination among the community. Astronomy is a collaborative science, but it has also become highly specialized, as many other disciplines. Sharing, preservation, discovery and a much simplified access to resources in the composition of scientific workflows will enable astronomers to greatly benefit from each other’s highly specialized knowhow, they constitute a way to push Astronomy to share and publish not only results and data, but also processes and methodologies.
We will show how the use of scientific workflows can help to improve the reproducibility of the experiment and a more efficient exploitation of astronomical archives, as well as the visibility of the scientific methodology and its reuse.
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
How Apache Spark fits into the Big Data landscapePaco Nathan
How Apache Spark fits into the Big Data landscape http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/217858832/
2014-12-02 in Herndon, VA and sponsored by Raytheon, Tetra Concepts, and MetiStream
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
Los IPython Notebooks nos han proporcionado una sustancial mejora en la documentación del scripts, así como su inspección y una mayor re-utilización. Los IPython Notebooks también permiten acceder a distintos lenguajes de programación (Fortran, IDL, R, Shell,..) en un mismo script, lo que unido a su modo de acceso Web les hace ser un elemento ideal para el trabajo colaborativo (multi-lenguaje, multi-usuario, multi-plataforma, etc..) Os contaré qué tipo de cosas pueden hacerse con IPython Notebooks, desde desarrollo colaborativo de código multi-lenguaje, pasando por la reutilización de tutoriales, visualización interactiva de resultados, hasta la distribución de código más modular, y la publicación final de un experimento digital verificable y reproducible: el preámbulo de los papers ejecutables.
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
“Next Generation Grid – HPC Cloud” proposes a toolkit capturing current capabilities of Apache Hadoop, Spark, Flink and Heron as well as MPI and Asynchronous Many Task systems from HPC. This supports a Cloud-HPC-Edge (Fog, Device) Function as a Service Architecture. Note this "new grid" is focussed on data and IoT; not computing. Use interoperable common abstractions but multiple polymorphic implementations.
Building a distributed search system with Hadoop and LuceneMirko Calvaresi
This work analyses the problem coming from the so called Big Data scenario, which can be defined as the technological challenge to manage and administer quantity of information with global dimension in the order of Terabyte (〖10〗^12 bytes) or Petabyte (〖10〗^15 bytes) and with an exponential growth rate.
We’ll explore a technological and algorithmic approach to handle and calculate theses amounts of data that exceed the limit of computation of a traditional architecture based on real-time request processing: in particular we’ll analyze a singular open source technology, called Apache Hadoop, which implements the approach described as Map and Reduce.
We’ll describe also how to distribute a cluster of common server to create a Virtual File System and use this environment to populate a centralized search index (realized using another open source technology, called Apache Lucene).
The practical implementation will be a web based application which offers to the user a unified searching interface against a collection of technical papers.
The scope is to demonstrate that a performant search system can be obtained pre-processing the data using the Map and Reduce paradigm, in order to obtain a real time response, which is independent to the underlying amount of data.
Finally we’ll compare this solutions to different approaches based on clusterization or No SQL solutions, with the scope to describe the characteristics of concrete scenarios, which suggest the adoption of those technologies.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
Director's Colloquium at Los Alamos National Laboratory, September 18, 2014.
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.
High Performance Processing of Streaming DataGeoffrey Fox
Describes two parallel robot planning algorithms implemented with Apache Storm on OpenStack -- SLAM (Simultaneous Localization & Mapping) and collision avoidance. Performance (response time) studied and improved as example of HPC-ABDS (High Performance Computing enhanced Apache Big Data Software Stack) concept.
FOSS4G NA St. Louis 2018
https://2018.foss4g-na.org/session/drag-and-drop-open-source-geotools-etl-apache-nifi
Loading your GeoTools SimpleFeatures into your cloud or database never seems as easy as it should be. There's Twitter streams, FTP servers, inboxes, dropboxes, and all sorts of other data that you need to parse, convert to SimpleFeatures, and then ingest into your GeoTools datastore. Luckily, GeoMesa has teamed with HortonWorks DataFlow to provide a fully open source tool to ease the pain of ingesting data into ANY GeoTools datastore. It's as easy as drag and drop! Literally!
In this talk Constantin and Andrew will introduce Apache NiFi -- an open source project dedicated to making dataflow easy. We'll show how real-time streaming data such as satellite AIS can be ingested and managed in real-time using ingest pipelines with your web browser without having to compile any code.
Packaging computational biology tools for broad distribution and ease-of-reuseMatthew Vaughn
A typical instance of computational biology software is composed of interpreted code, compiled binaries, shared libraries, and shell scripts, sometimes mixed in with use of web services or databases, running in the context of a complex computer operating system, atop increasingly sophisticiated physical resources. How can we expect computations to be sharable and reproducible, and how can we hope to train people to use such resources? This talk will describe how the Texas Advanced Computing Center enables distribution and use scientific software via various approaches, including Jupyter notebooks, Github repositories, computation-oriented web service APIs, virtual machine images, and container technologies such as Docker, and how these approaches complement one another for training and education.
Values & Vision - Cloud Sandboxes for BIG Earth Sciencesterradue
Terradue is now engaging in the race for developing innovative solutions that aim at better prepared people for the new digital, data-intensive science era.
How to Create the Google for Earth Data (XLDB 2015, Stanford)Rainer Sternfeld
This talk was built around the example of NOAA Big Data Project, in which Planet OS is a partner with Amazon Web Services. The aim of NOAA Big Data Project is to bring NOAA's atmospheric and oceanic data to the cloud, make it discoverable and machine-readable.
This presentation outlines some of the organizational and technical challenges that exist in the project, as well as potential solutions and ideas to approach this set of challenges.
The aim of the EU FP 7 Large-Scale Integrating Project LarKC is to develop the Large Knowledge Collider (LarKC, for short, pronounced “lark”), a platform for massive distributed incomplete reasoning that will remove the scalability barriers of currently existing reasoning systems for the Semantic Web. The LarKC platform is available at larkc.sourceforge.net. This talk, is part of a tutorial for early users of the LarKC platform, and introduces the platform and the project in general.
Data-intensive applications on cloud computing resources: Applications in lif...Ola Spjuth
Presentation at the de.NBI 2017 symposium “The Future Development of Bioinformatics in Germany and Europe” held at the Center for Interdisciplinary Research (ZiF) of Bielefeld University, October 23-25, 2017.
https://www.denbi.de/symposium2017
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...BigData_Europe
Slides for keynote talk at the Big Data Europe workshop nr 3 on 11.9.2017 in Amsterdam co-located with SEMANTiCS2017 conference by Ron Dekker, Director CESSDA: European Open Science Agenda: where we are and where we are going?
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
DK Panda from Ohio State University presented this deck at the 2017 HPC Advisory Council Stanford Conference.
"This talk will focus on challenges in designing runtime environments for exascale systems with millions of processors and accelerators to support various programming models. We will focus on MPI, PGAS (OpenSHMEM, CAF, UPC and UPC++) and Hybrid MPI+PGAS programming models by taking into account support for multi-core, high-performance networks, accelerators (GPGPUs and Intel MIC), virtualization technologies (KVM, Docker, and Singularity), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries will be presented."
Watch the video: http://wp.me/p3RLHQ-glW
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...Rahul Krishna Upadhyaya
Slide was presented at Dr. Dobb's Conference in Bangalore.
Talks about Openstack Introduction in general
Projects under Openstack.
Contributing to Openstack.
This was presented jointly by CB Ananth and Rahul at Dr. Dobb's Conference Bangalore on 12th Apr 2014.
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...VMware Tanzu
Business Track presented by Animesh Singh, Lead Architect and Strategist at IBM.
Bring the world's best IaaS to the world's best PaaS, In this talk IBM and Rackspace are going to share their experiences of running Cloud Foundry on OpenStack. The talk will focus on how CloudFoundry and OpenStack complement each other, how they technically integrate using Cloud provider interface (CPI), how could we automate OpenStack setup for Cloud Foundry deployments, and what are some of the best practices for configuring a scalable environment.
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...VMware Tanzu
Animesh Singh, Lead Architect and Strategist at IBM.
Bring the world's best IaaS to the world's best PaaS, In this talk IBM and Rackspace are going to share their experiences of running Cloud Foundry on OpenStack. The talk will focus on how CloudFoundry and OpenStack complement each other, how they technically integrate using Cloud provider interface (CPI), how could we automate OpenStack setup for Cloud Foundry deployments, and what are some of the best practices for configuring a scalable environment.
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...inside-BigData.com
In this deck from the 2019 Stanford HPC Conference, DK Panda from Ohio State University presents: How to Design Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss about the challenges in designing runtime environments for MPI+X (PGAS - OpenSHMEM/UPC/CAF/UPC++, OpenMP, and CUDA) programming models taking into account support for multi-core systems (Xeon, OpenPower, and ARM), high-performance networks, GPGPUs (including GPUDirect RDMA), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu) will be presented. For the Deep Learning domain, we will focus on popular Deep Learning frameworks (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library and RDMA-Enabled Big Data stacks. Finally, we will outline the challenges in moving middleware to the Cloud environments."
Watch the video: https://youtu.be/hR8cnFVF8Zg
Learn more: http://www.cse.ohio-state.edu/~panda
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
Stay up-to-date on the latest news, research, and resources. This month's edition covers 2024 predictions across the HPC and AI industry, NSF's National Artificial Intelligence Research Resource (NAIRR) pilot, the role of compilers in scientific computing, on-demand and upcoming webinars, and more!
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
At Monsanto, emerging technologies such as IoT, advanced imaging and geo-spatial platforms; molecular breeding, ancestry and genomics data sets have made us rethink how we approach developing, deploying, scaling and distributing our software to accelerate predictive and prescriptive decisions. We created a Cloud based Data Science platform for the enterprise to address this need. Our primary goals were to perform analytics@scale and integrate analytics with our core product platforms.
As part of this talk, we will be sharing our journey of transformation showing how we enabled: a collaborative discovery analytics environment for data science teams to perform model development, provisioning data through APIs, streams and deploying models to production through our auto-scaling big-data compute in the cloud to perform streaming, cognitive, predictive, prescriptive, historical and batch analytics@scale, integrating analytics with our core product platforms to turn data into actionable insights.
Similar to Scaling People, Not Just Systems, to Take On Big Data Challenges (20)
On-Demand Cloud Computing for Life Sciences Research and EducationMatthew Vaughn
The Jetstream cloud is a collaboration between Cyverse partners TACC and University of Arizona, University of Chicago, Johns Hopkins University, and Indiana University to bring the flexibility and ease-of-use of CyVerse Atmosphere to the entire community of science, at a much larger scale. Jetstream is a cloud resource operated as part of XSEDE, and built from two independent OpenStack clusters, each capable of supporting thousands of virtual machines and data volumes. The clusters are integrated via the user-friendly "Atmosphere" interface developed by CyVerse, with authentication enabled by Globus, and, unlike the CyVerse cloud also offer full access to Openstack web service APIs. Jetstream features a diverse catalog virtual machine templates. One can launch a personal Galaxy server, do advanced biostatistics, use Matlab, or experiment with new technologies like Docker, all on Jetstream. This talk highlights the unique capabilities of Jetstream and provides information about how researchers from all over can access it.
How Cyverse.org enables scalable data discoverability and re-useMatthew Vaughn
Cyverse.org designs, builds, and operates an innovative, integrated life sciences cyberinfrastructure. It provides data management and analysis capabilities with point and click, cloud, API, and command-line interfaces that engage users of any computing proficiency and is based on an extensible platform that integrates local and national-scale HPC, storage, and cloud resources. Cyverse directly supports thousands of users who store and access over 2PB of research data, use millions of compute hours annually, and participate in the platform's improvement, plus a secondary user community from partner projects that have built atop it. Cyverse is organized around "Data Store" and "App Catalog" services, each of which enables users to upload digital research assets that can be kept private, shared, or made public. Recently, Cyverse has been transitioning from passively enabling digital sharing towards active facilitation. It is partnering with repositories like NCBI SRA to enable direct submission from Cyverse applications, adopting commonly-used ontologies, enabling import/export of virtual machine images, developing metadata-driven persistent landing pages for data sets, and providing DOI (and other identifier) services. These new features are expected to further catalyze growth of an interoperable, interconnected network of shared research infrastructure across the biological sciences.
Clouds, Clusters, and Containers: Tools for responsible, collaborative computingMatthew Vaughn
Intro slides from AKES workshop at ISMB2016. This workshop addresses the challenges and requirements for working effectively on cloud computing and high performance computing resources, discusses the key principles that should guide responsible scientific computation and collaboration, and using hands-on sessions presents practical solutions using emergent software tools that are becoming widely adopted in the global scientific community. Specifically, we will look at using “containers” to bundle software applications and their full execution environment in a portable way. We will look at managing and sharing data across distributed resources. And finally, we will tackle how to orchestrate job execution across systems and capture metadata on the results (and the process) so that parameters and methodologies are not lost.
Arabidopsis Information Portal: A Community-Extensible Platform for Open DataMatthew Vaughn
Araport is an innovative model organism database resource that offers users the ability to bring their own visualizations, data sets, algorithms, and genome browser tracks and share them with their colleagues.
Arabidopsis Information Portal overview from Plant Biology Europe 2014Matthew Vaughn
An overview of the design, technical decisions, and implementation of the Arabidopsis Information Portal community-extensible data sharing and analytics platform.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Scaling People, Not Just Systems, to Take On Big Data Challenges
1. SCALING PEOPLE, NOT JUST
SYSTEMS, TO TAKE ON BIG DATA
CHALLENGES
Matthew Vaughn @mattdotvaughn
Director, Life Sciences Computing, TACC
Co-PI Cyverse, Araport, Jetstream Cloud
2/1/2016 1
2. TACC AT A GLANCE
2/1/2016 2
Personnel
160 Full time staff (~70 PhD)
Facilities
10 MW Data center capacity
New office facility to be completed by
start of 2016
Systems and Services
A Billion compute hours per year
5 Billion files, 50 Petabytes of Data,
Hundreds of Public Datasets
Capacity & Services
HPC, HTC, Visualization, Large scale data
storage, Cloud computing
Consulting, Curation and analysis, Code
optimization, Portals and Gateways, Web
service APIs, Training and Outreach
3. Stampede
• #7 HPC system in the world for
computation 500k CPU core 9.7 PF
Lonestar 5
• Texas-focused Cray XC40 30,000
Intel Haswell cores 1.25 PF
Wrangler
• 0.6 PB usable DSSD flash storage w
1 TB/s read rate + 10 PB Lustre
Maverick
•132 Fat nodes w dual 10 core Ivy
Bridge + NVIDIA Kepler K40 GPGPU
Chameleon & Jetstream Cloud
•1400 nodes OpenStack
Disk and Tape Storage
• 100+ PB storage in HIPAA-aligned
data center
EXTREME SCALE SUPERCOMPUTING
Coming Soon: Hikari
• Green computing
system parternship with
NEDO and NTT. 10k
Haswell cores. HVDC
and Solar (partial)
• Support for container
ecosystem
2/1/2016 3
4. TACC’S STAMPEDE SUPPORTS AN INCREDIBLE
AMOUNT & DIVERSITY OF RESEARCH
• Through <3 years of production operations:
• Over *2 Billion* processor hours delivered to end users
• 6+ million successful jobs
• About 10,000 students, faculty, and staff use our Stampede directly
• Over 30,000 more use it indirectly via portals and services
• Peer-reviewed requests for time (via XSEDE) run ~500% available hours
• Stampede alone supports nearly 2,500 funded projects across the United
States and abroad
2/1/2016 4
6. Plasma
Physics
Researchers at the
Institute for Fusion
Studies are using
Stampede to model
plasma flow and
turbulence in both
fusion power-
generating tokomaks,
and in the ionosphere
(with applications to
GPS).
2/1/2016 6
7. Finding our
Ancestors
Genomics analysis
performed on Stampede
compared the genomes
of nine ancient
Europeans that bridge
the transition to
agriculture in Europe
between 8,000 and 7,000
years ago, revealing that
most present-day
Europeans derive from at
least three highly
differentiated
populations.
2/1/2016 7
9. THE EVOLUTION OF A CYBERINFRASTRUCTURE
2/1/2016 9
Once upon
a time, most
of us built
garage-style
clusters…
HPC systems have grown
up since then and become
much more powerful and
sophisticated
We built a very successful
center based on 3 pillars in
our first 7 years
• Simulation & HPC
• Visualization
• People (Consulting &
Algorithms)
But a curious
thing happened
while HPC was
running a
victory lap…
10. NOT JUST ONE DATA TSUNAMI BUT THOUSANDS OF THEM
2/1/2016 10
11. DNA SEQUENCING COSTS DECREASED FROM $1B/GENOME
IN 2000 TO $500 in 2015
2/1/2016 11
12. DOZENS OF NEW DISRUPTIVE NEW MEASUREMENT TECHNOLOGIES EMERGED
AND THEY ALL GENERATE GBs or TBs OF DATA
2/1/2016 12
14. KEY CHALLENGES EMERGING FROM
THE BIG DATA REVOLUTION
2/1/2016 14
Data generation occurs at higher volume and is
geographically dispersed
More researchers are asking questions that are too
big for local-scale computing
Research teams are larger, more distributed, and
technologically heterogeneous
Workflows have increasingly complex hardware and
software requirements
Data and software sharing, while expected, is still
too hard Science and engineering hinges on managing
and analyzing large data, so TACC's strategy
embraces data and its analysis right alongside
computing
15. Mike
• Computing novice
• Works remotely at partner
site
Eliza
• Masters specific
analysis skills
• Readily adopts new
tech
Roshan
• Computationally
experienced
• Focused on interpretation
Paulo
• Staff computational
expert
• Supports multiple
projects
DIVERSE DISTRIBUTED RESEARCH TEAMS
2/1/2016 15
Nikolaidas Group
• Mostly
experimentalists
• Strict data sharing
& access
16. THEIR NEEDS (30,000 FT. VIEW)
Store, organize, share primary data
Iteratively perform 1’ analyses
Store, organize, share derived data products
Iteratively generate and explore hypotheses
Share analytical code with the scientific public
Integrate results from new experiments
Publish data alongside plots, visualizations and
analytical tools
2/1/2016 16
17. THEIR NEEDS (500 FT. VIEW)
Data lifecycle management
Fine-grained permission management
Discoverability
Version control
Domesticating promising new analysis codes based
on often immature technology
Doing reproducible computational science
Adopting efficient analytical methods
2/1/2016 17
18. WORKFLOWS NOW TECHNICALLY COMPLICATED
2/1/2016 18
LANGUAGES
• Python 2 & 3
• R
• Julia
• Perl
• Matlab
• Java
• Scala, Clojure, etc
• .NET
• C++
• Swift
• Haskell
• Go
• Javascript
FRAMEWORKS
• MapReduce Hadoop, Storm,
Pachyderm
• Event & Streaming: Kinesis,
Azure Stream Analytics, Camel,
Streambase
• Deep/Machine Learning: Watson,
Azure BI, Tensorflow
• In-memory parsing: Kognito,
Apache Spark
• New data warehouse: Snowflake
• Containers: Docker, Rocket,
MESOS, Kubernetes
• Cloud: AWS, GCE, OpenStack,
VMWare
HARDWARE
•Rise of many-core computing
means 50-100 threads/node*
• Xeon / Xeon Phi
• GPU
• OpenPower
• ARM
• Multi-level memory architectures
• Hierarchical storage
architectures
• FPGAs
19. WORKFLOWS NOW TECHNICALLY COMPLICATED
2/1/2016 19
LANGUAGES
• Python 2 & 3
• R
• Julia
• Perl
• Matlab
• Java
• Scala, Clojure, etc
• .NET
• C++
• Swift
• Haskell
• Go
• Javascript
FRAMEWORKS
• MapReduce Hadoop, Storm,
Pachyderm
• Event & Streaming: Kinesis,
Azure Stream Analytics, Camel,
Streambase
• Deep/Machine Learning: Watson,
Azure BI, Tensorflow
• In-memory parsing: Kognito,
Apache Spark
• New data warehouse: Snowflake
• Containers: Docker, Rocket,
MESOS, Kubernetes
• Cloud: AWS, GCE, OpenStack,
VMWare
HARDWARE
•Rise of many-core computing
means 50-100 threads/node
• Xeon / Xeon Phi
• GPU
• OpenPower
• ARM
• Multi-level memory architectures
• Hierarchical storage
architectures
• FPGA
20. HOW CAN WE HELP RESEARCHERS WITH SUCH
DIVERSE NEEDS AND BACKGROUNDS?
2/1/2016 20
21. BUILD A MASSIVE STORAGE CLOUD NEXT TO INNOVATIVE, POWERFUL, USABLE
COMPUTERS AT THE END OF FAST INTERNET PIPES
2/1/2016 21
22. MANY DOMAIN SCIENTISTS ARE NOT EXPERTS AT COMPUTING TECHNOLOGY.
CREATE PURPOSE-BUILT, HIGHLY INTUITIVE INTERFACES
2/1/2016 22
23. TACC CREATES A LOT OF SOFTWARE…
2/1/2016 23
Over half of our staff build software
to empower data-intensive science
communities
Earthquake/Flood/Wind Engineering
Genomics and Genetics
Immunology
Geology
Drug Discovery
Archaeology
Infectious Disease
Many more…
designsafe-ci.org vdjserver.org
digitalrocksportal.org chameleoncloud.org
jetstream-cloud.org araport.org
cyverse.org (née iplantc.org) xsede.org
24. Point-and-click interfaces
• Data management,
sharing, and analysis
• Publishing reproducible
analysis workflows
• Discovery of new or
updated tools and data
• Interactive visualization
of results
Backed by world-class
computing and data
capacity
2/1/2016 24
25. 2/1/2016 25
Hosted SaaS
• JupyterHub notebooks
• Rstudio
• Web-based VNC
Also, backed by world-class
computing and data
capacity
26. Easy to use Cloud Computing
• Atmosphere (Cyverse)
• Jetstream (IU,UA,TACC)
• Chameleon (UC,TACC )
Cloud consoles are aimed at
sysadmins and unintutive.
We’re changing that with
improved UX and support
• APIs are still available
• No cost to end user
2/1/2016 26
27. GIVE EXPERTS ACCESS TO EVERY SINGLE ONE OF YOUR BUILDING BLOCKS.
WEB SERVICE APIs EVERYWHERE. AUGMENT WITH PROFESSIONAL TOOLING.
2/1/2016 27
30. KNIT COMPLEX SCIENCE INFRASTRUCTURE TOGETHER WITH
USEFUL ABSTRACTIONS
2/1/2016 30
Reference Design: Extensible Research Infrastructure
Web site Workspace Data Portal Learning 3rd Party Dev Zone
Science APIs IAM Data Movement Smart Search
iRODS* Shibboleth/Oauth2 OpenStack & Docker
Ease of Use
Ease of Re-use
Extensibility
Capability
31. MIKE: LEARNING COMPUTATIONAL SKILLS
Mike generates mass-spec data from his samples, which
are used to decide on future experiments
Eliza and Paulo have published analytical scripts for
his data as ‘apps’ into a project web portal
Paulo has wired up the lab’s equipment to send
data directly to remote storage
Mike can perform basic analysis and reporting on
his data from the web interface
Roshan can see Mike’s results in the project portal
and discuss them with him
Mike collaborates with Eliza to improve the results
of their analytical scripts
32. ELIZA: AUGMENTING HER RESEARCH CAPABILITY
Eliza collects and analyzes in situ hybridization imagery
as a major part of her inquiry
She uses code developed by Paulo and others to
perform feature detection
She helps Paulo develop code for Mike to use in his
proteomics work
She has developed scripted workflows that run
locally on her laptop to complete her analyses
She writes her own code in Python and R for
aggregate analysis and shares it via the Docker Hub
and the project portal
She presents her data viz via interactive Rshiny apps
that she deploys to the project portal
33. PAULO: CONCENTRATING ON COMPUTATIONAL
RESEARCH
Paulo likes to solve hard computational problems but is also
responsible for lab research infrastructure
He has automated data movement from instruments to remote
storage, including duplication to AWS Glacier. Roshan gets the
bill
He builds and maintains the project web portal. He didn’t have
much experience with such technology when the project
started
Paulo has used Spark to developed new software for feature
extraction from in situ hybridization images – he can
automatically deploy it to the portal, powered by TACC
Wrangler, as part of his build process
He is working on a paper describing his software and is getting
valuable fedback from other folks he has shared it with via the
project portal and public source repositories
34. ROSHAN: FOCUSING ON THE BIG PICTURE
Roshan has deep experience in gene expression analysis -
Paulo has populated the project portal with many of the
tools she needs to accomplish her goals
She collaborates with Mike and Eliza on interpreting their
experimental results
Because Eliza has included a custom notification in her
scripting workflow, Roshan knows when new image
analyses have been completed and can schedule time to
look at them
She can work with Paulo to enable the project web portal
to make use of her newly-awarded XSEDE computing and
storage allocation
She routinely shares results with experimentalist
colleagues and reviewers via a simple Dropbox-like
interface
35. FUTURE DIRECTIONS AND CHALLENGES
2/1/2016 35
Technological Challenges Remain…
• Data volume
• TACC has capacity for 160PB today. Moore’s law curve holding for tape and disk (whew). But, POSIX
isn’t sustainable long-term.
• Data velocity
• We routinely move 5 TB/h between large research universities.
• Network bandwidth is still scaling, but data spread is getting worse.
• Ubiquitous streaming data is coming.
• Massive Parallelism
• All server-class chips will have 40+ cores in next 18mo.
• Performant software will require embrace of parallelism at instruction, thread and task levels.
• High Transaction Rate I/O
• Flash & NVRAM allow architecting of massive speed systems for small I/O (graph traversal, database,
etc).
• TACC Wrangler can do 800GB per second of I/O.
36. FUTURE DIRECTIONS AND CHALLENGES
2/1/2016 36
But also…
• Collaboration and Sharing
• Noisy Data QA/QC (Quality Assurance/Quality Control)
• Provenance and Reproducibility
• Managing Metadata
• Compliance and Data Security
TACC’s Systems and Expertise Change How Discovery is Done, and Change the
World
37. WE’RE HIRING
Life Science, Data Science, Machine Learning, Web and
Mobile, Cloud, HPC, Sysadmin
jobs@tacc.utexas.edu
2/1/2016 37
FOLLOW US
@TACC
fb.com/tacc.utexas
www.tacc.utexas.edu
Editor's Notes
Talk a little about what we mean by BIG INNOVATIVE SYSTEMS
Then, spend the rest of my time talking about how we are HELPING PEOPLE GO FAST
Cloud systems weave all this together
We also have testbed systems online or coming in next 6 mo
Altera FPGA, POWER/NVIDIA, HP APOLLO powered by solar HVDC
Everyone’s generating TERASCALE DATA
There aren’t thousands of locations capable of computing at this scale
Collaborative teams are geographically dispersed
iPlant can HELP
Break up genome into little bits Determine sequence of each Stitch together
$1B 5 years
$500.00 24 hours
This is the Sen research group, and they’re a set of model users for explaining how Science as a Service can work.
Coming from a traditional molecular science background, Dr Sen’s research is now reliant on
Large-scale sequencing, proteomics, and imaging data sets
You’ll recognize these users from your own communities…
“Create detailed spatial-temporal molecular atlas (RNA, proteins, metabolites) of the developing lung”
Here are their high level requirements. Seems familiar, right?
They’re inevitably bogged down in these kinds of details…
all while their NEED for computing is outpacing their resources
Ah-ha, you say. They should just move to the cloud!
Open Source, Open Access is rule of the land. Cambrian Explosion of language, framework, hardware
Orange: What the LSC group at TACC have their fingers on in Jan 2016. It’s a lot to keep up with…
The iPlant Data Store is foundational to our work
Big (1+PB). Secure. REALLY EASY SHARING!
Next to heavy metal @ TACC and UA
How do we get people to use these resources?
Make it easier to use than suffering through screens that look like this!
The iPlant DE features tools that run on all TACC systems, as well as institutional clusters at UA, Uhawaii, the UK, and soon many other sites. Storage is distributed, too.
Cloud: Infinitely, dynamically scalabe versus “I have 100% control of my computing environment”
Give them access to all your BUILDING BLOCKS
APPLICATION PROGRAMMING INTERFACES
MAKE DEVELOPERS FIRST CLASS residents of your infrastructure
These provide a necessary abstraction to the nasty middleware
Consolidate, abstract, interpret
Semantically consistent
Universal ACLs
Portable
Below the fold, access to physical systems. HPC, HTC, DBMS, DIS, Cloud, Object and Block storage, NoSQL, you name it
Make it easy to INTEGRATE across the resources you have at your disposal