Digital Science: Reproducibility and Visibility in AstronomyJose Enrique Ruiz
The science done in Astronomy is digital science, from observing proposals to final publication, to data and software used: each of the elements and actions involved in scientific output could be recorded in electronic form. This fact does not prevent the final outcome of an experiment is still difficult to reproduce. This procedure can be long, tedious, not easily accessible or understandable, even to the author. At the same time, we have a rich infrastructure of files, observational data and publications. This could be used more efficiently if we reach greater visibility of the scientific production, which avoids duplication of effort and reinvention.
Reproducibility is a cornerstone in scientific method, and extraction of relevant information in the current and future data flood is key in Astronomy. The AMIGA group (Analysis of the interstellar Medium of Isolated GAlaxies, IAA-CSIC, http://amiga.iaa.es) faces these two challenges in the European project "Wf4Ever: Advanced technologies for enhanced preservation workflow Science" to enable the preservation of the methodology in scalable semantic repositories to facilitate their discovery, access, inspection, exploitation and distribution. These repositories store the experiments on "Research Objects" whose main constituents are digital scientific workflows. These provide a comprehensive view and clear scientific interpretation of the experiment as well as the automation of the method, going beyond the usual pipelines that normally end up in data processing.
The quantitative leap in volume and complexity of the next generation of archives will need analysis and data mining tasks to live closer to the data, in computing and distributed storage environments, but they should also be modular enough to allow customization from scientists and be easily accessible to foster their dissemination among the community. Astronomy is a collaborative science, but it has also become highly specialized, as many other disciplines. Sharing, preservation, discovery and a much simplified access to resources in the composition of scientific workflows will enable astronomers to greatly benefit from each other’s highly specialized knowhow, they constitute a way to push Astronomy to share and publish not only results and data, but also processes and methodologies.
We will show how the use of scientific workflows can help to improve the reproducibility of the experiment and a more efficient exploitation of astronomical archives, as well as the visibility of the scientific methodology and its reuse.
A Recommender Story: Improving Backend Data Quality While Reducing CostsDatabricks
A recommender story: improving backend data quality while reducing costsnInformation overload is one of the biggest challenges academics face on a daily basis while finding the right knowledge to advance science. With around 7k research articles being published every day, how do you find the right ones?
Elsevier is a global information analytics business that helps institutions and professionals advance healthcare, open science and improve performance. With many data sources and signals being available, data science and big data engineering provide the perfect opportunity to deliver more value to researchers.
Here we will focus on Mendeley, an open (free of charge) academic content platform to help researchers discover new information via functionalities such as a crowd sourced collection of academic related documents (Catalogue) and various personalized recommender systems. MendeleySuggest, the recommender system, helps millions of researchers worldwide to find documents and people relevant to their research field, they did not yet know exist. The personalised recommenders are powered by Mendeley Catalogue, clustering 2 billion records correctly into canonical records, state of the art algorithms and big data solutions (e.g. Spark).
In the past few years, we noticed that with our content growth, quality of the canonical records started drifting due to scalability issues. As a result, we faced clustering accuracy problems and, in turn, impacting also the recommenders. In this talk we will highlight how we rearchitected the fabrication of Mendeley Catalogue to improve its scalability and accuracy. In addition, we will show how the migration from Hadoop Map Reduce to Spark has helped us reduce costs as well as improving maintainability.
Los IPython Notebooks nos han proporcionado una sustancial mejora en la documentación del scripts, así como su inspección y una mayor re-utilización. Los IPython Notebooks también permiten acceder a distintos lenguajes de programación (Fortran, IDL, R, Shell,..) en un mismo script, lo que unido a su modo de acceso Web les hace ser un elemento ideal para el trabajo colaborativo (multi-lenguaje, multi-usuario, multi-plataforma, etc..) Os contaré qué tipo de cosas pueden hacerse con IPython Notebooks, desde desarrollo colaborativo de código multi-lenguaje, pasando por la reutilización de tutoriales, visualización interactiva de resultados, hasta la distribución de código más modular, y la publicación final de un experimento digital verificable y reproducible: el preámbulo de los papers ejecutables.
Digital Science: Reproducibility and Visibility in AstronomyJose Enrique Ruiz
The science done in Astronomy is digital science, from observing proposals to final publication, to data and software used: each of the elements and actions involved in scientific output could be recorded in electronic form. This fact does not prevent the final outcome of an experiment is still difficult to reproduce. This procedure can be long, tedious, not easily accessible or understandable, even to the author. At the same time, we have a rich infrastructure of files, observational data and publications. This could be used more efficiently if we reach greater visibility of the scientific production, which avoids duplication of effort and reinvention.
Reproducibility is a cornerstone in scientific method, and extraction of relevant information in the current and future data flood is key in Astronomy. The AMIGA group (Analysis of the interstellar Medium of Isolated GAlaxies, IAA-CSIC, http://amiga.iaa.es) faces these two challenges in the European project "Wf4Ever: Advanced technologies for enhanced preservation workflow Science" to enable the preservation of the methodology in scalable semantic repositories to facilitate their discovery, access, inspection, exploitation and distribution. These repositories store the experiments on "Research Objects" whose main constituents are digital scientific workflows. These provide a comprehensive view and clear scientific interpretation of the experiment as well as the automation of the method, going beyond the usual pipelines that normally end up in data processing.
The quantitative leap in volume and complexity of the next generation of archives will need analysis and data mining tasks to live closer to the data, in computing and distributed storage environments, but they should also be modular enough to allow customization from scientists and be easily accessible to foster their dissemination among the community. Astronomy is a collaborative science, but it has also become highly specialized, as many other disciplines. Sharing, preservation, discovery and a much simplified access to resources in the composition of scientific workflows will enable astronomers to greatly benefit from each other’s highly specialized knowhow, they constitute a way to push Astronomy to share and publish not only results and data, but also processes and methodologies.
We will show how the use of scientific workflows can help to improve the reproducibility of the experiment and a more efficient exploitation of astronomical archives, as well as the visibility of the scientific methodology and its reuse.
A Recommender Story: Improving Backend Data Quality While Reducing CostsDatabricks
A recommender story: improving backend data quality while reducing costsnInformation overload is one of the biggest challenges academics face on a daily basis while finding the right knowledge to advance science. With around 7k research articles being published every day, how do you find the right ones?
Elsevier is a global information analytics business that helps institutions and professionals advance healthcare, open science and improve performance. With many data sources and signals being available, data science and big data engineering provide the perfect opportunity to deliver more value to researchers.
Here we will focus on Mendeley, an open (free of charge) academic content platform to help researchers discover new information via functionalities such as a crowd sourced collection of academic related documents (Catalogue) and various personalized recommender systems. MendeleySuggest, the recommender system, helps millions of researchers worldwide to find documents and people relevant to their research field, they did not yet know exist. The personalised recommenders are powered by Mendeley Catalogue, clustering 2 billion records correctly into canonical records, state of the art algorithms and big data solutions (e.g. Spark).
In the past few years, we noticed that with our content growth, quality of the canonical records started drifting due to scalability issues. As a result, we faced clustering accuracy problems and, in turn, impacting also the recommenders. In this talk we will highlight how we rearchitected the fabrication of Mendeley Catalogue to improve its scalability and accuracy. In addition, we will show how the migration from Hadoop Map Reduce to Spark has helped us reduce costs as well as improving maintainability.
Los IPython Notebooks nos han proporcionado una sustancial mejora en la documentación del scripts, así como su inspección y una mayor re-utilización. Los IPython Notebooks también permiten acceder a distintos lenguajes de programación (Fortran, IDL, R, Shell,..) en un mismo script, lo que unido a su modo de acceso Web les hace ser un elemento ideal para el trabajo colaborativo (multi-lenguaje, multi-usuario, multi-plataforma, etc..) Os contaré qué tipo de cosas pueden hacerse con IPython Notebooks, desde desarrollo colaborativo de código multi-lenguaje, pasando por la reutilización de tutoriales, visualización interactiva de resultados, hasta la distribución de código más modular, y la publicación final de un experimento digital verificable y reproducible: el preámbulo de los papers ejecutables.
Astronomical Data Processing on the LSST Scale with Apache SparkDatabricks
The next decade promises to be exciting for both astronomy and computer science with a number of large-scale astronomical surveys in preparation. One of the most important ones is Large Scale Survey Telescope, or LSST. LSST will produce the first ‘video’ of the deep sky in history by continually scanning the visible sky and taking one 3.2 giga-pixel image every 20 seconds. In this talk we will describe LSST’s unique design and how its image processing pipeline produces catalogs of astronomical objects. To process and quickly cross-match catalog data we built AXS (Astronomy Extensions for Spark), a system based on Apache Spark. We will explain its design and what is behind its great cross-matching performance.
My talk at the Winter School on Big Data in Tarragona, Spain.
Abstract: We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers.
Accelerating Discovery via Science ServicesIan Foster
[A talk presented at Oak Ridge National Laboratory on October 15, 2015]
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In big-science projects in high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to develop suites of science services to which researchers can dispatch mundane but time-consuming tasks, and thus to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers. I use examples from Globus and other projects to demonstrate what can be achieved.
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...Laurent Lefort
Presentation of the SSN XG results at eResearch Australia 2011 https://eresearchau.files.wordpress.com/2012/06/74-semantically-enabling-the-web-of-things-the-w3c-semantic-sensor-network-ontology.pdf
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Geo exploration simplified with Elastic MapsElasticsearch
The latest versions of Elasticsearch and Kibana deliver significant enhancements to geo capabilities. Learn how these new features take your location data further, from ingest to insight, with Elastic Maps.
See the video: https://www.elastic.co/elasticon/tour/2019/washington-dc/geo-exploration-simplified-with-elastic-maps
Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
A talk at the CODATA/RDA meeting in Gaborone, Botswana. I made the case that the biggest barriers to effective data sharing and reuse are often those associated with "data friction" and that cloud automation can be used to overcome those barriers.
The image on the first slide shows a few of the more than 20,000 active Globus endpoints.
Big Data Modeling Challenges and Machine Learning with No CodeLiana Ye
Presented at SF BAY ACM_202001015_by_Karthik Chinnusamy
What are the Big Data model challenges in today's field? With a few best practice recommendations and Machine Learning approaches, I will use Knime to show the modeling advantages for Big Data with the following themes:
.Performance: Good data models can help us quickly query the required data and reduce I/O throughput.
.Cost: Good data models can significantly reduce unnecessary data redundancy, reuse computing results, and reduce the storage and computing costs for the big data system.
.Efficiency: Good data models can greatly improve user experience and increase the efficiency of data utilization.
.Quality: Good data models make data statistics more consistent and reduce the possibility of computing errors.
I will also describe tools for Sources, Ingestion, Exploration, Modeling and Machine Learning.
Scaling collaborative data science with Globus and JupyterIan Foster
The Globus service simplifies the utilization of large and distributed data on the Jupyter platform. Ian Foster explains how to use Globus and Jupyter to seamlessly access notebooks using existing institutional credentials, connect notebooks with data residing on disparate storage systems, and make data securely available to business partners and research collaborators.
Big data visualization frameworks and applications at Kitwarebigdataviz_bay
Big data visualization frameworks and applications at Kitware
Marcus Hanwell, Technical Leader at Kitware, Inc.
March 27th 2014
Kitware develops permissively licensed open source frameworks and applications for scientific data applications, and related areas. Some of the frameworks developed by our High Performance Computing and Visualization group address current challenges in big data visualization and analysis in a number of application domains including geospatial visualization, social media, finance, chemistry, biological (phylogenetics), and climate. The frameworks used to develop solutions in these areas will be described, along with the applications and the nature of the underlying data. These solutions focus on shared frameworks providing data storage, indexing, retrieval, client-server delivery models, server-side serial and parallel data reduction, analysis, and diagnostics. Additionally, they provide mechanisms that enable server-side or client-side rendering based on the capabilities and configuration of the system.
Big Data Visualization Meetup - South Bay
http://www.meetup.com/Big-Data-Visualisation-South-Bay/
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
Big Data Visualization
Kwan-Liu Ma
Professor of Computer Science and Chair of the Graduate Group in Computer Science (GGCS) at the University of California-Davis
January 22nd 2014
We are entering a data-rich era. Advanced computing, imaging, and sensing technologies enable scientists to study natural and physical phenomena at unprecedented precision, resulting in an explosive growth of data. The size of the collected information about the Web and mobile device users is expected to be even greater. To make sense and maximize utilization of such vast amounts of data for knowledge discovery and decision making, we need a new set of tools beyond conventional data mining and statistical analysis. One such a tool is visualization. I will present visualizations designed for gleaning insight from massive data and guiding complex data analysis tasks. I will show case studies using data from cyber/homeland security, large-scale scientific simulations, medicine, and sociological studies.
Big Data Visualization Meetup - South Bay
http://www.meetup.com/Big-Data-Visualisation-South-Bay/
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
The Matsu Project is an Open Cloud Consortium project that is developing open source software for processing satellite imagery data using Hadoop, OpenStack and R.
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy
Internet of Things (IoT) data frequently has a location and time component. Getting value out of this "geotemporal" data can be tricky. We'll explore when and how to leverage Cassandra, DSE Search and DSE Analytics to surface meaningful information from your geotemporal data.
Astronomical Data Processing on the LSST Scale with Apache SparkDatabricks
The next decade promises to be exciting for both astronomy and computer science with a number of large-scale astronomical surveys in preparation. One of the most important ones is Large Scale Survey Telescope, or LSST. LSST will produce the first ‘video’ of the deep sky in history by continually scanning the visible sky and taking one 3.2 giga-pixel image every 20 seconds. In this talk we will describe LSST’s unique design and how its image processing pipeline produces catalogs of astronomical objects. To process and quickly cross-match catalog data we built AXS (Astronomy Extensions for Spark), a system based on Apache Spark. We will explain its design and what is behind its great cross-matching performance.
My talk at the Winter School on Big Data in Tarragona, Spain.
Abstract: We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers.
Accelerating Discovery via Science ServicesIan Foster
[A talk presented at Oak Ridge National Laboratory on October 15, 2015]
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In big-science projects in high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to develop suites of science services to which researchers can dispatch mundane but time-consuming tasks, and thus to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers. I use examples from Globus and other projects to demonstrate what can be achieved.
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...Laurent Lefort
Presentation of the SSN XG results at eResearch Australia 2011 https://eresearchau.files.wordpress.com/2012/06/74-semantically-enabling-the-web-of-things-the-w3c-semantic-sensor-network-ontology.pdf
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Geo exploration simplified with Elastic MapsElasticsearch
The latest versions of Elasticsearch and Kibana deliver significant enhancements to geo capabilities. Learn how these new features take your location data further, from ingest to insight, with Elastic Maps.
See the video: https://www.elastic.co/elasticon/tour/2019/washington-dc/geo-exploration-simplified-with-elastic-maps
Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
A talk at the CODATA/RDA meeting in Gaborone, Botswana. I made the case that the biggest barriers to effective data sharing and reuse are often those associated with "data friction" and that cloud automation can be used to overcome those barriers.
The image on the first slide shows a few of the more than 20,000 active Globus endpoints.
Big Data Modeling Challenges and Machine Learning with No CodeLiana Ye
Presented at SF BAY ACM_202001015_by_Karthik Chinnusamy
What are the Big Data model challenges in today's field? With a few best practice recommendations and Machine Learning approaches, I will use Knime to show the modeling advantages for Big Data with the following themes:
.Performance: Good data models can help us quickly query the required data and reduce I/O throughput.
.Cost: Good data models can significantly reduce unnecessary data redundancy, reuse computing results, and reduce the storage and computing costs for the big data system.
.Efficiency: Good data models can greatly improve user experience and increase the efficiency of data utilization.
.Quality: Good data models make data statistics more consistent and reduce the possibility of computing errors.
I will also describe tools for Sources, Ingestion, Exploration, Modeling and Machine Learning.
Scaling collaborative data science with Globus and JupyterIan Foster
The Globus service simplifies the utilization of large and distributed data on the Jupyter platform. Ian Foster explains how to use Globus and Jupyter to seamlessly access notebooks using existing institutional credentials, connect notebooks with data residing on disparate storage systems, and make data securely available to business partners and research collaborators.
Big data visualization frameworks and applications at Kitwarebigdataviz_bay
Big data visualization frameworks and applications at Kitware
Marcus Hanwell, Technical Leader at Kitware, Inc.
March 27th 2014
Kitware develops permissively licensed open source frameworks and applications for scientific data applications, and related areas. Some of the frameworks developed by our High Performance Computing and Visualization group address current challenges in big data visualization and analysis in a number of application domains including geospatial visualization, social media, finance, chemistry, biological (phylogenetics), and climate. The frameworks used to develop solutions in these areas will be described, along with the applications and the nature of the underlying data. These solutions focus on shared frameworks providing data storage, indexing, retrieval, client-server delivery models, server-side serial and parallel data reduction, analysis, and diagnostics. Additionally, they provide mechanisms that enable server-side or client-side rendering based on the capabilities and configuration of the system.
Big Data Visualization Meetup - South Bay
http://www.meetup.com/Big-Data-Visualisation-South-Bay/
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
Big Data Visualization
Kwan-Liu Ma
Professor of Computer Science and Chair of the Graduate Group in Computer Science (GGCS) at the University of California-Davis
January 22nd 2014
We are entering a data-rich era. Advanced computing, imaging, and sensing technologies enable scientists to study natural and physical phenomena at unprecedented precision, resulting in an explosive growth of data. The size of the collected information about the Web and mobile device users is expected to be even greater. To make sense and maximize utilization of such vast amounts of data for knowledge discovery and decision making, we need a new set of tools beyond conventional data mining and statistical analysis. One such a tool is visualization. I will present visualizations designed for gleaning insight from massive data and guiding complex data analysis tasks. I will show case studies using data from cyber/homeland security, large-scale scientific simulations, medicine, and sociological studies.
Big Data Visualization Meetup - South Bay
http://www.meetup.com/Big-Data-Visualisation-South-Bay/
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
The Matsu Project is an Open Cloud Consortium project that is developing open source software for processing satellite imagery data using Hadoop, OpenStack and R.
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy
Internet of Things (IoT) data frequently has a location and time component. Getting value out of this "geotemporal" data can be tricky. We'll explore when and how to leverage Cassandra, DSE Search and DSE Analytics to surface meaningful information from your geotemporal data.
The geophysical validation of satellite borne atmospheric chemistry instruments requires a large number of independent observations by a variety of in-situ, remote-sensing and satellite instruments. In the case of the Aura and Envisat missions more than 300 instruments have been formally included in the validation efforts.
In order to extract maximum information from the "ground-truth" measurements, these independent validation datasets must be readily accessible to all investigators in an simple format with standard metadata, variable naming and data structures.
Beginning in 2000, NASA and ESA have been collaborating under the auspices of the CEOS Working Group Calibration/Validation (WGCV) Atmospheric Chemistry Sub Group (ACSG) on the file format definition, metadata formulations, and variable definitions for validation datasets.
This standard, known as the AVDC/Envisat HDF format, is the basis of the Aura Validation Data Center (AVDC) and the Envisat Cal/ Val Data Center implementations. Recently, the AVDC/Envisat HDF formulation has been accepted as the new reporting standard of the Network for the Detection of Atmospheric Composition Change (NDACC), an international network of high quality remote sensing stations, and for the next ESA GMES Earth Observing missions. The concept behind the AVDC/Envisat HDF reporting standard, its implementation and future applications will to be presented in detail.
WOTS2E: A Search Engine for a Semantic Web of ThingsAndreas Kamilaris
A Semantic Web of Things (SWoT) brings together the Semantic Web and the Web of Things (WoT), associating
semantically annotated information to web-enabled physical de-
vices, services and their data, towards seamless data integration and better understanding of real-world information. A missing element in order to realize SWoT is a standardized, scalable and flexible way to globally discover in (near) real time web-connected embedded devices, as well as their semantic data. To address this gap, we propose WOT Semantic Search Engine (WOTS2E), which is a search engine for the SWoT, based on web crawling, being able to discover Linked Data endpoints and, through them, WoT-enabled devices and their services. In this presentation, we describe the design, development and implementation of WOTS2E, as well as an evaluation procedure showing its operation and performance across the web.
Presentació del projecte europeu ECHOES duta a terme el 28 de juny de 2018 a Leiden (Holanda), on el CSUC ha mostrat els objectius i principals característiques del projecte a empreses tecnològiques holandeses.
This slide will provide an overview of current functionality, techniques, and tips for visualization and query of HDF and netCDF data in ArcGIS, as well as future plans. Hierarchical Data Format (HDF) and netCDF (network Common Data Form) are two widely used data formats for storing and manipulating scientific data. The NetCDF format also supports temporal data by using multidimensional arrays. The basic structure of data in this format and how to work with it will be covered in the context of standardized data structures and conventions. This slide will demonstrate the tools and techniques for ingesting HDF and netCDF data efficiently in ArcGIS, as well as some common workflows to employ the visualization capabilities of ArcGIS for effective animation and analysis of your data.
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
Challenges in Data Analytics:
Different application scenarios need different storage solutions: HBASE is ideal for point query scenarios but unsuitable for multi-dimensional queries. MPP is suitable for data warehouse scenarios but engine and data are coupled together which hampers scalability. OLAP stores used in BI applications perform best for Aggregate queries but full scan queries perform at a sub-optimal performance. Moreover, they are not suitable for real-time analysis. These distinct systems lead to low resource sharing and need different pipelines for data and application management.
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...Data Con LA
This session will explore how to apply GeoSpatial analytics using Apache Spark on high-velocity streaming (data-in-motion) and high-volume batch (data-at-rest). Demonstrations will be performed throughout the session to cement these concepts.
This presentation provides an introduction and walk-through of the LoCloud Geocoding application, used during the LoCloud training workshops. The application is a tool to add geographic coordinates to existing data (such as records describing items of content in digital libraries). The presentation includes a step-by-step walk through of the application
20160922 Materials Data Facility TMS WebinarBen Blaiszik
Fall 2016 TMS Webinar on Data Curation Tools. Slides for the Materials Data Facility presentation on data services (publish and discover) as described by Ben Blaiszik. See http://www.materialsdatafacility.org for more information.
Arc Hydro is a set of data models and tools that operates within ArcGIS to support geospatial and temporal data analyses.
Use Arc Hydro to delineate and characterize watersheds in raster and vector formats, define and analyze hydro geometric networks, manage time series data, and configure and export data to numerical models.
Rainer Schmidt, AIT Austrian Institute of Technology, presented Scalable Preservation Workflows from SCAPE at the 5-days ‘Digital Preservation Advanced Practitioner Training’ event (http://bit.ly/1fYCvMO), hosted by DPC, in Glasgow on 15-19 July 2013.
The presentation gives an introduction to the SCAPE Platform, it presents scenarios from SCAPE Testbeds and it finally describes how to create scalable workflows and execute them on the SCAPE Platform.
Scalable Data Analytics and Visualization with Cloud Optimized ServicesGlobus
These slides were presented by Esri's Sudhir Shrestha at the AGU Fall Meeting 2018 in a session titled "Scalable Data Management Practices in Earth Sciences" convened by Ian Foster, Globus co-founder and director of Argonne's data science and learning division.
Similar to Implementing a VO archive for datacubes of galaxies (20)
Jupyter notebooks have arrived to stay as a means to document the scientific analysis protocol, as well as to provide executable recipes shared seamlessly among the community. This has triggered the rise of a plethora of complementary tools and services associated to them. This talk will cover different possibilities to use Jupyter notebooks and JupyterLab interface. We will start with the description of their basic functionalities, as well as functionality extensions not widely known by the community. We will describe how to take advantage of their cross-language capabilities to enhance collaborative work, and also use them as complementary assets in the paper publication process to provide reproducibility of the results. Other aspects on how to deal with modularity and scalability of long complex notebooks will be covered, and we will see several platforms for rendering and execution others then the browser and the local desktop. We will finish on how they are actually being used together with Docker and Binder as part of the versioned executable documentation of a project like Gammapy.
Astronomy is a collaborative science, but it has also become highly specialized, as many other disciplines. Improvement of sharing, discovery and access to resources will enable astronomers to greatly benefit from each other’s highly specialized knowhow. Some initiatives led by scientists and publishers, complement traditional paper publishing with assets published in more interactive digital formats. Among the main goals of these efforts are improving the reproducibility and clarity of the scientific outcome, going beyond the static PDF file, and fostering re-use, which turns into a more efficient exploitation of available digital resources.
The science performed in Astronomy is digital science, from observing proposals to final publication, including data and software used: each of the elements and actions involved in the scientific output could be recorded in electronic form.
This fact does not prevent the final outcome of an experiment is still difficult to reproduce. An exhaustive process of documentation can be long, tedious, where access to all the resources must be granted, and after all, the repeatability of results is not even guaranteed. At the same time, we have access to a wealth of files, observational data and publications that could be used more efficiently with a better visibility of the scientific production, avoiding duplication of effort and reinvention.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Implementing a VO archive for datacubes of galaxies
1. Implementing a VO archive for
datacubes of galaxies
José Enrique Ruiz
Instituto de Astrofísica de Andalucía – CSIC
Sprint 2014 IVOA Interop
May 20th 2014 - ESAC
14. Database Filling
IPython Notebook Executable Recipe
• Web tool executable in the browser
• Imports a single Collection each time
• Single-target single-line velocity data cubes of galaxies
• Configuration of parameters for each Collection
• Type, SubType, File Format, Calibration Level
• Collection,Access Rights, Publisher, Contact, Creator, CreationType
• Facilicity, Instrument, Bandpass, DataSource, Proposal
• UCD/Unit Axis Characterization, Line Characterization
• Fills a specific MySQL BDD structure
• DataSets are FITS files placed in a specified folder
• Configurable FITS headers mapping
• Produces a SQL Script
• Addition of Derived Metadata for Targets in a second step
34. Position Velocity at Fixed RA
3D Exploration /Access Data Use Cases
Position Velocity at Fixed Dec
Range in velocities
Range in flux
35. Position Velocity at Defined Position Angle
3D Exploration /Access Data Use Cases
Range in velocities
Range in flux
Position angle
36. Renzogram: 2D overlay of contour channels with different colors
3D Exploration /Access Data Use Cases
Range in velocities
Range in flux
Position angle
37. Aperture Velocity Profile
3D Exploration /Access Data Use Cases
Integrated Emission Radial Profile
Spatial aperture
Range in velocities
Range in flux
Range in velocities
Range in flux
Position angle
Elliptical rings
38. Visualization as a service
IPython implementation of yt-project software
3D Exploration /Access Data Use Cases
39. Conclusions
• SIAv2 prototype/testbed evolving with in progress standards
• Two different collections of single target /single line-emission velocity cubes of galaxies
• Import from FITS into MySQL database with IPython executable notebook /tool
• Web Interface for discovery and display of SIAv2/ImageDM UCDs and Utypes
• accesData UseCases implemented as programmatic access services
• Additions performed into DM to provide better knowledge of the dataSet:
• Emission-line characterization (velocity and flux)
• Post-processing provenance in radio interferometric cleaned datacubes
• Archive-specific physical properties of Targets