Just the sketch: advanced streaming analytics in Apache MetronDataWorks Summit
Doing advanced analytics in streaming architectures presents unique challenges around the tradeoff of having more context vs. performance. Typically performance and scalability requirements mandate that each message in a stream be operated on without the context of other messages in the stream that may have come before. In this talk, we will talk about using sketching algorithms to engineering a compromise which allows us to consider historical state without compromising scalability.
What we found analyzing the capabilities of many similar SIEMs and cybersecurity platforms is that a good portion of the advanced anaytics boil down to either simple rules enriched with the ability to do statistical baselining, set existence, and set cardinality computations. These operations are necessarily difficult to do in-stream, so often they're done after the fact. We look at ways to open up these analytics to stream computation without sacrificing scalability.
Specifically, we will introduce the infrastructure built for Apache Metron to perform these kinds of tasks. We will cover the novel integration between an Apache Storm and Apache Hbase and orchestrated by a custom domain specific language called Stellar to take all the sting out of constructing sketches and using them to accomplish simple and more advanced analytics such as statistical outlier analysis in stream. CASEY STELLA, Principal Software Engineer, Hortonworks
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...Databricks
Chesapeake Regional Information System for our Patients (CRISP) is a nonprofit healthcare information exchange (HIE) whose customers include states like Maryland and healthcare providers such as Johns Hopkins. CRISP’s work supports the local healthcare community by securely sharing the kind of data that facilitates care and improves health outcomes.
When the pandemic started, the Maryland Department of Health reached out to CRISP with a request: Get us the demographic data we need to track COVID-19 and proactively support our communities. As a result, CRISP employees spent long hours attempting to handle multiple data sources with complex data enrichment processes. To automate these requests, CRISP partnered with Slalom to build a data platform powered by Databricks and Delta Lake.
Using the power of the Databricks Lakehouse platform and the flexibility of Delta Lake, Slalom helped CRISP provide the Maryland Department of Health with near real-time reporting of key COVID-19 measures. With this information, Maryland has been able to track the path of the pandemic, target the locations of new testing sites, and ultimately improve access for vulnerable communities.
The work did not stop there—once CRISP’s customers saw the value of the platform, more requests starting coming in. Now, nearly one year since the platform was created, CRISP has processed billons of records from hundreds of data sources in an effort to combat the pandemic. Notable outcomes from the work include hourly contact tracing with data already cross-referenced for individual risk factors, automated reporting on COVID-19 hospitalizations, real-time ICU capacity reporting for EMTs, tracking of COVID-19 patterns in student populations, tracking of the vaccination campaign, connecting Maryland MCOs to vulnerable people who need to be prioritized for the vaccine, and analysis of the impact of COVID-19 on pregnancies.
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
This document discusses design patterns for big data applications. It begins by defining what a design pattern is and providing examples from architecture and software design. It then analyzes characteristics of big data applications to determine appropriate patterns, including volume, velocity, variety, and more. Common patterns are presented like percolation, recommendation, and encapsulated processes. Examples include personalized search, medicine, and market segmentation. The document concludes that applying the right patterns can improve productivity, performance, and maintainability of big data systems.
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Ian Foster
Ever more data- and compute-intensive science makes computing increasingly important for research. But for advanced computing infrastructure to benefit more than the scientific 1%, we need new delivery methods that slash access costs, new sustainability models beyond direct research funding, and new platform capabilities to accelerate the development of new, interoperable tools and services.
The Globus team has been working towards these goals since 2010. We have developed software-as-a-service methods that move complex and time-consuming research IT tasks out of the lab and into the cloud, thus greatly reducing the expertise and resources required to use them. We have demonstrated a subscription-based funding model that engages research institutions in supporting service operations. And we are now also showing how the platform services that underpin Globus applications can accelerate the development and use of an integrated ecosystem of advanced science applications, such as NCAR’s Research Data Archive and OSG Connect, thus enabling access to powerful data and compute resources by many more people than is possible today.
In this talk, I introduce Globus services and the underlying Globus platform. I present representative applications and discuss opportunities that this platform presents for both small science and large facilities.
Accelerating Data-driven Discovery in Energy ScienceIan Foster
A talk given at the US Department of Energy, covering our work on research data management and analysis. Three themes:
(1) Eliminate data friction (use of SaaS for research data management)
(2) Liberate scientific data (research on data extraction, organization, publication)
(3) Create discovery engines at DOE facilities (services that organize data + computation)
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...DataWorks Summit
Hadoop Distributed File System (HDFS) based architectures allow faster ingestion and processing of larger quantities of time series data than presently possible in current seismic, hydroacoustic, and infrasonic (SHI) analysis platforms. We have developed a data acquisition and signal analysis system using Hadoop, Accumulo, and NiFi. The data model allows individual waveform samples and their associated metadata to be stored in Accumulo. This is a significant departure from traditional storage practices, where continuous waveform segments are stored with their associated metadata as a single entity. Our design allows for rapid table scans of large data archives within Accumulo for locating, retrieving, and analyzing specific waveform segments directly. The scalability of Hadoop permits the system to accommodate the ingestion and analysis of new data as a sensor network grows. Our system is currently acquiring data from over 200 SHI sensors. Peak ingest rates are approaching 500k entries per second, while preserving constant sub-second access times to any range of entries. The average load produced by the data ingest process is consuming less than 10 percent of available system resources. CHARLES HOUCHIN, Computer Scientist, Air Force Technical Applications Center (AFTAC) and JOHN HIGHCOCK, Systems Architect, Hortonworks
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
Just the sketch: advanced streaming analytics in Apache MetronDataWorks Summit
Doing advanced analytics in streaming architectures presents unique challenges around the tradeoff of having more context vs. performance. Typically performance and scalability requirements mandate that each message in a stream be operated on without the context of other messages in the stream that may have come before. In this talk, we will talk about using sketching algorithms to engineering a compromise which allows us to consider historical state without compromising scalability.
What we found analyzing the capabilities of many similar SIEMs and cybersecurity platforms is that a good portion of the advanced anaytics boil down to either simple rules enriched with the ability to do statistical baselining, set existence, and set cardinality computations. These operations are necessarily difficult to do in-stream, so often they're done after the fact. We look at ways to open up these analytics to stream computation without sacrificing scalability.
Specifically, we will introduce the infrastructure built for Apache Metron to perform these kinds of tasks. We will cover the novel integration between an Apache Storm and Apache Hbase and orchestrated by a custom domain specific language called Stellar to take all the sting out of constructing sketches and using them to accomplish simple and more advanced analytics such as statistical outlier analysis in stream. CASEY STELLA, Principal Software Engineer, Hortonworks
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...Databricks
Chesapeake Regional Information System for our Patients (CRISP) is a nonprofit healthcare information exchange (HIE) whose customers include states like Maryland and healthcare providers such as Johns Hopkins. CRISP’s work supports the local healthcare community by securely sharing the kind of data that facilitates care and improves health outcomes.
When the pandemic started, the Maryland Department of Health reached out to CRISP with a request: Get us the demographic data we need to track COVID-19 and proactively support our communities. As a result, CRISP employees spent long hours attempting to handle multiple data sources with complex data enrichment processes. To automate these requests, CRISP partnered with Slalom to build a data platform powered by Databricks and Delta Lake.
Using the power of the Databricks Lakehouse platform and the flexibility of Delta Lake, Slalom helped CRISP provide the Maryland Department of Health with near real-time reporting of key COVID-19 measures. With this information, Maryland has been able to track the path of the pandemic, target the locations of new testing sites, and ultimately improve access for vulnerable communities.
The work did not stop there—once CRISP’s customers saw the value of the platform, more requests starting coming in. Now, nearly one year since the platform was created, CRISP has processed billons of records from hundreds of data sources in an effort to combat the pandemic. Notable outcomes from the work include hourly contact tracing with data already cross-referenced for individual risk factors, automated reporting on COVID-19 hospitalizations, real-time ICU capacity reporting for EMTs, tracking of COVID-19 patterns in student populations, tracking of the vaccination campaign, connecting Maryland MCOs to vulnerable people who need to be prioritized for the vaccine, and analysis of the impact of COVID-19 on pregnancies.
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
This document discusses design patterns for big data applications. It begins by defining what a design pattern is and providing examples from architecture and software design. It then analyzes characteristics of big data applications to determine appropriate patterns, including volume, velocity, variety, and more. Common patterns are presented like percolation, recommendation, and encapsulated processes. Examples include personalized search, medicine, and market segmentation. The document concludes that applying the right patterns can improve productivity, performance, and maintainability of big data systems.
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Ian Foster
Ever more data- and compute-intensive science makes computing increasingly important for research. But for advanced computing infrastructure to benefit more than the scientific 1%, we need new delivery methods that slash access costs, new sustainability models beyond direct research funding, and new platform capabilities to accelerate the development of new, interoperable tools and services.
The Globus team has been working towards these goals since 2010. We have developed software-as-a-service methods that move complex and time-consuming research IT tasks out of the lab and into the cloud, thus greatly reducing the expertise and resources required to use them. We have demonstrated a subscription-based funding model that engages research institutions in supporting service operations. And we are now also showing how the platform services that underpin Globus applications can accelerate the development and use of an integrated ecosystem of advanced science applications, such as NCAR’s Research Data Archive and OSG Connect, thus enabling access to powerful data and compute resources by many more people than is possible today.
In this talk, I introduce Globus services and the underlying Globus platform. I present representative applications and discuss opportunities that this platform presents for both small science and large facilities.
Accelerating Data-driven Discovery in Energy ScienceIan Foster
A talk given at the US Department of Energy, covering our work on research data management and analysis. Three themes:
(1) Eliminate data friction (use of SaaS for research data management)
(2) Liberate scientific data (research on data extraction, organization, publication)
(3) Create discovery engines at DOE facilities (services that organize data + computation)
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...DataWorks Summit
Hadoop Distributed File System (HDFS) based architectures allow faster ingestion and processing of larger quantities of time series data than presently possible in current seismic, hydroacoustic, and infrasonic (SHI) analysis platforms. We have developed a data acquisition and signal analysis system using Hadoop, Accumulo, and NiFi. The data model allows individual waveform samples and their associated metadata to be stored in Accumulo. This is a significant departure from traditional storage practices, where continuous waveform segments are stored with their associated metadata as a single entity. Our design allows for rapid table scans of large data archives within Accumulo for locating, retrieving, and analyzing specific waveform segments directly. The scalability of Hadoop permits the system to accommodate the ingestion and analysis of new data as a sensor network grows. Our system is currently acquiring data from over 200 SHI sensors. Peak ingest rates are approaching 500k entries per second, while preserving constant sub-second access times to any range of entries. The average load produced by the data ingest process is consuming less than 10 percent of available system resources. CHARLES HOUCHIN, Computer Scientist, Air Force Technical Applications Center (AFTAC) and JOHN HIGHCOCK, Systems Architect, Hortonworks
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
Streamlined data sharing and analysis to accelerate cancer researchIan Foster
Advances in genomics and data analytics create new opportunities for cancer research and personalized medical treatment via large-scale federation of genomic, clinical, imaging and other data from many thousands of patients across institutions around the world. Despite these opportunities and promising early results, cancer research is often stymied by information technology barriers. One major barrier is a lack of tools for the reliable, secure, rapid, and easy transfer, sharing, and management of large collections of human data. In the absence of such tools, security and performance concerns often prevent sharing altogether or force researchers to resort to slow and error prone shipping of physical media. If data are received, timely analysis is further impeded by the difficulties inherent in verifying data integrity and managing who can access data and for what purpose. I will discuss how the mature Globus data management platform addresses these obstacles to discovery and explain how its intuitive, web-based interfaces enable use by researchers without specialized IT knowledge. I also describe how Globus technologies can be extended to meet the security requirements of human data so as to enable use in data-intensive cancer research.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing p...GigaScience, BGI Hong Kong
Jesse Xiao at the Data Publishing session at CODATA2017: Updates to the GigaDB open access data publishing platform. Wednesday 11th October in St Petersburg, Russia
What is Data Commons and How Can Your Organization Build One?Robert Grossman
1. Data commons co-locate large biomedical datasets with cloud computing infrastructure and analysis tools to create shared resources for the research community.
2. The NCI Genomic Data Commons is an example of a data commons that makes over 2.5 petabytes of cancer genomics data available through web portals, APIs, and harmonized analysis pipelines.
3. The Gen3 platform is an open source software stack for building data commons that can interoperate through common APIs and data models to support reproducible, collaborative research across projects.
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
Open source stak of big data techs open suse asiaMuhammad Rifqi
This document summarizes the key technologies in the open source stack for big data. It discusses Hadoop, the leading open source framework for distributed storage and processing of large data sets. Components of Hadoop include HDFS for distributed file storage and MapReduce for distributed computations. Other related technologies are also summarized like Hive for data warehousing, Pig for data flows, Sqoop for data transfer between Hadoop and databases, and approaches like Lambda architecture for batch and real-time processing. The document provides a high-level overview of implementing big data solutions using open source Hadoop technologies.
The document discusses big data testing and provides examples of big data projects. It defines big data as large volumes of data that are analyzed to make better decisions. Big data has three characteristics - volume, velocity, and variety. Traditional testing approaches are not suitable for big data, which requires new testing strategies and tools to handle the scale and complexity. Automating testing and understanding the data and processes are important for big data testing. The document outlines challenges and provides examples of batch and real-time systems as well as automation tools like Talend Open Studio.
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
LendingClub RealTime BigData Platform with Oracle GoldenGate BigData Adapter. This was presented at Oracle Open World 2017 at San Francisco.
Speaker :
Rajit Saha
Vengata Guruswami
Keynote talk at the International Conference on Supercoming 2009, at IBM Yorktown in New York. This is a major update of a talk first given in New Zealand last January. The abstract follows.
The past decade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs "outside the box" can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible.
1. The document discusses the limitations of Hadoop for advanced analytics tasks beyond basic statistics like mean and variance.
2. It introduces several distributed data analytics platforms like Spark, Storm, and GraphLab that can perform tasks like linear algebra, graph processing, and iterative machine learning algorithms more efficiently than Hadoop.
3. Specific use cases from companies that moved from Hadoop to these platforms are discussed, where they saw significantly faster performance for tasks like logistic regression, collaborative filtering, and k-means clustering.
Rob peglar introduction_analytics _big data_hadoopGhassan Al-Yafie
This document provides an introduction to analytics and big data using Hadoop. It discusses the growth of digital data and challenges of big data. Hadoop is presented as a solution for storing and processing large, unstructured datasets across commodity servers. The key components of Hadoop - HDFS for distributed storage and MapReduce for distributed processing - are described at a high level. Examples of industries using big data analytics are also listed.
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
The document discusses how empowering transformational science through open data access, optimized data formats, and open-source tools. It argues that traditional methods of accessing large datasets can be inefficient, with 80% of time spent on data preparation and only 10% on analysis. New approaches using analytics optimized data stores (AODS) like Zarr, and tools like Xarray and Dask, allow accessing large datasets with a single line of code and performing analyses within minutes by leveraging lazy loading and parallel computing. This represents a paradigm shift from traditional project timelines that can reduce barriers to science and increase reproducibility, empowering more researchers to efficiently analyze data and focus on scientific questions.
This deck covers some of the open problems in the big data analytics space, starting with a discussion of state-of-art analytics using Spark/Hadoop YARN. It details out whether each of these are appropriate technologies and explores alternatives wherever possible. It ends with an important problem discussion - how to build a single system to handle big data pipelines without explicit data transfers.
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
Data commons are emerging as a solution to challenges in analyzing and sharing large biomedical datasets. A data commons co-locates data with cloud computing infrastructure and software tools to create an interoperable resource for the research community. Examples include the NCI Genomic Data Commons and the Open Commons Consortium. The open source Gen3 platform supports building disease- or project-specific data commons to facilitate open data sharing while protecting patient privacy. Developing interoperable data commons can accelerate research through increased access to data.
MD Anderson Cancer Center implemented Hadoop to help manage and analyze big data as part of its big data program. The implementation included building Hadoop clusters to store and process structured and unstructured data from various sources. Lessons learned included that implementing Hadoop is complex and a journey, and to leverage existing strengths, collaborate openly, learn from experts, start with one cluster for multiple uses cases, and follow best practices. Next steps include expanding the Hadoop platform, ingesting more data types, identifying high value use cases, and developing and training people with new big data skills.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help boost feelings of calmness, happiness and focus.
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...Allen Day, PhD
This document discusses Google's capabilities for handling large genomic and biomedical data sets. It describes how Google uses technologies like Google Cloud, BigQuery, Dataflow and TensorFlow to process, store and analyze massive volumes of genomic and medical data. Google's systems can handle hundreds of terabytes to petabytes of data and enable fast querying and machine learning on these data sets. The document also provides examples of how Google is applying these capabilities to challenges in genomics, healthcare and precision medicine.
Streamlined data sharing and analysis to accelerate cancer researchIan Foster
Advances in genomics and data analytics create new opportunities for cancer research and personalized medical treatment via large-scale federation of genomic, clinical, imaging and other data from many thousands of patients across institutions around the world. Despite these opportunities and promising early results, cancer research is often stymied by information technology barriers. One major barrier is a lack of tools for the reliable, secure, rapid, and easy transfer, sharing, and management of large collections of human data. In the absence of such tools, security and performance concerns often prevent sharing altogether or force researchers to resort to slow and error prone shipping of physical media. If data are received, timely analysis is further impeded by the difficulties inherent in verifying data integrity and managing who can access data and for what purpose. I will discuss how the mature Globus data management platform addresses these obstacles to discovery and explain how its intuitive, web-based interfaces enable use by researchers without specialized IT knowledge. I also describe how Globus technologies can be extended to meet the security requirements of human data so as to enable use in data-intensive cancer research.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing p...GigaScience, BGI Hong Kong
Jesse Xiao at the Data Publishing session at CODATA2017: Updates to the GigaDB open access data publishing platform. Wednesday 11th October in St Petersburg, Russia
What is Data Commons and How Can Your Organization Build One?Robert Grossman
1. Data commons co-locate large biomedical datasets with cloud computing infrastructure and analysis tools to create shared resources for the research community.
2. The NCI Genomic Data Commons is an example of a data commons that makes over 2.5 petabytes of cancer genomics data available through web portals, APIs, and harmonized analysis pipelines.
3. The Gen3 platform is an open source software stack for building data commons that can interoperate through common APIs and data models to support reproducible, collaborative research across projects.
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
Open source stak of big data techs open suse asiaMuhammad Rifqi
This document summarizes the key technologies in the open source stack for big data. It discusses Hadoop, the leading open source framework for distributed storage and processing of large data sets. Components of Hadoop include HDFS for distributed file storage and MapReduce for distributed computations. Other related technologies are also summarized like Hive for data warehousing, Pig for data flows, Sqoop for data transfer between Hadoop and databases, and approaches like Lambda architecture for batch and real-time processing. The document provides a high-level overview of implementing big data solutions using open source Hadoop technologies.
The document discusses big data testing and provides examples of big data projects. It defines big data as large volumes of data that are analyzed to make better decisions. Big data has three characteristics - volume, velocity, and variety. Traditional testing approaches are not suitable for big data, which requires new testing strategies and tools to handle the scale and complexity. Automating testing and understanding the data and processes are important for big data testing. The document outlines challenges and provides examples of batch and real-time systems as well as automation tools like Talend Open Studio.
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
LendingClub RealTime BigData Platform with Oracle GoldenGate BigData Adapter. This was presented at Oracle Open World 2017 at San Francisco.
Speaker :
Rajit Saha
Vengata Guruswami
Keynote talk at the International Conference on Supercoming 2009, at IBM Yorktown in New York. This is a major update of a talk first given in New Zealand last January. The abstract follows.
The past decade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs "outside the box" can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible.
1. The document discusses the limitations of Hadoop for advanced analytics tasks beyond basic statistics like mean and variance.
2. It introduces several distributed data analytics platforms like Spark, Storm, and GraphLab that can perform tasks like linear algebra, graph processing, and iterative machine learning algorithms more efficiently than Hadoop.
3. Specific use cases from companies that moved from Hadoop to these platforms are discussed, where they saw significantly faster performance for tasks like logistic regression, collaborative filtering, and k-means clustering.
Rob peglar introduction_analytics _big data_hadoopGhassan Al-Yafie
This document provides an introduction to analytics and big data using Hadoop. It discusses the growth of digital data and challenges of big data. Hadoop is presented as a solution for storing and processing large, unstructured datasets across commodity servers. The key components of Hadoop - HDFS for distributed storage and MapReduce for distributed processing - are described at a high level. Examples of industries using big data analytics are also listed.
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
The document discusses how empowering transformational science through open data access, optimized data formats, and open-source tools. It argues that traditional methods of accessing large datasets can be inefficient, with 80% of time spent on data preparation and only 10% on analysis. New approaches using analytics optimized data stores (AODS) like Zarr, and tools like Xarray and Dask, allow accessing large datasets with a single line of code and performing analyses within minutes by leveraging lazy loading and parallel computing. This represents a paradigm shift from traditional project timelines that can reduce barriers to science and increase reproducibility, empowering more researchers to efficiently analyze data and focus on scientific questions.
This deck covers some of the open problems in the big data analytics space, starting with a discussion of state-of-art analytics using Spark/Hadoop YARN. It details out whether each of these are appropriate technologies and explores alternatives wherever possible. It ends with an important problem discussion - how to build a single system to handle big data pipelines without explicit data transfers.
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
Data commons are emerging as a solution to challenges in analyzing and sharing large biomedical datasets. A data commons co-locates data with cloud computing infrastructure and software tools to create an interoperable resource for the research community. Examples include the NCI Genomic Data Commons and the Open Commons Consortium. The open source Gen3 platform supports building disease- or project-specific data commons to facilitate open data sharing while protecting patient privacy. Developing interoperable data commons can accelerate research through increased access to data.
MD Anderson Cancer Center implemented Hadoop to help manage and analyze big data as part of its big data program. The implementation included building Hadoop clusters to store and process structured and unstructured data from various sources. Lessons learned included that implementing Hadoop is complex and a journey, and to leverage existing strengths, collaborate openly, learn from experts, start with one cluster for multiple uses cases, and follow best practices. Next steps include expanding the Hadoop platform, ingesting more data types, identifying high value use cases, and developing and training people with new big data skills.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help boost feelings of calmness, happiness and focus.
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...Allen Day, PhD
This document discusses Google's capabilities for handling large genomic and biomedical data sets. It describes how Google uses technologies like Google Cloud, BigQuery, Dataflow and TensorFlow to process, store and analyze massive volumes of genomic and medical data. Google's systems can handle hundreds of terabytes to petabytes of data and enable fast querying and machine learning on these data sets. The document also provides examples of how Google is applying these capabilities to challenges in genomics, healthcare and precision medicine.
This is a talk I gave at a Northwestern University - Complete Genomics Workshop on April 21, 2011 about using clouds to support research in genomics and related areas.
The document introduces Tag.bio as a low-code analytics application platform built from interconnected data products in a data mesh architecture. It consists of data, algorithms, and analysis apps contributed by different groups - data engineers, data scientists, and domain experts. The platform can integrate various data sources and enable collaboration between groups. It then provides demos of the Tag.bio developer studio and data portal. Key capabilities discussed include integration with AWS services like AI/ML and HealthLake, as well as security features like confidential computing. Example use cases presented are for clinical trials, healthcare, life sciences, and universities.
Google Cloud Platform: Prototype ->Production-> Planet scaleIdan Tohami
As one of Big Data’s Founding Fathers, Google explored the technological changes we faced over the past 10 years and present their solutions to the new data challenges within the Google Cloud ecosystem
20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD
This document discusses applying machine learning and artificial intelligence techniques like deep neural networks to problems in genomics and agriculture. It provides examples of using Google Cloud platforms and services for storing and analyzing large genomic datasets, as well as developing models for tasks like variant calling from sequencing data and marker-assisted breeding. The document advocates that Google is well-positioned to handle massive volumes of genomic and agricultural data and help advance the application of AI in these domains.
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
The document discusses extending the iPlant cyberinfrastructure to support microbes in addition to plants. It provides an overview of iPlant, including its funding from NSF, collaborations, resources like data storage and computing platforms, and applications for analysis. Future plans are outlined to build tools and streamline workflows for metagenomics and enable high-throughput computing for microbial data.
The crusade for big data in the AAL domainAALForum
This document summarizes a keynote presentation about big data integration in the context of drug discovery. It discusses challenges with integrating diverse data sources, including issues with data volume, variety, veracity, and velocity. It presents the Open PHACTS platform as a case study, which integrates multiple biomedical databases into a single access point using semantic web technologies. Open PHACTS has developed apps and APIs to enable complex queries across integrated data related to diseases, tissues, targets, compounds and pathways. The talk highlights ongoing work to address issues like data licensing, identity resolution, quantitative data standards, quality assurance, and data provenance tracking in big data integration efforts.
The pulse of cloud computing with bioinformatics as an exampleEnis Afgan
The document discusses how cloud computing can enable large-scale genomic analysis by providing on-demand access to computational resources and petabytes of reference data. It describes how tools like Galaxy and CloudMan allow researchers to perform genomic analysis in the cloud through a web browser by automating the provisioning and configuration of cloud resources. This approach makes genomic research more accessible and enables the elastic scaling of analysis as needed.
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
How novel compute technology transforms life science researchDenis C. Bauer
Unprecedented data volumes and pressure on turnaround time driven by commercial applications require bioinformatics solutions to evolve to meed these new demands. New compute paradigms and cloud-based IT solutions enable this transition. Here I present two solution capable of meeting these demands for genomic variant analysis, VariantSpark, as well as genome engineering applications, GT-Scan2.
VariantSpark classifies 3000 individuals with 80 Million genomic variants each in under 30 minutes. This Hadoop/Spark solution for machine learning application on genomic data is hence capable to scale up to population size cohorts.
GT-Scan2, identifies CRISPR target sites by minimizing off-target effects and maximizing on-target efficiency. This optimization is powered by AWS Lambda functions, which offer an “always-on” web service that can instantaneously recruit enough compute resources keep runtime stable even for queries with several thousand of potential target sites.
This document provides an overview of next generation sequencing (NGS) analysis. It discusses various NGS platforms such as Illumina, Roche 454, PacBio, and Ion Torrent. It also covers common file formats for sequencing data like FASTQ, quality control measures to assess data quality, and applications of NGS such as RNA-seq and ChIP-seq. The document aims to introduce researchers to basic concepts in NGS analysis and highlights available resources for storing and analyzing large sequencing datasets.
Life Technologies' Journey to the Cloud (ENT208) | AWS re:Invent 2013Amazon Web Services
Life Technologies initially planned to build out its own data center infrastructure, but when a cost analysis revealed that by using Amazon Web Services the company would save $325,000 in hardware alone for a single new initiative, the company decided to use AWS instead. Within 6 months of adopting AWS, Life Technologies launched their Digital Hub platform in production, which now undergirds Life Technologies' entire instrumentation product suite.This immediately began to decrease their time-to-market and enhance their customers' user experience. In this session, we provide an overview of our path to the AWS cloud, with particular focus on the evaluation criteria used to make a cloud vendor decision. We also discuss the lessons learned since going into production.
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryChris Schalk
This document introduces several new Google cloud technologies: Google Storage for storing data in Google's cloud, the Prediction API for machine learning and predictive analytics, and BigQuery for interactive analysis of large datasets. It provides overviews and examples of using each service, highlighting their capabilities for scalable data storage, predictive modeling, and fast querying of massive amounts of data.
This is a talk about Big Data, focusing on its impact on all of us. It also encourages institution to take a close look on providing courses in this area.
This document discusses using the T-BioInfo platform to provide practical education in bioinformatics. It describes how the platform can integrate different types of omics data and analysis into intuitive, visual pipelines. This allows non-experts to analyze and interpret complex datasets. Example projects are provided, such as using RNA-seq data to identify genes involved in a disease. The goal is to teach bioinformatics through collaborative, project-based learning without requiring programming skills. Learners would reconstruct simulated biological processes and contribute to ongoing analysis of real scientific datasets.
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
Typically in predictive data analysis challenges, participants are provided a dataset and asked to make predictions. Participants include with their prediction the scripts/code used to produce it. Challenge administrators validate the winning model by reconstructing and running the source code.
Often data cannot be provided to participants directly, e.g. due to data sensitivity (data may be from living human subjects) or data size (tens of terabytes). Further, predictions must be reproducible from the code provided by particpants. Containerization is an excellent solution to these problems: Rather than providing the data to the participants, we ask the participants to provided a Dockerized "trainable" model. We run the both the training and validation phases of machine learning and guarantee reproducibility 'for free'.
We use the Docker tool suite to spin up and run servers in the cloud to process the queue of submitted containers, each essentially a batch job. This fleet can be scaled to match the level of activity in the challenge. We have used Docker successfully in our 2015 ALS Stratification Challenge and our 2015 Somatic Mutation Calling Tumour Heterogeneity (SMC-HET) Challenge, and are starting an implementation for our 2016 Digitial Mammography Challenge.
This presentation was given at the GlobusWorld 2020 Virtual Conference, by Ian Foster, Rachana Ananthakrishnan, and Vas Vasiliadis from the University of Chicago.
This document discusses the challenges and opportunities biology faces with increasing data generation. It outlines four key points:
1) Research approaches for analyzing infinite genomic data streams, such as digital normalization which compresses data while retaining information.
2) The need for usable software and decentralized infrastructure to perform real-time, streaming data analysis.
3) The importance of open science and reproducibility given most researchers cannot replicate their own computational analyses.
4) The lack of data analysis training in biology and efforts at UC Davis to address this through workshops and community building.
Cybowall is committed to protecting organizations of all sizes. Whether securing the IP reputations of some of the largest Service Provider networks in the world.
AML Transaction Monitoring Tuning WebinarIdan Tohami
Poorly defined thresholds have a number of key impacts on a bank’s operations and compliance departments. Often times, analysts spend considerable time investigating useless alerts which increases operational costs significantly and causes a delay in regulatory filings. Also, the absence of risk-focused thresholds may cause potential money laundering patterns to go un-detected which poses higher monitoring risk to the bank.
Learn how financial institutions can leverage advanced analytics techniques to improve the productivity of the rules by setting up appropriate thresholds. Our speaker will also discuss how to leverage automation techniques for alert investigation in order to reduce the effort spent on false positives, thereby giving more time for the investigations to focus on true suspicious activities.
Topics covered:
- Regulatory Implications
- Managing AML Risks and Emerging Typologies
- Developing Targeted Detection Scenarios
- Customer Segmentation/Population Groups
- Understanding Normal and Outliers
- Operational Improvement through automation
Robotic Process Automation (RPA) Webinar - By Matrix-IFSIdan Tohami
(1) RPA can automate repetitive tasks in financial crime compliance like AML/KYC to reduce manual work and costs. It allows focusing investigator time on more complex cases.
(2) The document discusses how RPA can enhance operations throughput by automating tasks like external data retrieval and form filling. A case study shows an organization improved alerts processed per day from 200 to 1200 using RPA.
(3) The presentation recommends organizations first assess their operations to identify automation opportunities, then start with a pilot RPA project and scale up based on proven value and ROI. RPA benefits include faster processes, accuracy, and scalability with business needs.
Open Banking / PSD2 & GDPR Regulations and How They Are Changing Fraud & Fina...Idan Tohami
The purpose of this webinar is to help Financial Institutions understand the implications of financial crime and fraud prevention, and get ready to review and upgrade their systems accordingly where required.
Topics covered:
-Overview of GDPR and PSD2 regulations with respect to Financial Crime
-Implications of each the regulations on Fraud and Financial Crime (FFC)
-The challenges and opportunities offered by those regulations
-Which steps should Financial Institutions take to mitigate the cost of FFC
Robotic Automation Process (RPA) Webinar - By Matrix-IFSIdan Tohami
Anshul Arora presented Matrix-ifs' RPA solution which talked about
- Integrating AML, Fraud and Cyber-security Investigations
- Eliminate Manual Time Consuming Tasks Using Automation
- Proactive Investigations - System Triggering using AI and Machine Learning Trends
Public cloud spending is growing rapidly, with the public cloud market expected to reach $236 billion by 2020. While public cloud platforms are growing the fastest, cloud and on-premises environments still need to co-exist. There are different hybrid models organizations can choose from based on their environment, tiers, load requirements, and cloud readiness. A hybrid multi-cloud environment provides capabilities across infrastructure, security, integration, service operation, and service transition to manage applications and data across on-premises and multiple cloud platforms.
The document discusses CloudZone's path to helping customers adopt AWS cloud services. It describes AWS' global infrastructure including regions and availability zones. CloudZone provides assessments, governance, workload reviews, and implementation to help customers migrate systems to AWS cloud. Ongoing services include cost optimization and managed services. Two customer case studies are presented: a Ministry of Health using AWS for big data healthcare research, and a manufacturer using AWS for worldwide connectivity of factory data collection.
The document discusses how enterprises are accelerating their journey to the cloud. It notes that change has become more dynamic and that transformation can take years during which the patient/enterprise needs to remain conscious. It discusses how the traditional IT model lacks agility to keep pace with startups. Adopting capabilities of startups can help but bridging the gap is not simple. AWS provides services that can help enterprises and startups bridge this gap. Moving to the cloud allows enterprises to focus on their core mission rather than IT operations. It also discusses how enterprises can become more agile like startups through practices like DevOps and continuous delivery. The document also discusses how the cloud makes it feasible for enterprises to move to the next generation
This document provides an overview of Google Cloud Fundamentals. It introduces Andrew Liaskovski as the teacher and covers various Google Cloud topics including migration, security, DevOps, big data, and disaster recovery services. It also discusses CloudZone's full service package including consulting, managed services, and professional services. The rest of the document focuses on specific Google Cloud products and services such as Compute Engine, App Engine, Container Engine, Cloud Storage, Cloud SQL, networking, big data, and machine learning.
This document provides instructions for deploying the necessary environments and tools for a data analytics lab. It includes setting up a Hortonworks sandbox cluster on Azure, creating an Azure data science virtual machine, and optional configurations for Azure Data Lake and SQL Data Warehouse. Completing these steps ensures students have all required software and access installed prior to the lab. The document estimates completion of the prerequisite setup should take less than 30 minutes.
Cloud Regulations and Security Standards by Ran AdlerIdan Tohami
The document discusses regulations and standards related to cloud computing and privacy. It outlines various regulations including GDPR, Ramot (Israeli privacy authority), and Privacy Shield. It also discusses standards such as ISO 27017 and 27018 which provide guidance on information security controls for cloud computing. The document suggests that cloud computing raises risks regarding confidentiality but can improve availability and integrity if proper security policies and frameworks are implemented.
Azure Logic Apps by Gil Gross, CloudZoneIdan Tohami
This document discusses Azure Logic Apps and serverless computing. It defines key cloud computing models like IaaS, PaaS, and serverless. Serverless computing is running code without dedicated servers. Logic Apps allow automating workflows between cloud services without coding by using connectors. Popular Logic Apps connectors include FTP, HTTP, and Office 365. Logic Apps are billed per action and examples of pricing are provided. Advanced uses of Logic Apps include orchestrating API apps, data validation, transformation, and connectivity between cloud and on-premises systems.
AWS Fundamentals @Back2School by CloudZoneIdan Tohami
This document provides an overview of an AWS Fundamentals course. The course objectives are to teach attendees how to navigate the AWS Management Console, understand foundational AWS services like EC2, VPC, S3, and EBS, manage security and access with IAM, use database services like DynamoDB and RDS, and manage resources with services like Auto Scaling, ELB, and CloudWatch. The agenda covers introductions to AWS, foundational services, security and IAM, databases, and management tools.
Rolling presentation during Couchbase Day. Including
Introduction to NoSQL
Why NoSQL?
Introduction to Couchbase
Couchbase Architecture
Single Node Operations
Cluster Operations
HA and DR
Availability and XDCR
Backup/Restore
Security
Developing with Couchbase
Couchbase SDKs
Couchbase Indexing
Couchbase GSI and Views
Indexing and Query
Couchbase Mobile
Sarine's Big Data Journey by Rostislav AaronovIdan Tohami
This document discusses how Sarine, a company that provides technology for the diamond industry, uses Elasticsearch. It notes that Sarine uses Elasticsearch to store over 400 million documents totaling 1 terabyte of data across 125 indices. Sarine uses Elasticsearch for logging application requests, monitoring system activity, collecting statistics, and visualizing and reporting on data. The document recommends how to best implement and use Elasticsearch, such as using at least three nodes, carefully designing index mappings, educating teams, and using partners for consulting.
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
ESPP presentation to EU Waste Water Network, 4th June 2024 “EU policies driving nutrient removal and recycling
and the revised UWWTD (Urban Waste Water Treatment Directive)”
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
2. Table of Contents
Section 1
Section 2
Section 3
Throughout
Getting from Research to Application… Faster
What are the bottlenecks for translating research into products?
Emphasis on information processing.
From CompBio Research to CompBio Engineering
Getting results, more of them, and predictably improving
Data Integration - Cutting Edge Use Cases
What’s happening right now in industry and academia?
How to use Google Cloud?
I’ll introduce specific cloud services, along with examples of
how they’ve been used successfully. Compute Engine,
Kubernetes, Dataflow, Cloud ML, Genomics API
3. How to Understand?
Linear B is a syllabic script
that was used for writing
Mycenaean Greek, the
earliest attested form of
Greek. The script predates
the Greek alphabet by
several centuries. The oldest
Mycenaean writing dates to
about 1450 BC.
6. DNA Sequencing Value Chain
%Effort
0
100
Pre-NGS
~2000
Future
~2020
Now
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Secondary
Analytics
Analytics,
Intepretation,
Planning
Experiment
Design
DNA
Sequencing
7. Human Genetics Scenario
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Secondary
Analytics
Analytics,
Intepretation,
Planning
Experiment
Design
%Effort
0
100
DNA
Sequencing
Situation:
Unlimited Free DNA
Result:
Slow to understand.
Pre-NGS
~2000
Future
~2020
Now
8. Q: Why Slow to Understand? A1: Data Processing
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Secondary
Analytics
Analytics,
Intepretation,
Planning
Experiment
Design
%Effort
0
100
DNA
Sequencing
Situation:
We still have an
analysis bottleneck
Result:
Slow to understand.
Pre-NGS
~2000
Future
~2020
Now
13. Google confidential │ Do not distribute
Google is good at handling massive volumes of data
uploads per minute
users
search index
query response time
300hrs
500M+
100PB+
0.25s
14. Google confidential │ Do not distribute
Google can is good at handleing massive volumes of genomic data
uploads per minute
users
search index
query response time
300hrs
500M+
100PB+
0.25s
~6WGS
>100x US PhDs
~1M WGS
0.25s
15. Google confidential │ Do not distributeGoogle confidential │ Do not distribute
Google Genomics
August 2015
16. Google confidential │ Do not distribute
Google Genomics is more than infrastructure
General-purpose
cloud infrastructure
Genomics-specific
featuresGenomics API
Virtual Machines & Storage
Data Services & Tools
17. Google confidential │ Do not distribute
BioQuery Analysis Engine
Medical Records Genomics Devices Imaging Patient Reports
Baseline Study Data Private Data
Pharma Health Providers …
Google’s vision to tackle complex health data
Public Data
18. Google confidential │ Do not distribute
BioQuery Analysis Engine
Medical Records Genomics Devices Imaging Patient Reports
Baseline Study Data Private Data
Pharma Health Providers …
Google’s vision to tackle complex health data
Public Data
19. CONFIDENTIAL & PROPRIETARY
3.75 TERABYTES PER HUMAN
1.00 TB GENOME
2.00 TB EPIGENOME
0.70 TB TRANSCRIPTOME
0.06 TB METABOLOME
0.04 TB PROTEOME
~1 MB STANDARD LAB TESTS
5-YR LONGITUDINAL STUDY
BASELINE STUDY: BIG DATA ANALYSIS
Validate a pipeline to process complex phenotypic, biochemical,
and genomic data
● Pilot Study (N=200)
○ Determine optimal biospecimen collection strategy for stable sampling
and reproducible assays
○ Determine optimal assay methodology
○ Validate quality control methods
○ Validate device data against surrogate and primary endpoints
● Baseline Study (N=10,000+)
○ 6 cohorts from low to high risk for cardiovascular and cancer
○ Characterize human systems biology
○ Define normal values for a given parameter in heterogeneous states
○ Predict meaningful events
○ Validate wearable devices for human monitoring
○ Characterize transitions in disease state
20. Public Datasets Project
https://cloud.google.com/bigquery/public-data/
A public dataset is any dataset that is stored in BigQuery and made available to the general public. This URL lists a
special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications.
Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the
queries that you perform on the data (the first 1TB per month is free)
21. Confidential & ProprietaryGoogle Cloud Platform 21
Platinum Genomes
1000 Genomes
Medical (Human)
Population-scale Genome Projects
1000 Bulls
10K Dog Genomes
Veterinary / Agriculture
Open Cannabis Project
Genome To Fields
Panzea (1000 Maize)
AgriculturePersonal Genome Project
Human Microbiome Project
NCBI GEO Human 100K
Cancer Genome Atlas
Many Other
Interesting
Datasets...
22. Google confidential │ Do not distribute
PI / Biologist : variant calls for the 1,000 genomes
23. Google confidential │ Do not distribute
Information: principal coordinates analysis (1000 genomes)
26. Google Cloud Platform
Dataflow + BigQuery
Used for Extract, Transform,
Load (ETL), analytics,
real-time computation and
process orchestration.
cloud.google.com/dataflow
Dataflow
Run SQL queries against
multi-terabyte datasets in
seconds.
cloud.google.com/bigquery
BigQuery
27. Google Cloud Platform
Dataflow + BigQuery
Used for Extract, Transform,
Load (ETL), analytics,
real-time computation and
process orchestration.
cloud.google.com/dataflow
Dataflow
Run SQL queries against
multi-terabyte datasets in
seconds.
cloud.google.com/bigquery
BigQuery
29. Google confidential │ Do not distribute
Example: GATK
Analysis Pipeline
Old way: install
applications on host
kernel
libs
app
app app
app
Makefiles,
CWL, WDL
(on a virtual machine)
30.
31.
32. Google confidential │ Do not distribute
Example: GATK
Analysis Pipeline
Old way: install
applications on host
kernel
libs
app
app app
app
Makefiles,
CWL, WDL
(on a virtual machine)
33. Google confidential │ Do not distribute
Example: GATK
Analysis Pipeline
● Decouple process
management from
host configuration
● Portable across OS
distros and clouds
● Consistent
environment from
development to
production
● Immutable images
New way: deploy
containers
Old way: install
applications on host
kernel
libs
app
app app
app
libs
app
kernel
libs
app
libs
app
libs
app
Makefiles,
CWL, WDL
(on a virtual machine)
Dockerflow:
Dataflow + Docker
Benefits
34. Google confidential │ Do not distribute
Use Case:
Reproducible Science with Docker
● Objective: Build a mutation-detection pipeline
● Provided to competitors
○ Training data set
○ Evalutation data set
● Competitors submit pipelines as Docker images to DREAM Challenge host, Sage Bionetworks
● Submitted pipelines were used to process unseen data set
● Post-competition, Docker images made public
● Incidentally, Google won this competition with a deep-learning based variant caller called
DeepVariant cloud.google.com/genomics/v1alpha2/deepvariant
35. Confidential & ProprietaryGoogle Cloud Platform 35
An idealized version of the
hypothetico-deductive
model of the scientific
method is shown. Various
potential threats to this
model exist (indicated in
red), including
hypothesizing after the
results are known
(HARKing) and lack of
data sharing. Together
these undermine the
robustness of results, and
may impact on the ability
of science to self-correct.
Threats to
reproducible
science.
http://www.nature.com/articles/s41562-016-0021
36. > java -jar target/dockerflow*dependencies.jar
--project=YOUR_PROJECT
--workflow-file=hello.yaml
--workspace=gs://YOUR_BUCKET/YOUR_FOLDER
--runner=DataflowPipelineRunner
To run it:
Variant Calls
Your Variant Caller
36PubSub
Queue
Sequencer
DNA Reads
Genomics
API
Genomics
API
BigQuery
Your Other Tool
37. GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsanto
https://www.youtube.com/watch?v=6KEvLURBenM
38. GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsanto
https://www.youtube.com/watch?v=6KEvLURBenM
42. Q: Why Slow to Understand? A1: Data Processing
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Secondary
Analytics
Analytics,
Intepretation,
Planning
Experiment
Design
%Effort
0
100
DNA
Sequencing
Situation:
We still have an
analysis bottleneck
Result:
Slow to understand.
Pre-NGS
~2000
Future
~2020
Now
43. Q: Why Slow to Understand? A2: Limited Feedback
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Secondary
Analytics
Analytics,
Intepretation,
Planning
Experiment
Design
DNA
Sequencing
Situation:
Data acquisition cost approaches zero
However, still slow to understand, because:
1. Restricted choice of what can be observed, i.e. controlled
modifications and artificial selection
2. Passive Learning. Limited feedback => Low rate of learning
Contrast with active learning...
53. Google Cloud Platform
Verily: Assisting Pathologists in Detecting Cancer with Deep Learning
research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html
Prediction heatmaps produced by the algorithm had
improved so much that the localization score (FROC)
for the algorithm reached 89%, which significantly
exceeded the score of 73% for a pathologist with no
time constraint2
. We were not the only ones to see
promising results, as other groups were getting scores
as high as 81% with the same dataset.
Model generalized very well, even to images that were
acquired from a different hospital using different
scanners. For full details, see our paper “Detecting
Cancer Metastases on Gigapixel Pathology Images”.