The document discusses using cloud computing resources for large-scale cheminformatics and computational chemistry applications. It describes how the cloud enables on-demand access to massive computing power and data storage. While some legacy applications can directly use cloud clusters, others may require redesigning algorithms to take advantage of the cloud's parallel and distributed capabilities, such as using Hadoop for large database searches and machine learning over big datasets. Simpler programming frameworks like Pig Latin can make Hadoop applications easier to write.
YARN - Hadoop Next Generation Compute PlatformBikas Saha
The presentation emphasizes the new mental model of YARN being the cluster OS where one can write and run different applications in Hadoop in a cooperative multi-tenant cluster
The document discusses high availability in Hadoop 2.0 and YARN. It describes the differences between Hadoop 1.0 and 2.0, including changes to configuration files and directories. It then explains the components and workflow of YARN, including how it separates resource management and scheduling from job execution. Finally, it discusses setting up high availability for the NameNode using shared storage and Zookeeper.
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
The document summarizes a technical seminar presentation on scheduling methods in the Hadoop MapReduce framework. The presentation covers the motivation for Hadoop and MapReduce, provides an introduction to big data and Hadoop, and describes HDFS and the MapReduce programming model. It then discusses challenges in MapReduce scheduling and surveys the literature on existing scheduling methods. The presentation surveys five papers on proposed MapReduce scheduling methods, summarizing the key points of each. It concludes that improving data locality can enhance performance and that future work could consider scheduling algorithms for heterogeneous clusters.
Operating multi-tenant clusters requires careful planning of capacity for on-time launch of big data projects and applications within expected budget and with appropriate SLA guarantees. Making such guarantees with a set of standard hardware configurations is key to operate big data platforms as a hosted service for your organization.
This talk highlights the tools, techniques and methodology applied on a per-project or user basis across three primary multi-tenant deployments in the Apache Hadoop ecosystem, namely MapReduce/YARN and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively. We will demo the estimation tools developed for these deployments that can be used for capital planning and forecasting, and cluster resource and SLA management, including making latency and throughput guarantees to individual users and projects.
As we discuss the tools, we will share considerations that got incorporated to come up with the most appropriate calculation across these three primary deployments. We will discuss the data sources for calculations, resource drivers for different use cases, and how to plan for optimum capacity allocation per project with respect to given standard hardware configurations.
Apache Drill is an open source engine for interactive analysis of large-scale datasets. It was inspired by Google's Dremel, which allows interactive querying of trillions of records at fast speeds. Drill uses a SQL-like language called DrQL to query nested data in a column-based manner. It has a flexible architecture that allows pluggable query languages, execution engines, data formats and sources. Drill aims to be fast, flexible, dependable and easy to use for interactive analysis of big data.
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
(1) The amount of data in the world is growing exponentially, with unstructured data making up over 80% of collected data by 2020. (2) Apache Drill provides data agility for Hadoop by enabling self-service data exploration through a flexible data model and schema discovery. (3) Drill allows business users to rapidly query diverse data sources like files, HBase tables, and Hive without requiring IT, through a simple SQL interface.
YARN - Hadoop Next Generation Compute PlatformBikas Saha
The presentation emphasizes the new mental model of YARN being the cluster OS where one can write and run different applications in Hadoop in a cooperative multi-tenant cluster
The document discusses high availability in Hadoop 2.0 and YARN. It describes the differences between Hadoop 1.0 and 2.0, including changes to configuration files and directories. It then explains the components and workflow of YARN, including how it separates resource management and scheduling from job execution. Finally, it discusses setting up high availability for the NameNode using shared storage and Zookeeper.
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
The document summarizes a technical seminar presentation on scheduling methods in the Hadoop MapReduce framework. The presentation covers the motivation for Hadoop and MapReduce, provides an introduction to big data and Hadoop, and describes HDFS and the MapReduce programming model. It then discusses challenges in MapReduce scheduling and surveys the literature on existing scheduling methods. The presentation surveys five papers on proposed MapReduce scheduling methods, summarizing the key points of each. It concludes that improving data locality can enhance performance and that future work could consider scheduling algorithms for heterogeneous clusters.
Operating multi-tenant clusters requires careful planning of capacity for on-time launch of big data projects and applications within expected budget and with appropriate SLA guarantees. Making such guarantees with a set of standard hardware configurations is key to operate big data platforms as a hosted service for your organization.
This talk highlights the tools, techniques and methodology applied on a per-project or user basis across three primary multi-tenant deployments in the Apache Hadoop ecosystem, namely MapReduce/YARN and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively. We will demo the estimation tools developed for these deployments that can be used for capital planning and forecasting, and cluster resource and SLA management, including making latency and throughput guarantees to individual users and projects.
As we discuss the tools, we will share considerations that got incorporated to come up with the most appropriate calculation across these three primary deployments. We will discuss the data sources for calculations, resource drivers for different use cases, and how to plan for optimum capacity allocation per project with respect to given standard hardware configurations.
Apache Drill is an open source engine for interactive analysis of large-scale datasets. It was inspired by Google's Dremel, which allows interactive querying of trillions of records at fast speeds. Drill uses a SQL-like language called DrQL to query nested data in a column-based manner. It has a flexible architecture that allows pluggable query languages, execution engines, data formats and sources. Drill aims to be fast, flexible, dependable and easy to use for interactive analysis of big data.
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
(1) The amount of data in the world is growing exponentially, with unstructured data making up over 80% of collected data by 2020. (2) Apache Drill provides data agility for Hadoop by enabling self-service data exploration through a flexible data model and schema discovery. (3) Drill allows business users to rapidly query diverse data sources like files, HBase tables, and Hive without requiring IT, through a simple SQL interface.
Resource Aware Scheduling for Hadoop [Final Presentation]Lu Wei
The document describes a resource-aware scheduler for Hadoop that aims to improve task scheduling by considering both job resource demands and node resource availability. It captures job and node profiles, estimates task execution times, and applies scheduling policies like shortest job first. Evaluation on word count and Pi estimation workloads showed the estimated task times closely matched the actual times, demonstrating the accuracy of the scheduler's resource modeling and estimations.
The document provides an overview of the Hadoop ecosystem. It introduces Hadoop and its core components, including MapReduce and HDFS. It describes other related projects like HBase, Pig, Hive, Mahout, Sqoop, Flume and Nutch that provide data access, algorithms, and data import capabilities to Hadoop. The document also discusses hosted Hadoop frameworks and the major Hadoop providers.
This document introduces MapR and Hadoop. It provides an overview of Hadoop, including how MapReduce works and the Hadoop ecosystem of tools. It explains that MapR is mostly compatible with Hadoop but aims to improve reliability, performance, and management compared to other Hadoop distributions through its architecture and features. The objectives are to explain why Hadoop is important for big data, describe MapReduce jobs, identify Hadoop tools, and compare MapR to other Hadoop distributions.
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
Treasure Data on The YARN - Hadoop Conference Japan 2014Ryu Kobayashi
Ryu Kobayashi from Treasure Data gave a presentation on using YARN (Yet Another Resource Negotiator) with Hadoop. Some key points:
- YARN was introduced to improve Hadoop resource management by separating processing from scheduling.
- Configuration changes are required when moving from MRv1 to YARN, including properties for memory allocation and scheduler configuration.
- Container execution, directories, and other components were adapted in the transition from JobTracker to the ResourceManager and NodeManager architecture in YARN.
- Proper configuration of YARN is important to avoid bugs, and tools from distributions can help with configuration.
This document provides an overview of YARN (Yet Another Resource Negotiator), the resource management system for Hadoop. It describes the key components of YARN including the Resource Manager, Node Manager, and Application Master. The Resource Manager tracks cluster resources and schedules applications, while Node Managers monitor nodes and containers. Application Masters communicate with the Resource Manager to manage applications. YARN allows Hadoop to run multiple applications like Spark and HBase, improves on MapReduce scheduling, and transforms Hadoop into a distributed operating system for big data processing.
Vinod Kumar Vavilapalli presented on Apache Hadoop YARN: Present and Future. He discussed how YARN improved on Hadoop 1 by separating resource management from processing, allowing multiple types of applications on the same platform. He summarized recent Hadoop releases including YARN enhancements like high availability and preemption. Future plans include improved isolation, multi-dimensional scheduling, and supporting long-running services. YARN aims to be a general resource management platform powering a growing ecosystem of applications beyond just MapReduce.
Brad Anderson from MapR Technologies presented on technologies for interactive analysis (Apache Drill) and stream processing (Storm) beyond traditional batch processing with Hadoop/MapReduce. Drill allows interactive queries over large datasets through its columnar storage and distributed query engine. Storm is a framework for real-time computation over streaming data through topologies of processing components. M7 provides a more reliable and higher performance alternative to HBase through its unified storage and simplified architecture with no external daemons.
This document provides an overview of Hadoop and big data concepts. It discusses Hadoop core components like HDFS, YARN, MapReduce and how they work. It also covers related technologies like Hive, Pig, Sqoop and Flume. The document discusses common Hadoop configurations, deployment modes, use cases and best practices. It aims to help developers get started with Hadoop and build big data solutions.
Challenges & Capabilites in Managing a MapR Cluster by David TuckerMapR Technologies
"If you're using Hadoop in production, how do you manage it? Does the distribution you're using provide any tools to make the job easier? What are the pitfalls? Are there parts of the system that are less robust or that have problems more often? Are you running Hadoop on bare metal, or in a cloud environment, and is one easier than the other?"
MapR Senior Solutions Architect David Tucker speaks about the challenges and capabilites in managing a cluster. This talk was given at the SF Bay Area Large Scale Production Engineering Meetup (Sept 19, 2013).
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
This presentation contains brief description about big data along with that hadoop installation, configuration and MapReduce wordcount program and its explanation.
1) The document discusses using search and big data technologies to enable reflected intelligence applications through crowd sourcing.
2) It provides background on Ted Dunning and Grant Ingersoll and outlines use cases that combine search, analytics, and machine learning like social media analysis in telecom, claims analysis, and content recommendation.
3) The authors propose a reference architecture combining LucidWorks Search, MapR technologies, and other tools to build a next generation search and discovery platform for these types of reflected intelligence applications.
Apache Hadoop: design and implementation. Lecture in the Big data computing course (http://twiki.di.uniroma1.it/twiki/view/BDC/WebHome), Department of Computer Science, Sapienza University of Rome.
The document discusses using Hadoop for scientific workloads and summarizes early results from benchmarking Hadoop. It explores using Hadoop and MapReduce for data-intensive scientific applications like BLAST sequence analysis. Performance results show that Hadoop can provide comparable performance to existing parallel file systems. Challenges include lack of turn-key solutions, managing data formats, and performance tuning. The research aims to understand the unique needs of science clouds and how to effectively support data-intensive scientific applications on cloud platforms.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
The document discusses best practices for scaling Hadoop applications. It covers causes of sublinear scalability like sequential bottlenecks, load imbalance, over-partitioning, and synchronization issues. It also provides equations for analyzing scalability and discusses techniques like reducing algorithmic overheads, increasing task granularity, and using compression. The document recommends using higher-level languages, tuning configuration parameters, and minimizing remote procedure calls to improve scalability.
Characterization of Chemical Libraries Using Scaffolds and Network ModelsRajarshi Guha
This document discusses characterizing chemical libraries using scaffold networks. Scaffold networks represent chemical structures as nodes in a network connected by edges denoting substructure relationships. Network metrics can characterize overall library structure and distributions of properties at the scaffold level. Comparing libraries involves identifying common scaffolds and comparing their properties in the different networks. Reducing the networks to forests of trees provides a simplified representation while retaining structural information to enable comparisons of library coverage and diversity.
This document discusses high-throughput screening (HTS) workflows for identifying biologically active small molecules. It describes how robots are used to rapidly screen large libraries of compounds in assays and generate large datasets. Statistical and machine learning methods in R can then be used to build predictive models from these datasets to identify promising leads and guide the screening of additional compounds. Caveats regarding the applicability of models to new chemical spaces are also discussed.
Resource Aware Scheduling for Hadoop [Final Presentation]Lu Wei
The document describes a resource-aware scheduler for Hadoop that aims to improve task scheduling by considering both job resource demands and node resource availability. It captures job and node profiles, estimates task execution times, and applies scheduling policies like shortest job first. Evaluation on word count and Pi estimation workloads showed the estimated task times closely matched the actual times, demonstrating the accuracy of the scheduler's resource modeling and estimations.
The document provides an overview of the Hadoop ecosystem. It introduces Hadoop and its core components, including MapReduce and HDFS. It describes other related projects like HBase, Pig, Hive, Mahout, Sqoop, Flume and Nutch that provide data access, algorithms, and data import capabilities to Hadoop. The document also discusses hosted Hadoop frameworks and the major Hadoop providers.
This document introduces MapR and Hadoop. It provides an overview of Hadoop, including how MapReduce works and the Hadoop ecosystem of tools. It explains that MapR is mostly compatible with Hadoop but aims to improve reliability, performance, and management compared to other Hadoop distributions through its architecture and features. The objectives are to explain why Hadoop is important for big data, describe MapReduce jobs, identify Hadoop tools, and compare MapR to other Hadoop distributions.
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
Treasure Data on The YARN - Hadoop Conference Japan 2014Ryu Kobayashi
Ryu Kobayashi from Treasure Data gave a presentation on using YARN (Yet Another Resource Negotiator) with Hadoop. Some key points:
- YARN was introduced to improve Hadoop resource management by separating processing from scheduling.
- Configuration changes are required when moving from MRv1 to YARN, including properties for memory allocation and scheduler configuration.
- Container execution, directories, and other components were adapted in the transition from JobTracker to the ResourceManager and NodeManager architecture in YARN.
- Proper configuration of YARN is important to avoid bugs, and tools from distributions can help with configuration.
This document provides an overview of YARN (Yet Another Resource Negotiator), the resource management system for Hadoop. It describes the key components of YARN including the Resource Manager, Node Manager, and Application Master. The Resource Manager tracks cluster resources and schedules applications, while Node Managers monitor nodes and containers. Application Masters communicate with the Resource Manager to manage applications. YARN allows Hadoop to run multiple applications like Spark and HBase, improves on MapReduce scheduling, and transforms Hadoop into a distributed operating system for big data processing.
Vinod Kumar Vavilapalli presented on Apache Hadoop YARN: Present and Future. He discussed how YARN improved on Hadoop 1 by separating resource management from processing, allowing multiple types of applications on the same platform. He summarized recent Hadoop releases including YARN enhancements like high availability and preemption. Future plans include improved isolation, multi-dimensional scheduling, and supporting long-running services. YARN aims to be a general resource management platform powering a growing ecosystem of applications beyond just MapReduce.
Brad Anderson from MapR Technologies presented on technologies for interactive analysis (Apache Drill) and stream processing (Storm) beyond traditional batch processing with Hadoop/MapReduce. Drill allows interactive queries over large datasets through its columnar storage and distributed query engine. Storm is a framework for real-time computation over streaming data through topologies of processing components. M7 provides a more reliable and higher performance alternative to HBase through its unified storage and simplified architecture with no external daemons.
This document provides an overview of Hadoop and big data concepts. It discusses Hadoop core components like HDFS, YARN, MapReduce and how they work. It also covers related technologies like Hive, Pig, Sqoop and Flume. The document discusses common Hadoop configurations, deployment modes, use cases and best practices. It aims to help developers get started with Hadoop and build big data solutions.
Challenges & Capabilites in Managing a MapR Cluster by David TuckerMapR Technologies
"If you're using Hadoop in production, how do you manage it? Does the distribution you're using provide any tools to make the job easier? What are the pitfalls? Are there parts of the system that are less robust or that have problems more often? Are you running Hadoop on bare metal, or in a cloud environment, and is one easier than the other?"
MapR Senior Solutions Architect David Tucker speaks about the challenges and capabilites in managing a cluster. This talk was given at the SF Bay Area Large Scale Production Engineering Meetup (Sept 19, 2013).
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
This presentation contains brief description about big data along with that hadoop installation, configuration and MapReduce wordcount program and its explanation.
1) The document discusses using search and big data technologies to enable reflected intelligence applications through crowd sourcing.
2) It provides background on Ted Dunning and Grant Ingersoll and outlines use cases that combine search, analytics, and machine learning like social media analysis in telecom, claims analysis, and content recommendation.
3) The authors propose a reference architecture combining LucidWorks Search, MapR technologies, and other tools to build a next generation search and discovery platform for these types of reflected intelligence applications.
Apache Hadoop: design and implementation. Lecture in the Big data computing course (http://twiki.di.uniroma1.it/twiki/view/BDC/WebHome), Department of Computer Science, Sapienza University of Rome.
The document discusses using Hadoop for scientific workloads and summarizes early results from benchmarking Hadoop. It explores using Hadoop and MapReduce for data-intensive scientific applications like BLAST sequence analysis. Performance results show that Hadoop can provide comparable performance to existing parallel file systems. Challenges include lack of turn-key solutions, managing data formats, and performance tuning. The research aims to understand the unique needs of science clouds and how to effectively support data-intensive scientific applications on cloud platforms.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
The document discusses best practices for scaling Hadoop applications. It covers causes of sublinear scalability like sequential bottlenecks, load imbalance, over-partitioning, and synchronization issues. It also provides equations for analyzing scalability and discusses techniques like reducing algorithmic overheads, increasing task granularity, and using compression. The document recommends using higher-level languages, tuning configuration parameters, and minimizing remote procedure calls to improve scalability.
Characterization of Chemical Libraries Using Scaffolds and Network ModelsRajarshi Guha
This document discusses characterizing chemical libraries using scaffold networks. Scaffold networks represent chemical structures as nodes in a network connected by edges denoting substructure relationships. Network metrics can characterize overall library structure and distributions of properties at the scaffold level. Comparing libraries involves identifying common scaffolds and comparing their properties in the different networks. Reducing the networks to forests of trees provides a simplified representation while retaining structural information to enable comparisons of library coverage and diversity.
This document discusses high-throughput screening (HTS) workflows for identifying biologically active small molecules. It describes how robots are used to rapidly screen large libraries of compounds in assays and generate large datasets. Statistical and machine learning methods in R can then be used to build predictive models from these datasets to identify promising leads and guide the screening of additional compounds. Caveats regarding the applicability of models to new chemical spaces are also discussed.
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSRajarshi Guha
This document discusses the work of the National Center for Advancing Translational Sciences (NCATS) in bridging chemistry, biology and informatics to improve the process of translational research. It describes NCATS' mission to develop new methods and technologies to enhance drug development and implementation of interventions to improve human health. Specifically, it outlines initiatives at NCATS such as the Chemical Genomics Center, which performs high-throughput screens and develops chemical probes and leads. It also discusses how translational bioinformatics uses data integration to move between molecular to clinical scales to enable decision-making in areas like drug design and target validation.
The design of chemical libraries is usually informed by pre-existing characteristics and desired features. On the other hand, assesing the prospective performance of a new library is more difficult. Importantly, a given screening library is often screened in a variety of systems which can differ in cell lines, readouts, formats and so on. In this study we explore to what extent pre-existing libraries can shed light on the relation between library activity and assay features. Using an ontology such as the BAO, it is possible to construct a hierarchy of annotations associated with an assay. Based on this annotation hierarchy we can then ask how likely are molecules associated with a specific annotation, to be identified as active. To allow generalization we consider substrucural features, as represented by a structural key fingerprint, rather than whole molecules. We employ a Bayesian framework to quantify the the association between a substructural feature and a given assay annotation, using a set of NCGC assays that have been annotated with BAO terms. We discuss our approach to training the Bayesian model and describe benchmarks that characterize model performance relative to the position of the annotation in the BAO hierarchy. Finally we discuss the role of this approach in a library design workflow that includes traditional design features such as chemical space coverage and physicochemical properties but also takes in to account screening platform features.
So I have an SD File … What do I do next?Rajarshi Guha
This document discusses important considerations for working with chemical structure data files. It recommends SMILES and MOL files as the best formats for data storage and manipulation due to their wide support. While file formats store structure representations, canonical formats like InChI and canonical SMILES are needed to reliably determine molecular identity. Special attention must be paid to stereochemistry and aromaticity representations between different file formats and software.
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformRajarshi Guha
This document summarizes work done using a high throughput screening (HTS) platform to discover novel drug combinations. Over 140,000 combinations across 320 cell lines have been screened involving 1911 small molecules focused on oncology, infectious disease, and stem cell biology. Analysis of the large dataset seeks to identify global trends in synergistic combinations based on targets, mechanisms of action, and physicochemical properties. Challenges include accurately characterizing combination quality and predicting synergies. Ongoing work involves exploring differential responses, other readout measures beyond viability, and translating combinations to in vivo models.
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
This document discusses big data, Hadoop, and using Hadoop in the cloud via Amazon EMR. It provides an overview of big data and what Hadoop is, explains how Hadoop works and how it can help store and process large datasets. It then discusses how Amazon EMR can be used to deploy Hadoop clusters in the cloud without having to manage the underlying infrastructure, and provides instructions on setting up and using EMR. Finally, it discusses debugging, profiling, and performance tuning Hadoop jobs and EMR clusters.
Hadoop makes data storage and processing at scale available as a lower cost and open solution. If you ever wanted to get your feet wet but found the elephant intimidating fear no more.
We will explore several integration considerations from a Windows application prospective like accessing HDFS content, writing streaming jobs, using .NET SDK, as well as HDInsight on premise or on Azure.
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
Hadoop is an open source software framework for distributed storage and processing of large datasets across clusters of computers. It implements the MapReduce programming model pioneered by Google and a distributed file system (HDFS). Mahout builds machine learning libraries on top of Hadoop. HBase is a non-relational distributed database modeled after Google's BigTable that provides random access and real-time read/write capabilities. These projects are used by many large companies for large-scale data processing and analytics tasks.
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
1) Hadoop is a framework for distributed processing of large datasets across clusters of computers using a simple programming model.
2) Virtualizing Hadoop enables rapid deployment, high availability, elastic scaling, and consolidation of big data workloads on a common infrastructure.
3) Serengeti is a tool that automates the deployment and management of Hadoop clusters on vSphere in under 30 minutes through simple commands.
The document discusses virtualizing Hadoop clusters on VMware vSphere. It describes how Hadoop enables parallel processing of large datasets across clusters using MapReduce. Virtualizing Hadoop provides benefits like simple operations, high availability, and elastic scaling. The document outlines challenges with using Hadoop and how virtualization addresses them. It provides examples of deploying Hadoop clusters on Serengeti and configuring different distributions. Performance results show little overhead from virtualization and benefits of local storage. Joint engineering with Hortonworks adds high availability to Hadoop master daemons using vSphere features.
1) Hadoop is a framework for distributed processing of large datasets across clusters of computers using a simple programming model.
2) Virtualizing Hadoop enables rapid deployment, high availability, elastic scaling, and consolidation of big data workloads on a common infrastructure.
3) Serengeti is a tool that automates the deployment and management of Hadoop clusters on vSphere in under 30 minutes through simple commands.
This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop uses HDFS for distributed storage and fault tolerance, YARN for resource management, and MapReduce for parallel processing of large datasets. It provides details on the architecture of HDFS including the name node, data nodes, and clients. It also explains the MapReduce programming model and job execution involving map and reduce tasks. Finally, it states that as data volumes continue rising, Hadoop provides an affordable solution for large-scale data handling and analysis through its distributed and scalable architecture.
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
This document discusses SQL Server and big data analytics projects in the real world. It covers the big data technology landscape, big data analytics, and three big data analytics scenarios using different technologies like Hadoop, MongoDB, and SQL Server. It also discusses SQL Server's role in the big data world and how to get data into Hadoop for analysis.
Hadoop Training, Enhance your Big data subject knowledge with Online Training without wasting your time. Register for Free LIVE DEMO Class.
For more info: http://www.hadooponlinetutor.com
Contact Us:
8121660044
732-419-2619
http://www.hadooponlinetutor.com
This document discusses big data and cloud computing. It introduces cloud storage and computing models. It then discusses how big data requires distributed systems that can scale out across many commodity machines to handle large volumes and varieties of data with high velocity. The document outlines some famous cloud products and their technologies. Finally, it provides an overview of the company's focus on enterprise big data management leveraging cloud technologies, and lists some of its cloud products and services including data storage, object storage, MapReduce and compute cloud services.
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
The document discusses Apache Spark and its role in big data and emerging technologies for big data. It provides background on MapReduce and the emergence of specialized systems. It then discusses how Spark provides a unified engine for batch processing, iterative jobs, SQL queries, streaming, and more. It can simplify programming by using a functional approach. The document also discusses Spark's architecture and performance advantages over other frameworks.
Josh Patterson gave a presentation on Hadoop and how it has been used. He discussed his background working on Hadoop projects including for the Tennessee Valley Authority. He outlined what Hadoop is, how it works, and examples of use cases. This includes how Hadoop was used to store and analyze large amounts of smart grid sensor data for the openPDC project. He discussed integrating Hadoop with existing enterprise systems and tools for working with Hadoop like Pig and Hive.
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
Tomáš Červenka will discuss Hive, an open-source data warehousing system built on Hadoop that provides SQL-like queries over large datasets. He will explain what Hive is useful for (big data analytics and processing), and not useful for (real-time queries and algorithms difficult to parallelize). He will demonstrate how to get started with Hive using Amazon EMR and provide a sample query, and discuss how VisualDNA uses Hive for analytics, reporting pipelines, and machine learning inference. Tips provided include using fast instance types, compression, and partitioning data.
Jumpstart your career with the world’s most in-demand technology: Hadoop. Hadooptrainingacademy provides best Hadoop online training with quality videos, comprehensive
online live training and detailed study material. Join today!
For more info, visit: http://www.hadooptrainingacademy.com/
Contact Us:
8121660088
732-419-2619
http://www.hadooptrainingacademy.com/
This document discusses the next generation of Apache Hadoop and MapReduce. It outlines limitations with the current MapReduce framework including scalability, single points of failure, and lack of support for other programming paradigms. The next generation architecture addresses these by splitting the JobTracker into a ResourceManager and ApplicationMaster, distributing application management, and allowing custom application frameworks. This improves scalability, availability, utilization, and supports additional paradigms like iterative processing, while maintaining wire compatibility.
Similar to Chemogenomics in the cloud: Is the sky the limit? (20)
Pharos: A Torch to Use in Your Journey in the Dark GenomeRajarshi Guha
This document summarizes Pharos, a knowledge management center and database for biomedical research targets developed by the National Center for Advancing Translational Sciences (NCATS). Key points:
- Pharos contains over 20,000 biomedical targets, drugs, diseases, and related publications to facilitate exploration and hypothesis generation.
- It provides search, visualization, and filtering capabilities to browse entity relationships and summaries. Target knowledge vectors also enable new methods like recommendations and clustering.
- The goals are to characterize understudied targets, identify small molecules/biologics, explore research landscapes, and support data mining. Pharos is intended for biologists, clinical researchers, and informatics scientists.
- Future work
This document describes Pharos, a web application and database that provides integrated access to data on biological targets and their relationships to diseases, publications, drugs and other entities. It summarizes the current status and capabilities of Pharos, how it can be used by different audiences, and the long term vision to improve inference across data types and generate natural language summaries.
Pharos – A Torch to Use in Your Journey In the Dark GenomeRajarshi Guha
This document summarizes a webinar presentation about the Illuminating the Druggable Genome (IDG) Knowledge Management Center (KMC) and its Pharos interface. The presentation discusses how Pharos provides access to integrated data on over 20,000 protein targets to help characterize well and poorly studied targets. It highlights how Pharos analyzes target data using methods like knowledge availability scoring and similarity in knowledge space to help prioritize "dark" targets lacking extensive research that may have therapeutic potential. The long-term vision is for Pharos to generate customized, semi-natural language summaries of target data to act as a biological dashboard for users.
This document provides an overview and update on Pharos, an interface for the KMC (Knowledge Management Center) that allows browsing and searching of biomedical entities. It summarizes recent updates including improved indexing, performance, documentation and visualization features. Usage statistics show increasing adoption. The long term vision is to incorporate more dependencies between data types to support inference and personalized summaries. Feedback is encouraged to further improve the tool.
The document discusses fingerprint representations of chemical structures that can be used for tasks like searching, prediction, and clustering. It provides examples of generating, reading, manipulating, and comparing fingerprints in R using the fingerprint package. Fingerprints allow efficient comparison of large collections of molecules through bit vector representations. The document also discusses using fingerprints for predictive modeling of compound properties from high-throughput screening data and analyzing results.
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Rajarshi Guha
The document discusses high-throughput screening of compound combinations. It describes how combination screening can provide insights into drug efficacy, resistance, toxicity, and pathways. The author details their workflow for combination screens, which involves running single agent dose responses and 6x6 or 10x10 combination matrices. They have conducted over 300 screens testing thousands of combinations against hundreds of cell lines. Challenges include quality control of large datasets and effective reporting of combination results. Network representations are proposed as one way to analyze and visualize combination screening data.
When the whole is better than the partsRajarshi Guha
The document discusses high-throughput screening of drug combinations to identify synergistic interactions that could lead to increased efficacy, delayed resistance, or reduced toxicity. The author outlines their workflow for combination screening against diverse cancer cell lines and molecular libraries. Over 300 screens have been conducted to date, assessing over 1,000 drug combinations. Challenges include automated quality control of large combination datasets and effective analysis methods to rank and interpret combination responses based on multiple factors. Network representations are proposed to help analyze and visualize combination screening results.
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Rajarshi Guha
The document describes efforts to screen drug combinations in high throughput settings beyond traditional one-dimensional metrics. It discusses the infrastructure and workflows used to screen compound combinations against a library of over 2000 small molecules with diverse mechanisms of action. Quality control of combination screening experiments poses challenges due to the multi-dimensional nature of the data. The researchers are exploring various metrics and analytical approaches to characterize synergistic, additive and antagonistic combination responses across different cell lines and combinations.
Pushing Chemical Biology Through the PipesRajarshi Guha
This document discusses the BioAssay Research Database (BARD) API, which was developed to make bioassay data from the NIH Molecular Libraries Program more accessible. The BARD API uses a RESTful design to provide access to data through resources like assays, compounds, and experiments. It also has an extensibility framework that allows new functionality to be added through plugins written in Java. The document outlines how users can search, access, and extend the BARD API resources.
Characterization and visualization of compound combination responses in a hig...Rajarshi Guha
This document summarizes a study characterizing drug combination responses in a high-throughput screening setting. It describes the workflow used, which involves running single agent dose responses, then 6x6 matrices to identify potential synergies followed by 10x10 matrices for confirmation. Methods for visualizing and reporting the combination results are discussed, including heatmaps and clustering response surfaces. The primary focus of the study is investigating combinations of Ibrutinib, a Btk inhibitor, for treating diffuse large B-cell lymphoma.
The BioAssay Research Database (BARD) aims to enable scientists to utilize data from the Molecular Libraries Program Collection (MLPCN) to generate new hypotheses. BARD provides a platform for public data sharing and analysis through intuitive query and visualization tools accessible via a web portal or desktop client. BARD integrates data from multiple sources and centers, and aims to improve data annotation and standardization to enable more meaningful experiment descriptions and discovery. The project involves ongoing community engagement and development of new analytical tools through its open API and plugin framework.
This document discusses using cloud computing resources for cheminformatics applications. It describes how Hadoop and MapReduce can be used to perform large-scale parallel computations on chemical data and databases. Specific examples discussed include counting atoms in large datasets using MapReduce and performing substructure searches using SMARTS queries on Hadoop. The document also compares different approaches to programming Hadoop applications and how Pig Latin can simplify writing cheminformatics jobs for Hadoop.
Chemical Data Mining: Open Source & ReproducibleRajarshi Guha
This document discusses using the R programming environment for cheminformatics and chemical data mining. It describes how R is enhanced by open source software like the Chemistry Development Kit (CDK) which provides capabilities for working with chemical structure data within R. The CDK allows reading/writing different file formats, generating fingerprints, visualizing molecules, and accessing public databases directly from R to streamline workflows like quantitative structure-activity relationship modeling. This enables reproducible science by ensuring analyses can be updated based on the latest data.
This document discusses quantifying sentiment in text using R. It describes cleaning Twitter data, using sentiment dictionaries to score words as positive or negative, and using these scores to quantify the sentiment of tweets on a scale. It finds that most tweets are neutral and that two different scoring functions produce similar sentiment distributions and behaviors. It also explores how sentiment varies with time of day.
This document discusses using PMML (Predictive Model Markup Language) for exchanging QSAR (Quantitative Structure-Activity Relationship) models between different software. It notes that hundreds of QSAR models have been published but are difficult to reproduce. Using PMML allows building models in one program, saving them in PMML format, distributing them, and evaluating descriptors for new observations. The R packages rcdk and pmml provide cheminformatics support in R and can be used to take in molecules, output descriptors, build a model, and encode it in PMML format for exchange.
The document discusses using molecular fragments to explore large chemical spaces. It describes how fragmenting a set of bioactive molecules yields thousands of unique fragments. Fragments can be used for learning structure-activity relationships, as filters for clustering, and to explore activity profiles across targets. The document outlines setting up a database from ChEMBL to aggregate activity data by scaffold and target, which can provide insights into known activity and promiscuity for a given scaffold.
Small Molecules and siRNA: Methods to Explore Bioactivity DataRajarshi Guha
This document discusses methods for exploring bioactivity data from small molecules and siRNA. It begins with background on cheminformatics methods like QSAR and machine learning approaches. It then outlines exploring structure-activity relationships using a "landscape" view and quantifying cliffs in activity. Models can be developed to predict these SAR landscapes. The document also discusses linking small molecule and siRNA screening data by looking at shared targets and pathways. It notes challenges integrating the different data types due to differences in dimensionality and resolution.
Predicting Activity Cliffs - Can Machine Learning Handle Special Cases?Rajarshi Guha
This document discusses using machine learning to predict "activity cliffs", which are large changes in biological activity from small changes in molecular structure. The author introduces the Structure Activity Landscape Index (SALI) to characterize cliffs numerically. Models are developed to predict SALI values for pairs of molecules based on molecular descriptors, with the goal of predicting the overall structure-activity relationship landscape. On several test datasets, the SALI prediction models perform comparably or better than models that directly predict activity, demonstrating the potential of this approach. However, the models struggle more with smaller cliffs and have a large error range for any new molecule. Domain applicability metrics may help address these limitations.
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}Rajarshi Guha
R & CDK provides a platform for working with chemical data in R. It allows loading and working with molecules from files and databases using classes and methods from the Chemistry Development Kit. Key capabilities include substructure searching, visualization of 2D structures, and calculating molecular descriptors. However, it does not provide a complete interface to the CDK API and some functionality like 3D coordinates is limited on Mac OS X.
Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the ...Rajarshi Guha
The document discusses high throughput screening (HTS) at the NIH Center for Translation Therapeutics (NCTT). It outlines the center's capabilities for small molecule and RNAi HTS, including assay formats, detection methods, quantitative high throughput screening (qHTS) to obtain dose response curves, and associated bioinformatics activities like automated curve fitting, data integration and structure-activity relationship analysis. The goal is to enable discoveries through these HTS approaches and computational analyses.
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfflufftailshop
When it comes to unit testing in the .NET ecosystem, developers have a wide range of options available. Among the most popular choices are NUnit, XUnit, and MSTest. These unit testing frameworks provide essential tools and features to help ensure the quality and reliability of code. However, understanding the differences between these frameworks is crucial for selecting the most suitable one for your projects.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
1. Chemogenomics
in
the
cloud
Is
the
sky
the
limit?
Rajarshi
Guha,
Ph.D.
NIH
Center
for
Transla:onal
Therapeu:cs
June
28,
2012
2. The
cloud
as
infrastructure
• Cloud
compu:ng
is
a
service
for
– Infrastructure
– PlaForm
– SoHware
• Much
of
the
benefits
of
cloud
compu:ng
are
– Economic
– Poli:cal
• Won’t
be
discussing
the
remote
hos:ng
aspects
of
clouds
3. Characteris8cs
of
the
cloud
Virtually Pay-per-use
assemble
Offsite Cloud Shared
technology Computing workloads
Massive
On-demand scale
self service
hPp://www.slideshare.net/haslinatuanhim/slides-‐cloud-‐compu:ng
4. Parallel
compu8ng
in
the
cloud
• Modern
cloud
vendors
make
provisioning
compute
resources
easy
– Allows
one
to
handle
unpredictable
loads
easily
– Pay
only
for
what
you
need
• Chemistry
applica:ons
don’t
usually
have
very
dynamic
loads
• But
large
scale
resources
are
an
opportunity
for
large
scale
(parallel)
computa:ons
5. Storing
chemical
informa8on
• Fill
up
a
hard
drive,
mail
to
Amazon
• Copy
over
the
network
– Aspera
– GridFTP
• S:ll
need
to
pay
for
storage
space
• Lots
of
op:ons
on
the
cloud
–
S3,
rela:onal
DB’s
• See
Chris
Dagdigian’s
talk
for
views
on
storage
hPp://www.slideshare.net/chrisdag/2012-‐trends-‐from-‐the-‐trenches
6. Recoding
for
the
cloud?
• Only
if
we
really
have
to
• Large
amounts
of
legacy
code,
runs
perfectly
well
on
local
clusters
– May
not
make
sense
to
recode
as
a
map-‐reduce
job
– May
not
be
possible
to
?
• Different
levels
of
HPC
on
the
cloud
– Legacy
HPC
– ‘Cloudy’
HPC
– Big
Data
HPC
hPp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa:cs-‐to-‐the-‐cloud
7. Recoding
for
the
cloud?
• Use
cloud
resources
in
• Make
use
of
cloud
• Huge
datasets
the
same
way
as
a
local
capabili:es
• Candidates
for
map-‐
cluster
• Old
algorithms,
new
reduce
• MIT
StarCluster
makes
infrastructure
• Involves
algorithm
this
easy
to
do
• Spot
instances,
SNS,
(re)design
SQS
SimpleDB,
S3,
etc
Legacy
Cloudy
Big
Data
HPC
HPC
HPC
hPp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa:cs-‐to-‐the-‐cloud
8. How
does
the
cloud
enable
science?
• How
does
the
cloud
change
computa:onal
chemistry,
cheminforma:cs,
…
– The
way
we
do
them
– The
scale
at
which
we
do
them
Are
there
problems
that
we
can
address
that
we
could
not
have
if
we
didn’t
have
on-‐demand,
scalable
cloud
resources?
9. Big
data
&
cheminforma8cs
• Computa:on
over
large
chemical
databases
– Pubchem,
ChEMBL,
…
• What
types
of
computa:ons?
– Searches
(substructure,
pharmacophore,
….)
– QSAR
models
over
large
data
– Predic:ons
for
large
data
• Certain
applica:ons
just
need
structures
• Access
to
correspondingly
massive
experimental
datasets
is
tough
(impossible?)
10. Big
data
&
cheminforma8cs
• GDB-‐13
is
a
truly
big
database
–
977
million
different
structures
– Current
search
interface
is
based
on
NN
searches
using
a
reduced
representa:on
– Could
be
a
good
candidate
for
a
Hadoop
based
analysis
• More
generally,
enumerated
virtual
libraries
can
also
lead
to
very
big
data
– Time
required
to
enumerate
is
a
boPleneck
11. Big
data
&
cheminforma8cs
• Fundamentally,
“big
chemical
data”
lets
us
explore
larger
chemical
spaces
– Can
plow
through
large
catalogs
– e.g.,
iden:fying
PKR
inhibitors
by
LBVS
of
the
ChemNavigator
collec:on
[Bryk
et
al]
• This
can
push
predic:ve
models
to
their
limits
– Brings
us
back
to
the
global
vs
local
arguments
12. The
Hadoop
ecosystem
• A
framework
for
the
map-‐reduce
agorithm
– Not
something
you
can
download
and
just
run
– Need
to
implement
the
infrastructure
and
then
develop
code
to
run
using
the
infrastructure
• Low
level
Hadoop
programs
can
be
large,
complex
and
tedious
• Abstrac:ons
have
been
developed
that
make
Hadoop
queries
more
SQL-‐like
–
results
in
much
more
concise
code
13. The
Hadoop
ecosystem
Chukwa Zookeeper Flume Pig
HBase Mahout Avro Whirr
Map Reduce Engine Hama
Hadoop Distributed
Hive
Filesystem
Hadoop Common
Based
on
hPp://www.slideshare.net/informa:cacorp/101111-‐part-‐3-‐maP-‐asleP-‐the-‐hadoop-‐ecosystem
15. Pig
&
Pig
La8n
• Pig
La:n
programs
are
much
simpler
to
write
and
get
translated
to
!"#"$%&'"()*'+,)-.)+("&."/.)+$*.012&3&33&456"
Hadoop
code
7"#"8$9*3"!":4";*9-3<,2&-'1-=+<->?!@AB/.)+$*.C"(DA/#E5A/#E5D(56"
.9%3*"7"+;9%"(%,9=,9-9F9(6"
SMARTS
search
in
• SQL-‐like,
requires
Pig
La:n
!"#$%&'&$())'*+,-./'012034)'5%$2065"3&'7'
UDF
to
be
'''')2(8&'*+,9-*:"06;-<<$')=2>)2(8&'7'
''''''''26;'7'
'''''''''''')=2'?'30@'*+,9-*:"06;-<<$AB.BC>'
implemented
to
''''''''D'&(2&E'A.FGH1&0!8<3'0C'7'
''''''''''''*;)20IJ<"2J!6%32$3A0C>'
''''''''D'
''''D'
perform
'''')2(8&'*I%$0)K(6)06')!'?'30@'*I%$0)K(6)06AF0L("$2.E0IM#N0&2O"%$406JP02Q3)2(3&0ACC>'
'
''''!"#$%&'O<<$0(3'010&A-"!$0'2"!$0C'2E6<@)'QMH1&0!8<3'7'
non-‐standard
tasks
''''''''%L'A2"!$0'??'3"$$'RR'2"!$0J)%S0AC'T'UC'602"63'L($)0>'
''''''''*26%3P'2(6P02'?'A*26%3PC'2"!$0JP02AVC>'
''''''''*26%3P'="06;'?'A*26%3PC'2"!$0JP02AWC>'
''''''''26;'7' UDF
for
SMARTS
search
'''''''''''')=2J)02*I(62)A="06;C>'
''''''''''''Q,2<I.<32(%306'I<$'?')!J!(6)0*I%$0)A2(6P02C>'
''''''''''''602"63')=2JI(2&E0)AI<$C>'
''''''''D'&(2&E'A.FGH1&0!8<3'0C'7'
''''''''''''2E6<@'X6(!!04QMH1&0!8<3J@6(!ABH66<6'%3'*+,9-*'!(Y063'<6'*+QZH*')26%3P'B[="06;'0C>'
''''''''D'
''''D'
D'
16. Working
on
top
of
Hadoop
• Hadoop
doesn’t
know
anything
about
cheminforma:cs
– Need
to
write
your
own
code,
UDF’s
etc
• But
applica:on
layers
have
been
developed
for
other
purposes
–
Apache
Mahout:
a
library
for
machine
learning
on
data
stored
in
Hadoop
clusters
– Possible
to
build
virtual
screening
pipelines
based
on
the
Hadoop
framework
17. What
Hadoop
is
not
for
• Doesn’t
replace
an
actual
database
• It’s
not
uniformly
fast
or
efficient
• Not
good
for
ad
hoc
or
real:me
analysis
• Not
effec:ve
unless
dealing
with
massive
datasets
• All
algorithms
are
not
amenable
to
the
map-‐
reduce
method
– CPU
bound
methods
and
those
requiring
communica:on
18. Cheminforma8cs
on
Hadoop
• Hadoop
and
Atom
Coun:ng
• Hadoop
and
SD
Files
• Cheminforma:cs,
Hadoop
and
EC2
• Pig
and
Cheminforma:cs
But
are
cheminforma1cs
problems
really
big
enough
to
jus1fy
all
of
this?
19. How
big
is
big?
• Bryk
et
al
performed
a
LBVS
of
5
million
compounds
to
iden:fy
PKR
inhibitors
– Pharmacophore
fingerprints
+
perceptron
– Required
conformer
genera:on
• Given
that
conformer
and
descriptor
genera:on
are
one-‐:me
tasks,
screening
5M
compounds
doesn’t
take
long
• Example:
RF
models
built
on
512
bit
binary
fingerprints
gives
us
predic:ons
for
5M
fingerprints
in
12
min
[Single
core,
3
GHz
Xeon,
OS
X
10.6.8]
20. Going
beyond
chunking?
• All
the
preceding
use
cases
are
embarrassingly
parallel
– Chunking
the
input
data
and
applying
the
same
opera:on
to
each
chunk
– Very
nice
when
you
have
a
big
cluster
Are
there
algorithms
in
cheminforma1cs
that
can
employ
map-‐reduce
at
the
algorithmic
level?
21. Going
beyond
chunking?
• Applica:ons
that
make
use
of
pairwise
(or
higher
order)
calcula:ons
could
benefit
from
a
map-‐
reduce
incarna:on
– Doesn’t
always
avoid
the
O(N2)
barrier
– Bioisostere
iden:fica:on
is
one
case
that
could
be
rephrased
as
a
map-‐reduce
problem
• Search
algorithms
such
as
GA’s,
par:cle
swarms
can
make
use
of
map-‐reduce
– GA
based
docking
– Feature
selec:on
for
QSAR
models
22. Going
beyond
chunking?
• Machine
learning
for
massive
chemical
datasets?
– MR
jobs
(descriptor
genera:on)
+
Mahout
(model
building)
lets
us
handle
this
in
a
straight
forward
manner
• But
will
QSAR
models
benefit
from
more
data?
– Helgee
et
al
suggest
global
models
are
preferable
– But
diversity
and
the
structure
of
the
chemical
space
will
affect
performance
of
global
models
– Unsupervised
methods
maybe
more
relevant
– Philosophical
ques:on?
23. Going
beyond
chunking?
• Many
clustering
algorithms
are
amenable
to
map-‐reduce
style
– K-‐means,
Spectral,
EM,
minhash,
…
– Many
are
implemented
in
Mahout
Problems
where
we
generate
large
numbers
of
combina8ons
can
be
amenable
to
map-‐reduce
24. Networks
&
integra8on
• Network
models
of
molecules,
and
targets
are
common
– Allows
for
the
incorpora:on
of
lots
of
associated
informa:on
– Diseases,
pathways,
OTE’s,
Yildirim,
M.A.
et
al
• When
linked
with
clinical
data
&
outcomes,
we
can
generate
massive
networks
– Adverse
events
(FDA
AERS)
– Analysis
by
Cloudera
considered
>
10E6
drug-‐drug-‐
reac:on
triples
25. Networks
&
integra8on
• SAR
data
can
be
viewed
in
a
network
form
– SALI,
SARI
based
networks
– Usually
requires
pairwise
calcula:ons
of
the
metric
Peltason,
L
et
al
hPp://sali.rguha.net/
• Current
studies
have
focused
on
small
datasets
(<
1000
molecules)
• Hadoop
+
Giraph
could
let
us
apply
this
to
HTS-‐
scale
datasets
26. Networks
&
integra8on
• When
we
apply
a
network
view
we
can
consider
many
interes:ng
applica:ons
&
make
use
of
cloud
scale
infrastructure
– Network
based
similarity
– Community
detec:on
(aka
clustering)
Bauer-‐Mehren
et
al
– PageRank
style
ranking
(of
targets,
compounds,
…)
– Generate
network
metrics,
which
can
be
used
as
input
to
predic:ve
models
(for
interac:ons,
effects,
…)
27. Conclusions
• Cheminforma:cs
applica:ons
can
be
rewriPen
to
take
advantage
of
cloud
resources
– Remotely
hosted
– Embarrassingly
parallel
/
chunked
– Map/reduce
• Ability
to
process
larger
structure
collec:ons
lets
us
explore
more
chemical
space
• Integra:ng
chemistry
with
clinical
&
pharmacological
data
can
lead
to
big
datasets
28. Conclusions
• Q:
But
are
cheminforma8cs
problems
really
big
enough
to
jus8fy
all
of
this?
• A:
Yes
–
virtual
libraries,
integra:ng
chemical
structure
with
other
types
and
scales
of
data
• Q:
Are
there
algorithms
in
cheminforma8cs
that
can
employ
map-‐reduce
at
the
algorithmic
level?
• A:
Yes
–
especially
when
we
consider
problems
with
a
combinatorial
flavor