This document discusses attaching cloud storage to a campus grid using Parrot, Chirp, and Hadoop. The authors present a solution that bridges the Chirp distributed filesystem to Hadoop to provide simple access to large datasets on Hadoop for jobs running on the campus grid. Chirp layers additional grid computing features on top of Hadoop like simple deployment without special privileges, easy access via Parrot, and strong flexible security access control lists. The authors evaluate the performance of connecting Parrot directly to Hadoop for better scalability versus connecting Parrot to Chirp and then to Hadoop for greater stability.
The Swiss National Supercomputing Center (CSCS) needed a centralized storage solution to handle rapidly doubling data volumes from high performance computing simulations. CSCS implemented a solution using IBM General Parallel File System (GPFS) software on IBM hardware, including System x servers and System Storage arrays. GPFS provided high scalability, compatibility across operating systems, parallel access, and failover. The solution supports massively parallel read/write operations and provides extremely high availability and scalability to meet CSCS's growing storage needs.
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
Elastic, Multi-tenant Hadoop on Demand! Richard McDougall, Chief Architect, Application Infrastructure and Big Data, VMware, Inc @richardmcdougll ApacheCon Europe, 2012. Broadens the application of Hadoop technology with horizontal and vertical use cases. Hadoop enables parallel processing through a programming framework for highly parallel data processing using MapReduce and the Hadoop Distributed File System (HDFS) for distributed data storage. Serengeti automates deployment of Hadoop on virtual platforms in under 30 minutes for multi-tenant elastic Hadoop as a service.
Architecting Virtualized Infrastructure for Big DataRichard McDougall
This document discusses architecting virtualized infrastructure for big data. It notes that data is growing exponentially and that the value of data now exceeds hardware costs. It advocates using virtualization to simplify and optimize big data infrastructure, enabling flexible provisioning of workloads like Hadoop, SQL, and NoSQL clusters on a unified analytics cloud platform. This platform leverages both shared and local storage to optimize performance while reducing costs.
1) The document discusses using search and big data technologies to enable reflected intelligence applications through crowd sourcing.
2) It provides background on Ted Dunning and Grant Ingersoll and outlines use cases that combine search, analytics, and machine learning like social media analysis in telecom, claims analysis, and content recommendation.
3) The authors propose a reference architecture combining LucidWorks Search, MapR technologies, and other tools to build a next generation search and discovery platform for these types of reflected intelligence applications.
Hadoop clusters can be provisioned quickly and easily on virtual infrastructure using techniques like linked clones and thin provisioning. This allows Hadoop to leverage capabilities of virtualization like high availability, resource controls, and re-using spare resources. Shared storage like SAN is useful for VM images and metadata, while local disks provide scalable bandwidth for HDFS data. Virtualizing Hadoop simplifies operations and enables flexible, on-demand provisioning of Hadoop clusters.
This document proposes improving infrastructure as a service (IaaS) cloud computing by overlapping on-demand resource allocation with opportunistic provisioning of cycles from idle cloud nodes. It extends the Nimbus cloud computing toolkit to deploy backfill virtual machines on idle nodes for high throughput computing workloads. Evaluation shows backfill VMs can increase cloud infrastructure utilization from 38.5% to 100% with only 6.39% overhead for processing workloads, benefiting both cloud providers and users.
The document describes Megastore, a storage system developed by Google to meet the requirements of interactive online services. Megastore blends the scalability of NoSQL databases with the features of relational databases. It uses partitioning and synchronous replication across datacenters using Paxos to provide strong consistency and high availability. Megastore has been widely deployed at Google to handle billions of transactions daily storing nearly a petabyte of data across global datacenters.
The Swiss National Supercomputing Center (CSCS) needed a centralized storage solution to handle rapidly doubling data volumes from high performance computing simulations. CSCS implemented a solution using IBM General Parallel File System (GPFS) software on IBM hardware, including System x servers and System Storage arrays. GPFS provided high scalability, compatibility across operating systems, parallel access, and failover. The solution supports massively parallel read/write operations and provides extremely high availability and scalability to meet CSCS's growing storage needs.
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
Elastic, Multi-tenant Hadoop on Demand! Richard McDougall, Chief Architect, Application Infrastructure and Big Data, VMware, Inc @richardmcdougll ApacheCon Europe, 2012. Broadens the application of Hadoop technology with horizontal and vertical use cases. Hadoop enables parallel processing through a programming framework for highly parallel data processing using MapReduce and the Hadoop Distributed File System (HDFS) for distributed data storage. Serengeti automates deployment of Hadoop on virtual platforms in under 30 minutes for multi-tenant elastic Hadoop as a service.
Architecting Virtualized Infrastructure for Big DataRichard McDougall
This document discusses architecting virtualized infrastructure for big data. It notes that data is growing exponentially and that the value of data now exceeds hardware costs. It advocates using virtualization to simplify and optimize big data infrastructure, enabling flexible provisioning of workloads like Hadoop, SQL, and NoSQL clusters on a unified analytics cloud platform. This platform leverages both shared and local storage to optimize performance while reducing costs.
1) The document discusses using search and big data technologies to enable reflected intelligence applications through crowd sourcing.
2) It provides background on Ted Dunning and Grant Ingersoll and outlines use cases that combine search, analytics, and machine learning like social media analysis in telecom, claims analysis, and content recommendation.
3) The authors propose a reference architecture combining LucidWorks Search, MapR technologies, and other tools to build a next generation search and discovery platform for these types of reflected intelligence applications.
Hadoop clusters can be provisioned quickly and easily on virtual infrastructure using techniques like linked clones and thin provisioning. This allows Hadoop to leverage capabilities of virtualization like high availability, resource controls, and re-using spare resources. Shared storage like SAN is useful for VM images and metadata, while local disks provide scalable bandwidth for HDFS data. Virtualizing Hadoop simplifies operations and enables flexible, on-demand provisioning of Hadoop clusters.
This document proposes improving infrastructure as a service (IaaS) cloud computing by overlapping on-demand resource allocation with opportunistic provisioning of cycles from idle cloud nodes. It extends the Nimbus cloud computing toolkit to deploy backfill virtual machines on idle nodes for high throughput computing workloads. Evaluation shows backfill VMs can increase cloud infrastructure utilization from 38.5% to 100% with only 6.39% overhead for processing workloads, benefiting both cloud providers and users.
The document describes Megastore, a storage system developed by Google to meet the requirements of interactive online services. Megastore blends the scalability of NoSQL databases with the features of relational databases. It uses partitioning and synchronous replication across datacenters using Paxos to provide strong consistency and high availability. Megastore has been widely deployed at Google to handle billions of transactions daily storing nearly a petabyte of data across global datacenters.
1) Hadoop is a framework for distributed processing of large datasets across clusters of computers using a simple programming model.
2) Virtualizing Hadoop enables rapid deployment, high availability, elastic scaling, and consolidation of big data workloads on a common infrastructure.
3) Serengeti is a tool that automates the deployment and management of Hadoop clusters on vSphere in under 30 minutes through simple commands.
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
This document summarizes research characterizing Hadoop jobs using unsupervised learning techniques. The researchers clustered over 11,000 Hadoop jobs from Yahoo production clusters into 8 groups based on job metrics. The centroids of each cluster represent characteristic jobs and show differences in map/reduce tasks and data processed. Identifying common job profiles can help benchmark and optimize Hadoop performance.
This paper discusses implementing NoSQL databases for robotics applications. NoSQL databases are well-suited for robotics because they can store massive amounts of data, retrieve information quickly, and easily scale. The paper proposes using a NoSQL graph database to store robot instructions and relate them according to tasks. MapReduce processing is also suggested to break large robot data problems into parallel pieces. Implementing a NoSQL system would allow building more intelligent humanoid robots that can process billions of objects and learn quickly from massive sensory inputs.
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...EMC
The document describes a partnership between Cisco and Greenplum to deliver optimized high-performance Hadoop reference configurations. Key elements include:
- Greenplum MR provides a high-performance distribution of Hadoop with features like direct data access, high availability, and advanced management.
- Cisco UCS is the exclusive hardware platform and provides a flexible, scalable computing platform optimized for Hadoop workloads.
- The Cisco Greenplum MR Reference Configuration combines these software and hardware components into an integrated solution for running Hadoop and big data analytics workloads.
This document summarizes the 25 most promising open source projects of 2010 according to Bruno Michel of af83. It provides descriptions and advantages for several key-value stores, document databases, distributed databases, workqueue services, and configuration management systems, including Redis, MongoDB, Riak, Cassandra, Resque, Beanstalkd, and Puppet.
This document discusses Facebook's deployment of Hadoop and HBase to support real-time applications at massive scale. It describes how Facebook Messages, Insights, and other applications require high throughput writes, large datasets, and low-latency reads. The document outlines why Hadoop and HBase were chosen over other systems to meet these needs, including elasticity, consistency, availability, and fault tolerance. It also describes enhancements made to HDFS and HBase to optimize for Facebook's workloads.
The Google File System is a scalable distributed file system designed to meet the rapidly growing data storage needs of Google. It provides fault tolerance on inexpensive commodity hardware and high aggregate performance to large numbers of clients. Key aspects of its design include handling frequent component failures as the norm, managing huge files up to multiple gigabytes in size containing many objects, optimizing for file appending and sequential reads of appended data, and co-designing the file system interface to increase flexibility for applications. The largest deployment to date includes over 1,000 storage nodes providing hundreds of terabytes of storage.
The document discusses virtualizing Hadoop clusters on VMware vSphere. It describes how Hadoop enables parallel processing of large datasets across clusters using MapReduce. Virtualizing Hadoop provides benefits like simple operations, high availability, and elastic scaling. The document outlines challenges with using Hadoop and how virtualization addresses them. It provides examples of deploying Hadoop clusters on Serengeti and configuring different distributions. Performance results show little overhead from virtualization and benefits of local storage. Joint engineering with Hortonworks adds high availability to Hadoop master daemons using vSphere features.
Pivotal: Virtualize Big Data to Make the Elephant DanceEMC
Big Data and virtualization are two of the hottest trends in the industry today, yet the full potential for bringing the two together has not been fully realized. In this session, learn how virtualization brings the advantages of greater elasticity, stronger isolation for multi-tenancy, and a click HA protection to Hadoop, while maintaining the comparable performance to Hadoop on physical machines.
Objective 1: Understand the benefits of virtualizing Hadoop.
After this session you will be able to:
Objective 2: Understand how to get started with Pivotal HD Hadoop .
Objective 3: Understand where to find more information.
1) Running Hadoop on VMs provides advantages like easier cluster management, ability to consolidate clusters on spare resources, and more elastic scaling of clusters.
2) Separating Hadoop compute and data nodes into different VMs allows truly elastic scaling of clusters.
3) Hortonworks is working with VMware to provide first class support for running Hadoop on VMs, including high availability features and optimizations for performance.
CERN's IT infrastructure is reaching its limits and needs to expand to support increasing computing capacity demands while maintaining a fixed staff size. CERN is addressing this by expanding its data center capacity through a new remote facility in Budapest, Hungary, and by adopting new open source configuration, monitoring and infrastructure tools to improve efficiency. Key projects include deploying OpenStack for infrastructure as a service, Puppet for configuration management, and integrating monitoring across tools. The transition will take place between 2012-2014 alongside LHC upgrades.
This document discusses cache and consistency in NoSQL databases. It introduces distributed caching using Memcached to improve performance and reduce load on database servers. It discusses using consistent hashing to partition and replicate data across servers while maintaining consistency. Paxos is presented as an efficient algorithm for maintaining consistency during updates in a distributed system in a more flexible way than traditional 2PC and 3PC approaches.
This document provides an overview of Watson, an IBM computer system designed to answer questions posed in natural language. It summarizes that Watson uses IBM's DeepQA architecture to analyze natural language questions and evidence from 200 million pages of text to generate hypotheses and answer questions. Watson competed successfully on the Jeopardy! quiz show in 2011, answering questions faster and as accurately as human experts. The document describes how Watson harnesses massive parallel processing across 2880 POWER7 cores in a cluster of 90 servers to perform the complex analytics required for natural language question answering at scale.
IBM Watson Jeopardy! white paper which explains Watson’s workload optimised system design based on IBM DeepQA architecture and POWER7® processor-based servers.
Virtual Machines are a mainstay in the enterprise. Apache Hadoop is normally run on bare machines. This talk walks through the convergence and the use of virtual machines for running ApacheHadoop. We describe the results from various tests and benchmarks which show that the overhead of using VMs is small. This is a small price to pay for the advantages offered by virtualization. The second half of talk compares multi-tenancy with VMs versus multi-tenancy of with Hadoop`s Capacity scheduler. We follow on with a comparison of resource management in V-Sphere and the finer grained resource management and scheduling in NextGen MapReduce. NextGen MapReduce supports a general notion of a container (such as a process, jvm, virtual machine etc) in which tasks are run;. We compare the role of such first class VM support in Hadoop.
This document discusses the design and implementation of a network device driver in Linux using NAPI (New API) to improve performance. It begins with an introduction to network device drivers and challenges with high interrupt loads. It then describes NAPI and how it uses polling instead of interrupts to process packets. The rest of the document provides details on the specific NAPI implementation for an ARM920T processor, including advantages like reduced interrupt processing and packet dropping. It evaluates the performance improvement from using NAPI during high packet loads. In summary, NAPI is a technique for network device drivers to improve Linux performance under heavy network traffic by reducing interrupt processing and using polling.
NVIDIA DEEP LEARNING INFERENCE PLATFORM PERFORMANCE STUDY
| TECHNICAL OVERVIEW
| 1
Introduction
Artificial intelligence (AI), the dream of computer scientists for over half
a century, is no longer science fiction—it is already transforming every
industry. AI is the use of computers to simulate human intelligence. AI
amplifies our cognitive abilities—letting us solve problems where the
complexity is too great, the information is incomplete, or the details are
too subtle and require expert training.
While the machine learning field has been active for decades, deep
learning (DL) has boomed over the last five years. In 2012, Alex
Krizhevsky of the University of Toronto won the ImageNet image
recognition competition using a deep neural network trained on NVIDIA
GPUs—beating all the human expert algorithms that had been honed
for decades. That same year, recognizing that larger networks can learn
more, Stanford’s Andrew Ng and NVIDIA Research teamed up to develop
a method for training networks using large-scale GPU computing
systems. These seminal papers sparked the “big bang” of modern AI,
setting off a string of “superhuman” achievements. In 2015, Google and
Microsoft both beat the best human score in the ImageNet challenge. In
2016, DeepMind’s AlphaGo recorded its historic win over Go champion
Lee Sedol and Microsoft achieved human parity in speech recognition.
GPUs have proven to be incredibly effective at solving some of the most
complex problems in deep learning, and while the NVIDIA deep learning
platform is the standard industry solution for training, its inferencing
capability is not as widely understood. Some of the world’s leading
enterprises from the data center to the edge have built their inferencing
solution on NVIDIA GPUs. Some examples include:
Today’s data centers are being asked to do more with little additional financial support forthcoming. They are under a constant barrage to do things faster, more inexpensively and with no disruption to the revenue-‐generating part of the company. Frequently, they are experiencing 24 X 7 operations, numerous new application deployments and explosive data growth. Data storage often becomes a crucial limiting factor to meeting these stringent demands.
Faced with these seemingly insurmountable challenges, CIO’s are discovering that the old way of responding to application and data growth is unacceptable. That is, the outdated “rip and replace” method to improve capacity and IO performance, which often meant disruptive migration, doesn’t work anymore. Instead, a more nimble alternative is needed. In response to this need for storage agility, NetApp has recently released a new version of their storage operating environment called Data ONTAP® 8.1. This new software update successfully eliminates many problems of typical monolithic or legacy storage systems.
Hadoop as we know is a Java based massive scalable distributed framework for processing large data (several peta bytes) across a cluster (1000s) of commodity computers.
The Hadoop ecosystem has grown over the last few years and there is a lot of jargon in terms of tools as well as frameworks.
Many organizations are investing & innovating heavily in Hadoop to make it better and easier. The mind map on the next slide should be useful to get a high level picture of the ecosystem.
The report discusses the key components and objectives of HDFS, including data replication for fault tolerance, HDFS architecture with a NameNode and DataNodes, and HDFS properties like large data sets, write once read many model, and commodity hardware. It provides an overview of HDFS and its design to reliably store and retrieve large volumes of distributed data.
This document provides an overview of Hadoop, an open source framework for distributed storage and processing of large datasets. It discusses what Hadoop is, its applications and architecture, advantages like ability to store and process huge amounts of data quickly in a fault-tolerant and flexible manner. The document also outlines some disadvantages of Hadoop like security concerns. It describes key Hadoop components like HDFS, MapReduce, YARN and popular related projects. Finally, it provides guidance on when Hadoop use is appropriate, such as for large data aggregation and ETL operations.
This document discusses virtualizing Hadoop deployments. It describes how Hadoop has become a standard for storing and analyzing large datasets economically. Virtualization technologies have enabled consolidating diverse workloads onto fewer servers. The document examines considerations for virtualizing Hadoop, including that it is best suited for test/development and multi-tenant environments where workloads vary or isolation is needed. Performance impacts like oversubscription and shared infrastructure must be planned for. Virtualized Hadoop can provide benefits but dedicated environments are still best for production use cases.
1) Hadoop is a framework for distributed processing of large datasets across clusters of computers using a simple programming model.
2) Virtualizing Hadoop enables rapid deployment, high availability, elastic scaling, and consolidation of big data workloads on a common infrastructure.
3) Serengeti is a tool that automates the deployment and management of Hadoop clusters on vSphere in under 30 minutes through simple commands.
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
This document summarizes research characterizing Hadoop jobs using unsupervised learning techniques. The researchers clustered over 11,000 Hadoop jobs from Yahoo production clusters into 8 groups based on job metrics. The centroids of each cluster represent characteristic jobs and show differences in map/reduce tasks and data processed. Identifying common job profiles can help benchmark and optimize Hadoop performance.
This paper discusses implementing NoSQL databases for robotics applications. NoSQL databases are well-suited for robotics because they can store massive amounts of data, retrieve information quickly, and easily scale. The paper proposes using a NoSQL graph database to store robot instructions and relate them according to tasks. MapReduce processing is also suggested to break large robot data problems into parallel pieces. Implementing a NoSQL system would allow building more intelligent humanoid robots that can process billions of objects and learn quickly from massive sensory inputs.
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...EMC
The document describes a partnership between Cisco and Greenplum to deliver optimized high-performance Hadoop reference configurations. Key elements include:
- Greenplum MR provides a high-performance distribution of Hadoop with features like direct data access, high availability, and advanced management.
- Cisco UCS is the exclusive hardware platform and provides a flexible, scalable computing platform optimized for Hadoop workloads.
- The Cisco Greenplum MR Reference Configuration combines these software and hardware components into an integrated solution for running Hadoop and big data analytics workloads.
This document summarizes the 25 most promising open source projects of 2010 according to Bruno Michel of af83. It provides descriptions and advantages for several key-value stores, document databases, distributed databases, workqueue services, and configuration management systems, including Redis, MongoDB, Riak, Cassandra, Resque, Beanstalkd, and Puppet.
This document discusses Facebook's deployment of Hadoop and HBase to support real-time applications at massive scale. It describes how Facebook Messages, Insights, and other applications require high throughput writes, large datasets, and low-latency reads. The document outlines why Hadoop and HBase were chosen over other systems to meet these needs, including elasticity, consistency, availability, and fault tolerance. It also describes enhancements made to HDFS and HBase to optimize for Facebook's workloads.
The Google File System is a scalable distributed file system designed to meet the rapidly growing data storage needs of Google. It provides fault tolerance on inexpensive commodity hardware and high aggregate performance to large numbers of clients. Key aspects of its design include handling frequent component failures as the norm, managing huge files up to multiple gigabytes in size containing many objects, optimizing for file appending and sequential reads of appended data, and co-designing the file system interface to increase flexibility for applications. The largest deployment to date includes over 1,000 storage nodes providing hundreds of terabytes of storage.
The document discusses virtualizing Hadoop clusters on VMware vSphere. It describes how Hadoop enables parallel processing of large datasets across clusters using MapReduce. Virtualizing Hadoop provides benefits like simple operations, high availability, and elastic scaling. The document outlines challenges with using Hadoop and how virtualization addresses them. It provides examples of deploying Hadoop clusters on Serengeti and configuring different distributions. Performance results show little overhead from virtualization and benefits of local storage. Joint engineering with Hortonworks adds high availability to Hadoop master daemons using vSphere features.
Pivotal: Virtualize Big Data to Make the Elephant DanceEMC
Big Data and virtualization are two of the hottest trends in the industry today, yet the full potential for bringing the two together has not been fully realized. In this session, learn how virtualization brings the advantages of greater elasticity, stronger isolation for multi-tenancy, and a click HA protection to Hadoop, while maintaining the comparable performance to Hadoop on physical machines.
Objective 1: Understand the benefits of virtualizing Hadoop.
After this session you will be able to:
Objective 2: Understand how to get started with Pivotal HD Hadoop .
Objective 3: Understand where to find more information.
1) Running Hadoop on VMs provides advantages like easier cluster management, ability to consolidate clusters on spare resources, and more elastic scaling of clusters.
2) Separating Hadoop compute and data nodes into different VMs allows truly elastic scaling of clusters.
3) Hortonworks is working with VMware to provide first class support for running Hadoop on VMs, including high availability features and optimizations for performance.
CERN's IT infrastructure is reaching its limits and needs to expand to support increasing computing capacity demands while maintaining a fixed staff size. CERN is addressing this by expanding its data center capacity through a new remote facility in Budapest, Hungary, and by adopting new open source configuration, monitoring and infrastructure tools to improve efficiency. Key projects include deploying OpenStack for infrastructure as a service, Puppet for configuration management, and integrating monitoring across tools. The transition will take place between 2012-2014 alongside LHC upgrades.
This document discusses cache and consistency in NoSQL databases. It introduces distributed caching using Memcached to improve performance and reduce load on database servers. It discusses using consistent hashing to partition and replicate data across servers while maintaining consistency. Paxos is presented as an efficient algorithm for maintaining consistency during updates in a distributed system in a more flexible way than traditional 2PC and 3PC approaches.
This document provides an overview of Watson, an IBM computer system designed to answer questions posed in natural language. It summarizes that Watson uses IBM's DeepQA architecture to analyze natural language questions and evidence from 200 million pages of text to generate hypotheses and answer questions. Watson competed successfully on the Jeopardy! quiz show in 2011, answering questions faster and as accurately as human experts. The document describes how Watson harnesses massive parallel processing across 2880 POWER7 cores in a cluster of 90 servers to perform the complex analytics required for natural language question answering at scale.
IBM Watson Jeopardy! white paper which explains Watson’s workload optimised system design based on IBM DeepQA architecture and POWER7® processor-based servers.
Virtual Machines are a mainstay in the enterprise. Apache Hadoop is normally run on bare machines. This talk walks through the convergence and the use of virtual machines for running ApacheHadoop. We describe the results from various tests and benchmarks which show that the overhead of using VMs is small. This is a small price to pay for the advantages offered by virtualization. The second half of talk compares multi-tenancy with VMs versus multi-tenancy of with Hadoop`s Capacity scheduler. We follow on with a comparison of resource management in V-Sphere and the finer grained resource management and scheduling in NextGen MapReduce. NextGen MapReduce supports a general notion of a container (such as a process, jvm, virtual machine etc) in which tasks are run;. We compare the role of such first class VM support in Hadoop.
This document discusses the design and implementation of a network device driver in Linux using NAPI (New API) to improve performance. It begins with an introduction to network device drivers and challenges with high interrupt loads. It then describes NAPI and how it uses polling instead of interrupts to process packets. The rest of the document provides details on the specific NAPI implementation for an ARM920T processor, including advantages like reduced interrupt processing and packet dropping. It evaluates the performance improvement from using NAPI during high packet loads. In summary, NAPI is a technique for network device drivers to improve Linux performance under heavy network traffic by reducing interrupt processing and using polling.
NVIDIA DEEP LEARNING INFERENCE PLATFORM PERFORMANCE STUDY
| TECHNICAL OVERVIEW
| 1
Introduction
Artificial intelligence (AI), the dream of computer scientists for over half
a century, is no longer science fiction—it is already transforming every
industry. AI is the use of computers to simulate human intelligence. AI
amplifies our cognitive abilities—letting us solve problems where the
complexity is too great, the information is incomplete, or the details are
too subtle and require expert training.
While the machine learning field has been active for decades, deep
learning (DL) has boomed over the last five years. In 2012, Alex
Krizhevsky of the University of Toronto won the ImageNet image
recognition competition using a deep neural network trained on NVIDIA
GPUs—beating all the human expert algorithms that had been honed
for decades. That same year, recognizing that larger networks can learn
more, Stanford’s Andrew Ng and NVIDIA Research teamed up to develop
a method for training networks using large-scale GPU computing
systems. These seminal papers sparked the “big bang” of modern AI,
setting off a string of “superhuman” achievements. In 2015, Google and
Microsoft both beat the best human score in the ImageNet challenge. In
2016, DeepMind’s AlphaGo recorded its historic win over Go champion
Lee Sedol and Microsoft achieved human parity in speech recognition.
GPUs have proven to be incredibly effective at solving some of the most
complex problems in deep learning, and while the NVIDIA deep learning
platform is the standard industry solution for training, its inferencing
capability is not as widely understood. Some of the world’s leading
enterprises from the data center to the edge have built their inferencing
solution on NVIDIA GPUs. Some examples include:
Today’s data centers are being asked to do more with little additional financial support forthcoming. They are under a constant barrage to do things faster, more inexpensively and with no disruption to the revenue-‐generating part of the company. Frequently, they are experiencing 24 X 7 operations, numerous new application deployments and explosive data growth. Data storage often becomes a crucial limiting factor to meeting these stringent demands.
Faced with these seemingly insurmountable challenges, CIO’s are discovering that the old way of responding to application and data growth is unacceptable. That is, the outdated “rip and replace” method to improve capacity and IO performance, which often meant disruptive migration, doesn’t work anymore. Instead, a more nimble alternative is needed. In response to this need for storage agility, NetApp has recently released a new version of their storage operating environment called Data ONTAP® 8.1. This new software update successfully eliminates many problems of typical monolithic or legacy storage systems.
Hadoop as we know is a Java based massive scalable distributed framework for processing large data (several peta bytes) across a cluster (1000s) of commodity computers.
The Hadoop ecosystem has grown over the last few years and there is a lot of jargon in terms of tools as well as frameworks.
Many organizations are investing & innovating heavily in Hadoop to make it better and easier. The mind map on the next slide should be useful to get a high level picture of the ecosystem.
The report discusses the key components and objectives of HDFS, including data replication for fault tolerance, HDFS architecture with a NameNode and DataNodes, and HDFS properties like large data sets, write once read many model, and commodity hardware. It provides an overview of HDFS and its design to reliably store and retrieve large volumes of distributed data.
This document provides an overview of Hadoop, an open source framework for distributed storage and processing of large datasets. It discusses what Hadoop is, its applications and architecture, advantages like ability to store and process huge amounts of data quickly in a fault-tolerant and flexible manner. The document also outlines some disadvantages of Hadoop like security concerns. It describes key Hadoop components like HDFS, MapReduce, YARN and popular related projects. Finally, it provides guidance on when Hadoop use is appropriate, such as for large data aggregation and ETL operations.
This document discusses virtualizing Hadoop deployments. It describes how Hadoop has become a standard for storing and analyzing large datasets economically. Virtualization technologies have enabled consolidating diverse workloads onto fewer servers. The document examines considerations for virtualizing Hadoop, including that it is best suited for test/development and multi-tenant environments where workloads vary or isolation is needed. Performance impacts like oversubscription and shared infrastructure must be planned for. Virtualized Hadoop can provide benefits but dedicated environments are still best for production use cases.
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing.
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced
the task completion time, but once the volume of the data being processed increases there is a considerable
cutback in computational speeds due to update cost. Further the threshold level for balance between the update
cost and replication factor is identified and presented graphically.
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...EMC
Pivotal has setup and operationalized 1000 node Hadoop cluster called the Analytics Workbench. It takes special setup and skills to manage such a large deployment. This session shares how we set it up and how you will manage it.
Objective 1: Understand what it takes to operationalize a 1000-nodeHadoop cluster.
After this session you will be able to:
Objective 2: Understand how to set up and manage the day to day challenges of a large Hadoop deployments.
Objective 3: Have a view to the tools that are necessary to solve the challenges of managing the large Hadoop cluster.
The document discusses analyzing temperature data using Hadoop MapReduce. It describes importing a weather dataset from the National Climatic Data Center into Eclipse to create a MapReduce program. The program will classify days in the Austin, Texas data from 2015 as either hot or cold based on the recorded temperature. The steps outlined are: importing the project, exporting it as a JAR file, checking that the Hadoop cluster is running, uploading the input file to HDFS, and running the JAR file with the input and output paths specified. The goal is to analyze temperature variation and find the hottest/coldest days of the month/year from the large climate dataset.
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
Cloud has been a computational and storage solution for many data centric organizations. The
problem today those organizations are facing from the cloud is in data searching in an efficient
manner. A framework is required to distribute the work of searching and fetching from
thousands of computers. The data in HDFS is scattered and needs lots of time to retrieve. The
major idea is to design a web server in the map phase using the jetty web server which will give
a fast and efficient way of searching data in MapReduce paradigm. For real time processing on
Hadoop, a searchable mechanism is implemented in HDFS by creating a multilevel index in
web server with multi-level index keys. The web server uses to handle traffic throughput. By web
clustering technology we can improve the application performance. To keep the work down, the
load balancer should automatically be able to distribute load to the newly added nodes in the
server.
This document describes a proposed architecture for improving data retrieval performance in a Hadoop Distributed File System (HDFS) deployed in a cloud environment. The key aspects are:
1) A web server would replace the map phase of MapReduce to provide faster searching of data. The web server uses multi-level indexing for real-time processing on HDFS.
2) An Apache load balancer distributes requests across backend application servers to improve throughput and scalability.
3) The NameNode is divided into master and slave servers, with the master containing the multi-level index and slaves storing data and lower-level indexes. This allows distributed data retrieval.
Integrating dbm ss as a read only execution layer into hadoopJoão Gabriel Lima
The document discusses integrating database management systems (DBMSs) as a read-only execution layer into Hadoop. It proposes a new system architecture that incorporates modified DBMS engines augmented with a customized storage engine capable of directly accessing data from the Hadoop Distributed File System (HDFS) and using global indexes. This allows DBMS to efficiently provide read-only operators while not being responsible for data management. The system is designed to address limitations of HadoopDB by improving performance, fault tolerance, and data loading speed.
This document discusses security issues with Hadoop and available solutions. It identifies vulnerabilities in Hadoop including lack of authentication, unsecured data in transit, and unencrypted data at rest. It describes current solutions like Kerberos for authentication, SASL for encrypting data in motion, and encryption zones for encrypting data at rest. However, it notes limitations of encryption zones for processing encrypted data efficiently with MapReduce. It proposes a novel method for large scale encryption that can securely process encrypted data in Hadoop.
Presentation on cloud computing security issues using HADOOP and HDFS ARCHITE...Pushpa
we discuss security issues for cloud computing and present a layered framework for secure clouds and then focus on two of the layers, i.e., the storage layer and the data layer. In particular, we discuss a scheme for secure third party publications of documents in a cloud. Next, we will converse secure federated query processing with map Reduce and Hadoop, and discuss the use of secure co-processors for cloud computing. Finally, we discuss XACML implementation for Hadoop and discuss their beliefs that building trusted applications from untrusted components will be a major aspect of secure
cloud computing.
Fundamentals of Big Data, Hadoop project design and case study or Use case
General planning consideration and most necessaries in Hadoop ecosystem and Hadoop projects
This will provide the basis for choosing the right Hadoop implementation, Hadoop technologies integration, adoption and creating an infrastructure.
Building applications using Apache Hadoop with a use-case of WI-FI log analysis has real life example.
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
Hadoop Foundation for Analytics
History of Hadoop
Features of Hadoop
Key Advantages of Hadoop
Why Hadoop
Versions of Hadoop
Eco Projects
Essential of Hadoop ecosystem
RDBMS versus Hadoop
Key Aspects of Hadoop
Components of Hadoop
This presentation provides an overview of big data concepts and Hadoop technologies. It discusses what big data is and why it is important for businesses to gain insights from massive data. The key Hadoop technologies explained include HDFS for distributed storage, MapReduce for distributed processing, and various tools that run on top of Hadoop like Hive, Pig, HBase, HCatalog, ZooKeeper and Sqoop. Popular Hadoop SQL databases like Impala, Presto and Stinger are also compared in terms of their performance and capabilities. The document discusses options for deploying Hadoop on-premise or in the cloud and how to integrate Microsoft BI tools with Hadoop for big data analytics.
- Big data refers to large sets of data that businesses and organizations collect, while Hadoop is a tool designed to handle big data. Hadoop uses MapReduce, which maps large datasets and then reduces the results for specific queries.
- Hadoop jobs run under five main daemons: the NameNode, DataNode, Secondary NameNode, JobTracker, and TaskTracker.
- HDFS is Hadoop's distributed file system that stores very large amounts of data across clusters. It replicates data blocks for reliability and provides clients high-throughput access to files.
Presented By :- Rahul Sharma
B-Tech (Cloud Technology & Information Security)
2nd Year 4th Sem.
Poornima University (I.Nurture),Jaipur
www.facebook.com/rahulsharmarh18
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
This paper summarizes the design, development, and deployment of YARN (Yet Another Resource Negotiator), the next generation compute platform for Apache Hadoop. YARN decouples the programming model from the resource management infrastructure, allowing multiple programming frameworks like MapReduce, Dryad, Giraph, and Spark to run on top of it. This separation of concerns improves scalability, efficiency, and flexibility compared to the original Hadoop architecture. The authors provide experimental evidence of these improvements and discuss real-world deployments of YARN at Yahoo and other companies.
Similar to Attaching cloud storage to a campus grid using parrot, chirp, and hadoop (20)
The document discusses the steps of a data analysis project and provides a case study example. The key steps are:
1) Defining the business problem and purpose of the analysis.
2) Choosing and preparing appropriate data sources.
3) Applying relevant techniques such as data preparation, pattern recognition, and data analysis.
4) Evaluating the results and delivering value or insights to address the original business problem.
The case study examines building a neuromarketing tool using AI to predict areas of visual attention in images and memory retention. Pattern recognition techniques are trained on labeled datasets to help identify these patterns autonomously.
We are a company that delivers value to our customers by lowering costs with digital marketing and increasing the efficiency of campaigns and their conversions. Using the most advanced artificial intelligence models in the neuro-marketing perspective, we have been able to predict the effectiveness of a marketing campaign before it is published. After its publication, we evaluated the campaign, segmenting the public according to the standard extracted from each market segment, delivering information for strategic and efficient management.
Aplicações de Alto Desempenho com JHipster Full StackJoão Gabriel Lima
Uma apresentação sobre o framework JHipster para construção de aplicações full stack Java. O framework fornece generators, scaffolding e estruturas para criar aplicações web e APIs escaláveis com Spring Boot e Angular/React. O documento discute a arquitetura, geração de projetos, estrutura de pastas, depuração, produção e dicas de uso do JHipster.
O documento discute a realidade aumentada com React Native e ARKit. Apresenta exemplos de aplicações incríveis de realidade aumentada e lista os requisitos para usar o ARKit no iOS. Explica como começar um projeto de realidade aumentada com React Native e ARKit, incluindo como criar a aplicação e linkar as dependências.
O documento discute os conceitos de Big Data, Inteligência Artificial e Aprendizado de Máquina. Apresenta as principais ferramentas e técnicas dessas áreas, incluindo redes neurais profundas, clustering, regressão linear e florestas aleatórias. Também aborda a importância dessas tecnologias para a tomada de decisão estratégica e geração de conhecimento a partir de dados.
O modelo de regressão é então usado para prever o resultado de uma variável dependente desconhecida, dados os valores das variáveis independentes.
Nesta aula, mostro um passo a passo com a bordage teórica e prática de como fazer regressão linear utilizando o WEKA
O documento apresenta estudos de caso sobre segurança na internet conduzidos pelo professor João Gabriel Lima, incluindo o ataque ao site Ashley Madison, ataques de malvertising escondendo malware em pixels de banners publicitários e o grande ataque DDoS de 2016 contra servidores da Dyn que causou instabilidade em diversos sites e serviços.
O documento descreve diversos tipos de ameaças à segurança na internet, incluindo vírus, worms, adwares, phishing e ataques de negação de serviço. O autor, Prof. João Gabriel Lima, fornece definições concisas de cada ameaça, desde conceitos básicos até técnicas avançadas usadas por crackers.
O documento apresenta uma introdução à aprendizagem de máquina com Javascript, discutindo conceitos de inteligência artificial e machine learning, exemplos de aplicações, ferramentas e desafios para implementar machine learning na web com Javascript.
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...João Gabriel Lima
Nesta palestra, vamos trabalhar uma abordagem passo a passo de como construir um modelo de classificação, para identificar os padrões de clientes de uma empresa de telefonia que cancelaram o serviço, de modo que a operadora possa prever o risco de cancelamento e iniciar um trabalho para evitar que isso aconteça.
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoJoão Gabriel Lima
Nesta apresentação, apresento um passo a passo prático de como clusterizar e mais importante que isso, como interpretar os resultados aplicando isso para auxiliar a tomada de decisão.
No final temos um exercício de fixação muito interessante que nos dá a oportunidade de aplicar os conhecimentos adquiridos.
jgabriel.ufpa@gmail.com
Mineração de dados na prática com RapidMiner e WekaJoão Gabriel Lima
O documento apresenta uma introdução à regressão linear usando o software WEKA para mineração de dados. Explica o que é mineração de dados e regressão, como carregar e formatar dados no WEKA, criar um modelo de regressão linear para prever preços de casas com base em variáveis como tamanho e quartos, e interpretar os resultados do modelo.
Nessa apresentação apresento ambas arquiteturas e mostro que ao invés de escolher entre uma e outra, podemos tirar o que há de melhor em cada e utilizá-las de forma limpa, simples e objetiva.
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...João Gabriel Lima
Nesta apresentação mostro um estudo realizado pela universidade de Munique que visa prever a probabilidade de um personagem morrer na próxima temporada de acordo com 24 características pré-selecionadas
O documento discute o aplicativo IPVA Cidadão Mobile, que permite aos cidadãos pagarem o IPVA (Imposto sobre a Propriedade de Veículos Automotores) em seus dispositivos móveis. O aplicativo está disponível para os estados do Ceará e Mato Grosso e permite realizar o pagamento de forma rápida e prática.
[Estácio - IESAM] Automatizando Tarefas com Gulp.jsJoão Gabriel Lima
O documento descreve como automatizar tarefas com a ferramenta Gulp.js. Ele explica que Gulp ajuda a automatizar tarefas repetitivas como concatenar arquivos, minificar e rodar testes. Também fornece exemplos de como usar Gulp para rodar testes JavaScript, minificar HTML, CSS e JavaScript e otimizar imagens. Recomenda pré-processadores e plugins úteis e encoraja a explorar mais funcionalidades de Gulp.
O documento discute como JavaScript pode ser usado para conectar dispositivos da Internet das Coisas, como eletrodomésticos e roupas inteligentes. Apresenta como JavaScript é uma linguagem amplamente usada na Internet que possui muitas bibliotecas e frameworks úteis para IoT. Também lista alguns projetos de IoT feitos com JavaScript e áreas em que pode ser aplicado, como cidades inteligentes e agricultura.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Webinar: Designing a schema for a Data WarehouseFederico Razzoli
Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you.
A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services.
But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts.
We will discuss these topics:
- How to gather information about a business;
- Understanding dictionaries and how to identify business entities;
- Dimensions and facts;
- Setting a table granularity;
- Types of facts;
- Types of dimensions;
- Snowflakes and how to avoid them;
- Expanding existing dimensions and facts.
Project Management Semester Long Project - Acuityjpupo2018
Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
2. mance, but a less robust system. By connecting Parrot to opers. A FUSE [1] module allows unmodified applica-
Chirp and then to Hadoop, we achieve a much more stable tions to access Hadoop through a user-level filesystem
system, but with a modest performance penalty. The final driver, but this method still suffers from the remaining
section of the paper evaluates the scalability of our solution three problems:
on both local area and campus area networks.
• Deployment. To access Hadoop by any method, it
is necessary to install a large amount of software, in-
2 Background cluding the Java virtual machine and resolve a number
of dependencies between components. Using FUSE
The following is an overview of the systems we are try- module requires administrator access to install the
ing to connect to allow campus grid users access to scalable FUSE kernel modules, install a privileged executable,
distributed storage. First, we examine the Hadoop filesys- and update the local administrative group definitions
tem and provide a summary of its characteristics and limita- to permit access. On a campus grid consisting of thou-
tions. Throughout the paper we refer to the Hadoop filesys- sands of borrowed machines, this level of access is
tem as HDFS or simply Hadoop. Next, we introduce the simply not practical.
Chirp filesystem, which is used at the University of Notre
• Authentication. Hadoop was originally designed with
Dame and other research sites for grid storage. Finally, we
the assumption that all of its components were oper-
describe Parrot, a virtual filesystem adapter that provides
ating on the same local area, trusted network. As a
applications transparent access to remote storage systems.
result, there is no true authentication between compo-
nents – when a client connects to Hadoop, it simply
2.1 Hadoop claims to represent principal X, and the server compo-
nent accepts that assertion at face value. On a campus-
Hadoop [8] is an open source large scale distributed area network, this level of trust is inappropriate, and a
filesystem modeled after the Google File System [7]. It pro- stronger authentication method is required.
vides extreme scalability on commodity hardware, stream-
ing data access, built-in replication, and active storage. • Interoperability. The communication protocols used
Hadoop is well known for its extensive use in web search between the various components of Hadoop are also
by companies such as Yahoo! as well as by researchers for tightly coupled, and change with each release of the
data intensive tasks like genome assembly [10]. software as additional information is added to each in-
A key feature of Hadoop is its resiliency to hardware teraction. This presents a problem on a campus grid
failure. The designers of Hadoop expect hardware failure with many simultaneous users, because applications
to be normal during operation rather than exceptional [3]. may run for weeks or months at a time, requiring up-
Hadoop is a ”rack-aware” filesystem that by default repli- grades of both clients and servers to be done indepen-
cates all data blocks three times across storage components dently and incrementally. A good solution requires
to minimize the possiblty of data loss. both forwards and backwards compatibility.
Large data sets are a hallmark of the Hadoop filesystem.
File sizes are anticipated to exceed hundreds of gigabytes 2.2 Chirp
to terabytes in size. Users may create millions of files in a
single data set and still expect excellent performance. The Chirp [15] is a distributed filesystem used in grid com-
file coherency model is ”write once, read many”. A file that puting for easy userlevel export and access to file data. It is a
is written and then closed cannot be written to again. In regular user level process that requires no privileged access
the future, Hadoop is expected to provide support for file during deployment. The user simply specifies a directory on
appends. Generally, users do not ever need support for ran- the local filesystem to export for outside access. Chirp of-
dom writes to a file in everyday application. fers flexible and robust security through per-directory ACLs
Despite many attractive properties, Hadoop has several and various strong authentication modes. Figure 1a pro-
essential problems that make it difficult to use from a cam- vides a general overview of the Chirp distributed filesystem
pus grid environment: and how it is used.
A Chirp server is set up using a simple command as a
• Interface. Hadoop provides a native Java API for ac- regular user. The server is given the root directory as an op-
cessing the filesystem, as well as a POSIX-like C API, tion to export to outside access. Users can authenticate to
which is a thin wrapper around the Java API. In princi- the server using various schemes including the Grid Secu-
ple, applications could be re-written to access these in- rity Infrastructure (GSI) [6], Kerberos [11], or hostnames.
terfaces directly, but this is a time consuming and intru- Access to data is controlled by a per-directory ACL system
sive change that is unattractive to most users and devel- using these authentication credentials.
489
3. tcsh emacs perl appl tcsh emacs perl appl
ordinary unix system calls ordinary unix system calls
custom custom
parrot fuse tools parrot fuse tools
file file
cache libchirp libchirp libchirp cache libchirp libchirp libchirp
chirp network protocol chirp network protocol
chirp Hadoop chirp
server Headnode server
unix
filesys
Hadoop
datanode
Hadoop
datanode
... Hadoop
datanode
(a) The Chirp Filesystem (b) The Chirp Filesystem Bridging Hadoop
Figure 1: Applications can use Parrot, FUSE, or libchirp directly to communicate with a Chirp server on the client side. The
modifications presented in this paper, Chirp can now multiplex between a local unix filesystem and a remote storage service
such as Hadoop.
A Chirp server requires neither special privileges nor code to a Chirp filesystem without having to interface di-
kernel modification to operate. Chirp can both be deployed rectly with libchirp. While applications may also use FUSE
to a grid by a regular user to operate on personal data se- to remotely mount a Chirp fileserver, Parrot is more practi-
curely. We use Chirp at the University of Notre Dame to cal in a campus grid system where users may not have the
support workflows that require a personal file bridge to ex- necessary administrative access to install and run the FUSE
pose filesystems such as AFS [9] to grid jobs. Chirp is also kernel module. Because Parrot runs in userland and has
used for scratch space that harnesses the extra disk space of minimal dependencies, it is easy to distribute and deploy in
various clusters of machines. One can also use Chirp as a restricted campus grid environments.
cluster filesystem for greater I/O bandwidth and capacity.
3 Design and Implementation
2.3 Parrot
To enable existing applications to access distributed This section describes the changes we have made to im-
filesystems such as Chirp, the Parrot [12] virtual filesys- plement a usable and robust system that allows for appli-
tem adapter is used to redirect I/O operations for unmod- cations to transparently access the Hadoop filesystem. The
ified applications. Normally, applications would need to be first approach is to connect Parrot directly to HDFS, which
changed to allow them to access remote storage services provides for scalable performance, but has some limita-
such Chirp or Hadoop. With Parrot attached, unmodified tions. The second approach is to connect Parrot to Chirp
applications can access remote distributed storage systems and thereby Hadoop, which allows us to overcome these
transparently. Because Parrot works by trapping program limitations with a modest performance penalty.
system calls through the ptrace debugging interface, it can
be deployed and installed without any special system privi- 3.1 Parrot + HDFS
leges or kernel changes.
One service Parrot supports is the Chirp filesystem pre- Our initial attempt at integrating Hadoop into our cam-
viously described. Users may connect to Chirp using a va- pus grid environment was to utilize the Parrot middleware
riety of methods. For instance, the libchirp library provides filesystem adapter to communicate with our test Hadoop
an API that manages connections to a Chirp server and al- cluster. To do this we used libhdfs, a C/C++ binding for
lows user to construct custom applications. However, a user HDFS provided by the Hadoop project, to implement a
will generally use Parrot to transparently connect existing HDFS service for Parrot. As with other Parrot services, the
490
4. ever, were fully supported. The parrot hdfs module did not
tcsh emacs perl appl provide a means of overcoming this limitation.
Finally, the biggest impediment to using both Hadoop
and the parrot hdfs module was the lack of proper access
ordinary unix system calls
controls. Hadoop’s permissions can be entirely ignored as
parrot there is no authentication mechanism. The filesystem ac-
libhdfs cepts whatever identity the user presents to Hadoop. So
while a user may partition the Hadoop filesystem for dif-
Hadoop RPC ferent groups and users, any user may report themselves as
any identity thus nullifying the permission model. Since
parrot hdfs simply provides the underlying access control
system, this problem is not mitigated in anyway by the ser-
Hadoop
Headnode
Hadoop
Datanode
Hadoop
Datanode
... Hadoop
Datanode
vice. For a small isolated cluster filesystem this is not a
problem, but for a large campus grid, such a lack of permis-
sions is worrisome and a deterent for wide adoption, espe-
cially if sensitive research data is being stored.
Figure 2: In the parrot hdfs tool, we connect Parrot directly So although parrot hdfs is a useful tool for integrating
to Hadoop using libhdfs, which allows client applications to Hadoop into a campus grid environment, it has some limita-
communicate with a HDFS service. tions that prevent it from being a complete solution. More-
over, Hadoop itself has limitations that need to be addressed
before it can be widely deployed and used in a campus grid.
HDFS Parrot module (parrot hdfs) allows unmodified end
user applications to access remote filesystems, in this case 3.2 Parrot + Chirp + HDFS
HDFS, transparently and without any special priviledges.
This is different from tools such as FUSE which require To overcome these limitations in both parrot hdfs and
kernel modules and thus adminstrative access. Since Par- Hadoop, we decided to connect Parrot to Chirp and then
rot is a static binary and requires no special priviledges, it Hadoop. In order to accomplish this, we had to implement
can easily be deployed around our campus grid and used to a new virtual multiplexing interface in Chirp to allow it to
access remote filesystems. to use any backend filesystem. With this new capability, a
Figure 2 shows the architecture of the parrot hdfs mod- Chirp server could act as a frontend to potentially any other
ule. Applications run normally until an I/O operation is filesystem, such as Hadoop, which is not usually accessible
requested. When this occurs, Parrot traps the system call with normal Unix I/O system calls.
using the ptrace system interface and reroutes to the appro- In our modified version of Chirp, the new virtual mul-
priate filesystem handler. In the case of Hadoop, any I/O tiplexing interface routes each I/O Remote Procedure Call
operation to the Hadoop filesytem is implemented using the (RPC) to the appropriate backend filesystem. For Hadoop,
libhdfs library. Of course, the trapping and redirection of all data access and storage on the Chirp server is done
system calls incurs some overhead, but as shown in a pre- through Hadoop’s libhdfs C library. Figure 1b illustrates
vious paper [4], the parrot hdfs module is quite serviceable how Chirp operates with Hadoop as a backend filesystem.
and provides adequate performance. With our new system, a simple command line switch
Unfortunately, the parrot hdfs module was not quite a changes the backend filesystem during initialization of the
perfect solution for integrating Hadoop into our campus in- server. The command to start the Chirp server remains sim-
frastructure. First, while Parrot is lightweight and has mini- ple to start and deploy:
mal dependencies, libhdfs, the library required for commu-
nicating to Hadoop, does not. In fact, to use libhdfs we $ chirp_server -f hdfs
must have access to a whole Java virtual machine installa- -x headnode.hadoop:9100
tion, which is not always possible in a heterogeneous cam-
pus grid with multiple administrative domains. Clients that connect to the Chirp server such as the one
Second, only sequential writing was supported by the above will work over HDFS instead of on the the server’s
module. This is because Hadoop, by design, is a write-once- local filesystem. The head node of Hadoop and port it
read-many filesystem and thus is optimized for streaming listens on are specified using the -x <host>:<port>
reads rather than frequent writes. Moreover, due to the in- switch. The root of the Hadoop filesystem is the exported
stability of the append feature in the version of Hadoop de- directory by Chirp changed via the -r switch. Because
ployed in our cluster, it was also not possible to support ap- Hadoop and Java require multiple inconvenient environ-
pends to existing files. Random and sequential reads, how- ment variables to be setup prior to execution, we have also
491
5. provided the chirp_server_hdfs wrapper which sets replication and correctly dealing with the (now) muta-
these variables to configuration defined values before run- ble last block of a file. As of now, turning appends on
ning chirp_server. requires setting experimental parameters in Hadoop.
Recall that Hadoop has no internal authentication sys- Chirp provides some support for appends by imple-
tem; it relies on external mechanisms. A user may de- menting the operation through other primitives within
clare any identity and group membership when establishing Hadoop. That is, Chirp copies the file into memory,
a connection with the Hadoop cluster. The model naturally deletes the file, writes the contents of memory back to
relies on a trustworthy environment; it is not expected to be the file, and returns the file for further writes – effec-
resilient to malicious use. In a campus environment, this tively allowing appends. Support for appends to small
may not be the case. So, besides being an excellent tool for files fit most use cases, logs in particular. Appending
grid storage in its own right, Chirp brings the ability to wall to a large file is not advised. The solution is naturally
off Hadoop from the campus grid while still exposing it in less than optimal but is necessary until Hadoop pro-
a secure way. To achieve this, we would expect the admin- vides non-experimental support.
istrator to firewall off regular Hadoop access while still ex-
posing Chirp servers running on top of Hadoop. This setup
is a key difference to parrot hdfs, discussed in the previous • Random Writes Random writes to a file are not al-
section. Chirp brings its strong authentication mechanisms lowed by Hadoop. This type of restriction generally
and flexible access control lists to protect Hadoop from un- affects programs, like linkers, that do not write to a file
intended or accidental access whereas parrot hdfs connects sequentially. In Chirp, writes to areas other than the
to an exposed, unprotected Hadoop cluster. end of the file are silently changed to appends instead.
3.3 Limitations when Bridging Hadoop • Other POSIX Incompatibilities Hadoop has other
with Chirp peculiar incompatibilities with POSIX. The execute bit
for regular files is nonexistent in Hadoop. Chirp gets
The new modified Chirp provides users with an approxi- around this by reporting the execute bit set for all users.
mate POSIX interface to the Hadoop filesystem. Certain re- Additionally, it does not allow renaming a file to one
strictions are unavoidable due to limitations stemming from that already exists. This would be equivalent to the
design decisions in HDFS. The API provides most POSIX Unix mv operation from one existing file to another ex-
operations and semantics that are needed for our bridge with isting file. Further, opening, unlinking, and closing a
Chirp, but not all. We list certain notable problems we have file will result in Hadoop failing with an EINTERNAL
encountered here. error. For these types of semantic errors, the Chirp
server and client have no way of reasonably respond-
• Errno Translation The Hadoop libhdfs C API binding
ing or anticipating such an error. We expect that as the
interprets exceptions or errors within the Java runtime
HDFS binding matures many of these errors will be
and sets errno to an appropriate value. These trans-
responded to in a more reasonable way.
lations are not always accurate and some are missing.
A defined errno value within libhdfs is EINTERNAL
which is the generic error used when another is not
found to be more appropriate. Chirp tries to sensi- 4 Evaluation
bly handle these errors by returning a useful status to
the user. Some errors must be caught early because
Hadoop does not properly handle all error conditions. To evaluate the performance characteristics of our Par-
For example, attempting to open a file that is a direc- rot + Chirp + HDFS system, we ran a series of benchmark
tory results errno being set to EINTERNAL (an HDFS tests. The first is a set of micro-benchmarks designed to
generic EIO error) rather than the correct EISDIR. test the overhead for individual I/O system calls when us-
ing Parrot with various filesystems. The second is a data
• Appends to a File Appends are largely unsupported transfer performance test of our new system versus the na-
in Hadoop within production environments. Hadoop tive Hadoop tools. Our tests ran on a 32 node cluster of
went without appends for a long time in its history be- machines (located off Notre Dame’s main campus) with the
cause the Map Reduce abstraction did not benefit from Hadoop setup using 15 TB of aggregate capacity. The clus-
it. Still, use cases for appends surfaced and work to ter is using approximately 50% of available space. The
add support – and determine the appropriate semantics Hadoop cluster shares a 1 gigabit link to communicate with
– has been in progress since 2007. The main barri- the main campus network. We have setup a Chirp server as
ers to its developement are coherency semantics with a frontend to the Hadoop cluster on one of the data nodes.
492
6. 50 50
2 Clients 2 Clients
4 Clients 4 Clients
8 Clients 8 Clients
40 16 Clients 40 16 Clients
Bandwidth (MB/s)
Bandwidth (MB/s)
30 30
20 20
10 10
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000
File Size (MB) File Size (MB)
(a) Campus Network Reads to Chirp+HDFS. (b) Campus Network Reads to HDFS.
50 50
2 Clients 2 Clients
4 Clients 4 Clients
8 Clients 8 Clients
40 16 Clients 40 16 Clients
Bandwidth (MB/s)
Bandwidth (MB/s)
30 30
20 20
10 10
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000
File Size (MB) File Size (MB)
(c) Campus Network Writes to Chirp+HDFS. (d) Campus Network Writes to HDFS.
Figure 3
35
Parrot+Chirp+Unix
The results of these micro-benchmarks are shown in Fig-
ure 5. As can be seen, our new Parrot + Chirp + HDFS
Latency (Milliseconds)
30 Parrot+HDFS
Parrot+Chirp+HDFS
25 systems has some overhead when compared to accessing a
20 Chirp server or HDFS service using Parrot. In particular, the
15 metadata operations of stat and open are particularly heavy
10 in our new system while read and close are relatively close.
5 The reason for this high overhead on metadata-intensive op-
0 erations is because of the access control lists Chirp uses to
stat open read close
Micro-Operations implement security and permissions. For each stat or open,
multiple files must be opened and checked in order to verify
Figure 5: Latency of I/O Operations using Microbench valid access, which increases the latencies for these system
calls.
4.1 Micro-benchmarks 4.2 Performance benchmarks
The first set of tests we ran were a set of micro- For our second set of benchmarks, we stress test the
benchmarks that measure the average latency of various I/O Chirp server by determining the throughput achieved when
system calls such as stat, open, read, close. For this test we putting and getting very large files. These types of com-
wanted to know the cost of putting Chirp in front of HDFS mands are built into Chirp as the RPC’s getfile and
and using Parrot to access HDFS through our Chirp server. putfile. Our tests compared the equivalent Hadoop
We compared these latencies to those for accessing a Chirp native client hadoop fs get and hadoop fs put
server hosting a normal Unix filesystem through Parrot, and commands to the Chirp ones. The computer on the cam-
accessing HDFS directly using Parrot. pus network initiating the tests used the Chirp command
493
7. 100 100
2 Clients 2 Clients
4 Clients 4 Clients
8 Clients 8 Clients
80 16 Clients 80 16 Clients
Bandwidth (MB/s)
Bandwidth (MB/s)
60 60
40 40
20 20
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000
File Size (MB) File Size (MB)
(a) LAN Network Reads to Chirp+HDFS. (b) LAN Network Reads to HDFS.
50 50
2 Clients 2 Clients
4 Clients 4 Clients
8 Clients 8 Clients
40 16 Clients 40 16 Clients
Bandwidth (MB/s)
Bandwidth (MB/s)
30 30
20 20
10 10
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000
File Size (MB) File Size (MB)
(c) LAN Network Writes to Chirp+HDFS. (d) LAN Network Writes to HDFS.
Figure 4
line tool, which uses the libchirp library without any ex- chines will follow access the Hadoop storage in this way.
ternal dependencies, and the Hadoop native client, which For contrast, in the next tests, we remove this barrier.
requires all the Hadoop and Java dependencies. We want to
Next, we place the clients on the same LAN as the
test how the Chirp server performs under load compared to
Hadoop cluster and the Chirp server frontend. We expect
the Hadoop cluster by itself. For these tests, we communi-
that throughput will not be hindered by the link rate of the
cated with the Hadoop cluster and the Chirp server frontend
network and parallelism will allow Hadoop to scale. Similar
from the main campus. All communication went through
to the tests from the main campus grid, the clients will put
the 1 gigabit link to the cluster.
and get files to the Chirp server and via the Hadoop native
The graphs in Figures 3a and 3b show the results of read- client.
ing (getting) a file from the Chirp server frontend and via The graphs in Figures 4a and 4b show the results of read-
the the native Hadoop client, respectively. We see that the ing a file over the LAN setup. We see that we get simi-
Chirp server is able to maintain throughput comparable to lar throughput over the Chirp server as in the main campus
the Hadoop native client; the Chirp server does not present tests. While the link to the main campus grid is no longer
a real bottleneck to reading a file from Hadoop. In Fig- a limiting factor, the Chirp server’s machine still only has
ures 3c and 3d, we have the results of writing (putting) a a gigabit link to the LAN. Meanwhile, Hadoop achieves
file to the Chirp server frontend and native Hadoop client, respectably good performance, as expected, when reading
again respectively. These graphs similarly show that the files with a large number of clients. The parallelism gives
Chirp server obtains similar throughput to the Hadoop na- Hadoop almost twice the throughput. For a larger number
tive client for writes. For these tests, the 1 gigabit link plays of clients, Hadoop’s performance outstrips chirp by a large
a major role in limiting the exploitable parallelism obtain- factor of 10. The write performance in Figures 4c and 4d,
able via writing to Hadoop over a faster (local) network. show similar results. The writes over the LAN bottleneck
Despite hamstringing Hadoop’s potential parallelism, we on the gigabit link to the Chirp server while Hadoop main-
find these tests valuable because the campus network ma- tains minor performance hits as the number of clients in-
494
8. crease. In both cases, the size of the file did not have a [5] J. Dean and S. Ghemawat. Mapreduce: Simplified
significant effect on the average throughput. Because one data processing on large cluster. In Operating Systems
(Chirp) server cannot handle the throughput due to net- Design and Implementation, 2004.
work link restrictions in a LAN environment, we recom-
mend adding more Chirp servers on other machines in the [6] I. Foster, C. Kesselman, G. Tsudik, and S. Tuecke.
LAN to help distribute the load. A Security Architecture for Computational Grids. In
We conclude from these benchmark evaluations that ACM Conference on Computer and Communications
Chirp achieves adequate performance compared to Hadoop Security, pages 83–92, San Francisco, CA, November
with some overhead. In a campus grid setting we see that 1998.
Chirp gets approximately equivalent streaming throughput [7] S. Ghemawat, H. Gobioff, and S. Leung. The Google
compared to the native Hadoop client. We find this accept- filesystem. In ACM Symposium on Operating Systems
able for the type of environment we are targeting. Principles, 2003.
[8] Hadoop. http://hadoop.apache.org/, 2007.
5 Conclusions
[9] J. Howard, M. Kazar, S. Menees, D. Nichols,
Hadoop is a highly scalable filesystem used in a wide ar- M. Satyanarayanan, R. Sidebotham, and M. West.
ray of disciplines but lacks support for campus or grid com- Scale and performance in a distributed file system.
puting needs. Hadoop lacks strong authentication mecha- ACM Trans. on Comp. Sys., 6(1):51–81, February
nisms and access control. Hadoop also lacks ease of ac- 1988.
cess due to the number of dependencies and program re- [10] M. C. Schatz. CloudBurst: highly sensitive read map-
quirements. We have designed a solution that enchances the ping with MapReduce. Bioinformatics, 25(11):1363–
functionality of Chirp, coupled with Parrot or FUSE, to pro- 1369, 2009.
vide a bridge between a user’s work and Hadoop. This new
system can be used to correct these problems in Hadoop so [11] J. Steiner, C. Neuman, and J. I. Schiller. Kerberos: An
it may be used safely and securely on the grid without sig- authentication service for open network systems. In
nificant loss in performance. Proceedings of the USENIX Winter Technical Confer-
The new version of the Chirp software is now capable ence, pages 191–200, 1988.
of multiplexing the I/O operations on the server so opera-
tions are handled by filesystems other than the local Unix [12] D. Thain and M. Livny. Parrot: Transparent user-level
system. Chirp being capable of bridging Hadoop to the grid middleware for data-intensive computing. Technical
for safe and secure use is a particularly significant conse- Report 1493, University of Wisconsin, Computer Sci-
quence of this work. Chirp can be freely downloaded at: ences Department, December 2003.
http://www.cse.nd.edu/ ccl/software. [13] D. Thain and M. Livny. How to Measure a Large Open
Source Distributed System. Concurrency and Compu-
References tation: Practice and Experience, 18(15):1989–2019,
2006.
[1] Filesystem in user space. [14] D. Thain and C. Moretti. Efficient access to many
http://sourceforge.net/projects/fuse. small files in a filesystem for grid computing. In IEEE
Conference on Grid Computing, Austin, TX, Septem-
[2] Boilergrid web site. ber 2007.
http://www.rcac.purdue.edu/userinfo/resources/boilergrid,
2010. [15] D. Thain, C. Moretti, and J. Hemmes. Chirp: A prac-
tical global file system for cluster and grid computing.
[3] D. Borthakur. The hadoop distributed file system: Journal of Grid Computing, to appear in 2008.
Architecture and design. http://hadoop.apache.org/,
2007. [16] D. Thain, T. Tannenbaum, and M. Livny. Condor and
the grid. In F. Berman, G. Fox, and T. Hey, editors,
[4] H. Bui, P. Bui, P. Flynn, and D. Thain. ROARS: Grid Computing: Making the Global Infrastructure a
A Scalable Repository for Data Intensive Scientific Reality. John Wiley, 2003.
Computing. In The Third International Workshop on
Data Intensive Distributed Computing at ACM HPDC
2010, 2010.
495