Sector is an open source cloud platform designed for data intensive computing. It provides several advantages over Hadoop such as being up to 2x faster, supporting user defined functions, and exploiting data locality and network topology. Sector uses a layered architecture with user defined functions, a distributed file system, and a UDP-based transport protocol. Experimental results show Sector outperforms Hadoop on benchmarks and has less than a 5% performance penalty compared to a local cluster when run on distributed wide area clusters connected by 10Gbps networks.
This document provides an introduction and overview of big data concepts including MapReduce, NoSQL databases, and Hadoop. It discusses key aspects of big data such as volume, velocity, and variety. The document outlines MapReduce algorithms for word counting and k-means clustering. It also describes the NoSQL database Neptune and its architecture. Finally, it proposes future plans for a big data pilot system and research into parallelizing data mining algorithms using MapReduce.
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET Journal
This document proposes a new HDFS architecture that eliminates the single point of failure of the NameNode by distributing metadata storage using blockchain technology. In the traditional HDFS, the NameNode stores all metadata, but in the new architecture this is replaced by blockchain miners that securely store encrypted metadata across data nodes. Blockchain links data blocks in a serial manner with cryptographic hashes to ensure integrity. The key components are HDFS clients, data nodes for storage, and specially designated miner nodes that help create and store metadata blocks in an encrypted and distributed fashion similar to how transactions are recorded in a blockchain. This architecture aims to provide reliable, secure and faster metadata access without a single point of failure.
This document provides an overview of Hadoop and its core components. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce as its programming model and the Hadoop Distributed File System (HDFS) for storage. HDFS stores data redundantly across nodes for reliability. The core subprojects of Hadoop include MapReduce, HDFS, Hive, HBase, and others.
This document discusses HDFS federation as a way to scale the HDFS filesystem. HDFS federation allows horizontally scaling the namespace by adding independent namenodes, each serving a portion of the overall namespace. It also generalizes the block storage layer, treating blocks as a shared pool across namespaces. This preserves namenode robustness while improving isolation and availability. The goal is to scale HDFS clusters to support thousands of nodes, hundreds of thousands of cores, petabytes of storage, and tens of thousands of concurrent jobs.
A simple replication-based mechanism has been used to achieve high data reliability of Hadoop Distributed File System (HDFS). However, replication based mechanisms have high degree of disk storage requirement since it makes copies of full block without consideration of storage size. Studies have shown that erasure-coding mechanism can provide more storage space when used as an alternative to replication. Also, it can increase write throughput compared to replication mechanism. To improve both space efficiency and I/O performance of the HDFS while preserving the same data reliability level, we propose HDFS+, an erasure coding based Hadoop Distributed File System. The proposed scheme writes a full block on the primary DataNode and then performs erasure coding with Vandermonde-based Reed-Solomon algorithm that divides data into m data fragments and encode them into n data fragments (n>m), which are saved in N distinct DataNodes such that the original object can be reconstructed from any m fragments. The experimental results show that our scheme can save up to 33% of storage space while outperforming the original scheme in write performance by 1.4 times. Our scheme provides the same read performance as the original scheme as long as data can be read from the primary DataNode even under single-node or double-node failure. Otherwise, the read performance of the HDFS+ decreases to some extent. However, as the number of fragments increases, we show that the performance degradation becomes negligible.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
ContextWeb is an online advertisement company that processes large volumes of log data using Hadoop. They process up to 120GB of raw log files per day. Their Hadoop cluster consists of 40 nodes and processes around 2000 MapReduce jobs per day. They developed techniques for partitioning data by date/time and using file revisions to allow incremental processing while ensuring data consistency and freshness of reports.
Mike Miller is the Co-Founder and Chief Scientist of Cloudant, a company that provides a globally distributed data layer for web applications. He has a background in machine learning, analysis, big data, and distributed systems. Cloudant was founded in 2009 by MIT data scientists and provides a hyper-scalable document database and analytics platform that runs across multiple data centers.
This document provides an introduction and overview of big data concepts including MapReduce, NoSQL databases, and Hadoop. It discusses key aspects of big data such as volume, velocity, and variety. The document outlines MapReduce algorithms for word counting and k-means clustering. It also describes the NoSQL database Neptune and its architecture. Finally, it proposes future plans for a big data pilot system and research into parallelizing data mining algorithms using MapReduce.
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET Journal
This document proposes a new HDFS architecture that eliminates the single point of failure of the NameNode by distributing metadata storage using blockchain technology. In the traditional HDFS, the NameNode stores all metadata, but in the new architecture this is replaced by blockchain miners that securely store encrypted metadata across data nodes. Blockchain links data blocks in a serial manner with cryptographic hashes to ensure integrity. The key components are HDFS clients, data nodes for storage, and specially designated miner nodes that help create and store metadata blocks in an encrypted and distributed fashion similar to how transactions are recorded in a blockchain. This architecture aims to provide reliable, secure and faster metadata access without a single point of failure.
This document provides an overview of Hadoop and its core components. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce as its programming model and the Hadoop Distributed File System (HDFS) for storage. HDFS stores data redundantly across nodes for reliability. The core subprojects of Hadoop include MapReduce, HDFS, Hive, HBase, and others.
This document discusses HDFS federation as a way to scale the HDFS filesystem. HDFS federation allows horizontally scaling the namespace by adding independent namenodes, each serving a portion of the overall namespace. It also generalizes the block storage layer, treating blocks as a shared pool across namespaces. This preserves namenode robustness while improving isolation and availability. The goal is to scale HDFS clusters to support thousands of nodes, hundreds of thousands of cores, petabytes of storage, and tens of thousands of concurrent jobs.
A simple replication-based mechanism has been used to achieve high data reliability of Hadoop Distributed File System (HDFS). However, replication based mechanisms have high degree of disk storage requirement since it makes copies of full block without consideration of storage size. Studies have shown that erasure-coding mechanism can provide more storage space when used as an alternative to replication. Also, it can increase write throughput compared to replication mechanism. To improve both space efficiency and I/O performance of the HDFS while preserving the same data reliability level, we propose HDFS+, an erasure coding based Hadoop Distributed File System. The proposed scheme writes a full block on the primary DataNode and then performs erasure coding with Vandermonde-based Reed-Solomon algorithm that divides data into m data fragments and encode them into n data fragments (n>m), which are saved in N distinct DataNodes such that the original object can be reconstructed from any m fragments. The experimental results show that our scheme can save up to 33% of storage space while outperforming the original scheme in write performance by 1.4 times. Our scheme provides the same read performance as the original scheme as long as data can be read from the primary DataNode even under single-node or double-node failure. Otherwise, the read performance of the HDFS+ decreases to some extent. However, as the number of fragments increases, we show that the performance degradation becomes negligible.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
ContextWeb is an online advertisement company that processes large volumes of log data using Hadoop. They process up to 120GB of raw log files per day. Their Hadoop cluster consists of 40 nodes and processes around 2000 MapReduce jobs per day. They developed techniques for partitioning data by date/time and using file revisions to allow incremental processing while ensuring data consistency and freshness of reports.
Mike Miller is the Co-Founder and Chief Scientist of Cloudant, a company that provides a globally distributed data layer for web applications. He has a background in machine learning, analysis, big data, and distributed systems. Cloudant was founded in 2009 by MIT data scientists and provides a hyper-scalable document database and analytics platform that runs across multiple data centers.
NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...Hanh Le Hieu
The document proposes a novel architecture called NDCouplingHDFS for efficient gear-shifting in power-proportional Hadoop Distributed File Systems (HDFS). Current HDFS-based methods have inefficient gear-shifting processes due to bottlenecks from centralized metadata access and high communication costs. The proposed approach distributes metadata management across nodes to eliminate bottlenecks and couples name nodes and data nodes to reduce communication during gear shifts by localizing metadata and enabling bulk data transfers. Experimental evaluation demonstrates the efficiency gains of the new architecture.
This document provides an overview of Hadoop Distributed File System (HDFS). It discusses the goals of HDFS including providing a scalable, distributed file system that handles large datasets and node failures through techniques like data replication. The architecture of HDFS is described as having a single NameNode that manages the file system namespace and tracks where data blocks are stored across multiple DataNodes. Clients contact the NameNode for read/write operations and data is written once but can be read many times from DataNodes for high throughput.
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
This chapter discusses different methods for processing large amounts of data across distributed systems. It introduces MapReduce as a programming model used by Google to process vast amounts of data across thousands of servers. MapReduce allows for distributed processing of large datasets by dividing work into independent tasks (mapping) and collecting/aggregating the results (reducing). The chapter also discusses scaling computation by launching many independent virtual machines and assigning tasks via a messaging queue. Overall it provides an overview of approaches for parallel and distributed processing of big data across cloud infrastructures.
Compaction and Splitting in Apache AccumuloHortonworks
The document discusses compaction and splitting in Apache Accumulo distributed key-value stores. It explains that Accumulo tables are divided into non-overlapping ranges called tablets, and that compaction merges sorted files within a tablet into a single file to improve read performance. Splitting divides large tablets into two in order to balance workload. The document provides details on Accumulo's and HBase's compaction algorithms and how they determine when to compact and split tablets.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It allows for the reliable, scalable, and distributed processing of large data sets across a cluster. A typical Hadoop cluster consists of thousands of commodity servers storing exabytes of data and processing petabytes of data per day. Hadoop uses the Hadoop Distributed File System (HDFS) for storage and MapReduce as its processing engine. HDFS stores data across nodes in a cluster as blocks and provides redundancy, while MapReduce processes data in parallel on those nodes.
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
This document discusses big data processing options for optimizing analytical workloads using Hadoop. It provides an overview of Hadoop and its core components HDFS and MapReduce. It also discusses the Hadoop ecosystem including tools like Pig, Hive, HBase, and ecosystem projects. The document compares building Hadoop clusters to using appliances or Hadoop-as-a-Service offerings. It also briefly mentions some Hadoop competitors for real-time processing use cases.
This document summarizes benchmarks performed on the Ceph distributed file system deployed on an OpenStack cloud. The benchmarks measure performance metrics such as bandwidth, latency, and throughput. Results show that Ceph has good performance and scalability when increasing client requests and data sizes. A variety of tests were used, including Bonnie++, DD, RADOS Bench, OSD Tell, Iperf, and Netcat. Ceph installation on the OpenStack cloud involved preparing nodes, creating a storage cluster, and configuring MON, MDS, and OSD daemons.
Construindo Data Lakes - Visão Prática com Hadoop e BigDataMarco Garcia
Minha apresentação sobre construção de data lakes para bigdata usando hadoop como plataforma de dados. Conheça mais sobre nossos trabalhos de consultoria e treinamento em Hadoop Hortonworks, BigData, Data Warehousing e Business Intelligence
The document provides an overview of how MapReduce works in Hadoop. It explains that MapReduce involves a mapping phase where mappers process input data and emit key-value pairs, and a reducing phase where reducers combine the output from mappers by key and produce the final output. An example of word count using MapReduce is also presented, showing how the input data is split, mapped to count word occurrences, shuffled by key, and reduced to get the final count of each word.
The causes and consequences of too many bitsDipesh Lall
The document provides an overview of big data, including definitions of data units like bits and bytes. It discusses how data is growing exponentially in terms of volume, velocity, and variety. Traditional relational database management systems cannot handle this scale of data. Therefore, new approaches like Not Only SQL (NOSQL) databases and Hadoop were developed to better manage large, diverse, and fast-moving data. These new big data architectures allow problems to be broken into pieces and processed in parallel across many servers for improved speed and scalability compared to traditional approaches. The document concludes by noting that skills like communication, presentation, and understanding business and statistics will be important for working with big data.
Cost model for RFID-based traceability information systemsMiguel Pardal
This document presents a cost model for comparing radio frequency identification (RFID)-based traceability information systems. The cost model considers parameters such as the number of companies in the supply chain, average item records, message size, item record size, bandwidth, and processing speed. The model is intended to quantitatively compare different traceability systems and evaluate how costs may change based on the structure and depth of the supply chain. Future work plans to add more detail to the model and validate it using real traceability systems.
The document provides an overview of Hadoop, describing it as an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It discusses key Hadoop components like HDFS for storage, MapReduce for distributed processing, and YARN for resource management. The document also gives examples of how organizations are using Hadoop at large scale for applications like search indexing and data analytics.
The document discusses various techniques for processing queries and optimizing database performance. It covers topics like data organization on disk using buffer pools, B-tree indexes for efficient searching, different join algorithms like nested loops and sort-merge joins, and how external sorting and aggregation work. The goal is to evaluate queries efficiently by minimizing disk I/O and utilizing available memory.
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterBill Graham
The document discusses Twitter's data analytics platform, including Hadoop and Vertica. It outlines Twitter's data flow, which ingests 400 million tweets daily into HDFS, then uses various tools like Crane, Oink, and Rasvelg to run jobs on the main Hadoop cluster before loading analytics into Vertica and MySQL for web tools and analysts. It also describes Twitter's heterogeneous technology stack and the various teams that use the analytics platform.
it is bit towards Hadoop/Hive installation experience and ecosystem concept. The outcome of this slide is derived from a under published book Fundamental of Big Data.
Dhruba Borthakur presented on Apache Hadoop and Hive. He discussed the architecture of Hadoop Distributed File System (HDFS) and how it is optimized for processing large datasets across commodity hardware. HDFS uses a master/slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. Hive provides a SQL-like interface to query and analyze large datasets stored in HDFS. Facebook uses a large Hadoop cluster to process petabytes of data daily and many engineers are now using Hadoop and Hive. Borthakur proposed several ideas for collaborations between Hadoop and Condor.
Hadoop provides high availability through replication of data across multiple nodes. Replication handles data integrity through checksums and automatic re-replication of corrupt blocks. Rack failures are reduced by dual networking and more replication bandwidth. NameNode failures are rare but cause downtime, so Hadoop 1 adds cold failover of Namenodes using VMware HA or RedHat HA. Hadoop 2 introduces live failover of Namenodes using a quorum journal manager to eliminate single points of failure. Full stack high availability adds monitoring and restart of all services.
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Cloudera, Inc.
This document provides an introduction and overview of Apache Hadoop. It begins with an outline and discusses why Hadoop is important given the growth of data. It then describes the core components of Hadoop - HDFS for distributed storage and MapReduce for distributed computing. The document explains how Hadoop is able to provide scalability and fault tolerance. It provides examples of how Hadoop is used in production at large companies. It concludes by discussing the Hadoop ecosystem and encouraging questions.
The document discusses the Hadoop ecosystem. It provides an overview of Hadoop and its core components HDFS and MapReduce. HDFS is the storage component that stores large files across nodes in a cluster. MapReduce is the processing framework that allows distributed processing of large datasets in parallel. The document also discusses other tools in the Hadoop ecosystem like Hive, Pig, and Hadoop distributions from companies. It provides examples of running MapReduce jobs and accessing HDFS from the command line.
NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...Hanh Le Hieu
The document proposes a novel architecture called NDCouplingHDFS for efficient gear-shifting in power-proportional Hadoop Distributed File Systems (HDFS). Current HDFS-based methods have inefficient gear-shifting processes due to bottlenecks from centralized metadata access and high communication costs. The proposed approach distributes metadata management across nodes to eliminate bottlenecks and couples name nodes and data nodes to reduce communication during gear shifts by localizing metadata and enabling bulk data transfers. Experimental evaluation demonstrates the efficiency gains of the new architecture.
This document provides an overview of Hadoop Distributed File System (HDFS). It discusses the goals of HDFS including providing a scalable, distributed file system that handles large datasets and node failures through techniques like data replication. The architecture of HDFS is described as having a single NameNode that manages the file system namespace and tracks where data blocks are stored across multiple DataNodes. Clients contact the NameNode for read/write operations and data is written once but can be read many times from DataNodes for high throughput.
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
This chapter discusses different methods for processing large amounts of data across distributed systems. It introduces MapReduce as a programming model used by Google to process vast amounts of data across thousands of servers. MapReduce allows for distributed processing of large datasets by dividing work into independent tasks (mapping) and collecting/aggregating the results (reducing). The chapter also discusses scaling computation by launching many independent virtual machines and assigning tasks via a messaging queue. Overall it provides an overview of approaches for parallel and distributed processing of big data across cloud infrastructures.
Compaction and Splitting in Apache AccumuloHortonworks
The document discusses compaction and splitting in Apache Accumulo distributed key-value stores. It explains that Accumulo tables are divided into non-overlapping ranges called tablets, and that compaction merges sorted files within a tablet into a single file to improve read performance. Splitting divides large tablets into two in order to balance workload. The document provides details on Accumulo's and HBase's compaction algorithms and how they determine when to compact and split tablets.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It allows for the reliable, scalable, and distributed processing of large data sets across a cluster. A typical Hadoop cluster consists of thousands of commodity servers storing exabytes of data and processing petabytes of data per day. Hadoop uses the Hadoop Distributed File System (HDFS) for storage and MapReduce as its processing engine. HDFS stores data across nodes in a cluster as blocks and provides redundancy, while MapReduce processes data in parallel on those nodes.
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
This document discusses big data processing options for optimizing analytical workloads using Hadoop. It provides an overview of Hadoop and its core components HDFS and MapReduce. It also discusses the Hadoop ecosystem including tools like Pig, Hive, HBase, and ecosystem projects. The document compares building Hadoop clusters to using appliances or Hadoop-as-a-Service offerings. It also briefly mentions some Hadoop competitors for real-time processing use cases.
This document summarizes benchmarks performed on the Ceph distributed file system deployed on an OpenStack cloud. The benchmarks measure performance metrics such as bandwidth, latency, and throughput. Results show that Ceph has good performance and scalability when increasing client requests and data sizes. A variety of tests were used, including Bonnie++, DD, RADOS Bench, OSD Tell, Iperf, and Netcat. Ceph installation on the OpenStack cloud involved preparing nodes, creating a storage cluster, and configuring MON, MDS, and OSD daemons.
Construindo Data Lakes - Visão Prática com Hadoop e BigDataMarco Garcia
Minha apresentação sobre construção de data lakes para bigdata usando hadoop como plataforma de dados. Conheça mais sobre nossos trabalhos de consultoria e treinamento em Hadoop Hortonworks, BigData, Data Warehousing e Business Intelligence
The document provides an overview of how MapReduce works in Hadoop. It explains that MapReduce involves a mapping phase where mappers process input data and emit key-value pairs, and a reducing phase where reducers combine the output from mappers by key and produce the final output. An example of word count using MapReduce is also presented, showing how the input data is split, mapped to count word occurrences, shuffled by key, and reduced to get the final count of each word.
The causes and consequences of too many bitsDipesh Lall
The document provides an overview of big data, including definitions of data units like bits and bytes. It discusses how data is growing exponentially in terms of volume, velocity, and variety. Traditional relational database management systems cannot handle this scale of data. Therefore, new approaches like Not Only SQL (NOSQL) databases and Hadoop were developed to better manage large, diverse, and fast-moving data. These new big data architectures allow problems to be broken into pieces and processed in parallel across many servers for improved speed and scalability compared to traditional approaches. The document concludes by noting that skills like communication, presentation, and understanding business and statistics will be important for working with big data.
Cost model for RFID-based traceability information systemsMiguel Pardal
This document presents a cost model for comparing radio frequency identification (RFID)-based traceability information systems. The cost model considers parameters such as the number of companies in the supply chain, average item records, message size, item record size, bandwidth, and processing speed. The model is intended to quantitatively compare different traceability systems and evaluate how costs may change based on the structure and depth of the supply chain. Future work plans to add more detail to the model and validate it using real traceability systems.
The document provides an overview of Hadoop, describing it as an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It discusses key Hadoop components like HDFS for storage, MapReduce for distributed processing, and YARN for resource management. The document also gives examples of how organizations are using Hadoop at large scale for applications like search indexing and data analytics.
The document discusses various techniques for processing queries and optimizing database performance. It covers topics like data organization on disk using buffer pools, B-tree indexes for efficient searching, different join algorithms like nested loops and sort-merge joins, and how external sorting and aggregation work. The goal is to evaluate queries efficiently by minimizing disk I/O and utilizing available memory.
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterBill Graham
The document discusses Twitter's data analytics platform, including Hadoop and Vertica. It outlines Twitter's data flow, which ingests 400 million tweets daily into HDFS, then uses various tools like Crane, Oink, and Rasvelg to run jobs on the main Hadoop cluster before loading analytics into Vertica and MySQL for web tools and analysts. It also describes Twitter's heterogeneous technology stack and the various teams that use the analytics platform.
it is bit towards Hadoop/Hive installation experience and ecosystem concept. The outcome of this slide is derived from a under published book Fundamental of Big Data.
Dhruba Borthakur presented on Apache Hadoop and Hive. He discussed the architecture of Hadoop Distributed File System (HDFS) and how it is optimized for processing large datasets across commodity hardware. HDFS uses a master/slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. Hive provides a SQL-like interface to query and analyze large datasets stored in HDFS. Facebook uses a large Hadoop cluster to process petabytes of data daily and many engineers are now using Hadoop and Hive. Borthakur proposed several ideas for collaborations between Hadoop and Condor.
Hadoop provides high availability through replication of data across multiple nodes. Replication handles data integrity through checksums and automatic re-replication of corrupt blocks. Rack failures are reduced by dual networking and more replication bandwidth. NameNode failures are rare but cause downtime, so Hadoop 1 adds cold failover of Namenodes using VMware HA or RedHat HA. Hadoop 2 introduces live failover of Namenodes using a quorum journal manager to eliminate single points of failure. Full stack high availability adds monitoring and restart of all services.
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Cloudera, Inc.
This document provides an introduction and overview of Apache Hadoop. It begins with an outline and discusses why Hadoop is important given the growth of data. It then describes the core components of Hadoop - HDFS for distributed storage and MapReduce for distributed computing. The document explains how Hadoop is able to provide scalability and fault tolerance. It provides examples of how Hadoop is used in production at large companies. It concludes by discussing the Hadoop ecosystem and encouraging questions.
The document discusses the Hadoop ecosystem. It provides an overview of Hadoop and its core components HDFS and MapReduce. HDFS is the storage component that stores large files across nodes in a cluster. MapReduce is the processing framework that allows distributed processing of large datasets in parallel. The document also discusses other tools in the Hadoop ecosystem like Hive, Pig, and Hadoop distributions from companies. It provides examples of running MapReduce jobs and accessing HDFS from the command line.
This document provides an introduction to Hadoop and its core components. It discusses HDFS and how it provides a scalable, fault-tolerant distributed file system. It also covers MapReduce and how it allows distributed processing of large datasets across clusters. Finally, it mentions some newer technologies like YARN, Spark, and Tez that improve on and extend the original MapReduce framework in Hadoop.
Apache Spark is written in Scala programming language that compiles the program code into byte code for the JVM for spark big data processing.
The open source community has developed a wonderful utility for spark python big data processing known as PySpark.
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
The document discusses big data concepts and Hadoop technologies. It provides an overview of massive parallel processing and the Hadoop architecture. It describes common processing engines like MapReduce, Spark, Hive, Pig and BigSQL. It also discusses Hadoop distributions from Hortonworks, Cloudera and IBM along with stream processing and advanced analytics on Hadoop platforms.
The document discusses linking the statistical programming language R with the Hadoop platform for big data analysis. It introduces Hadoop and its components like HDFS and MapReduce. It describes three ways to link R and Hadoop: RHIPE which performs distributed and parallel analysis, RHadoop which provides HDFS and MapReduce interfaces, and Hadoop streaming which allows R scripts to be used as Mappers and Reducers. The goal is to use these methods to analyze large datasets with R functions on Hadoop clusters.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
This document provides an introduction to Hadoop, including its programming model, distributed file system (HDFS), and MapReduce framework. It discusses how Hadoop uses a distributed storage model across clusters, racks, and data nodes to store large datasets in a fault-tolerant manner. It also summarizes the core components of Hadoop, including HDFS for storage, MapReduce for processing, and YARN for resource management. Hadoop provides a scalable platform for distributed processing and analysis of big data.
Best Practices for Deploying Hadoop (BigInsights) in the CloudLeons Petražickis
This document provides best practices for optimizing the performance of InfoSphere BigInsights and InfoSphere Streams when deployed in the cloud. It discusses optimizing disk performance by choosing cloud providers and instances with good disk I/O, partitioning and formatting disks correctly, and configuring HDFS to use multiple data directories. It also discusses optimizing Java performance by correctly configuring JVM memory and optimizing MapReduce performance by setting appropriate values for map and reduce tasks based on machine resources.
The document provides an introduction to Hadoop and distributed computing, describing Hadoop's core components like MapReduce, HDFS, HBase and Hive. It explains how Hadoop uses a map-reduce programming model to process large datasets in a distributed manner across commodity hardware, and how its distributed file system HDFS stores and manages large amounts of data reliably. Functional programming concepts like immutability and avoiding state changes are important to Hadoop's ability to process data in parallel across clusters.
Big data refers to large and complex datasets that are difficult to process using traditional methods. Key challenges include capturing, storing, searching, sharing, and analyzing large datasets in domains like meteorology, physics simulations, biology, and the internet. Hadoop is an open-source software framework for distributed storage and processing of big data across clusters of computers. It allows for the distributed processing of large data sets in a reliable, fault-tolerant and scalable manner.
Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
NetCDF and HDF5 are data formats and software libraries used for scientific data. NetCDF began in 1989 and allows for array-oriented data with dimensions, variables, and attributes. NetCDF-4 introduced new features while maintaining backward compatibility. It uses HDF5 for data storage and can read HDF4/HDF5 files. NetCDF provides APIs for C, Fortran, Java, and is widely used for earth science and climate data. It supports conventions, parallel I/O, and reading many data formats.
This document provides an overview of an advanced Big Data hands-on course covering Hadoop, Sqoop, Pig, Hive and enterprise applications. It introduces key concepts like Hadoop and large data processing, demonstrates tools like Sqoop, Pig and Hive for data integration, querying and analysis on Hadoop. It also discusses challenges for enterprises adopting Hadoop technologies and bridging the skills gap.
HDFS (Hadoop Distributed File System) is designed to store very large files across commodity hardware in a Hadoop cluster. It partitions files into blocks and replicates blocks across multiple nodes for fault tolerance. The document discusses HDFS design, concepts like data replication, interfaces for interacting with HDFS like command line and Java APIs, and challenges related to small files and arbitrary modifications.
A brief introduction to Hadoop distributed file system. How a file is broken into blocks, written and replicated on HDFS. How missing replicas are taken care of. How a job is launched and its status is checked. Some advantages and disadvantages of HDFS-1.x
Sector is a distributed file system that stores files on local disks of nodes without splitting files. Sphere is a parallel data processing engine that processes data locally using user-defined functions like MapReduce. Sector/Sphere is open source, written in C++, and provides high performance distributed storage and processing for large datasets across wide areas using techniques like UDT for fast data transfer. Experimental results show it outperforms Hadoop for certain applications by exploiting data locality.
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
I review three frameworks for analytic operations that are designed to improve the value obtained when deploying analytic models into products, services and internal operations.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
This is an overview of the Data Biosphere Project, its goals, its architecture, and the three core projects that form its foundation. We also discuss data commons.
What is Data Commons and How Can Your Organization Build One?Robert Grossman
1. Data commons co-locate large biomedical datasets with cloud computing infrastructure and analysis tools to create shared resources for the research community.
2. The NCI Genomic Data Commons is an example of a data commons that makes over 2.5 petabytes of cancer genomics data available through web portals, APIs, and harmonized analysis pipelines.
3. The Gen3 platform is an open source software stack for building data commons that can interoperate through common APIs and data models to support reproducible, collaborative research across projects.
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
Data commons are emerging as a solution to challenges in analyzing and sharing large biomedical datasets. A data commons co-locates data with cloud computing infrastructure and software tools to create an interoperable resource for the research community. Examples include the NCI Genomic Data Commons and the Open Commons Consortium. The open source Gen3 platform supports building disease- or project-specific data commons to facilitate open data sharing while protecting patient privacy. Developing interoperable data commons can accelerate research through increased access to data.
This document discusses best practices for deploying analytic models from development environments into operational systems. It describes how modeling environments often use different languages than deployment environments, requiring significant effort to move models. The document outlines the life cycle of analytic models, from exploratory data analysis to model deployment and monitoring. It also discusses standards like PMML and PFA that can be used to export models between different applications and analytic engines that integrate models into operational workflows.
This document discusses big data and analytics, outlining five trends and five research challenges. It begins by defining big data in terms of volume, velocity, variety, veracity and value. It then discusses the origins and evolution of big data, from early statistics to modern data science. Analytics is defined as using data to make empirically-derived, statistically valid decisions. The document outlines how hardware choices led to scaling out data processing across clusters rather than scaling up on single machines. It also provides examples of fields that generate huge volumes of data from billion dollar instruments like CERN's Large Hadron Collider and genomic sequencing facilities.
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
The document discusses lessons learned from moving machine learning algorithms to production environments, referred to as "AnalyticOps". It introduces AnalyticOps as establishing an environment where building, validating, deploying, and running analytic models happens rapidly, frequently, and reliably. A key challenge is deploying analytic models into operations, products, and services. The document discusses strategies for deploying models, including scoring engines that integrate analytic models into operational workflows using a model interchange format. It provides two case studies as examples.
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
These are the slides from a 5 minute Lightning Talk that I gave at XLDB 2015 on May 19, 2015 at Stanford. It is based in part on our experiences developing the NCI Genomic Data Commons (GDC).
Practical Methods for Identifying Anomalies That Matter in Large DatasetsRobert Grossman
This document summarizes four approaches to identifying anomalies in large datasets: 1) statistical modeling of populations, 2) identifying clusters and distances of outliers from clusters, 3) examining neighborhoods and densities, and 4) ranking and packaging candidate anomalies for expert review. It also provides a case study on detecting active voxels in fMRI data from a salmon's brain during a mentalizing task. Several active voxels were found in a cluster in the brain, but the resolution was too coarse to identify specific brain regions.
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
This document discusses how biomedical discovery is being disrupted by big data. Large genomic, phenotype, and environmental datasets are needed to understand complex diseases that result from combinations of many rare variants. However, analyzing large biomedical data is costly and difficult given the standard model of local computing. The document proposes creating large "commons" of community data and computing as an instrument for big data discovery. Examples are given of the Cancer Genome Atlas project, which has petabytes of research data on thousands of cancer patients, and how tumors evolve over time. Overall, the document argues that new models of shared biomedical clouds and commons are needed to enable cost-effective analysis of big biomedical data.
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
This is a talk I gave at the Strata Conference and Hadoop World in New York City on October 28, 2013. It describes predictive modeling in the context of modeling an adversary's behavior.
MySQL InnoDB Storage Engine: Deep Dive - MydbopsMydbops
This presentation, titled "MySQL - InnoDB" and delivered by Mayank Prasad at the Mydbops Open Source Database Meetup 16 on June 8th, 2024, covers dynamic configuration of REDO logs and instant ADD/DROP columns in InnoDB.
This presentation dives deep into the world of InnoDB, exploring two ground-breaking features introduced in MySQL 8.0:
• Dynamic Configuration of REDO Logs: Enhance your database's performance and flexibility with on-the-fly adjustments to REDO log capacity. Unleash the power of the snake metaphor to visualize how InnoDB manages REDO log files.
• Instant ADD/DROP Columns: Say goodbye to costly table rebuilds! This presentation unveils how InnoDB now enables seamless addition and removal of columns without compromising data integrity or incurring downtime.
Key Learnings:
• Grasp the concept of REDO logs and their significance in InnoDB's transaction management.
• Discover the advantages of dynamic REDO log configuration and how to leverage it for optimal performance.
• Understand the inner workings of instant ADD/DROP columns and their impact on database operations.
• Gain valuable insights into the row versioning mechanism that empowers instant column modifications.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
"NATO Hackathon Winner: AI-Powered Drug Search", Taras KlobaFwdays
This is a session that details how PostgreSQL's features and Azure AI Services can be effectively used to significantly enhance the search functionality in any application.
In this session, we'll share insights on how we used PostgreSQL to facilitate precise searches across multiple fields in our mobile application. The techniques include using LIKE and ILIKE operators and integrating a trigram-based search to handle potential misspellings, thereby increasing the search accuracy.
We'll also discuss how the azure_ai extension on PostgreSQL databases in Azure and Azure AI Services were utilized to create vectors from user input, a feature beneficial when users wish to find specific items based on text prompts. While our application's case study involves a drug search, the techniques and principles shared in this session can be adapted to improve search functionality in a wide range of applications. Join us to learn how PostgreSQL and Azure AI can be harnessed to enhance your application's search capability.
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...AlexanderRichford
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation Functions to Prevent Interaction with Malicious QR Codes.
Aim of the Study: The goal of this research was to develop a robust hybrid approach for identifying malicious and insecure URLs derived from QR codes, ensuring safe interactions.
This is achieved through:
Machine Learning Model: Predicts the likelihood of a URL being malicious.
Security Validation Functions: Ensures the derived URL has a valid certificate and proper URL format.
This innovative blend of technology aims to enhance cybersecurity measures and protect users from potential threats hidden within QR codes 🖥 🔒
This study was my first introduction to using ML which has shown me the immense potential of ML in creating more secure digital environments!
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...DanBrown980551
This LF Energy webinar took place June 20, 2024. It featured:
-Alex Thornton, LF Energy
-Hallie Cramer, Google
-Daniel Roesler, UtilityAPI
-Henry Richardson, WattTime
In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms.
This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups.
Three primary specifications will be discussed:
-Discovery and client registration, emphasizing transparent processes and secure and private access
-Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure
-Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data
Session 1 - Intro to Robotic Process Automation.pdfUiPathCommunity
👉 Check out our full 'Africa Series - Automation Student Developers (EN)' page to register for the full program:
https://bit.ly/Automation_Student_Kickstart
In this session, we shall introduce you to the world of automation, the UiPath Platform, and guide you on how to install and setup UiPath Studio on your Windows PC.
📕 Detailed agenda:
What is RPA? Benefits of RPA?
RPA Applications
The UiPath End-to-End Automation Platform
UiPath Studio CE Installation and Setup
💻 Extra training through UiPath Academy:
Introduction to Automation
UiPath Business Automation Platform
Explore automation development with UiPath Studio
👉 Register here for our upcoming Session 2 on June 20: Introduction to UiPath Studio Fundamentals: https://community.uipath.com/events/details/uipath-lagos-presents-session-2-introduction-to-uipath-studio-fundamentals/
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
What is an RPA CoE? Session 2 – CoE RolesDianaGray10
In this session, we will review the players involved in the CoE and how each role impacts opportunities.
Topics covered:
• What roles are essential?
• What place in the automation journey does each role play?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Keywords: AI, Containeres, Kubernetes, Cloud Native
Event Link: https://meine.doag.org/events/cloudland/2024/agenda/#agendaId.4211
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving
What began over 115 years ago as a supplier of precision gauges to the automotive industry has evolved into being an industry leader in the manufacture of product branding, automotive cockpit trim and decorative appliance trim. Value-added services include in-house Design, Engineering, Program Management, Test Lab and Tool Shops.
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Ukraine
Під час доповіді відповімо на питання, навіщо потрібно підвищувати продуктивність аплікації і які є найефективніші способи для цього. А також поговоримо про те, що таке кеш, які його види бувають та, основне — як знайти performance bottleneck?
Відео та деталі заходу: https://bit.ly/45tILxj
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
"What does it really mean for your system to be available, or how to define w...Fwdays
We will talk about system monitoring from a few different angles. We will start by covering the basics, then discuss SLOs, how to define them, and why understanding the business well is crucial for success in this exercise.
QA or the Highway - Component Testing: Bridging the gap between frontend appl...zjhamm304
These are the slides for the presentation, "Component Testing: Bridging the gap between frontend applications" that was presented at QA or the Highway 2024 in Columbus, OH by Zachary Hamm.
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
Sector CloudSlam 09
1. Sector: An Open Source Cloud
for Data Intensive Computing
Robert Grossman
University of Illinois at Chicago
Open Data Group
YunhongGu
University of Illinois at Chicago
April 20, 2009
3. What is a Cloud?
Clouds provide on-demand resources or
services over a network with the scale and
reliability of a data center.
No standard definition.
Cloud architectures are not new.
What is new:
– Scale
– Ease of use
– Pricing model.
3
4. Categories of Clouds
On-demand resources & services over the
Internet at the scale of a data center
On-demand computing instances
– IaaS: Amazon EC2, S3, etc.; Eucalyptus
– supports many Web 2.0 users
On-demand computing capacity
– Data intensive computing
– (say 100 TB, 500 TB, 1PB, 5PB)
– GFS/MapReduce/Bigtable, Hadoop, Sector, …
4
5. Requirements for Clouds Designed for
Data Intensive Computing
Scale to Scale Support Security
Data Across Large Data
Centers Data Flows
Centers
Business X X
E-science X X X
Health- X X
care
Sector/Sphere is a cloud designed for data intensive
computing supporting all four requirements.
6. Sector Overview
Sector is fast
– Over 2x faster than Hadoop using MalStone Benchmark
– Sector exploits data locality and network topology to improve
performance
Sector is easy to program
– Supports MapReduce style over (key, value) pairs
– Supports User-defined Functions over records
– Easy to process binary data (images, specialized formats, etc.)
Sector clouds can be wide area
6
10. Sector’s Layered Cloud Services
Applications
Sphere’s UDFs
Compute Services
Data Services
Sector’s Distributed File
Storage Services System (SDFS)
Routing & UDP-based Data Transport
Transport Services Protocol (UDT)
Sector’s Stack
10
11. Computing an Inverted Index
Using Hadoop’sMapReduce
HTML page_1 Stage 2:
Sort each bucket on local
word_x word_y word_y word_z
node, merge the same word
Map
Bucket-A Bucket-A
word_x Page_1
Bucket-B Bucket-B
word_y Page_1
word_z Page_1
Sort
Reduce
Bucket-Z Bucket-Z
1st char
word_z Page_1 word_z 1, 5, 10
Shuffle
word_z Page_5
Stage 1: Page_10
word_z
Process each HTML file and hash
(word, file_id) pair to buckets
12. Idea 1 – Support UDF’s Over Files
Think of MapReduce as
– Map acting on (text) records
– With fixed Shuffle and Sort
– Followed by Reducing acting on (text) records
We generalize this framework as follows:
– Support a sequence of User Defined Functions
(UDF) acting on segments (=chunks) of files.
– In both cases, framework takes care of assigning
nodes to process data, restarting failed processes,
etc.
12
13. Computing an Inverted Index Using
Sphere’s User Defined Functions (UDF)
HTML page_1 Stage 2:
Sort each bucket on local
word_x word_y word_y word_z
node, merge the same word
UDF1 - Map
Bucket-A Bucket-A
word_x Page_1
Bucket-B Bucket-B
word_y Page_1
UDF4-
word_z Page_1
UDF3 - Sort
Reduce
Bucket-Z Bucket-Z
1st char
word_z Page_1 word_z 1, 5, 10
UDF2 - Shuffle
word_z Page_5
Stage 1: Page_10
word_z
Process each HTML file and hash
(word, file_id) pair to buckets
16. Sector Programming Model
Sector dataset consists of one or more physical files
Sphere applies User Defined Functions over streams of
data consisting of data segments
Data segments can be data records, collections of data
records, or files
Example of UDFs: Map function, Reduce function, Split
function for CART, etc.
Outputs of UDFs can be returned to originating node,
written to local node, or shuffled to another node.
16
17. Idea 2: Add Security From the Start
Security server maintains
Security
Master Client information about users
Server
SSL and slaves.
SSL
User access control:
password and client IP
address.
AAA data
File level access control.
Messages are encrypted
over SSL. Certificate is
used for authentication.
Sector is HIPAA capable.
Slaves
18. Idea 3: Extend the Stack
Compute Services Compute Services
Data Services Data Services
Storage Services Storage Services
Routing &
Google, Hadoop Transport Services
Sector
18
19. Sector is Built on Top of UDT
• UDT is a specialized network transport
protocol.
• UDT can take advantage of wide area high
performance 10 Gbps network
• Sector is a wide area distributed file system
built over UDT.
• Sector is layered over the native file system (vs
being a block-based file system).
19
20. UDT Has Been Downloaded 25,000+ Times
udt.sourceforge.net Sterling Commerce Movie2Me
Globus
Power Folder
Nifty TV
20
21. Alternatives to TCP –
Decreasing Increases AIMD Protocols
(x)
UDT
Scalable TCP
HighSpeed TCP
AIMD (TCP NewReno)
x
increase of packet sending rate x
decrease factor
22. Using UDT Enables Wide Area Clouds
10 Gbps per
application
Using UDT, Sector can take advantage of wide
area high performance networks (10+ Gbps)
22
24. Comparing Sector and Hadoop
Hadoop Sector
Storage Cloud Block-based file File-based
system
Programming MapReduce UDF&MapReduc
Model e
Protocol TCP UDP-based
protocol (UDT)
Replication At time of writing Periodically
Security Not yet HIPAA capable
Language Java C++
24
25. Open Cloud Testbed – Phase 1 (2008)
C-Wave
CENIC Dragon
Phase 1
Hadoop
4 racks
Sector/Sphere
120 Nodes MREN Thrift
480 Cores
Eucalyptus
10+ Gb/s
Each node in the testbed is a Dell 1435 computer with 12 GB memory, 1TB
disk, 2.0GHz dual dual-core AMD Opteron 2212, with 1 Gb/s network interface
cards. 25
26. MalStone Benchmark
Benchmark developed by Open Cloud
Consortium for clouds supporting data
intensive computing.
Code to generate synthetic data required is
available from code.google.com/p/malgen
Stylized analytic computation that is easy to
implement in MapReduce and its
generalizations.
26
27. MalStone B
entities
sites
dk-2 dk-1 dk
time
27
28. MalStone B Benchmark
MalStone B
Hadoop v0.18.3 799 min
Hadoop Streamingv0.18.3 142 min
Sector v1.19 44 min
# Nodes 20 nodes
# Records 10 Billion
Size of Dataset 1 TB
These are preliminary results and we expect these results to
change as we improve the implementations of MalStone B.
28
29. Terasort - Sector vsHadoop Performance
LAN MAN WAN 1 WAN 2
Number 58 116 178 236
Cores
Hadoop 2252 2617 3069 3702
(secs)
Sector 1265 1301 1430 1526
(secs)
Locations UIC UIC, SL UIC, SL, UIC, SL,
Calit2 Calit2,
JHU
All times in seconds.
30. With Sector, “Wide Area Penalty” < 5%
Used Open Cloud Testbed.
And wide area 10 Gb/sec networks.
Ran a data intensive computing benchmark on 4
clusters distributed across the U.S. vs one cluster
in Chicago.
Difference in performance less than 5% for
Terasort.
One expects quite different results, depending
upon the particular computation.
30
31. Penalty for Wide Area Cloud
Computing on Uncongested 10 Gb/s
28 Local 4x 7 distributed Wide Area
Nodes Nodes “Penality”
Hadoop 3 8650 11600 34%
replicas
Hadoop 1 7300 9600 31%
replica
Sector 4200 4400 4.7%
All times in seconds usingMalStoneA benchmark on Open Cloud Testbed.
31
32. For More Information & To Obtain Sector
To obtain Sector or learn more about it:
sector.sourceforge.net
To learn more about the Open Cloud Consortium
www.opencloudconsortium.org
For related work by Robert Grossman
blog.rgrossman.com, www.rgrossman.com
For related work by YunhongGu
www.lac.uic.edu/~yunhong
32