Ceph, GlusterFS, Lustre, and MooseFS are popular open source distributed file systems. They differ in their architectures, how they manage metadata and data, handle failures, and provide load balancing. Ceph uses object storage devices and metadata servers while GlusterFS relies on replication between storage devices without a centralized metadata server. Lustre and MooseFS both use centralized metadata management with MooseFS providing metadata server failover.
The document discusses different types of parallel computer architectures, including shared-memory multiprocessors. It describes taxonomy of parallel computers including SISD, SIMD, MISD, and MIMD models. For shared-memory multiprocessors, it outlines consistency models including strict, sequential, processor, weak and release consistency. It also discusses UMA and NUMA architectures, cache coherence protocols like MESI, and examples of multiprocessors using crossbar switches or multistage networks.
A Distributed Shared Memory (DSM) system provides a logical abstraction of shared memory built using interconnected nodes with distributed physical memories. There are hardware, software, and hybrid DSM approaches. DSM offers simple abstraction, improved portability, potential performance gains, large unified memory space, and better performance than message passing in some applications. Consistency protocols ensure shared data coherency across distributed memories according to the memory consistency model.
PowerPoint Presentation on Distributed Operating Systems,reasons for opting for distributed systems over centralized systems,types of Distributed Systems,Process Migration and its advantages.
Mobile Ad hoc Networks (MANETs) allow devices to connect spontaneously without infrastructure by acting as both hosts and routers, forwarding traffic in a multi-hop fashion. They face challenges from dynamic topology, limited bandwidth and security, and use reactive routing protocols like Dynamic Source Routing (DSR) that discover routes on demand through flooding route requests. MANETs have applications in military operations, disaster relief, vehicular networks, and personal area networks.
The document discusses various multiplexing techniques including frequency division multiplexing (FDM), wavelength division multiplexing (WDM), time division multiplexing (TDM), and code division multiplexing (CDM). It provides examples of how each technique works, such as using different carrier frequencies for FDM, assigning time slots to each channel for TDM, and multiplying data values by unique code sequences for CDM. The techniques allow multiple signals to be combined and transmitted over a shared medium then separated again at the receiving end.
The document discusses spread spectrum techniques used to prevent eavesdropping and jamming by adding redundancy. It describes two types of spread spectrum: Frequency Hopping Spread Spectrum (FHSS) which spreads signals across the frequency domain, and Direct Sequence Spread Spectrum (DSSS) which spreads signals across the time domain. The document then compares FHSS and DSSS in terms of performance, issues, acceptance and applications.
The document provides an overview of IEEE 802.11 standards for wireless local area networks. It discusses the creation of 802.11 by IEEE, the physical layer, frame formats, and various 802.11 protocols including 802.11b, 802.11a, 802.11g, 802.11n, and 802.11ac. It also describes the media access control including CSMA/CA and security features like authentication and WEP encryption.
This document discusses multiple access protocols that allow multiple devices to share a common communication channel or transmission medium. It begins by explaining that the data link layer has two sublayers - the data link control sublayer and the multiple access resolution sublayer, which is used to resolve access to shared media. It then describes different types of multiple access links and defines multiple access protocols. The rest of the document discusses various random access methods like ALOHA, CSMA, controlled access methods like polling and token passing, and channelization access methods like FDMA, TDMA, and CDMA. It provides details on the concepts, procedures, and flow diagrams of these different multiple access protocols.
The document discusses different types of parallel computer architectures, including shared-memory multiprocessors. It describes taxonomy of parallel computers including SISD, SIMD, MISD, and MIMD models. For shared-memory multiprocessors, it outlines consistency models including strict, sequential, processor, weak and release consistency. It also discusses UMA and NUMA architectures, cache coherence protocols like MESI, and examples of multiprocessors using crossbar switches or multistage networks.
A Distributed Shared Memory (DSM) system provides a logical abstraction of shared memory built using interconnected nodes with distributed physical memories. There are hardware, software, and hybrid DSM approaches. DSM offers simple abstraction, improved portability, potential performance gains, large unified memory space, and better performance than message passing in some applications. Consistency protocols ensure shared data coherency across distributed memories according to the memory consistency model.
PowerPoint Presentation on Distributed Operating Systems,reasons for opting for distributed systems over centralized systems,types of Distributed Systems,Process Migration and its advantages.
Mobile Ad hoc Networks (MANETs) allow devices to connect spontaneously without infrastructure by acting as both hosts and routers, forwarding traffic in a multi-hop fashion. They face challenges from dynamic topology, limited bandwidth and security, and use reactive routing protocols like Dynamic Source Routing (DSR) that discover routes on demand through flooding route requests. MANETs have applications in military operations, disaster relief, vehicular networks, and personal area networks.
The document discusses various multiplexing techniques including frequency division multiplexing (FDM), wavelength division multiplexing (WDM), time division multiplexing (TDM), and code division multiplexing (CDM). It provides examples of how each technique works, such as using different carrier frequencies for FDM, assigning time slots to each channel for TDM, and multiplying data values by unique code sequences for CDM. The techniques allow multiple signals to be combined and transmitted over a shared medium then separated again at the receiving end.
The document discusses spread spectrum techniques used to prevent eavesdropping and jamming by adding redundancy. It describes two types of spread spectrum: Frequency Hopping Spread Spectrum (FHSS) which spreads signals across the frequency domain, and Direct Sequence Spread Spectrum (DSSS) which spreads signals across the time domain. The document then compares FHSS and DSSS in terms of performance, issues, acceptance and applications.
The document provides an overview of IEEE 802.11 standards for wireless local area networks. It discusses the creation of 802.11 by IEEE, the physical layer, frame formats, and various 802.11 protocols including 802.11b, 802.11a, 802.11g, 802.11n, and 802.11ac. It also describes the media access control including CSMA/CA and security features like authentication and WEP encryption.
This document discusses multiple access protocols that allow multiple devices to share a common communication channel or transmission medium. It begins by explaining that the data link layer has two sublayers - the data link control sublayer and the multiple access resolution sublayer, which is used to resolve access to shared media. It then describes different types of multiple access links and defines multiple access protocols. The rest of the document discusses various random access methods like ALOHA, CSMA, controlled access methods like polling and token passing, and channelization access methods like FDMA, TDMA, and CDMA. It provides details on the concepts, procedures, and flow diagrams of these different multiple access protocols.
Distributed shared memory (DSM) allows nodes in a cluster to access shared memory across the cluster in addition to each node's private memory. DSM uses a software memory manager on each node to map local memory into a virtual shared memory space. It consists of nodes connected by high-speed communication and each node contains components associated with the DSM system. Algorithms for implementing DSM deal with distributing shared data across nodes to minimize access latency while maintaining data coherence with minimal overhead.
This document discusses distributed systems and their evolution. It defines a distributed system as a collection of networked computers that communicate and coordinate actions by passing messages. Distributed systems have several advantages over centralized systems, including better utilization of resources and the ability to share information among distributed users. The document describes several models of distributed systems including mini computer models, workstation models, workstation-server models, processor pool models, and hybrid models. It also discusses why distributed computing systems are gaining popularity due to their ability to effectively manage large numbers of distributed resources and handle inherently distributed applications.
Difference between Homogeneous and HeterogeneousFaraz Qaisrani
Muhammad Faraz Qaisrani from the 2nd Batch at Benazir Bhutto Shaheed University discusses types of distributed database management systems (DDBMS). There are two main types: homogeneous, where all data centers use the same software, and heterogeneous, where different data centers may use different database products. Homogeneous systems are easier to design and manage but can be difficult for organizations to implement uniformly. Heterogeneous systems allow integration of existing databases but require translations between different hardware and software.
The document discusses MAC layer protocols, specifically CSMA/CD and CSMA/CA.
CSMA/CD is used for wired networks and works by having nodes listen to check if the medium is free before transmitting. If a collision is detected, transmission stops and resumes after a backoff time.
CSMA/CA is used for wireless networks and aims to avoid collisions through the use of request to send, clear to send, and acknowledgement frames exchanged between nodes, rather than detecting collisions.
Both protocols reduce collisions compared to simple CSMA, but CSMA/CA is less efficient and cannot completely solve collisions in wireless networks due to issues like hidden terminals.
This document discusses various approaches to improving TCP performance over mobile networks. It describes Indirect TCP, Snooping TCP, Mobile TCP, optimizations like fast retransmit/recovery and transmission freezing, and transaction-oriented TCP. Each approach is summarized in terms of its key mechanisms, advantages, and disadvantages. Overall, the document evaluates different ways TCP has been adapted to better support mobility and address challenges like frequent disconnections, packet losses during handovers, and high bit error rates over wireless links.
SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...vtunotesbysree
1) The document provides solutions to problems from Chapter 1 and Chapter 2 of a computer organization textbook.
2) In Chapter 1, it discusses basic computer structure, performance improvement through overlapping operations, and performance comparisons between RISC and CISC processors. In Chapter 2, it covers machine instructions, binary representations of numbers, assembly language programming, and addressing modes.
3) Some of the problems solved include calculating non-overlapped and overlapped execution times, comparing RISC and CISC processors under different clock rates, implementing addition and subtraction in binary, writing assembly code to calculate a dot product, and designing programs that use indexed addressing modes.
This document discusses satellite communication architecture and access. It describes the key components of satellite communication systems including the satellite subsystems of sensors, transponders and transmitters and the earth station subsystems of ground stations, amplifiers, receivers and displays. It also discusses the different types of communication links between satellites and ground stations including uplinks, downlinks, crosslinks and intersatellite links. Multiple access techniques for satellite communications like FDMA, TDMA and CDMA are also summarized.
This document discusses Pure Aloha, a random access protocol where multiple stations can transmit data simultaneously, potentially leading to collisions and garbled data. Under Pure Aloha, stations can transmit whenever they have data without coordination. If an acknowledgment is not received within an allotted time, the station will wait a random back-off time and resend the data. The throughput of Pure Aloha is maximized when frame lengths are uniform, and any overlap between frames results in both being destroyed and needing retransmission. The maximum throughput of Pure Aloha is 0.184 when half of stations have data to transmit.
Chapter 12 discusses mass storage systems and their role in operating systems. It describes the physical structure of disks and tapes and how they are accessed. Disks are organized into logical blocks that are mapped to physical sectors. Disks connect to computers via I/O buses and controllers. RAID systems improve reliability through redundancy across multiple disks. Operating systems provide services for disk scheduling, management, and swap space. Tertiary storage uses tape drives and removable disks to archive less frequently used data in large installations.
An internetwork connects individual networks together so they function as a single large network. It addresses the challenges of connecting different networks that may use varying technologies and speeds. The OSI reference model describes how information passes through seven layers as it moves between software applications on different computer systems. Each layer adds control information in the form of headers and trailers to communicate with its peer layer on other systems. This allows information to be reliably exchanged between networked devices.
A centralized database stores all data in a single location, typically on a central server, making it easy to get a complete view of data but slowing down with many users accessing the same file. A distributed database splits data across multiple physical locations and processing across nodes, avoiding bottlenecks while requiring synchronization and replication across locations.
The document provides an overview of the World Wide Web (WWW) and Hypertext Transfer Protocol (HTTP). It discusses the structure of the WWW including clients, servers, caches and components like HTML, URLs, and browsers. HTTP is described as the application protocol that allows for data communication across the internet using requests and responses. Key aspects of HTTP like features, architecture, status codes, and request methods are summarized.
This document discusses cloud computing protocols. It begins by defining cloud computing as using remote servers over the internet to store and access data and applications. The cloud is broken into three categories: applications, storage, and connectivity. Protocols are then defined as sets of rules that allow electronic devices to connect and exchange information. Ten specific protocols are described: Gossip protocol for failure detection and messaging; Connectionless network protocol for fragmentation; State routing protocol for path selection; Internet group management protocol for multicasting; Secure shell protocol for secure remote login; Coverage enhanced ethernet protocol for network traffic handling; Extensible messaging and presence protocol for publish/subscribe systems; Advanced message queuing protocol for point-to-point messaging; Enhanced interior
X.25 is an ITU-T standard protocol that defines how connections between user devices and network devices are established and maintained over packet-switched networks. It uses the Packet Layer Protocol (PLP) for network layer functions, Link Access Procedure, Balanced (LAPB) for data link layer functions, and various physical layer standards. X.25 supports both switched and permanent virtual circuits for data transfer.
This document discusses various data link control protocols. It covers framing, flow and error control, and specific protocols like HDLC and PPP. Framing involves adding structure like headers and trailers to organize data into packets. Flow and error control techniques like stop-and-wait ARQ and sliding window protocols are used to ensure reliable transmission over noisy channels. HDLC is a widely used bit-oriented protocol that defines frame structures and error control. PPP is a point-to-point protocol commonly used for dial-up internet access.
Mobile transport layer - traditional TCPVishal Tandel
This document summarizes several mechanisms proposed to improve TCP performance in wireless networks. It discusses approaches like indirect TCP, snooping TCP, and mobile TCP that split the TCP connection to isolate the wireless link. It also covers fast retransmit/recovery techniques, transmission freezing, and selective retransmission to more efficiently handle packet losses due to mobility. While each approach aims to address TCP issues in wireless networks, they often do so by mixing layers or requiring changes to the basic TCP protocol stack.
This document discusses different distributed computing system (DCS) models:
1. The minicomputer model consists of a few minicomputers with remote access allowing resource sharing.
2. The workstation model consists of independent workstations scattered throughout a building where users log onto their home workstation.
3. The workstation-server model includes minicomputers, diskless and diskful workstations, and centralized services like databases and printing.
It provides an overview of the key characteristics and advantages of different DCS models.
Message and Stream Oriented CommunicationDilum Bandara
Message and Stream Oriented Communication in distributed systems. Persistent vs. Transient Communication. Event queues, Pub/sub networks, MPI, Stream-based communication, Multicast communication
HDFS is Hadoop's implementation of a distributed file system designed to store large amounts of data across clusters of machines. It is based on Google's GFS and addresses limitations of other distributed file systems like NFS. HDFS uses a master/slave architecture with a NameNode master storing metadata and DataNodes storing data blocks. Data is replicated across multiple DataNodes for reliability. The file system is optimized for large, sequential reads and writes of entire files rather than random access or updates.
The document summarizes the Hadoop Distributed File System (HDFS), which is designed to reliably store and stream very large datasets at high bandwidth. It describes the key components of HDFS, including the NameNode which manages the file system metadata and mapping of blocks to DataNodes, and DataNodes which store block replicas. HDFS allows scaling storage and computation across thousands of servers by distributing data storage and processing tasks.
Distributed shared memory (DSM) allows nodes in a cluster to access shared memory across the cluster in addition to each node's private memory. DSM uses a software memory manager on each node to map local memory into a virtual shared memory space. It consists of nodes connected by high-speed communication and each node contains components associated with the DSM system. Algorithms for implementing DSM deal with distributing shared data across nodes to minimize access latency while maintaining data coherence with minimal overhead.
This document discusses distributed systems and their evolution. It defines a distributed system as a collection of networked computers that communicate and coordinate actions by passing messages. Distributed systems have several advantages over centralized systems, including better utilization of resources and the ability to share information among distributed users. The document describes several models of distributed systems including mini computer models, workstation models, workstation-server models, processor pool models, and hybrid models. It also discusses why distributed computing systems are gaining popularity due to their ability to effectively manage large numbers of distributed resources and handle inherently distributed applications.
Difference between Homogeneous and HeterogeneousFaraz Qaisrani
Muhammad Faraz Qaisrani from the 2nd Batch at Benazir Bhutto Shaheed University discusses types of distributed database management systems (DDBMS). There are two main types: homogeneous, where all data centers use the same software, and heterogeneous, where different data centers may use different database products. Homogeneous systems are easier to design and manage but can be difficult for organizations to implement uniformly. Heterogeneous systems allow integration of existing databases but require translations between different hardware and software.
The document discusses MAC layer protocols, specifically CSMA/CD and CSMA/CA.
CSMA/CD is used for wired networks and works by having nodes listen to check if the medium is free before transmitting. If a collision is detected, transmission stops and resumes after a backoff time.
CSMA/CA is used for wireless networks and aims to avoid collisions through the use of request to send, clear to send, and acknowledgement frames exchanged between nodes, rather than detecting collisions.
Both protocols reduce collisions compared to simple CSMA, but CSMA/CA is less efficient and cannot completely solve collisions in wireless networks due to issues like hidden terminals.
This document discusses various approaches to improving TCP performance over mobile networks. It describes Indirect TCP, Snooping TCP, Mobile TCP, optimizations like fast retransmit/recovery and transmission freezing, and transaction-oriented TCP. Each approach is summarized in terms of its key mechanisms, advantages, and disadvantages. Overall, the document evaluates different ways TCP has been adapted to better support mobility and address challenges like frequent disconnections, packet losses during handovers, and high bit error rates over wireless links.
SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...vtunotesbysree
1) The document provides solutions to problems from Chapter 1 and Chapter 2 of a computer organization textbook.
2) In Chapter 1, it discusses basic computer structure, performance improvement through overlapping operations, and performance comparisons between RISC and CISC processors. In Chapter 2, it covers machine instructions, binary representations of numbers, assembly language programming, and addressing modes.
3) Some of the problems solved include calculating non-overlapped and overlapped execution times, comparing RISC and CISC processors under different clock rates, implementing addition and subtraction in binary, writing assembly code to calculate a dot product, and designing programs that use indexed addressing modes.
This document discusses satellite communication architecture and access. It describes the key components of satellite communication systems including the satellite subsystems of sensors, transponders and transmitters and the earth station subsystems of ground stations, amplifiers, receivers and displays. It also discusses the different types of communication links between satellites and ground stations including uplinks, downlinks, crosslinks and intersatellite links. Multiple access techniques for satellite communications like FDMA, TDMA and CDMA are also summarized.
This document discusses Pure Aloha, a random access protocol where multiple stations can transmit data simultaneously, potentially leading to collisions and garbled data. Under Pure Aloha, stations can transmit whenever they have data without coordination. If an acknowledgment is not received within an allotted time, the station will wait a random back-off time and resend the data. The throughput of Pure Aloha is maximized when frame lengths are uniform, and any overlap between frames results in both being destroyed and needing retransmission. The maximum throughput of Pure Aloha is 0.184 when half of stations have data to transmit.
Chapter 12 discusses mass storage systems and their role in operating systems. It describes the physical structure of disks and tapes and how they are accessed. Disks are organized into logical blocks that are mapped to physical sectors. Disks connect to computers via I/O buses and controllers. RAID systems improve reliability through redundancy across multiple disks. Operating systems provide services for disk scheduling, management, and swap space. Tertiary storage uses tape drives and removable disks to archive less frequently used data in large installations.
An internetwork connects individual networks together so they function as a single large network. It addresses the challenges of connecting different networks that may use varying technologies and speeds. The OSI reference model describes how information passes through seven layers as it moves between software applications on different computer systems. Each layer adds control information in the form of headers and trailers to communicate with its peer layer on other systems. This allows information to be reliably exchanged between networked devices.
A centralized database stores all data in a single location, typically on a central server, making it easy to get a complete view of data but slowing down with many users accessing the same file. A distributed database splits data across multiple physical locations and processing across nodes, avoiding bottlenecks while requiring synchronization and replication across locations.
The document provides an overview of the World Wide Web (WWW) and Hypertext Transfer Protocol (HTTP). It discusses the structure of the WWW including clients, servers, caches and components like HTML, URLs, and browsers. HTTP is described as the application protocol that allows for data communication across the internet using requests and responses. Key aspects of HTTP like features, architecture, status codes, and request methods are summarized.
This document discusses cloud computing protocols. It begins by defining cloud computing as using remote servers over the internet to store and access data and applications. The cloud is broken into three categories: applications, storage, and connectivity. Protocols are then defined as sets of rules that allow electronic devices to connect and exchange information. Ten specific protocols are described: Gossip protocol for failure detection and messaging; Connectionless network protocol for fragmentation; State routing protocol for path selection; Internet group management protocol for multicasting; Secure shell protocol for secure remote login; Coverage enhanced ethernet protocol for network traffic handling; Extensible messaging and presence protocol for publish/subscribe systems; Advanced message queuing protocol for point-to-point messaging; Enhanced interior
X.25 is an ITU-T standard protocol that defines how connections between user devices and network devices are established and maintained over packet-switched networks. It uses the Packet Layer Protocol (PLP) for network layer functions, Link Access Procedure, Balanced (LAPB) for data link layer functions, and various physical layer standards. X.25 supports both switched and permanent virtual circuits for data transfer.
This document discusses various data link control protocols. It covers framing, flow and error control, and specific protocols like HDLC and PPP. Framing involves adding structure like headers and trailers to organize data into packets. Flow and error control techniques like stop-and-wait ARQ and sliding window protocols are used to ensure reliable transmission over noisy channels. HDLC is a widely used bit-oriented protocol that defines frame structures and error control. PPP is a point-to-point protocol commonly used for dial-up internet access.
Mobile transport layer - traditional TCPVishal Tandel
This document summarizes several mechanisms proposed to improve TCP performance in wireless networks. It discusses approaches like indirect TCP, snooping TCP, and mobile TCP that split the TCP connection to isolate the wireless link. It also covers fast retransmit/recovery techniques, transmission freezing, and selective retransmission to more efficiently handle packet losses due to mobility. While each approach aims to address TCP issues in wireless networks, they often do so by mixing layers or requiring changes to the basic TCP protocol stack.
This document discusses different distributed computing system (DCS) models:
1. The minicomputer model consists of a few minicomputers with remote access allowing resource sharing.
2. The workstation model consists of independent workstations scattered throughout a building where users log onto their home workstation.
3. The workstation-server model includes minicomputers, diskless and diskful workstations, and centralized services like databases and printing.
It provides an overview of the key characteristics and advantages of different DCS models.
Message and Stream Oriented CommunicationDilum Bandara
Message and Stream Oriented Communication in distributed systems. Persistent vs. Transient Communication. Event queues, Pub/sub networks, MPI, Stream-based communication, Multicast communication
HDFS is Hadoop's implementation of a distributed file system designed to store large amounts of data across clusters of machines. It is based on Google's GFS and addresses limitations of other distributed file systems like NFS. HDFS uses a master/slave architecture with a NameNode master storing metadata and DataNodes storing data blocks. Data is replicated across multiple DataNodes for reliability. The file system is optimized for large, sequential reads and writes of entire files rather than random access or updates.
The document summarizes the Hadoop Distributed File System (HDFS), which is designed to reliably store and stream very large datasets at high bandwidth. It describes the key components of HDFS, including the NameNode which manages the file system metadata and mapping of blocks to DataNodes, and DataNodes which store block replicas. HDFS allows scaling storage and computation across thousands of servers by distributing data storage and processing tasks.
PARALLEL FILE SYSTEM FOR LINUX CLUSTERSRaheemUnnisa1
The document discusses parallel file systems for Linux clusters. It describes how parallel file systems distribute data across multiple storage servers to enable high-performance access through simultaneous input/output operations. This allows each process on every node in a Linux cluster to perform I/O to and from a common storage target. Examples of parallel file systems for Linux clusters include PVFS, IBM GPFS, and Lustre. Parallel file systems enhance the performance of Linux clusters by optimizing the use of storage resources.
HDFS is a distributed file system that stores large data across multiple nodes in a Hadoop cluster. It divides files into blocks and replicates them across nodes for reliability. The NameNode manages the file system namespace and regulates client access, while DataNodes store data blocks. HDFS provides interfaces for applications to access data blocks efficiently and is highly fault tolerant due to replication.
HDFS is a distributed file system designed for storing large files across clusters of commodity hardware. It provides high-throughput access to application data reliably, even in the event of hardware failures, through data replication across multiple nodes. The master node (namenode) manages metadata like file paths and block locations, while slave nodes (datanodes) store file blocks. Files are split into large blocks which are replicated to provide fault tolerance.
Big data interview questions and answersKalyan Hadoop
This document provides an overview of the Hadoop Distributed File System (HDFS), including its goals, design, daemons, and processes for reading and writing files. HDFS is designed for storing very large files across commodity servers, and provides high throughput and reliability through replication. The key components are the NameNode, which manages metadata, and DataNodes, which store data blocks. The Secondary NameNode assists the NameNode in checkpointing filesystem state periodically.
We have entered an era of Big Data. Huge information is for the most part accumulation of information sets so extensive and complex that it is exceptionally hard to handle them utilizing close by database administration devices. The principle challenges with Big databases incorporate creation, curation, stockpiling, sharing, inquiry, examination and perception. So to deal with these databases we require, "exceedingly parallel software's". As a matter of first importance, information is procured from diverse sources, for example, online networking, customary undertaking information or sensor information and so forth. Flume can be utilized to secure information from online networking, for example, twitter. At that point, this information can be composed utilizing conveyed document frameworks, for example, Hadoop File System. These record frameworks are extremely proficient when number of peruses are high when contrasted with composes.
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
Hadoop Distributed Filesystem (HDFS) is a distributed filesystem designed for storing very large files across commodity hardware. It is optimized for streaming data access and is a good fit for large files, terabytes or petabytes in size, with streaming write-once and read-many access patterns. HDFS uses a master-slave architecture with a Namenode managing the filesystem metadata and Datanodes storing and retrieving block data. Blocks are replicated across Datanodes for reliability. The Namenode tracks block locations and clients read/write data by communicating with the Namenode and Datanodes in a pipeline.
The document summarizes and compares several distributed file systems, including Google File System (GFS), Kosmos File System (KFS), Hadoop Distributed File System (HDFS), GlusterFS, and Red Hat Global File System (GFS). GFS, KFS and HDFS are based on the GFS architecture of a single metadata server and multiple chunkservers. GlusterFS uses a decentralized architecture without a metadata server. Red Hat GFS requires a SAN for high performance and scalability. Each system has advantages and limitations for different use cases.
The document provides an introduction to the key concepts of Big Data including Hadoop, HDFS, and MapReduce. It defines big data as large volumes of data that are difficult to process using traditional methods. Hadoop is introduced as an open-source framework for distributed storage and processing of large datasets across clusters of computers. HDFS is described as Hadoop's distributed file system that stores data across clusters and replicates files for reliability. MapReduce is a programming model where data is processed in parallel across clusters using mapping and reducing functions.
Sector and Hadoop both provide distributed file systems and parallel data processing capabilities. However, Sector was designed for wide area data sharing and distribution across networks while Hadoop focuses on large data processing within a single data center. Sector also allows direct processing of files and directories using user-defined functions, making it up to 4x faster than Hadoop's MapReduce framework in some benchmarks.
This presentation discusses the following topics:
Hadoop Distributed File System (HDFS)
How does HDFS work?
HDFS Architecture
Features of HDFS
Benefits of using HDFS
Examples: Target Marketing
HDFS data replication
GlusterFS is an open-source distributed file system that aggregates various storage servers over a network into one large parallel file system. It does not require a metadata server, ensuring better performance, linear scalability, and reliability compared to traditional distributed file systems. Red Hat Gluster Storage provides a scalable, reliable data management platform using GlusterFS to streamline file access across physical, virtual, and cloud environments. It supports various volume types including distributed replicated volumes which replicate data across multiple servers for high availability.
The document describes the architecture and design of the Hadoop Distributed File System (HDFS). It discusses key aspects of HDFS including its master/slave architecture with a single NameNode and multiple DataNodes. The NameNode manages the file system namespace and regulates client access, while DataNodes store and retrieve blocks of data. HDFS is designed to reliably store very large files across machines by replicating blocks of data and detecting/recovering from failures.
The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a Master and Slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. The NameNode tracks locations of data blocks and regulates access to files, while DataNodes store file blocks and manage read/write operations as directed by the NameNode. HDFS provides high-performance, scalable access to data across large Hadoop clusters.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It was developed to support distributed processing of large datasets. The document provides an overview of Hadoop architecture including HDFS, MapReduce and key components like NameNode, DataNode, JobTracker and TaskTracker. It also discusses Hadoop history, features, use cases and configuration.
Survey of clustered_parallel_file_systems_004_lanl.pptRohn Wood
Lustre is a parallel file system that uses object storage and metadata servers to provide high performance storage for computing clusters. It separates metadata operations from data operations for enhanced scalability. Metadata servers maintain file system metadata while object storage targets handle file I/O and interface with storage devices. Lustre has been deployed on some of the world's fastest supercomputers and can sustain over 3 GB/s of bandwidth on large file systems.
AWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdfManish Chopra
This document is a brief tutorial for integration AWS and Slack. It shows implementing AWS CloudWatch notification to Slack, when any of your AWS service thresholds cross the set boundary.
The document introduces ChatGPT, an AI chatbot created by OpenAI, and provides instructions for getting started using it through a web browser by signing in with a Microsoft or Google account. It also describes some of ChatGPT's capabilities like answering questions and limitations, as well as resources for learning more about developing with OpenAI and ChatGPT through their GitHub repositories and other online materials.
Grafana and AWS - Implementation and UsageManish Chopra
This article provides a use case scenario of Grafana on AWS (Amazon Web Services). It demonstrates a brief implementation of Grafana as a Docker container, and connects to Amazon Cloudwatch data source. It then monitors events and allows visualizing using native methods and customization. Grafana can be setup in production as given in the document.
In this article, AWS Auto scaling is demonstrated with a web service implementation on Amazon ECS (Elastic Container Service) that uses NGinx web server. To accommodate large number to simultaneous requests, AWS Auto scaling is planned and implemented. Thereafter, load testing is carried out, which triggers an alarm in AWS and the ECS containers start to scale up to become 4 similar tasks running in parallel. When the load on web server is removed, the ECS containers gradually decrement back to the original state.
OpenKM is a Free/Libre document management system that provides a web interface for managing arbitrary files. OpenKM includes a content repository, Lucene indexing, and jBPM workflow. The OpenKM system was developed using Java technology.
Alfresco Content Services provides open, flexible, highly scalable Enterprise Content Management
(ECM) capabilities. Content is accessible wherever and however you work and easily integrates with
your other business applications.
Microservices with Dockers and KubernetesManish Chopra
This is a customized study guide to get started with Microservices using Docker and Kubernetes. This guide attempts to bridge the gap in the least possible time, and covers the essentials features to get started with Microservices, Docker, and Kubernetes.
An up-to-date course material detailing UNIX and Linux operating systems for graduate students who aspire to work towards gaining meaningful employment in the corporate world.
This is the Day-4 lab exercise for CGI group webinar series. It primarily includes demonstrations on Hive, Analytics and other tools on the Cloudera Hadoop Platform.
This tutorial presents how a new Dataset can be prepared by joining multiple Excel files into a single CSV file. The final Dataset can be used with RDBMS systems and Big Data based NoSQL systems.
Organizations with largest hadoop clustersManish Chopra
Yahoo has the largest Hadoop cluster with 42,000 nodes. LinkedIn has the second largest with 4,100 nodes, and Facebook has the third largest with 1,400 nodes. The document lists the number of nodes in the Hadoop clusters of various companies.
Difference between hadoop 2 vs hadoop 3Manish Chopra
Hadoop 3.x includes improvements over Hadoop 2.x such as supporting Java 8 as the minimum version, using erasure coding for fault tolerance which reduces storage overhead, improving the YARN timeline service for better scalability and reliability, and moving default ports out of the ephemeral range to prevent startup failures. Hadoop 3.x also adds support for the Microsoft Azure Data Lake filesystem and provides better scalability by allowing clusters to scale to over 10,000 nodes. Key features for resource management, high availability, and running analytics workloads are also continued from Hadoop 2.x in Hadoop 3.x.
This document provides instructions for installing Oracle Solaris 11 in a virtualized environment using VMware. It outlines downloading the Solaris 11 ISO file, creating a new virtual machine in VMware using the ISO, and following additional installation steps displayed in subsequent slides to complete the installation process. The presentation demonstrates and guides the user through the full Solaris 11 installation within a virtual machine.
A customized course that covers topics ranging from usage of Linux, understanding of Big Data including several Distributed File Systems like GlusterFS, Ceph, Lustre, Hadoop, Hive, Pig, NoSQL databases, Spark, different types of Analytics like Business/Predictive/Real-Time/Web and Big Data Analytics, Proof of Concept solutions and use cases.
Emergence and Importance of Cloud Computing for the EnterpriseManish Chopra
Cloud appeared as a buzzword around a decade ago, and the technology has made inroads into many enterprises now.
For any modern IT organisation, it is essential to have an active presence on the internet. Few techniques of automation would enhance such presence, as customers, stakeholders and employees get a single platform to be in sync anytime and from anywhere.
Steps to create an RPM package in LinuxManish Chopra
This exercise was done on RHEL 6 and same steps are applicable for other variants too. This tutorial provides you with steps to create your own RPM packages in Linux. Following procedure shows creating a basic RPM package that includes a shell script. After the RPM is installed, the script is executed on the command prompt to display the output.
Setting up a HADOOP 2.2 cluster on CentOS 6Manish Chopra
Create your own Hadoop distributed cluster using 3 virtual machines. Linux (CentOS 6 or RHEL 6) can be used, along with Java and Hadoop binary distributions.
Google's search engine consists of 3 main parts: Googlebot the web crawler, the indexer that sorts words from crawled pages into a database, and the query processor that compares search terms to the index to return relevant results. Googlebot simultaneously crawls thousands of pages using links and submissions to comprehensively index the web. The indexer ignores common words and stores each word with linked pages. The query processor uses algorithms like PageRank and proximity to return the most relevant results matching search terms and phrases.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
1. Distributed File Systems (DFS)
A distributed file system is any file system that allows access to files from multiple hosts shared via a
computer network. This makes it possible for multiple users on multiple machines to share files and
storage resources. Distributed file systems differ in their performance, mutability of content, handling of
concurrent writes, handling of permanent or temporary loss of nodes or storage, and their policy of storing
content.
Let us first understand what a Big Data project will comprise of, so that there is no doubt in differentiating
between a traditional project and a Big Data based project. Traditionally, projects have required an
RDBMS for managing data. Now that we know that Big Data differentiates with an RDBMS-based project
due to the distributed nature of data management in Big Data systems.
During a Big Data project analysis phase, one of the most important decisions is to ascertain the
distributed storage system. Distributed storage is relatively easy to understand but complex in
configuration and management. As with singular systems, where hard drives are locally attached to a
machine, and accessed from the machine itself whether running Windows OS, Linux OS or any other
operating system. Examples of managing data locally include a laptop’s c: drive, or a USB drive, or SAN
connected storage, or internal drives of Desktop and Server class machines.
For making a choice for a Big-Data based system, several factor need to be analysed. Also, more than
one Distributed Filesystem may be incorporated.
Types of Distributed File Systems
There are several types of Distributed File Systems (DFSs) and many of these put in production
depending upon the use case. Some DFSs are more popular, while others have been applied to specific
applications. Following is a list of few important ones:
Ceph
GlusterFS
Lustre
MooseFS
These have been chosen based on their popularity, use in production, and be regularly updated.
Distributed file systems differ in their performance, mutability of content, handling of concurrent writes,
handling of permanent or temporary loss of nodes or storage, and their policy of storing content. Let us
now see more details of each of these.
Ceph
Ceph is an open source distributed file system developed by Sage Weil.
Architecture
Ceph is a totally distributed system. Unlike HDFS, to ensure scalability Ceph provides a dynamic
distributed metadata management using a metadata cluster (MDS) and stores data and metadata in
Object Storage Devices (OSD). MDSs manage the namespace, the security and the consistency of the
system and perform queries of metadata while OSDs perform I/O operations.
Cache consistency
Ceph is a near POSIX system in which reads must re
ect an up-to-date version of data and writes must reflect a particular order of occurrences. It allows
concurrent read and write accesses using a simple locking mechanism called capabilities. These latter,
granted by the MDSs, give users the ability to read, read and cache, write or write and buffer data. Ceph
also provides some tools to relax consistency.
2. Load Balancing
Ceph ensures load balancing at two levels: data and metadata. Firstly, it implements a counter for each
metadata which allows it to know their access frequency. Thus, Ceph is able to replicate or move the
most frequently accessed metadata from an overloaded MDS to another one. Secondly, Ceph uses
weighted devices. For example, let assume having two storage resources with respectively a weight of
one and two. The second one will store twice as much data as the first one. Ceph monitors the disk space
used by each OSD, and moves data to balance the system according to the weight associated to each
OSD. For example, let us assume a new device is added. Ceph will be able to move data from OSDs
which have become unbalanced to this new device.
Fault Detection
Ceph uses the same fault detection model as HDFS. OSDs periodically exchange heartbeats to detect
failures. Ceph provides a small cluster of monitors which holds a copy of the cluster map. Each OSD can
send a failure report to any monitor. The inaccurate OSD will be marked as down. Each OSD can also
query an up-to-date version of the cluster map from the cluster of monitor. Thus, a formerly defaulting
OSD can join the system again.
Refer to following URL for installing GlusterFS:
https://github.com/ceph/ceph
GlusterFS
GlusterFS is an open source distributed file system developed by the gluster core team. This was
acquired by Red Hat in 2011and currently in development and production. GlusterFS is a scale-out
network-attached storage file system, ans has found applications in cloud computing, streaming media
services, and content delivery networks.
Architecture
GlusterFS is different from the other DFSs. It has a client-server design in which there is no metadata
server. Instead, GlusterFS stores data and metadata on several devices attached to different servers.
The set of devices is called a volume which can be configured to stripe data into blocks and/or replicate
them. Thus, blocks will be distributed and/or replicated across several devices inside the volume.
Naming
GlusterFS does not manage metadata in a dedicated and centralised server, instead, it locates files
algorithmically using the Elastic Hashing Algorithm (EHA) to provide a global name space. EHA uses a
hash function to convert a file's pathname into a fixed length, uniform and unique value. A storage is
assigned to a set of values allowing the system to store a file based on its value. For example, let assume
there are two storage devices: disk1 and disk2 which respectively store files with value from 1 to 20 and
from 21 to 40. The file Data.txt is converted to the value 30. Therefore, it will be stored onto disk2.
API and client access
GlusterFS provides a REST API and a mount module (FUSE or mount command) to give clients with an
access to the system. Clients interact, send and retrieve files with, to and from a volume.
Cache consistency
GlusterFS does not use client side caching, thus, avoiding cache consistency issues.
Replication and synchronisation
Compared to the other DFSs, GlusterFS does not replicate data one by one, but relies on RAID 1. It
makes several copies of a whole storage device into other ones inside a same volume using synchronous
writes. That is, a volume is composed by several subsets in which there are an initial storage disk and its
mirrors. Therefore, the number of storage devices in a volume must be a multiple of the desired
replication. For example, let assume we have got four storage medias, and replication is fixed to two. The
3. first storage will be replicated into the second and thus form a subset. In the same way, the third and
fourth storage form another subset.
Load balancing
GlusterFS uses a hash function to convert files' pathnames into values and assigns several logical
storage to different set of value. This is uniformly done and allows GlusterFS to be sure that each logical
storage almost holds the same number of files. Since Thus, when a disk is added or removed, virtual
storage can be moved to a different physical one in order to balance the system.
Fault detection
In GlusterFS, when a server becomes unavailable, it is removed from the system and no I/O operations to
it can be perform.
Refer to following URL for installing GlusterFS:
https://github.com/gluster/glusterfs
Lustre File System
Lustre is a parallel distributed file system, generally used for large-scale cluster computing. The name
Lustre is a word derived from Linux and cluster. Lustre file system software is available under the GNU
General Public License (version 2 only) and provides high performance file systems for computer clusters
ranging in size from small workgroup clusters to large-scale, multi-site clusters.
Because Lustre file systems have high performance capabilities and open licensing, it is often used in
supercomputers. Since June 2005, it has consistently been used by at least half of the top ten, and more
than 60 of the top 100 fastest supercomputers in the world, including the world's No. 2 and No. 3 ranked
TOP500 supercomputers in 2014, Titan and Sequoia.
Lustre file systems are scalable and can be part of multiple computer clusters with tens of thousands of
client nodes, tens of petabytes (PB) of storage on hundreds of servers, and more than a terabyte per
second (TB/s) of aggregate I/O throughput. This makes Lustre file systems a popular choice for
businesses with large data centers, including those in industries such as meteorology, simulation, oil and
gas, life science, rich media, and finance.
Architecture
Lustre is a centralised distributed file system which differs from the current DFSs in that it does not
provide any copy of data and metadata. Instead, Lustre chooses to It stores metadata on a shared
storage called Metadata Target (MDT) attached to two Metadata Servers (MDS), thus offering an
active/passive failover. MDS are the servers that handle the requests to metadata. Data themselves are
managed in the same way. They are split into objects and distributed at several shared Object Storage
Target (OST) which can be attached to several Object Storage Servers (OSS) to provide an active/active
failover. OSS are the servers that handle I/O requests.
Naming
The Lustre's single global name space is provided to user by the MDS. Lustre uses inodes, like HDFS or
MooseFS, and extended attributes to map file object name to its corresponding OSTs. Therefore, clients
will be informed of which OSTs it should query for each requested data.
API and client access
Lustre provides tools to mount the entire file system in user space (FUSE). Transfers of files and
processing queries are done in the same way like the other centralised DFSs, that is, they first ask the
MDS to locate data and then directly perform I/O with the appropriate OSTs.
4. Cache consistency
Lustre implements a Distributed Lock Manager (DLM) to manage cache consistency. This is a
sophisticated lock mechanism which allows Lustre to support concurrent read and write access. For
example, Lustre can choose granting write-back lock in a directory with little contention. Write-back lock
allows clients to cache a large number of updates in memory which will be committed to disk later, thus
reducing data transfers. While in a highly concurrently accessed directory, Lustre can choose performing
each request one by one to provide strong data consistency. More details on DLM can be found in.
Replication and synchronisation
Lustre does not ensure replication. Instead, it relies on independent software.
Load balancing
Lustre provides simple tools for load balancing. When an OST is unbalanced, that is, it uses more space
disk than other OSTs, the MDS will choose OSTs with more free space to store new data.
Fault detection
Lustre is different from other DFSs because faults are detected during clients' operations (metadata
queries or data transfers) instead of being detected during servers' communications. When a client
requests a file, it first contacts the MDS. If the latter is not available, the client will ask a LDAP server to
know what other MDS it can question. Then, the system perform a failover, that is, the first MDS is
marked as dead, and the second become the active one. A failure in OST is done in the same way. If a
client sends data to an OST which does not respond during a certain time it will be redirected to another
OST.
Refer to following URL for installing Lustre File System:
https://gist.github.com/joshuar/4e283308c932ec62fc05
MooseFS
Moose File System (MooseFS) is an Open-source, POSIX-compliant distributed file system developed by
Core Technology. MooseFS aims to be fault-tolerant, highly available, highly performing, scalable
general-purpose network distributed file system for data centers. Initially proprietary software, it was
released to the public as open source on May 5, 2008.
Architecture
MooseFS acts as HDFS. It has a master server managing metadata, several chunk servers storing and
replicating data blocks. MooseFS has a little difference since it provides failover between the master
server and the metalogger servers. Those are machines which periodically download metadata from the
master in order to be promoted as the new one in case of failures.
Naming
MooseFS manages the namespace as HDFS does. It stores metadata (permission, last access. . . ) and
the hierarchy of files and directories in the master main memory, while performing a persistent copy on
the metalogger. It provides users with a global name space of the system.
API and client access
MooseFS clients access the file system by mounting the name space (using the mfsmount command) in
their local file system. They can perform the usual operations by communicating with the master server
which redirects them to the chunk servers in a seamless way. mfsmount is based on the FUSE
mechanism.
5. Replication and synchronisation
In MooseFS, each file has a goal, that is, a specific number of copies to be maintained. When a client
writes data, the master server will send it a list of chunks servers in which data will be stored. Then, the
client sends the data to the first chunk server which orders other ones to replicate the file in a
synchronous way.
Load balancing
MooseFS provides load balancing. When a server is unavailable due to a failure, some data do not reach
their goal. In this case, the system will store replicas on other chunks servers. Furthermore, files can be
over their goal. If such a case appears, the system will remove a copy from a chunk server. MooseFS
also maintains a data version number so that if a server is again available, mechanisms allow the server
to update the data it stores. Finally, it is possible to dynamically add new data server which will be used to
store new files and replicas.
Fault detection
In MooseFS, when a server becomes unavailable, it is put in quarantine and no I/O operations to it can be
perform.
Refer to following URL for installing MooseFS:
https://github.com/moosefs/moosefs
6. Following table summarises the popular and widely used Distributed File Systems. It us essential to
understand that each of these may be applied to specific use cases. Some of these are functionally
similar and solve the same purpose.
Client Written in License Access API
Ceph C++ LGPL librados (C, C++, Python, Ruby), S3, Swift, FUSE
BeeGFS C / C++
FRAUNHOFER FS
(FhGFS) EULA,
GPLv2 client
POSIX
GlusterFS C GPLv3 libglusterfs, FUSE, NFS, SMB, Swift, libgfapi
Infinit C++
Proprietary (to be
open sourced in
2016)
FUSE, Installable File
System, NFS/SMB, POSIX, CLI, SDK (libinfinit)
ObjectiveFS C Proprietary POSIX, FUSE
MooseFS C GPLv2 POSIX, FUSE
Quantcast
File System
C Apache License 2.0
C++ client, FUSE (C++ server: MetaServer and
ChunkServer are both in C++)
Lustre C GPLv2 POSIX, liblustre, FUSE
OpenAFS C IBM Public License Virtual file system, Installable File System
Tahoe-LAFS Python
GNU GPL 2+ and
other
HTTP (browser
or CLI), SFTP, FTP, FUSE via SSHFS, pyfilesystem
HDFS Java Apache License 2.0 Java and C client, HTTP
XtreemFS Java, C++ BSD License libxtreemfs (Java, C++), FUSE
Ori C, C++ MIT libori, FUSE
Table : Features of Distributed File Systems