This is a seminar at the Course of Advanced Operating Systems at University of Salerno which shows the first cluster based storage technology (NASD) and its evolution till the development of the new Google File System.
Google has designed and implemented a scalable distributed file system for their large distributed data intensive applications. They named it Google File System, GFS.
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
My presentation for the Cloud Data Management course at EPFL by Anastasia Ailamaki and Christoph Koch.
It is mainly based on the following two papers:
1) S. Ghemawat, H. Gobioff, S. Leung. The Google File System. SOSP, 2003
2) J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004
Google has designed and implemented a scalable distributed file system for their large distributed data intensive applications. They named it Google File System, GFS.
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
My presentation for the Cloud Data Management course at EPFL by Anastasia Ailamaki and Christoph Koch.
It is mainly based on the following two papers:
1) S. Ghemawat, H. Gobioff, S. Leung. The Google File System. SOSP, 2003
2) J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004
File Replication : High availability is a desirable feature of a good distributed file system and file replication is the primary mechanism for improving file availability. Replication is a key strategy for improving reliability, fault tolerance and availability. Therefore duplicating files on multiple machines improves availability and performance.
Replicated file : A replicated file is a file that has multiple copies, with each copy located on a separate file server. Each copy of the set of copies that comprises a replicated file is referred to as replica of the replicated file.
Replication is often confused with caching, probably because they both deal with multiple copies of data. The two concepts has the following basic differences:
A replica is associated with server, whereas a cached copy is associated with a client.
The existence of cached copy is primarily dependent on the locality in file access patterns, whereas the existence of a replica normally depends on availability and performance requirements.
Satynarayanana [1992] distinguishes a replicated copy from a cached copy by calling the first-class replicas and second-class replicas respectively
Google is a multi-billion dollar company. It's one of the big power players on the World Wide Web and beyond. The company relies on a distributed computing system to provide users with the infrastructure they need to access, create and alter data.
Surely Google buys state-of-the-art computers and servers to keep things running smoothly, right?
Wrong. The machines that power Google's operations aren't cutting-edge power computers with lots of bells and whistles. In fact, they're relatively inexpensive machines running on Linux operating systems. How can one of the most influential companies on the Web rely on cheap hardware? It's due to the Google File System (GFS), which capitalizes on the strengths of off-the-shelf servers while compensating for any hardware weaknesses. It's all in the design.
Google uses the GFS to organize and manipulate huge files and to allow application developers the research and development resources they require. The GFS is unique to Google and isn't for sale. But it could serve as a model for file systems for organizations with similar needs.
A simple replication-based mechanism has been used to achieve high data reliability of Hadoop Distributed File System (HDFS). However, replication based mechanisms have high degree of disk storage requirement since it makes copies of full block without consideration of storage size. Studies have shown that erasure-coding mechanism can provide more storage space when used as an alternative to replication. Also, it can increase write throughput compared to replication mechanism. To improve both space efficiency and I/O performance of the HDFS while preserving the same data reliability level, we propose HDFS+, an erasure coding based Hadoop Distributed File System. The proposed scheme writes a full block on the primary DataNode and then performs erasure coding with Vandermonde-based Reed-Solomon algorithm that divides data into m data fragments and encode them into n data fragments (n>m), which are saved in N distinct DataNodes such that the original object can be reconstructed from any m fragments. The experimental results show that our scheme can save up to 33% of storage space while outperforming the original scheme in write performance by 1.4 times. Our scheme provides the same read performance as the original scheme as long as data can be read from the primary DataNode even under single-node or double-node failure. Otherwise, the read performance of the HDFS+ decreases to some extent. However, as the number of fragments increases, we show that the performance degradation becomes negligible.
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsKonstantin V. Shvachko
Hadoop Summit, April 2014
Amsterdam, Netherlands
Just as the survival of living species depends on the transfer of essential knowledge within the community and between generations, the availability and reliability of a distributed computer system relies upon consistent replication of core metadata between its components. This presentation will highlight the implementation of a replication technique for the namespace of the Hadoop Distributed File System (HDFS). In HDFS, the namespace represented by the NameNode is decoupled from the data storage layer. While the data layer is conventionally replicated via block replication, the namespace remains a performance and availability bottleneck. Our replication technique relies on quorum-based consensus algorithms and provides an active-active model of high availability for HDFS where metadata requests (reads and writes) can be load-balanced between multiple instances of the NameNode. This session will also cover how the same techniques are extended to provide replication of metadata and data between geographically distributed data centers, providing global disaster recovery and continuous availability. Finally, we will review how consistent replication can be applied to advance other systems in the Apache Hadoop stack; e.g., how in HBase coordinated updates of regions selectively replicated on multiple RegionServers improve availability and overall cluster throughput.
Threads,
system model,
processor allocation,
scheduling in distributed systems
Load balancing and
sharing approach,
fault tolerance,
Real time distributed systems,
Process migration and related issues
There are different dimensions for scalability of a distributed storage system: more data, more stored objects, more nodes, more load, additional data centers, etc. This presentation addresses the geographic scalability of HDFS. It describes unique techniques implemented at WANdisco, which allow scaling HDFS over multiple geographically distributed data centers for continuous availability. The distinguished principle of our approach is that metadata is replicated synchronously between data centers using a coordination engine, while the data is copied over the WAN asynchronously. This allows strict consistency of the namespace on the one hand and fast LAN-speed data ingestion on the other. In this approach geographically separated parts of the system operate as a single HDFS cluster, where data can be actively accessed and updated from any data center. The presentation also cover advanced features such as selective data replication.
Extended version of presentation at Strata + Hadoop World. November 20, 2014. Barcelona, Spain.
http://strataconf.com/strataeu2014/public/schedule/detail/39174
The talk introduces JBOD setup for Apache Kafka and shows how LinkedIn can save more than 30% storage cost in Kafka by adopting JBOD setup. The talk is given during the LinkedIn Streaming meetup in May, 2017.
File Replication : High availability is a desirable feature of a good distributed file system and file replication is the primary mechanism for improving file availability. Replication is a key strategy for improving reliability, fault tolerance and availability. Therefore duplicating files on multiple machines improves availability and performance.
Replicated file : A replicated file is a file that has multiple copies, with each copy located on a separate file server. Each copy of the set of copies that comprises a replicated file is referred to as replica of the replicated file.
Replication is often confused with caching, probably because they both deal with multiple copies of data. The two concepts has the following basic differences:
A replica is associated with server, whereas a cached copy is associated with a client.
The existence of cached copy is primarily dependent on the locality in file access patterns, whereas the existence of a replica normally depends on availability and performance requirements.
Satynarayanana [1992] distinguishes a replicated copy from a cached copy by calling the first-class replicas and second-class replicas respectively
Google is a multi-billion dollar company. It's one of the big power players on the World Wide Web and beyond. The company relies on a distributed computing system to provide users with the infrastructure they need to access, create and alter data.
Surely Google buys state-of-the-art computers and servers to keep things running smoothly, right?
Wrong. The machines that power Google's operations aren't cutting-edge power computers with lots of bells and whistles. In fact, they're relatively inexpensive machines running on Linux operating systems. How can one of the most influential companies on the Web rely on cheap hardware? It's due to the Google File System (GFS), which capitalizes on the strengths of off-the-shelf servers while compensating for any hardware weaknesses. It's all in the design.
Google uses the GFS to organize and manipulate huge files and to allow application developers the research and development resources they require. The GFS is unique to Google and isn't for sale. But it could serve as a model for file systems for organizations with similar needs.
A simple replication-based mechanism has been used to achieve high data reliability of Hadoop Distributed File System (HDFS). However, replication based mechanisms have high degree of disk storage requirement since it makes copies of full block without consideration of storage size. Studies have shown that erasure-coding mechanism can provide more storage space when used as an alternative to replication. Also, it can increase write throughput compared to replication mechanism. To improve both space efficiency and I/O performance of the HDFS while preserving the same data reliability level, we propose HDFS+, an erasure coding based Hadoop Distributed File System. The proposed scheme writes a full block on the primary DataNode and then performs erasure coding with Vandermonde-based Reed-Solomon algorithm that divides data into m data fragments and encode them into n data fragments (n>m), which are saved in N distinct DataNodes such that the original object can be reconstructed from any m fragments. The experimental results show that our scheme can save up to 33% of storage space while outperforming the original scheme in write performance by 1.4 times. Our scheme provides the same read performance as the original scheme as long as data can be read from the primary DataNode even under single-node or double-node failure. Otherwise, the read performance of the HDFS+ decreases to some extent. However, as the number of fragments increases, we show that the performance degradation becomes negligible.
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsKonstantin V. Shvachko
Hadoop Summit, April 2014
Amsterdam, Netherlands
Just as the survival of living species depends on the transfer of essential knowledge within the community and between generations, the availability and reliability of a distributed computer system relies upon consistent replication of core metadata between its components. This presentation will highlight the implementation of a replication technique for the namespace of the Hadoop Distributed File System (HDFS). In HDFS, the namespace represented by the NameNode is decoupled from the data storage layer. While the data layer is conventionally replicated via block replication, the namespace remains a performance and availability bottleneck. Our replication technique relies on quorum-based consensus algorithms and provides an active-active model of high availability for HDFS where metadata requests (reads and writes) can be load-balanced between multiple instances of the NameNode. This session will also cover how the same techniques are extended to provide replication of metadata and data between geographically distributed data centers, providing global disaster recovery and continuous availability. Finally, we will review how consistent replication can be applied to advance other systems in the Apache Hadoop stack; e.g., how in HBase coordinated updates of regions selectively replicated on multiple RegionServers improve availability and overall cluster throughput.
Threads,
system model,
processor allocation,
scheduling in distributed systems
Load balancing and
sharing approach,
fault tolerance,
Real time distributed systems,
Process migration and related issues
There are different dimensions for scalability of a distributed storage system: more data, more stored objects, more nodes, more load, additional data centers, etc. This presentation addresses the geographic scalability of HDFS. It describes unique techniques implemented at WANdisco, which allow scaling HDFS over multiple geographically distributed data centers for continuous availability. The distinguished principle of our approach is that metadata is replicated synchronously between data centers using a coordination engine, while the data is copied over the WAN asynchronously. This allows strict consistency of the namespace on the one hand and fast LAN-speed data ingestion on the other. In this approach geographically separated parts of the system operate as a single HDFS cluster, where data can be actively accessed and updated from any data center. The presentation also cover advanced features such as selective data replication.
Extended version of presentation at Strata + Hadoop World. November 20, 2014. Barcelona, Spain.
http://strataconf.com/strataeu2014/public/schedule/detail/39174
The talk introduces JBOD setup for Apache Kafka and shows how LinkedIn can save more than 30% storage cost in Kafka by adopting JBOD setup. The talk is given during the LinkedIn Streaming meetup in May, 2017.
Cluster computing is a type of computing where a group of several computers are linked together, allowing the entire group of computers to behave as if it were a single entity. There are a wide variety of different reasons why people might use cluster computing for various computer tasks. It s also used to make sure that a computing system will always be available. It is unknown when this cluster computing concept was first developed, and several different organizations have claimed to have invented it.
The Google File System (GFS) presented in 2003 is the inspiration for the Hadoop Distributed File System (HDFS). Let's take a deep dive into GFS to better understand Hadoop.
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
Alluxio Monthly Webinar
Nov. 15, 2023
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Tarik Bennett (Senior Solutions Engineer)
- Beinan Wang (Senior Staff Engineer & Architect)
Many companies are working with development architectures for AI platforms but have concerns about efficiency at scale as data volumes increase. They use centralized cloud data lakes, like S3, to store training data for AI platforms. However, GPU shortages add more complications. Storage and compute can be separate, or even remote, making data loading slow and expensive:
1) Optimizing a developmental setup can include manual copies, which are slow and error-prone
2) Directly transferring data across regions or from cloud to on-premises can incur expensive egress fees
This webinar covers solutions to improve data loading for model training. You will learn:
- The data loading challenges with distributed infrastructure
- Typical solutions, including NFS/NAS on object storage, and why they are not the best options
- Common architectures that can improve data loading and cost efficiency
- Using Alluxio to accelerate model training and reduce costs
Apache Hadoop 3 is coming! As the next major milestone for hadoop and big data, it attracts everyone's attention as showcase several bleeding-edge technologies and significant features across all components of Apache Hadoop: Erasure Coding in HDFS, Docker container support, Apache Slider integration and Native service support, Application Timeline Service version 2, Hadoop library updates and client-side class path isolation, etc. In this talk, first we will update the status of Hadoop 3.0 releasing work in apache community and the feasible path through alpha, beta towards GA. Then we will go deep diving on each new feature, include: development progress and maturity status in Hadoop 3. Last but not the least, as a new major release, Hadoop 3.0 will contain some incompatible API or CLI changes which could be challengeable for downstream projects and existing Hadoop users for upgrade - we will go through these major changes and explore its impact to other projects and users.
Presentation on Large Scale Data ManagementChris Bunch
These are the slides for a presentation I recently gave at a seminar on Large Scale Data Management. The first half talks about the current state of affairs in the debate between MapReduce and parallel databases, while the second half focuses on two recent papers on virtual machine migration.
Implementing data and databases on K8s within the Dutch governmentDoKC
A small walkthrough of projects within the dutch government running Data(bases) on OpenShift. This talk shares success stories, provides a proven recipe to `get it done` and debunks some of the FUD.
About Sebastiaan:
I have always been a weird DBA, trying to combine Databases with out-of-the-box thinking and a DevOps mindset. Around 2016 I fell in love with both Postgres and Kubernetes, and I then committed my life to enabling Dutch organisations with running their Database workloads CloudNative.
Over the last few years I worked as a private contractor for 2 large government agencies doing exactly that, and I want to share my and others (success stories) hoping to enable and inspire Data on Kubernetes adoption.
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFSUSE Italy
In questa sessione HPE e SUSE illustrano con casi reali come HPE Data Management Framework e SUSE Enterprise Storage permettano di risolvere i problemi di gestione della crescita esponenziale dei dati realizzando un’architettura software-defined flessibile, scalabile ed economica. (Alberto Galli, HPE Italia e SUSE)
Running a Megasite on Microsoft Technologiesgoodfriday
MySpace and Microsoft.com are two of the most-visited Web sites on the planet. Come to this session to hear about lessons learned using Microsoft technologies to run Web applications on a massive scale. Representatives from Microsoft.com talk about lessons learned using an all-Microsoft datacenter. Representatives from MySpace talk about the realities of using Microsoft technologies in a scalable, federated environment using SQL Server 2005, .NET 2.0 and IIS 6 on Windows Server 2003 64-bit editions. This session features an open Q&A with a panel of technical managers and engineers from MySpace and Microsoft.com.
Similar to Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014 (20)
Quick introduction to the usage of the merger in the Inspire ingestion workflow for merging records while preserving human curation.
Library at https://github.com/inspirehep/inspire-json-merger
Tech Talk about my traineeship in Aicas GmbH, which talks abouot an emulator for android apps.
You can find the project at: https://github.com/ammirate/android-wrapper
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaYara Milbes
Discover the transformative power of the WhatsApp API in our latest SlideShare presentation, "Top 7 Unique WhatsApp API Benefits." In today's fast-paced digital era, effective communication is crucial for both personal and professional success. Whether you're a small business looking to enhance customer interactions or an individual seeking seamless communication with loved ones, the WhatsApp API offers robust capabilities that can significantly elevate your experience.
In this presentation, we delve into the top 7 distinctive benefits of the WhatsApp API, provided by the leading WhatsApp API service provider in Saudi Arabia. Learn how to streamline customer support, automate notifications, leverage rich media messaging, run scalable marketing campaigns, integrate secure payments, synchronize with CRM systems, and ensure enhanced security and privacy.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
2. Agenda
Context
Goals of design
NASD
NASD prototype
Distrubuted file systems on NASD
NASD parallel file system
Conclusions
A Cost-Effective,
High-Bandwidth Storage Architecture
Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff Butler, Fay W. Chang,
Howard Gobioff, Charles Hardin, Erik Riedel, David Rochberg, Jim Zelenka, 1997-2001
3. Agenda
The File System
Motivations
Architecture
Benchmarks
Comparisons and conclusions
[Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung,
2003]
8. What is NASD?
Network-Attached Secure Disk
direct transfer to clients
secure interfaces via cryptographic support
asynchronous oversight
variable-size data objects map to blocks
10. NASD prototype
Based on Unix inode interface
Network with 13 NASD
Each NASD runs on
•DEC Alpha 3000, 133MHz, 64MB RAM
•2 x Seagate Medallist on 5MB/s SCSI bus
•Connected to 10 clients by ATM (155MB/s)
Ad Hoc handling modules (16K loc)
12. DFS on NASD
Porting NFS and AFS on NASD architecture
o Ok, no performance loss
o But there are concurrency limitations
Solution:
A new higher-level parallel file system
must be used…
13. NASD parallel file system
Scalable I/O low-level interface
Cheops as storage management layer
Exports the same object interfaces of NASD devices
Maps them to object on devices
Maps striped objects
Supports concurrency control for multi-disk accesses
(10K loc)
14. NASD parallel file system Test
Clustering data mining application
+ =
*Each NASD drive provides 6.2MB/s
15. Conclusions
High Scalability
Direct transfer to clients
Working prototype
Usabe with existing file systems
But...very high costs:
•Network adapters
• ASIC microcontroller,
•Workstation
increasing the total cost by over 80%
17. The Google File System
• Started with their Search Engine
• They provided new services like:
Google Video
Gmail
Google Maps, Earth
Google App Engine
… and many more
18. Design overview
Observing common operations in Google applications leads
developers to make several assumptions:
Multiple clusters distribuited worldwide
Fault-tolerance and auto-recovery need to be built into the
systems because problems are very often
A modest number of large files (100+ MB or Multi-GB)
Workloads consist of either large streaming or small
random reads, meanwhile write operations are sequential
and append large quantity of data to files
Google applications and GFS should be co-designed
Producer – consumer pattern
20. GFS Architecture: Chunks
Similar to standard File System blocks but much
larger
Size: at least 64 MB (configurable)
Advantages:
• Reduced clients’ need to contact w/ the
master
• Client may perform many operations on a
single block
• Less chunks less metadata in the master
• No internal fragmentation due to lazy space
21. GFS Architecture: Chunks
Disadvantages:
• Some small files, made of a small number of
chunks may be accessed many times
• Not a major issue since Google Apps mostly
read large multi-chunk files sequentially
• Moreover this can be fixed using an high
replication factor
22. GFS Architecture: Master
A single process running on a separate machine
Stores all metadata in its RAM:
• File and chunk namespace
• Mapping from files to chunks
• Chunks location
• Access control information and file locking
• Chunk versioning (snapshots handling)
• And so on…
23. GFS Architecture: Master
Master has the following responsabilities:
Chunk creation, re-replication,
rebalancing and deletion for:
Balancing space utilization and access speed
Spreading replicas across racks to reduce
correlated failures, usually 3 copies for each chunk
Rebalancing data to smooth out storage and
request load
Persistent and replicated logging of crititical
metadata updates
24. GFS Architecture: M - CS
Communication
Master and chunkservers communicate regularly in
order to retrieve their states:
o Is chunkserver down?
o Are there disk failure on any chunkserver ?
o Are any replicas corrupted ?
o Which chunk-replicas does a given chunkserver
store?
Moreover master handles garbage collection
and deletes «stale» replicas
o Master logs the deletion, then renames the
target file to an hidden name
o A lazy GC removes the hidden files after a
given amount of time
25. GFS Architecture: M - CS
Communication
Server Requests
Client retrieves metadata from master for the
requested
Read / Write dataflows between client and
chunkserver decoupled from master control
flow
Single master is no longer a bottleneck: its
involvement with R&W is minimized:
Clients communicate directly with
chunkservers
Master has to log operations as soon as they
are completed
Less than 64 BYTES of metadata for each
30. GFS Architecture: Writing
MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
NAME - DATA
NAME
CHUNK
INDEX
CHUNK
HANDLE
PRIMARY
AND
SECONDAY
REPLICA
LOCATIONS
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
34. Fault Tolerance
GFS has its own relaxed consistency
model
Consistent: all replicas have the same
value
Defined: each replica reflects the
performed mutations
GFS is high available
Faster recovery (machine quickly
rebootable)
Chunks replicated at least 3 times (take this
RAID-6)
35. Benchmarking: small cluster
GFS tested on a small cluster:
I master
16 chunkservers
16 clients
Server machines connected to 100MBits central switc
Same for client machines
The two switches are connected to a 1Gbits switch
37. Benchmarking: real-world cluster
Cluster A: 342 PCs
Used for research and development
Tasks last few hours reading TBs of data, processing
and writing results back
Cluster B: 227 PCs
Continuously generates and processes multi-TB data s
Typical tasks last more hours than cluster A tasks
38. Benchmarking: real-world cluster
Cluster A B
Chunkservers # 342 227
Available disk space 72 TB 180 TB
Used disk space 55 TB 155 TB
# of files 735000 737000
# of chunks 992000 1550000
Metadata at CSs 13 GB 21 GB
Metadata at Master 48 MB 60 MB
Read rate 580 MB/s (750 MB/s) 380 MB/s (1300
MB/s)
Write rate 30 MB/s 100 MB/s * 3
Master Ops 202~380 Ops/s 347~533 Ops/s
39. Benchmarking: recovery time
One chunkserver killed in cluster B:
o This chunkserver had 15000 chunks
containing 600GB of data
o All chunks were restored in 23.2 mins
with a replication rate of 440 MB/s
Two chunkserver killed in cluster B:
o Each with 16000 chunks and 660 GB
of data, 266 of them became uniques
o These 266 chunks were replicated at
an higher priority within 2 mins
40. Comparisons to others models
GFS
RAIDxFS
GPFS
AFS
NASD
spreads file data across
storage servers
simpler, uses only
replication for redundancy
location independent
namespace
centralized approach rather
than distribuited managementcommodity machines instead of
network attached disks
lazy allocated fixed-size blocks rather than variable-lengh objects
41. Conclusion
GFS demonstrates how to support large-scale
processing workloads on commodity hardware:
designed to tollerate frequent component
failures
optimised for huge files that are mostly
appended and read
It has met Google’s storage needs, therefore
good enough for them
GFS has influenced massively the computer
science in the last few years