The Google File System (GFS) presented in 2003 is the inspiration for the Hadoop Distributed File System (HDFS). Let's take a deep dive into GFS to better understand Hadoop.
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
This is the Complete Information about Data Replication you need, i am focused on these topics:
What is replication?
Who use it?
Types ?
Implementation Methods?
The Google File System (GFS) presented in 2003 is the inspiration for the Hadoop Distributed File System (HDFS). Let's take a deep dive into GFS to better understand Hadoop.
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
This is the Complete Information about Data Replication you need, i am focused on these topics:
What is replication?
Who use it?
Types ?
Implementation Methods?
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Presentation slides of the workshop on "Introduction to Pig" at Fifth Elephant, Bangalore, India on 26th July, 2012.
http://fifthelephant.in/2012/workshop-pig
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Know different types of tips about Importance of dataware housing, Data Cleansing and Extracting etc . For more details visit: http://www.skylinecollege.com/business-analytics-course
Designed by Sanjay Ghemawat , Howard Gobioff and Shun-Tak Leung of Google in 2002-03.
Provides fault tolerance, serving large number of clients with high aggregate performance.
The field of Google is beyond the searching.
Google store the data in more than 15 thousands commodity hardware.
Handles the exceptions of Google and other Google specific challenges in their distributed file system.
Class lecture by Prof. Raj Jain on Big Data. The talk covers Why Big Data Now?, Big Data Applications, ACID Requirements, Terminology, Google File System, BigTable, MapReduce, MapReduce Optimization, Story of Hadoop, Hadoop, Apache Hadoop Tools, Apache Other Big Data Tools, Other Big Data Tools, Analytics, Types of Databases, Relational Databases and SQL, Non-relational Databases, NewSQL Databases, Columnar Databases. Video recording available in YouTube.
This Hadoop will help you understand the different tools present in the Hadoop ecosystem. This Hadoop video will take you through an overview of the important tools of Hadoop ecosystem which include Hadoop HDFS, Hadoop Pig, Hadoop Yarn, Hadoop Hive, Apache Spark, Mahout, Apache Kafka, Storm, Sqoop, Apache Ranger, Oozie and also discuss the architecture of these tools. It will cover the different tasks of Hadoop such as data storage, data processing, cluster resource management, data ingestion, machine learning, streaming and more. Now, let us get started and understand each of these tools in detail.
Below topics are explained in this Hadoop ecosystem presentation:
1. What is Hadoop ecosystem?
1. Pig (Scripting)
2. Hive (SQL queries)
3. Apache Spark (Real-time data analysis)
4. Mahout (Machine learning)
5. Apache Ambari (Management and monitoring)
6. Kafka & Storm
7. Apache Ranger & Apache Knox (Security)
8. Oozie (Workflow system)
9. Hadoop MapReduce (Data processing)
10. Hadoop Yarn (Cluster resource management)
11. Hadoop HDFS (Data storage)
12. Sqoop & Flume (Data collection and ingestion)
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Learn Spark SQL, creating, transforming, and querying Data frames
14. Understand the common use-cases of Spark and the various interactive algorithms
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training.
As part of the recent release of Hadoop 2 by the Apache Software Foundation, YARN and MapReduce 2 deliver significant upgrades to scheduling, resource management, and execution in Hadoop.
At their core, YARN and MapReduce 2’s improvements separate cluster resource management capabilities from MapReduce-specific logic. YARN enables Hadoop to share resources dynamically between multiple parallel processing frameworks such as Cloudera Impala, allows more sensible and finer-grained resource configuration for better cluster utilization, and scales Hadoop to accommodate more and larger jobs.
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Presentation slides of the workshop on "Introduction to Pig" at Fifth Elephant, Bangalore, India on 26th July, 2012.
http://fifthelephant.in/2012/workshop-pig
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Know different types of tips about Importance of dataware housing, Data Cleansing and Extracting etc . For more details visit: http://www.skylinecollege.com/business-analytics-course
Designed by Sanjay Ghemawat , Howard Gobioff and Shun-Tak Leung of Google in 2002-03.
Provides fault tolerance, serving large number of clients with high aggregate performance.
The field of Google is beyond the searching.
Google store the data in more than 15 thousands commodity hardware.
Handles the exceptions of Google and other Google specific challenges in their distributed file system.
Class lecture by Prof. Raj Jain on Big Data. The talk covers Why Big Data Now?, Big Data Applications, ACID Requirements, Terminology, Google File System, BigTable, MapReduce, MapReduce Optimization, Story of Hadoop, Hadoop, Apache Hadoop Tools, Apache Other Big Data Tools, Other Big Data Tools, Analytics, Types of Databases, Relational Databases and SQL, Non-relational Databases, NewSQL Databases, Columnar Databases. Video recording available in YouTube.
This Hadoop will help you understand the different tools present in the Hadoop ecosystem. This Hadoop video will take you through an overview of the important tools of Hadoop ecosystem which include Hadoop HDFS, Hadoop Pig, Hadoop Yarn, Hadoop Hive, Apache Spark, Mahout, Apache Kafka, Storm, Sqoop, Apache Ranger, Oozie and also discuss the architecture of these tools. It will cover the different tasks of Hadoop such as data storage, data processing, cluster resource management, data ingestion, machine learning, streaming and more. Now, let us get started and understand each of these tools in detail.
Below topics are explained in this Hadoop ecosystem presentation:
1. What is Hadoop ecosystem?
1. Pig (Scripting)
2. Hive (SQL queries)
3. Apache Spark (Real-time data analysis)
4. Mahout (Machine learning)
5. Apache Ambari (Management and monitoring)
6. Kafka & Storm
7. Apache Ranger & Apache Knox (Security)
8. Oozie (Workflow system)
9. Hadoop MapReduce (Data processing)
10. Hadoop Yarn (Cluster resource management)
11. Hadoop HDFS (Data storage)
12. Sqoop & Flume (Data collection and ingestion)
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Learn Spark SQL, creating, transforming, and querying Data frames
14. Understand the common use-cases of Spark and the various interactive algorithms
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training.
As part of the recent release of Hadoop 2 by the Apache Software Foundation, YARN and MapReduce 2 deliver significant upgrades to scheduling, resource management, and execution in Hadoop.
At their core, YARN and MapReduce 2’s improvements separate cluster resource management capabilities from MapReduce-specific logic. YARN enables Hadoop to share resources dynamically between multiple parallel processing frameworks such as Cloudera Impala, allows more sensible and finer-grained resource configuration for better cluster utilization, and scales Hadoop to accommodate more and larger jobs.
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Hadoop has proven to be an invaluable tool for many companies over the past few years. Yet it has it's ways and knowing them up front can safe valuable time. This session is a run down of the ever recurring lessons learned from running various Hadoop clusters in production since version 0.15.
What to expect from Hadoop - and what not? How to integrate Hadoop into existing infrastructure? Which data formats to use? What compression? Small files vs big files? Append or not? Essential configuration and operations tips. What about querying all the data? The project, the community and pointers to interesting projects that complement the Hadoop experience.
My presentation for the first user group meeting of our lab's Big Data IWT TETRA project [*]. In the presentation, I gave a demo of Cloudera Manager, discussed 4 micro benchmarks and finalized the presentation with an overview of the Big Bench benchmark.
[*] For more information on what IWT TETRA funding exactly is, see http://www.iwt.be/english/funding/subsidy/tetra
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
In this session you will learn:
History of Hadoop
Hadoop Ecosystem
Hadoop Animal Planet
What is Hadoop?
Distinctions of Hadoop
Hadoop Components
The Hadoop Distributed Filesystem
Design of HDFS
When Not to use Hadoop?
HDFS Concepts
Anatomy of a File Read
Anatomy of a File Write
Replication & Rack awareness
Mapreduce Components
Typical Mapreduce Job
To know more, click here: https://www.mindsmapped.com/courses/big-data-hadoop/big-data-and-hadoop-training-for-beginners/
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
This video on Hadoop interview questions part-1 will take you through the general Hadoop questions and questions on HDFS, MapReduce and YARN, which are very likely to be asked in any Hadoop interview. It covers all the topics on the major components of Hadoop. This Hadoop tutorial will give you an idea about the different scenario-based questions you could face and some multiple-choice questions as well. Now, let us dive into this Hadoop interview questions video and gear up for youe next Hadoop Interview.
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
A brief introduction to Hadoop distributed file system. How a file is broken into blocks, written and replicated on HDFS. How missing replicas are taken care of. How a job is launched and its status is checked. Some advantages and disadvantages of HDFS-1.x
A simple replication-based mechanism has been used to achieve high data reliability of Hadoop Distributed File System (HDFS). However, replication based mechanisms have high degree of disk storage requirement since it makes copies of full block without consideration of storage size. Studies have shown that erasure-coding mechanism can provide more storage space when used as an alternative to replication. Also, it can increase write throughput compared to replication mechanism. To improve both space efficiency and I/O performance of the HDFS while preserving the same data reliability level, we propose HDFS+, an erasure coding based Hadoop Distributed File System. The proposed scheme writes a full block on the primary DataNode and then performs erasure coding with Vandermonde-based Reed-Solomon algorithm that divides data into m data fragments and encode them into n data fragments (n>m), which are saved in N distinct DataNodes such that the original object can be reconstructed from any m fragments. The experimental results show that our scheme can save up to 33% of storage space while outperforming the original scheme in write performance by 1.4 times. Our scheme provides the same read performance as the original scheme as long as data can be read from the primary DataNode even under single-node or double-node failure. Otherwise, the read performance of the HDFS+ decreases to some extent. However, as the number of fragments increases, we show that the performance degradation becomes negligible.
Hadoop Institutes : kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Assuring Contact Center Experiences for Your Customers With ThousandEyes
HDFS Design Principles
1. HDFS Design Principles
The Scale-out-Ability of Distributed Storage
Konstantin V. Shvachko
May 23, 2012
SVForum
Software Architecture & Platform SIG
2. Big Data
Computations that need the power of many computers
Large datasets: hundreds of TBs, tens of PBs
Or use of thousands of CPUs in parallel
Or both
Big Data management, storage and analytics
Cluster as a computer
2
3. Hadoop is an ecosystem of tools for processing
“Big Data”
Hadoop is an open source project
What is Apache Hadoop
3
4. The Hadoop Family
HDFS Distributed file system
MapReduce Distributed computation
Zookeeper Distributed coordination
HBase Column store
Pig Dataflow language, SQL
Hive Data warehouse, SQL
Oozie Complex job workflow
BigTop Packaging and testing
4
5. Hadoop: Architecture Principles
Linear scalability: more nodes can do more work within the same time
Linear on data size:
Linear on compute resources:
Move computation to data
Minimize expensive data transfers
Data are large, programs are small
Reliability and Availability: Failures are common
1 drive fails every 3 years => Probability of failing today 1/1000
How many drives per day fail on 1000 node cluster with 10 drives per node?
Simple computational model
hides complexity in efficient execution framework
Sequential data processing (avoid random reads)
5
6. Hadoop Core
A reliable, scalable, high performance distributed computing system
The Hadoop Distributed File System (HDFS)
Reliable storage layer
With more sophisticated layers on top
MapReduce – distributed computation framework
Hadoop scales computation capacity, storage capacity, and I/O bandwidth
by adding commodity servers.
Divide-and-conquer using lots of commodity hardware
6
9. Hadoop Distributed File System
The name space is a hierarchy of files and directories
Files are divided into blocks (typically 128 MB)
Namespace (metadata) is decoupled from data
Lots of fast namespace operations, not slowed down by
Data streaming
Single NameNode keeps the entire name space in RAM
DataNodes store block replicas as files on local drives
Blocks are replicated on 3 DataNodes for redundancy and availability
9
10. NameNode Transient State
/
apps
hbase
hive
users shv
NameNode RAM
blk_123_001 dn-1 dn-2 dn-3
blk_234_002 dn-11 dn-12 dn-13
blk_345_003 dn-101 dn-102 dn-103
Hierarchical
Namespace
Block
Manager
Live
DataNodes
dn-1
• Heartbeat
• Disk Used
• Disk Free
• xCeivers
dn-2
• Heartbeat
• Disk Used
• Disk Free
• xCeivers
dn-3
• Heartbeat
• Disk Used
• Disk Free
• xCeivers
11. NameNode Persistent State
The durability of the name space is maintained by a
write-ahead journal and checkpoints
Journal transactions are persisted into edits file before replying to the client
Checkpoints are periodically written to fsimage file
Handled by Checkpointer, SecondaryNameNode
Block locations discovered from DataNodes during startup via block reports.
Not persisted on NameNode
Types of persistent storage devices
Local hard drive
Remote drive or NFS filer
BackupNode
Multiple storage directories
Two on local drives, and one remote server, typically NFS filer
11
12. DataNodes
DataNodes register with the NameNode, and provide periodic block reports
that list the block replicas on hand
block report contains block id, generation stamp and length for each replica
DataNodes send heartbeats to the NameNode to confirm its alive: 3 sec
If no heartbeat is received during a 10-minute interval, the node is
presumed to be lost, and the replicas hosted by that node to be unavailable
NameNode schedules re-replication of lost replicas
Heartbeat responses give instructions for managing replicas
Replicate blocks to other nodes
Remove local block replicas
Re-register or shut down the node
Send an urgent block report
12
13. HDFS Client
Supports conventional file system operation
Create, read, write, delete files
Create, delete directories
Rename files and directories
Permissions
Modification and access times for files
Quotas: 1) namespace, 2) disk space
Per-file, when created
Replication factor – can be changed later
Block size – can not be reset after creation
Block replica locations are exposed to external applications
13
14. HDFS Read
To read a block, the client requests the list of replica locations from the
NameNode
Then pulling data from a replica on one of the DataNodes
14
15. HDFS Read
Open file returns DFSInputStream
DFSInputStream for a current block fetches replica locations from
NameNode
10 at a time
Client caches replica locations
Replica Locations are sorted by their proximity to the client
Choose first location
Open a socket stream to chosen DataNode, read bytes from the stream
If fails add to dead DataNodes
Choose the next DataNode in the list
Retry 2 times
16. Replica Location Awareness
MapReduce schedules a task assigned to process block B to a DataNode
serving a replica of B
Local access to data
16
NameNode
DataNode
TaskTracker
JobTracker
DataNode
TaskTracker
DataNode
TaskTracker
Block
Task
17. HDFS Write
To write a block of a file, the client requests a list of candidate DataNodes
from the NameNode, and organizes a write pipeline.
17
18. HDFS Write
Create file in the namespace
Call addBlock() to get next block
NN returns prospective replica locations sorted by proximity to the client
Client creates a pipeline for streaming data to DataNodes
HDFS client writes into internal buffer and forms a queue of Packets
DataStreamer sends a packet to DN1 as it becomes available
DN1 streams to DN2 the same way, and so on
If one node fails the pipeline is recreated with remaining nodes
Until at least one node remains
Replication is handled later by the NameNode
18
19. Write Leases
HDFS implements a single-writer, multiple-reader model.
HDFS client maintains a lease on files it opened for write
Only one client can hold a lease on a single file
Client periodically renews the lease by sending heartbeats to the NameNode
Lease expiration:
Until soft limit expires client has exclusive access to the file
After soft limit (10 min): any client can reclaim the lease
After hard limit (1 hour): NameNode mandatory closes the file, revokes the lease
Writer's lease does not prevent other clients from reading the file
19
20. Append to a File
Original implementation supported write-once semantics
After the file is closed, the bytes written cannot be altered or removed
Now files can be modified by reopening for append
Block modifications during appends use the copy-on-write technique
Last block is copied into temp location and modified
When “full” it is copied into its permanent location
HDFS provides consistent visibility of data for readers before file is closed
hflush operation provides the visibility guarantee
On hflush current packet is immediately pushed to the pipeline
hflush waits until all DataNodes successfully receive the packet
hsync also guarantees the data is persisted to local disks on DataNodes
20
21. Block Placement Policy
Cluster topology
Hierarchal grouping of nodes according to
network distance
Default block placement policy - a tradeoff
between minimizing the write cost,
and maximizing data reliability, availability and aggregate read bandwidth
1. First replica on the local to the writer node
2. Second and the Third replicas on two different nodes in a different rack
3. the Rest are placed on random nodes with restrictions
no more than one replica of the same block is placed at one node and
no more than two replicas are placed in the same rack (if there is enough racks)
HDFS provides a configurable block placement policy interface
experimental
21
DN00
Rack 1
/
Rack 0
DN01 DN02 DN10 DN11 DN12
22. System Integrity
Namespace ID
a unique cluster id common for cluster components (NN and DNs)
Namespace ID assigned to the file system at format time
Prevents DNs from other clusters join this cluster
Storage ID
DataNode persistently stores its unique (per cluster) Storage ID
makes DN recognizable even if it is restarted with a different IP address or port
assigned to the DataNode when it registers with the NameNode for the first time
Software Version (build version)
Different software versions of NN and DN are incompatible
Starting from Hadoop-2: version compatibility for rolling upgrades
Data integrity via Block Checksums
22
23. Cluster Startup
NameNode startup
Read image, replay journal, write new image and empty journal
Enter SafeMode
DataNode startup
Handshake: check Namespace ID and Software Version
Registration: NameNode records Storage ID and address of DN
Send initial block report
SafeMode – read-only mode for NameNode
Disallows modifications of the namespace
Disallows block replication or deletion
Prevents unnecessary block replications until the majority of blocks are reported
Minimally replicated blocks
SafeMode threshold, extension
Manual SafeMode during unforeseen circumstances
23
24. Block Management
Ensure that each block always has the intended number of replicas
Conform with block placement policy (BPP)
Replication is updated when a block report is received
Over-replicated blocks: choose replica to remove
Balance storage utilization across nodes without reducing the block’s availability
Try not to reduce the number of racks that host replicas
Remove replica from DataNode with longest heartbeat or least available space
Under-replicated blocks: place into the replication priority queue
Less replicas means higher in the queue
Minimize cost of replica creation but concur with BPP
Mis-replicated block, not to BPP
Missing block: nothing to do
Corrupt block: try to replicate good replicas, keep corrupt replicas as is
24
25. HDFS Snapshots: Software Upgrades
Snapshots prevent from data corruption/loss during software upgrades
allow to rollback if software upgrades go bad
Layout Version
identifies the data representation formats
persistently stored in the NN’s and the DNs’ storage directories
NameNode image snapshot
Start hadoop namenode –upgrade
Read checkpoint image and journal based on persisted Layout Version
o New software can always read old layouts
Rename current storage directory to previous
Save image into new current
DataNode upgrades
Follow NameNode instructions to upgrade
Create a new storage directory and hard link existing block files into it
25
26. Administration
Fsck verifies
Missing blocks, block replication per file
Block placement policy
Reporting tool, does not fix problems
Decommission
Safe removal of a DataNode from the cluster
Guarantees replication of blocks to other nodes from the one being removed
Balancer
Rebalancing of used disk space when new empty nodes are added to the cluster
Even distribution of used space between DataNodes
Block Scanner
Block checksums provide data integrity
Readers verify checksums on the client and report errors to the NameNode
Other blocks periodically verified by the scanner
26
27. Hadoop Size
Y! cluster 2010
70 million files, 80 million blocks
15 PB capacity
4000+ nodes. 24,000 clients
50 GB heap for NN
Data warehouse Hadoop cluster at Facebook 2010
55 million files, 80 million blocks. Estimate 200 million objects (files + blocks)
2000 nodes. 21 PB capacity, 30,000 clients
108 GB heap for NN should allow for 400 million objects
Analytics Cluster at eBay (Ares)
77 million files, 85 million blocks
1000 nodes: 24 TB of local disk storage, 72 GB of RAM, and a 12-core CPU
Cluster capacity 19 PB raw disk space
Runs upto 38,000 MapReduce tasks simultaneously
27
30. The Future: Hadoop 2
HDFS Federation
Independent NameNodes sharing a common pool of DataNodes
Cluster is a family of volumes with shared block storage layer
User sees volumes as isolated file systems
ViewFS: the client-side mount table
Federated approach provides a static partitioning of the federated namespace
High Availability for NameNode
Next Generation MapReduce
Separation of JobTracker functions
1. Job scheduling and resource allocation
o Fundamentally centralized
2. Job monitoring and job life-cycle coordination
o Delegate coordination of different jobs to other nodes
Dynamic partitioning of cluster resources: no fixed slots
30
34. Why High Availability is Important?
Nothing is perfect:
Applications and servers crash
Avoid downtime
Conventional for traditional RDB and
enterprise storage systems
Industry standard requirement
34
35. And Why it is Not?
Scheduled downtime dominates Unscheduled
OS maintenance
Configuration changes
Other reasons for Unscheduled Downtime
60 incidents in 500 days on 30,000 nodes
24 Full GC – the majority
System bugs / Bad application / Insufficient resources
“Data Availability and Durability with HDFS”
R. J. Chansler USENIX ;login: February, 2012
Pretty reliable
35
36. NameNode HA Challenge
Naïve Approach
Start new NameNode on the spare host, when the primary NameNode dies:
Use LinuxHA or VCS for failover
Not
NameNode startup may take up to 1 hour
Read the Namespace image and the Journal edits: 20 min
Wait for block reports from DataNodes (SafeMode): 30 min
.
36
37. Failover Classification
37
Manual-Cold (or no-HA) – an operator manually shuts down and restarts
the cluster when the active NameNode fails.
Automatic-Cold – save the namespace image and the journal into a
shared storage device, and use standard HA software for failover.
It can take up to an hour to restart the NameNode.
Manual-Hot – the entire file system metadata is fully synchronized on both
active and standby nodes, operator manually issues a command to failover
to the standby node when active fails.
Automatic-Hot – the real HA, provides fast and completely automated
failover to the hot standby.
Warm HA – BackupNode maintains up to date namespace fully
synchronized with the active NameNode. BN rediscovers location from
DataNode block reports during failover. May take 20-30 minutes.
38. HA: State of the Art
Manual failover is a routine maintenance procedure: Hadoop Wiki
Automatic-Cold HA first implemented at ContextWeb
uses DRBD for mirroring local disk drives between two nodes and Linux-HA
as the failover engine
AvatarNode from Facebook - a manual-hot HA solution.
Use for planned NameNode software upgrades without down time
Five proprietary installations running hot HA
Designs:
HDFS-1623. High Availability Framework for HDFS NN
HDFS-2064. Warm HA NameNode going Hot
HDFS-2124. NameNode HA using BackupNode as Hot Standby
Current implementation Hadoop-2
Manual failover with shared NFS storage
In progress: no NFS dependency, automatic failover based on internal algorithms
38
39. Automatic-Hot HA: the Minimalistic Approach
Standard HA software
LinuxHA, VCS, Keepalived
StandbyNode
keeps the up-to-date image of the namespace via Journal stream
available for read-only access
can become active NN
LoadReplicator
DataNodes send heartbeats / reports to both NameNode and StandbyNode
VIPs are assigned to the cluster nodes by their role:
NameNode – nn.vip.host.com
StandbyNode – sbn.vip.host.com
IP-failover
Primary node is always the one that has the NameNode VIP
Rely on proven HA software
39
41. Limitations of the Implementation
Single master architecture: a constraining resource
Limit to the number of namespace objects
A NameNode object (file or block) requires < 200 bytes in RAM
Block to file ratio is shrinking: 2 –> 1.5 -> 1.2
64 GB of RAM yields: 100 million files; 200 million blocks
Referencing 20 PB of data with and block-to-file ratio 1.5 and replication 3
Limits for linear performance growth
linear increase in # of workers puts a higher workload on the single NameNode
Single NameNode can be saturated by a handful of clients
Hadoop MapReduce framework reached its scalability limit at 40,000 clients
Corresponds to a 4,000-node cluster with 10 MapReduce slots
“HDFS Scalability: The limits to growth” USENIX ;login: 2010
41
42. From Horizontal to Vertical Scaling
Horizontal scaling is limited by single-master-architecture
Vertical scaling leads to cluster size shrinking
While Storage capacities, Compute power, and Cost remain constant
42
0
1000
2000
3000
4000
5000
Hadoop reached horizontal scalability limit
2008 Yahoo
4000 node cluster
2010 Facebook
2000 nodes
2011 eBay
1000 nodes
2013 Cluster of
500 nodes
43. Namespace Partitioning
Static: Federation
Directory sub-trees are statically distributed
between separate instances of FSs
Relocating sub-trees without copying is
challenging
Scale x10: billions of files
Dynamic
Files, directory sub-trees can move automatically
between nodes based on their utilization or load
balancing requirements
Files can be relocated without copying data blocks
Scale x100: 100s of billion of files
Orthogonal independent approaches.
Federation of distributed namespaces is possible
43
44. Distributed Metadata: Known Solutions
Ceph
Metadata stored on OSD
MDS cache metadata
Dynamic Metadata Partitioning
GFS Colossus: from Google S. Quinlan and J.Dean
100 million files per metadata server
Hundreds of servers
Lustre
Plans to release clustered namespace
Code ready
VoldFS, CassFS, MySQL – prototypes
44
45. HBase Overview
Distributed table storage system
Tables: big, sparse, loosely structured
Collection of rows
Has arbitrary number of columns
Table are Horizontally partitioned into regions. Dynamic partitioning
Columns can be grouped into Column families
Vertical partition of the table
Distributed cache: Regions loaded into RAM on cluster nodes
Timestamp: user-defined cell versioning, 3-rd dimension.
Cell id: <row_key, column_key, timestamp>
45
46. Giraffa File System
Goal: build from existing building blocks
minimize changes to existing components
HDFS + HBase = Giraffa
Store metadata in HBase table
Dynamic table partitioning into regions
Cashed in RAM for fast access
Store data in blocks on HDFS DataNodes
Efficient data streaming
Use NameNodes as block managers
Flat namespace of block IDs – easy to partition
Handle communication with DataNodes
Perform block replication
46
48. Giraffa Facts
HBase * HDFS = high scalability
More data & more files
High Availability
no SPOF, load balancing
Single cluster
no management overhead for operating more node
48
HDFS Federated HDFS Giraffa
Space 25 PB 120 PB 1 EB = 1000 PB
Files + blocks 200 million 1 billion 100 billion
Concurrent Clients 40,000 100,000 1 million