Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
This presentation helps you understand the basics of Hadoop.
What is Big Data?? How google search so fast and what is MapReduce algorithm? all these questions will be answered in the presentation.
Modern applications, often called “big-data” analysis, require us to manage immense amounts of data quickly. To deal with applications such as these, a new software stack has evolved.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
this presentation describes the company from where I did my summer training and what is bigdata why we use big data, big data challenges, the issue in big data, the solution of big data issues, hadoop, docker , Ansible etc.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
This is an updated version of Amr's Hadoop presentation. Amr gave this talk recently at NASA CIDU event, TDWI LA Chapter, and also Netflix HQ. You should watch the powerpoint version as it has animations. The slides also include handout notes with additional information.
A presentation on Hadoop for scientific researchers given at Universitat Rovira i Virgili in Catalonia, Spain in October 2010. http://etseq.urv.cat/seminaris/seminars/3/
This presentation helps you understand the basics of Hadoop.
What is Big Data?? How google search so fast and what is MapReduce algorithm? all these questions will be answered in the presentation.
Modern applications, often called “big-data” analysis, require us to manage immense amounts of data quickly. To deal with applications such as these, a new software stack has evolved.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
this presentation describes the company from where I did my summer training and what is bigdata why we use big data, big data challenges, the issue in big data, the solution of big data issues, hadoop, docker , Ansible etc.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
This is an updated version of Amr's Hadoop presentation. Amr gave this talk recently at NASA CIDU event, TDWI LA Chapter, and also Netflix HQ. You should watch the powerpoint version as it has animations. The slides also include handout notes with additional information.
A presentation on Hadoop for scientific researchers given at Universitat Rovira i Virgili in Catalonia, Spain in October 2010. http://etseq.urv.cat/seminaris/seminars/3/
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
P2P media streaming and file downloading is most popular applications over the Internet.
These systems reduce the server load and provide a scalable content distribution. P2P
networking is a new paradigm to build distributed applications. It describes the design
requirements for P2P media streaming, live and Video on demand system comparison based on their system architecture. In this paper we described and studied the traditional approaches for P2P streaming systems, design issues, challenges, and current approaches for providing P2P VoD services.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
This presentation provides a comprehensive introduction to the Hadoop Distributed System, a powerful and widely used framework for distributed storage and processing of large-scale data. Hadoop has revolutionized the way organizations manage and analyze data, making it a crucial tool in the field of big data and data analytics.
In this presentation, we explore the key components and features of Hadoop, shedding light on the fundamental building blocks that enable its exceptional data processing capabilities. We cover essential topics, including the Hadoop Distributed File System (HDFS), MapReduce, YARN (Yet Another Resource Negotiator), and Hadoop Ecosystem components like Hive, Pig, and Spark.
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
Abstract: Hadoop is a open source software framework for storage and processing large scale of datasets on clusters of commodity hardware. Hadoop provides a reliable shared storage and analysis system, here storage provided by HDFS and analysis provided by MapReduce. MapReduce frameworks are foraying into the domain of high performance of computing with stringent non-functional requirements namely execution times and throughputs. MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance. The challenge is that when we consider the data is dynamically and continuously produced, from different geographical locations. For dynamically generated data, an efficient algorithm is desired, for timely guiding the transfer of data into the cloud over time for geo-dispersed data sets, there is need to select the best data center to aggregate all data onto given that a MapReduce like framework is most efficient when data to be processed are all in one place, and not across data centers due to the enormous overhead of inter-data center data moving in the stage of shuffle and reduce. Recently, many researchers tend to implement and deploy data-intensive and/or computation-intensive algorithms on MapReduce parallel computing framework for high processing efficiency.
Hadoop Training, Enhance your Big data subject knowledge with Online Training without wasting your time. Register for Free LIVE DEMO Class.
For more info: http://www.hadooponlinetutor.com
Contact Us:
8121660044
732-419-2619
http://www.hadooponlinetutor.com
Jumpstart your career with the world’s most in-demand technology: Hadoop. Hadooptrainingacademy provides best Hadoop online training with quality videos, comprehensive
online live training and detailed study material. Join today!
For more info, visit: http://www.hadooptrainingacademy.com/
Contact Us:
8121660088
732-419-2619
http://www.hadooptrainingacademy.com/
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
1. A Seminar Report
On
HADOOP
By Varun Narang
MA 399 Seminar
IIT Guwahati
Roll Number: 09012332
2. Index of Topics:
1. Abstract
2. Introduction
3. What is MapReduce?
4. HDFS
Assumptions
Design
Concepts
The Communication Protocols
Robustness
Cluster Rebalancing
Data Integrity
Metadata disk failure
Snapshots
5. Data Organisation
Data Blocks
Staging
Replication Pipelining
6. Accessibility
7. Space Reclaimation
• File Deletes and Undeletes
• Decrease Replication Factor
• Hadoop Filesystems
7. Hadoop Archives
3. Bibliography
1)Hadoop- The Definitive Guide, O’Reilly 2009, Yahoo! Press
2)MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay
Ghemawat
3)Ranking and Semi-supervised Classification on Large Scale Graphs Using Map-
Reduce, Delip Rao, David Yarowsky, Dept. of Computer Science, Johns Hopkins
University
4)Improving MapReduce Performance in Heterogeneous Environments, Matei Zaharia,
Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica, University of
California, Berkeley
5)MapReduce in a Week By Hannah Tang, Albert Wong, Aaron Kimball
Winter 2007
4. Abstract
Problem Statement:
The amount total digital data in the world has exploded in recent years. This has
happened primarily due to information (or data) generated by various enterprises all over
5. the globe. In 2006, the universal data was estimated to be 0.18 zettabytes in 2006, and is
forecasting a tenfold growth by 2011 to 1.8 zettabytes.
1 zettabyte = 1021 bytes
The problem is that while the storage capacities of hard drives have increased
massively over the years, access speeds—the rate at which data can be read from drives
have not kept up. One typical drive from 1990 could store 1370 MB of data and had a
transfer speed of 4.4 MB/s, so we could read all the data from a full drive in around 300
seconds. In 2010, 1 Tbdrives are the standard hard disk size, but the transfer speed is
around 100 MB/s, so it takes more than two and a half hours to read all the data off the
disk.
Solution Proposed:
Parallelisation:
A very obvious solution to solving this problem is parallelisation. The input data is
usually large and the computations have to be distributed across hundreds or thousands of
machines in order to finish in a reasonable amount of time.
Reading 1 Tb from a single hard drive may take a long time, but on parallelizing this over
100 different machines can solve the problem in 2 minutes.
The key issues involved in this Solution:
• Hardware failure
• Combine the data after analysis (i.e reading)
Apache Hadoop is a framework for running applications on large cluster built of
commodity hardware. The Hadoop framework transparently provides applications both
reliability and data motion.
It solves the problem of Hardware Failure through replication. Redundant copies of the
data are kept by the system so that in the event of failure, there is another copy available.
(Hadoop Distributed File System)
The second problem is solved by a simple programming model- Mapreduce. This
programming paradigm abstracts the problem from data read/write to computation over a
series of keys.
Even though HDFS and MapReduce are the most significant features of Hadoop, other
subprojects provide complementary services:
The various subprojects of hadoop includes:-
• Core
• Avro
• Pig
• HBase
• Zoo Keeper
• Hive
• Chukwa
6. Introduction
Hadoop is designed to efficiently process large volumes of information by connecting many
commodity computers together to work in parallel. A 1000 CPU single machine (i.e a
7. supercomputer with a vast memory storage) would cost a lot. Thus Hadoop parallelizes the
computation by tying smaller and more reasonably priced machines together into a single cost-
effective compute cluster.
The features of hadoop that stand out are its simplified programming model and its efficient,
automatic distribution of data and work across machines.
Now we take a deeper look into these two main features of Hadoop and list their important
characteristics and description.
1. Data Distribution:
In a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in. The
Hadoop Distributed File System (HDFS) will split large data files into chunks which are managed
by different nodes in the cluster. In addition to this each chunk is replicated across several
machines, so that a single machine failure does not result in any data being unavailable. In case of
a system failure, the data is re-replicated which can result in partial storage. Even though the file
chunks are replicated and distributed across several machines, they form a single namespace, so
their contents are universally accessible.
Data is conceptually record-oriented in the Hadoop programming
framework. Individual input files are broken into segments and each segment
is processed upon by a node. The Hadoop framework schedules the
processes to be run in proximity to the location of data/records using
knowledge from the distributed file system. Each computation process
running on a node operates on a subset of the data. Which data operated on
by which node is decide based on its proximity to the node: i.e:
Most data is read from the local disk straight into the CPU, alleviating strain
on network bandwidth and preventing unnecessary network transfers. This
strategy of moving computation to the data, instead of moving the data to the
computation allows Hadoop to achieve high data locality which in turn results in high
performance.
8. 2. MapReduce: Isolated Processes
Hadoop limits the amount of communication which can be performed by the processes, as each
individual record is processed by a task in isolation from one another. It makes the whole
framework much more reliable. Programs must be written to conform to a particular
programming model, named "MapReduce."
MapReduce is composed of two chief elements: Mappers and Reducers.
1. Data segments or records are processed in isolation by tasks called Mappers.
2. The output from the Mappers is then brought together by Reducers, where results from
different mappers are merged together.
Separate nodes in a Hadoop cluster communicate implicitly. Pieces of data can be tagged with
key names which inform Hadoop how to send related bits of information to a common destination
node. Hadoop internally manages all of the data transfer and cluster topology issues.
By restricting the communication between nodes, Hadoop makes the distributed system much
more reliable. Individual node failures can be worked around by restarting tasks on other
machines. The other workers continue to operate as though nothing went wrong, leaving the
challenging aspects of partially restarting the program.
What is MapReduce?
MapReduce is a programming model for processing and generating large data sets. Users
specify a map function that processes a key/value pair to generate a set of intermediate key/value
9. pairs, and a reduce function that merges all intermediate values associated with the same
intermediate key.
Programs written in this functional style are automatically parallelized and executed on a large
cluster of commodity machines. The run-time system takes care of the details of partitioning the
input data, scheduling the program's execution across a set of machines, handling machine
failures, and managing the required inter-machine communication (i.e this procedure is abstracted
or hidden from the user who can focus on the computational problem)
Note: This abstraction was inspired by the map and reduces primitives present in Lisp
and many other functional languages.
The Programming Model:
The computation takes a set of input key/value pairs, and produces a set of output
key/value pairs. The user of the MapReduce library expresses the computation as two
functions: Map and Reduce.
Map, written by the user, takes an input pair and produces a set of intermediate
key/value pairs. The MapReduce library groups together all intermediate values
associated with the same intermediate key Iand passes them to the Reduce function.
The Reduce function, also written by the user, accepts an intermediate key Iand a
set of values for that key. It merges together these values to form a possibly smaller
set of values. Typically just zero or one output value is produced per Reduce
invocation. The intermediate values
are supplied to the user's reduce function via an iterator.
10. Map and Reduce (Associated Types):
The input keys and values are drawn from a different domain than the output keys
and values. Also, the intermediate keys and values are from the same domain as the
output keys and values.
Hadoop Map-Reduce is a software framework for easily writing applications which
process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
Analyzing the Data with Hadoop MapReduce:
MapReduce works by breaking the processing into two phases: the map phase and the
reduce phase. It splits the input data-set into independent chunks which are processed by the map
tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are
then input to the reduce tasks.
Both the input and the output of the job are stored in a file-system. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.
Data Locality Optimisation:
Typically the compute nodes and the storage nodes are the same. The Map-Reduce framework
and the Distributed File System run on the same set of nodes. This configuration allows the
framework to effectively schedule tasks on the nodes where data is already present, resulting in
very high aggregate bandwidth across the cluster.
There are two types of nodes that control the job execution process:
1. jobtrackers
2. tasktrackers
The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on
tasktrackers.
Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the
overall progress of each job.
If a tasks fails, the jobtracker can reschedule it on a different tasktracker.
11. Input splits: Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined map
function for each record in the split.
The quality of the load balancing increases as the splits become more fine-grained. On the other
hand, if splits are too small, then the overhead of managing the splits and of map task creation
begins to dominate the total job execution time. For most jobs, a good split size tends to be the
size of a HDFS block, 64 MB by default.
WHY?
Map tasks write their output to local disk, not to HDFS. Map output is intermediate output: it’s
processed by reduce tasks to produce the final output, and once the job is complete the map
output can be thrown away. So storing it in HDFS, with replication, would be a waste of time. It
is also possible that the node running the map task fails before the map output has been consumed
by the reduce task.
Reduce tasks don’t have the advantage of data locality—the input to a single reduce task is
normally the output from all mappers.
In case we have a single reduce task that is fed by all of the map tasks: The sorted map outputs
have to be transferred across the network to the node where the reduce task is running, where they
are merged and then passed to the user-defined reduce function.
The output of the reducer is normally stored in HDFS for reliability. For each HDFS block of the
reduce output, the first replica is stored on the local node, with other replicas being stored on off-
rack nodes.
12. MapReduce data flow with a single reduce task
When there are multiple reducers:
The map tasks partition their output, each creating one partition for each reduce task. There can
be many keys (and their associated values) in each partition, but the records for every key are all
in a single partition.
MapReduce data flow with multiple reduce task.
It is also possible to have zero reduce tasks as illustrated in the figure below.
13. MapReduce data flow with no reduce tasks
Combiner Functions
Many MapReduce jobs are limited by the bandwidth available on the cluster. In order to
minimize the data transferred between the map and reduce tasks, combiner functions are
introduced. Hadoop allows the user to specify a combiner function to be run on the map output—
the combiner function’s output forms the input to the reduce function. Combiner finctions can
help cut down the amount of data shuffled between the maps and the reduces.
Hadoop Streaming:
Hadoop provides an API to MapReduce that allows you to write your map and reduce functions
in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface
between Hadoop and your program, so you can use any language that can read standard input and
write to standard output to write your MapReduce program.
Hadoop Pipes:
Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce. Unlike Streaming, which
uses standard input and output to communicate with the map and reduce code, Pipes uses sockets
14. as the channel over which the tasktracker communicates with the process running the C++ map or
reduce function. JNI is not used.
15. HADOOP DISTRIBUTED FILESYSTEM (HDFS)
Filesystems that manage the storage across a network of machines are called distributed
filesystems. They are network-based, and thus all the complications of network programming are
also present in distributed file system. Hadoop comes with a distributed filesystem called HDFS,
which stands for Hadoop Distributed Filesystem.
HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very
large amounts of data (terabytes or even petabytes), and provide high-throughput access to this
information.
ASSUMPTIONS AND GOALS:
1. Hardware Failure
An HDFS instance may consist of hundreds or thousands of server machines, each storing part
of the file system’s data. In case of such a large number of nodes, the probability of one of them
failing becomes substantial.
2. Streaming Data Access
Applications that run on HDFS need streaming access to their data sets. HDFS is designed more
for batch processing rather than interactive use by users. The emphasis is on high throughput of
data access rather than low latency of data access.
3. Large Data Sets
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to
terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate
data bandwidth and scale to hundreds of nodes in a single cluster.
4. Simple Coherency Model
HDFS applications need a write-once-read-many access model for files. A file once created,
written, and closed need not be changed. This assumption simplifies data coherency issues and
enables high throughput data access. A Map/Reduce application or a web crawler application fits
perfectly with this model. There is a plan to support appending-writes to files in the future.
5. “Moving Computation is Cheaper than Moving Data”
A computation requested by an application is much more efficient if it is executed near the data
it operates on. This is especially true when the size of the data set is huge. This minimizes
network congestion and increases the overall throughput of the system. HDFS provides interfaces
for applications to move themselves closer to where the data is located.
6. Portability Across Heterogeneous Hardware and Software Platforms
HDFS has been designed to be easily portable from one platform to another. This facilitates
widespread adoption of HDFS as a platform of choice for a large set of applications.
16. The Design of HDFS:
• Very large files:
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or
terabytes in size. There are Hadoop clusters running today that store petabytes of data.
• Streaming data access:
HDFS is built around the idea that the most efficient data processing pattern is a write-once,
read-many-times pattern. A dataset is typically generated or copied from source, then various
analyses are performed on that dataset over time. Each analysis will involve a large
proportion, if not all, of the dataset, so the time to read the whole dataset is more important
than the latency in reading the first record.
• Commodity hardware:
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on
clusters of commodity hardware for which the chance of node failure across the cluster is
high for large clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure. The use of commodity hardware restricts
the effectiveness of Hadoop in some applications. These applications have the following
common characteristics:
• Low-latency data access
Applications that require low-latency access to data, in the tens of milliseconds range, will
not work well with HDFS.
• Lots of small files:
Since the master node (or namenode) holds file-system metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the namenode. As a
rule of thumb, each file, directory, and block takes about 150 bytes.
• Multiple writers, arbitrary file modifications:
Files in HDFS may be written to by a single writer. Writes are always made at the end of the
file. There is no support for multiple writers, or for modifications at arbitrary offsets in the
file.
17. A few Important Concepts of Hadoop Distributed File System:
1. Blocks:
A disk has a block size, which is the minimum amount of data that it can read or write.
Filesystems for a single disk build on this by dealing with data in blocks, which are an integral
multiple of the disk block size. Filesystem blocks are typically a few kilobytes in size, while disk
blocks are normally 512 bytes
HDFS too has the concept of a block, but it is a much larger unit—64 MB by default. Like in a
filesystem for a single disk, files in HDFS are broken into block-sized chunks, which are stored
as independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a
single block does not occupy a full block’s worth of underlying storage.
HDFS blocks are large compared to disk blocks.
Having a block abstraction for a distributed filesystem brings several benefits:
A file can be larger than any single disk in the network. There’s nothing that requires the
blocks from a file to be stored on the same disk, so they can take advantage of any of the
disks in the cluster.
Making the unit of abstraction a block rather than a file simplifies the storage subsystem.
Blocks fit well with replication for providing fault tolerance and availability. To insure
against corrupted blocks and disk and machine failure, each block is replicated to a small
number of physically separate machines (typically three).
2. Namenodes and Datanodes:
A HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the
master) and a number of datanodes (workers).
The namenode has two chief functions:
• To manage the filesystem namespace.
• To maintains the filesystem tree and the metadata for all the files and directories in the
tree.
18. • . The namenode also knows the datanodes on which all the blocks for a given file are
located
This information is stored persistently on the local disk in the form of two files: the namespace
image and the edit log. The namenode however does not store block locations persistently, since
this information is reconstructed from datanodes when the system starts.
A client accesses the filesystem on behalf of the user by communicating with the namenode and
datanodes.
Datanodes are the work horses of the filesystem. They store and retrieve blocks when they are
told to (by clients or the namenode), and they report back to the namenode periodically with lists
of blocks that they are storing. Without the namenode, the filesystem cannot be used. In fact, if
the machine running the namenode were obliterated, all the files on the filesystem would be lost
since there would be no way of knowing how to reconstruct the files from the blocks on the
datanodes. For this reason, it is important to make the namenode resilient to failure, and Hadoop
provides two mechanisms for this.
19. The first way is to back up the files that make up the persistent state of the filesystem metadata.
Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems.
These writes are synchronous and atomic. The usual configuration Choice is to write to local disk
as well as a remote NFS mount.
Another approach is to run a secondary namenode. It does not act as a namenode. Its main role
is to periodically merge the namespace image with the edit log to prevent the edit log from
becoming too large. The secondary namenode usually runs on a separate physical machine, since
it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a
copy of the merged namespace image, which can be used in the event of the namenode failing.
However, the state of the secondary namenode lags that of the primary, so in the event of total
failure of the primary data, loss is almost guaranteed.
3. The File System Namespace:
HDFS supports a traditional hierarchical file organization. A user or an application can create
directories and store files inside these directories. The file system namespace hierarchy is similar
to most other existing file systems; one can create and remove files, move a file from one
directory to another, or rename a file. HDFS does not yet implement user quotas or access
permissions.
The Namenode maintains the file system namespace. Any change to the file system namespace
or its properties is recorded by the Namenode.
4. Data Replication:
HDFS is designed to reliably store very large files across machines in a large cluster. It stores
each file as a sequence of blocks; all blocks in a file except the last block are the same size. The
blocks of a file are replicated for fault tolerance. The block size and replication factor are
configurable per file. An application can specify the number of replicas of a file.
20. The NameNode makes all decisions regarding replication of blocks. It periodically receives a
Heartbeat and a Blockreport from each of the Datanodes in the cluster. Receipt of a Heartbeat
implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a
Datanode.
5. Replica Placement:
Optimizing replica placement distinguishes HDFS from most other distributed file systems.. The
purpose of a rack-aware replica placement policy is to improve data reliability, availability, and
network bandwidth utilization.
Large HDFS instances run on a cluster of computers that commonly spread across many racks.
Communication between two nodes in different racks has to go through switches. In most cases,
network bandwidth between machines in the same rack is greater than network bandwidth
between machines in different racks.
For the common case, when the replication factor is three, HDFS’s placement policy is to put
one replica on one node in the local rack, another on a different node in the local rack, and the last
on a different node in a different rack. This policy cuts the inter-rack write traffic which generally
improves write performance.
6. Replica Selection:
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read
request from a replica that is closest to the reader.
7. Safemode:
On startup, the NameNode enters a special state called Safemode. Replication of data blocks
does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat
and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that
a DataNode is hosting. Each block has a specified minimum number of replicas. A block is
21. considered safely replicated when the minimum number of replicas of that data block has checked
in with the NameNode. After a configurable percentage of safely replicated data blocks checks in
with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It
then determines the list of data blocks (if any) that still have fewer than the specified number of
replicas. The NameNode then replicates these blocks to other DataNodes.
8. The Persistence of File System Metadata:
The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called
the EditLog to persistently record every change that occurs to file system metadata. The
NameNode uses a file in its local host OS file system to store the EditLog. The entire file system
namespace, including the mapping of blocks to files and file system properties, is stored in a file
called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.
The NameNode keeps an image of the entire file system namespace and file Blockmap in
memory. This key metadata item is designed to be compact, such that a NameNode with 4 GB of
RAM is plenty to support a huge number of files and directories
The DataNode stores HDFS data in files in its local file system. The DataNode has no
knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file
system. The DataNode does not create all files in the same directory. Instead, it uses a heuristic to
determine the optimal number of files per directory and creates subdirectories appropriately. It is
not optimal to create all local files in the same directory because the local file system might not
be able to efficiently support a huge number of files in a single directory. When a DataNode starts
up, it scans through its local file system, generates a list of all HDFS data blocks that correspond
to each of these local files and sends this report to the NameNode: this is the Blockreport
22. The Communication Protocols:
All HDFS communication protocols are layered on top of the TCP/IP protocol. A client
establishes a connection to a configurable TCP port on the NameNode machine. It talks the
ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode
Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the
DataNode Protocol. By design, the NameNode never initiates any RPCs. Instead, it only responds
to RPC requests issued by DataNodes or clients.
23. Robustness:
The primary objective of HDFS is to store data reliably even in the presence of failures. The
three common types of failures are NameNode failures, DataNode failures and network
partitions.
Data Disk Failure, Heartbeats and Re-Replication
Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition
can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode
detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes
without recent Heartbeats as dead and does not forward any new IO requests to them. Any data
that was registered to a dead DataNode is not available to HDFS any more. DataNode death may
cause the replication factor of some blocks to fall below their specified value. The NameNode
constantly tracks which blocks need to be replicated and initiates replication whenever necessary.
The necessity for re-replication may arise due to many reasons: a DataNode may become
unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the
replication factor of a file may be increased.
24. Cluster Rebalancing
The HDFS architecture is compatible with data rebalancing schemes. A scheme might
automatically move data from one DataNode to another if the free space on a DataNode falls
below a certain threshold. In the event of a sudden high demand for a particular file, a scheme
might dynamically create additional replicas and rebalance other data in the cluster. These types
of data rebalancing schemes are not yet implemented.
Data Integrity
It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can
occur because of faults in a storage device, network faults, or buggy software. The HDFS client
software implements checksum checking on the contents of HDFS files. When a client creates an
HDFS file, it computes a checksum of each block of the file and stores these checksums in a
separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies
that the data it received from each DataNode matches the checksum stored in the associated
checksum file. If not, then the client can opt to retrieve that block from another DataNode that has
a replica of that block.
Metadata Disk Failure
The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can
cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured
to support maintaining multiple copies of the FsImage and EditLog. Any update to either the
FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously.
This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate
of namespace transactions per second that a NameNode can support. However, this degradation is
acceptable because even though HDFS applications are very data intensive in nature, they are not
metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and
EditLog to use.
The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode
machine fails, manual intervention is necessary. Currently, automatic restart and failover of the
NameNode software to another machine is not supported.
Snapshots
Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot
feature may be to roll back a corrupted HDFS instance to a previously known good point in time.
HDFS does not currently support snapshots but will in a future release.
25. Data Organization
Data Blocks
HDFS is designed to support very large files. Applications that are compatible with HDFS are
those that deal with large data sets. These applications write their data only once but they read it
one or more times and require these reads to be satisfied at streaming speeds. HDFS supports
write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an
HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a
different DataNode.
Staging
A client request to create a file does not reach the NameNode immediately. In fact, initially the
HDFS client caches the file data into a temporary local file. Application writes are transparently
redirected to this temporary local file. When the local file accumulates data worth over one HDFS
block size, the client contacts the NameNode. The NameNode inserts the file name into the file
system hierarchy and allocates a data block for it. The NameNode responds to the client request
with the identity of the DataNode and the destination data block. Then the client flushes the block
of data from the local temporary file to the specified DataNode. When a file is closed, the
remaining un-flushed data in the temporary local file is transferred to the DataNode. The client
then tells the NameNode that the file is closed. At this point, the NameNode commits the file
creation operation into a persistent store. If the NameNode dies before the file is closed, the file is
lost.
The above approach has been adopted after careful consideration of target applications that run
on HDFS. These applications need streaming writes to files. If a client writes to a remote file
directly without any client side buffering, the network speed and the congestion in the network
impacts throughput considerably. This approach is not without precedent. Earlier distributed file
systems, e.g. AFS, have used client side caching to improve performance. A POSIX requirement
has been relaxed to achieve higher performance of data uploads.
Replication Pipelining
When a client is writing data to an HDFS file, its data is first written to a local file as explained
in the previous section. Suppose the HDFS file has a replication factor of three. When the local
file accumulates a full block of user data, the client retrieves a list of DataNodes from the
NameNode. This list contains the DataNodes that will host a replica of that block. The client then
flushes the data block to the first DataNode. The first DataNode starts receiving the data in small
portions (4 KB), writes each portion to its local repository and transfers that portion to the second
DataNode in the list. The second DataNode, in turn starts receiving each portion of the data
block, writes that portion to its repository and then flushes that portion to the third DataNode.
Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be
receiving data from the previous one in the pipeline and at the same time forwarding data to the
next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.
۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩
26. Accessibility:
HDFS can be accessed from applications in many different ways. Natively, HDFS provides a
java API for applications to use. A C language wrapper for this Java API is also available. In
addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is in
progress to expose HDFS through the WebDAV protocol.
Space Reclamation:
1. File Deletes and Undeletes
When a file is deleted by a user or an application, it is not immediately removed from HDFS.
Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as
long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the
expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion
of a file causes the blocks associated with the file to be freed. Note that there could be an
appreciable time delay between the time a file is deleted by a user and the time of the
corresponding increase in free space in HDFS.
A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user
wants to undelete a file that he/she has deleted, he/she can navigate the /trash directory and
retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The
/trash directory is just like any other directory with one special feature: HDFS applies specified
policies to automatically delete files from this directory. The current default policy is to delete
files from /trash that are more than 6 hours old. In the future, this policy will be configurable
through a well defined interface.
2. Decrease Replication Factor
When the replication factor of a file is reduced, the NameNode selects excess replicas that can
be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then
removes the corresponding blocks and the corresponding free space appears in the cluster. Once
again, there might be a time delay between the completion of the setReplication API call and the
appearance of free space in the cluster.
3. Hadoop Filesystems
Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation. The
Java abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop, and there
are several concrete implementations, which are described in following table.
A filesystem for a locally connected
27. disk with client-side checksums.
Local file Use RawLocalFileSys
fs.LocalFileSystem tem for a local filesystem with no
checksums.
Hadoop’s distributed filesystem.
HDFS is designed to work efficiently
HDFS hdfs hdfs.DistributedFileSystem in conjunction with Map-
Reduce.
A filesystem providing read-only
access to HDFS over HTTP. (Despite
HFTP hftp its name, HFTP has no connection
hdfs.HftpFileSystem with FTP.) Often used with distcp
(“Parallel Copying with
A filesystem providing read-only
access to HDFS over HTTPS. (Again,
HSFTP hsftp Hdfs.HsftpFileSystem this has no connection with FTP.)
A filesystem layered on another
filesystem for archiving files.
HAR har Fs.HarFileSystem Hadoop
Archives are typically used
for archiving files in HDFS to
reduce
the namenode’s memory usage.
CloudStore (formerly Kosmos
filesystem)
KFS(Clou Kfs fs.kfs.KosmosFileSystem is a distributed filesystem
d Store) like HDFS or Google’s GFS,
written in C++.
A filesystem backed by an FTP
FTP ftp fs.ftp.FtpFileSystem server.
A filesystem backed by Amazon
S3(Nativ s3n fs.s3native.NativeS3FileSyste S3.
e) m
A filesystem backed by Amazon
S3, which stores files in blocks
S3(Block S3 fs.s3.S3FileSystem A (much like HDFS) to overcome
Based) S3’s 5 GB file size limit.
28. Hadoop Archives:
HDFS stores small files inefficiently, since each file is stored in a block, and block metadata is
held in memory by the namenode. Thus, a large number of small files can eat up a lot of memory
on the namenode. (Note, however, that small files do not take up any more disk space than is
required to store the raw contents of the file. For example, a 1 MB file stored with a block size of
128 MB uses 1 MB of disk space, not 128 MB.) Hadoop Archives, or HAR files, are a file
archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode
memory usage while still allowing transparent access to files. In particular, Hadoop Archives can
be used as input to MapReduce.
Using Hadoop Archives
A Hadoop Archive is created from a collection of files using the archive tool. The tool runs a
MapReduce job to process the input files in parallel, so to run it, you need a MapReduce cluster
running to use it.
Limitations
There are a few limitations to be aware of with HAR files. Creating an archive creates a copy of
the original files, so you need as much disk space as the files you are archiving to create the
archive (although you can delete the originals once you have created the archive). There is
currently no support for archive compression, although the files that go into the archive can be
compressed (HAR files are like tar files in this respect). Archives are immutable once they have
been created. To add or remove files, you must recreate the archive. In practice, this is not a
problem for files that don’t change after being written, since they can be archived in batches on a
regular basis, such as daily or weekly. As noted earlier, HAR files can be used as input to
MapReduce. However, there is no archive-aware InputFormat that can pack multiple files into a
single MapReduce split, so processing lots of small files, even in a HAR file, can still be
inefficient.