This document discusses distributed data processing using MapReduce and Hadoop in a cloud computing environment. It describes the need for scalable, economical, and reliable distributed systems to process petabytes of data across thousands of nodes. It introduces Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers using MapReduce. Key aspects of Hadoop discussed include its core components HDFS for distributed file storage and MapReduce for distributed computation.
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
Optimizing MapReduce job performance is often seen as something of a black art. In order to maximize performance, developers need to understand the inner workings of the MapReduce execution framework and how they are affected by various configuration parameters and MR design patterns. The talk will illustrate the underlying mechanics of job and task execution, including the map side sort/spill, the shuffle, and the reduce side merge, and then explain how different job configuration parameters and job design strategies affect the performance of these operations. Though the talk will cover internals, it will also provide practical tips, guidelines, and rules of thumb for better job performance. The talk is primarily targeted towards developers directly using the MapReduce API, though will also include some tips for users of higher level frameworks.
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
Optimizing MapReduce job performance is often seen as something of a black art. In order to maximize performance, developers need to understand the inner workings of the MapReduce execution framework and how they are affected by various configuration parameters and MR design patterns. The talk will illustrate the underlying mechanics of job and task execution, including the map side sort/spill, the shuffle, and the reduce side merge, and then explain how different job configuration parameters and job design strategies affect the performance of these operations. Though the talk will cover internals, it will also provide practical tips, guidelines, and rules of thumb for better job performance. The talk is primarily targeted towards developers directly using the MapReduce API, though will also include some tips for users of higher level frameworks.
Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations.
This presentation speaks about Advanced Hadoop Tuning and Optimisation.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
The current major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, Snapshots, and performance improvements. We describe how to take advantages of these new features and their benefits. We cover some architectural improvements in detail such as HA, Federation and Snapshots. The second half of the talk describes the current features that are under development for the next HDFS release. This includes much needed data management features such as backup and Disaster Recovery. We add support for different classes of storage devices such as SSDs and open interfaces such as NFS; together these extend HDFS as a more general storage system. Hadoop has recently been extended to run first-class on Windows which expands its enterprise reach and allows integration with the rich tool-set available on Windows. As with every release we will continue improvements to performance, diagnosability and manageability of HDFS. To conclude, we discuss the reliability, the state of HDFS adoption, and some of the misconceptions and myths about HDFS.
Still All on One Server: Perforce at Scale Perforce
Google runs the busiest single Perforce server on the planet, and one of the largest repositories in any source control system. This session will address server performance and other issues of scale, as well as where Google is in general, how it got there and how it continues to stay ahead of its users.
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Severalnines Self-Training: MySQL® Cluster - Part IISeveralnines
Part II of our free self-training slides on MySQL Cluster.
In this part we cover 'Detailed Concepts':
* Data Distribution & Partitioning
* Two Phase Commit Protocol
* Transaction Resources
Apache HBase is the Hadoop opensource, distributed, versioned storage manager well suited for random, realtime read/write access. This talk will give an overview on how HBase achieve random I/O, focusing on the storage layer internals. Starting from how the client interact with Region Servers and Master to go into WAL, MemStore, Compactions and on-disk format details. Looking at how the storage is used by features like snapshots, and how it can be improved to gain flexibility, performance and space efficiency.
Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations.
This presentation speaks about Advanced Hadoop Tuning and Optimisation.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
The current major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, Snapshots, and performance improvements. We describe how to take advantages of these new features and their benefits. We cover some architectural improvements in detail such as HA, Federation and Snapshots. The second half of the talk describes the current features that are under development for the next HDFS release. This includes much needed data management features such as backup and Disaster Recovery. We add support for different classes of storage devices such as SSDs and open interfaces such as NFS; together these extend HDFS as a more general storage system. Hadoop has recently been extended to run first-class on Windows which expands its enterprise reach and allows integration with the rich tool-set available on Windows. As with every release we will continue improvements to performance, diagnosability and manageability of HDFS. To conclude, we discuss the reliability, the state of HDFS adoption, and some of the misconceptions and myths about HDFS.
Still All on One Server: Perforce at Scale Perforce
Google runs the busiest single Perforce server on the planet, and one of the largest repositories in any source control system. This session will address server performance and other issues of scale, as well as where Google is in general, how it got there and how it continues to stay ahead of its users.
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Severalnines Self-Training: MySQL® Cluster - Part IISeveralnines
Part II of our free self-training slides on MySQL Cluster.
In this part we cover 'Detailed Concepts':
* Data Distribution & Partitioning
* Two Phase Commit Protocol
* Transaction Resources
Apache HBase is the Hadoop opensource, distributed, versioned storage manager well suited for random, realtime read/write access. This talk will give an overview on how HBase achieve random I/O, focusing on the storage layer internals. Starting from how the client interact with Region Servers and Master to go into WAL, MemStore, Compactions and on-disk format details. Looking at how the storage is used by features like snapshots, and how it can be improved to gain flexibility, performance and space efficiency.
Hadoop is a software framework for distributed processing of large datasets across large clusters of computers
Large datasets Terabytes or petabytes of data
Large clusters hundreds or thousands of nodes
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
My presentation for the Cloud Data Management course at EPFL by Anastasia Ailamaki and Christoph Koch.
It is mainly based on the following two papers:
1) S. Ghemawat, H. Gobioff, S. Leung. The Google File System. SOSP, 2003
2) J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
Given on a free DevelopMentor webinar. A high level overview of big data and the need for Hadoop. Also covers Pig, Hive, Yarn, and the future of Hadoop.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Distributed Data processing in a Cloud
1. Distributed Data
processing in a Cloud
Rajiv Chittajallu
Yahoo! Inc
rajive@yahoo-inc.com
Software as a Service and Cloud Computing Workshop
ISEC2008, Hyderabad, India
22 February 2008
2. Desiderata
Operate scalably
– Petabytes of data on thousands on nodes
– Much larger that RAM, disk I/O required
• Operate economically
– Minimize $ spent on CPU cycles, disk, RAM, network
– Lash thousands of commodity PCs into an effective compute
and storage platform
• Operate reliably
– In a large enough cluster something is always broken
– Seamlessly protect user data and computations from
hardware and software flakiness
Yahoo! Inc.
3. Problem: bandwidth to data
• Need to process 100TB datasets
• On 1000 node cluster reading from remote
storage (on LAN)
– Scanning @ 10MB/s = 165 min
• On 1000 node cluster reading from local
storage
– Scanning @ 50200MB/s = 338 min
• Moving computation is more efficient than
moving data
– Need visibility into data placement
Yahoo! Inc.
4. Problem: scaling reliably is hard
• Need to store petabytes of data
– On 1000s of nodes
– MTBF < 1 day
– With so many disks, nodes, switches something is always
broken
• Need fault tolerant store
– Handle hardware faults transparently and efficiently
– Provide reasonable availability guarantees
Yahoo! Inc.
5. Distributed File System
• Fault tolerant, scalable, distributed storage system
• Designed to reliably store very large files across
machines in a large cluster
• Common Namespace for the entire filesystem
– Distribute namespace for scalability and failover
• Data Model
– Data is organized into files and directories
– Files are divided into uniform sized blocks and distributed
across cluster nodes
– Replicate blocks to handle hardware failure
– Checksums of data for corruption detection and recovery
– Expose block placement so that computes can be migrated
to data
Yahoo! Inc.
6. Problem: seeks are expensive
• CPU & transfer speed, RAM & disk size double
every 1824 months
• Seek time nearly constant (~5%/year)
• Time to read entire drive is growing
• Moral: scalable computing must go at transfer
rate
Yahoo! Inc.
7. Two database paradigms: seek
versus transfer
• BTree (Relational Dbs)
– operate at seek rate, log(N) seeks/access
• sort/merge flat files (MapReduce)
– operate at transfer rate, log(N) transfers/sort
• Caveats:
– sort & merge is batch based
• although possible to work around
– other paradigms (memory, streaming, etc.)
Yahoo! Inc.
8. Example: updating a terabyte DB
• given:
– 10MB/s transfer
– 10ms/seek
– 100B/entry (10B entries)
– 10kB/page (1B pages)
• updating 1% of entries (100M) takes:
– 1000 days with random BTree updates
– 100 days with batched BTree updates
– 1 day with sort & merge
Yahoo! Inc.
9. Map/Reduce: sort/merge based distributed
processing
• Best for batchoriented processing
• Sort/merge is primitive
– Operates at transfer rate
• Simple programming metaphor:
– input | map | shuffle | reduce > output
– cat * | grep | sort | uniq c > file
• Pluggable user code runs in generic reusable framework
– A natural for log processing, great for most web search
processing
– A lot of SQL maps trivially to this construct (see PIG)
• Distribution & reliability
– Handled by framework
Yahoo! Inc.
10. Map/Reduce
• Application writer specifies In In In
– A pair of functions called Map and Reduce and pu pu pu
t 0 t 1 t 2
a set of input files
• Workflow
– Input phase generates a number of FileSplits
from input files (one per Map task) Map 0 Map 1 Map 2
– The Map phase executes a user function to
transform input kvpairs into a new set of kvpairs
– The framework sorts & Shuffles the kvpairs to Shuffle
output nodes
– The Reduce phase combines all kvpairs with
the same key into new kvpairs
– The output phase writes the resulting pairs to Reduce 0 Reduce 1
files
• All phases are distributed with many tasks doing
the work
– Framework handles scheduling of tasks on O O
cluster ut ut
– Framework handles recovery when a node fails 0 1
Yahoo! Inc.
11. Map/Reduce features
• Fine grained Map and Reduce tasks
– Improved load balancing
– Faster recovery from failed tasks
• Locality optimizations
– With big data, bandwidth to data is a problem
– MapReduce + DFS is a very effective solution
– MapReduce queries DFS for locations of input data
– Map tasks are scheduled local to the inputs when possible
• Reexecution and Speculative execution
– In a large cluster, some nodes are always slow or flaky
– Introduces long tails or failures in computation
– Framework reexecutes failed jobs
– Framework runs multiple instances of last few tasks and uses the
ones that finish first
Yahoo! Inc.
12. Map/Reduce: pros and cons
• Developing large scale systems is expensive, this
is a shared platform
– Reduces development and debug time
– Leverages common optimizations, tools etc.
• Not always a natural fit
– With moderate force, many things will fit
• Not always optimal
– But not far off, and often cheaper in the end
Yahoo! Inc.
13. Hadoop
• Apache Software Foundation project
– Framework for running applications on large clusters of commodity hardware
– Since we’ve convinced Doug Cutting to split Hadoop into a separate project, Yahoo! is the
main contributor of source code to the infrastructure base.
– A search startup has adapted Hadoop to run on Amazon’s EC2 and S3, and has contributed
hBase, a BigTablelike extension.
• http://hadoop.apache.org/hbase/
• Includes
– HDFS a distributed filesystem
– Map/Reduce offline computing engine
– Hbase – online data access
• Still pre1.0, but already used by many
– http://wiki.apache.org/hadoop/PoweredBy
– alpha (0.16) release available for download
• http://lucene.apache.org/hadoop
Yahoo! Inc.
14. Hadoop Map/Reduce architecture
• MasterSlave architecture
• Map/Reduce Master “Jobtracker”
– Accepts MR jobs submitted by users
– Assigns Map and Reduce tasks to Tasktrackers
– Monitors task and tasktracker status, reexecutes tasks
upon failure
• Map/Reduce Slaves “Tasktrackers”
– Run Map and Reduce tasks upon instruction from the
Jobtracker
– Manage storage and transmission of intermediate output
Yahoo! Inc.
15. HDFS Architecture
• MasterSlave architecture
• DFS Master “Namenode”
– Manages the filesystem namespace
– Controls read/write access to files
– Manages block replication
– Checkpoints namespace and journals namespace
changes for reliability
• DFS Slaves “Datanodes”
– Serve read/write requests from clients
– Perform replication tasks upon instruction by namenode
Yahoo! Inc.
16. HDFS
• Notable differences from mainstream
DFS work
– Single ‘storage + compute’ cluster vs. Separate
clusters
– Simple I/O centric API vs. Attempts at POSIX
compliance
• Not against POSIX but currently prioritizing scale and reliability
Yahoo! Inc.
18. HDFS API
• Most common file and directory operations supported:
– Create, open, close, read, write, seek, tell, list, delete
etc.
• Files are write once and have exclusively one writer
• Append/truncate coming soon
• Some operations peculiar to HDFS:
– set replication, get block locations
• Support for owners, permissions (v0.16)
Yahoo! Inc.
19. HDFS command line utils
gritgw1004:/grid/0/tmp/rajive$ ls -lt
total 1300392
-rw-r--r-- 1 rajive users 244827000 Jan 20 05:02 1.5K-alice30.txt
-rw-r--r-- 1 rajive users 8160900 Jan 20 05:02 50-alice30.txt
-rw-r--r-- 1 rajive users 1077290150 Jan 20 04:58 part-00737
gritgw1004:/grid/0/tmp/rajive$ hadoop dfs -ls
Found 1 items
/user/rajive/rand0 <dir> 2008-01-20 05:00
gritgw1004:/grid/0/tmp/rajive$ hadoop dfs -ls /user/rajive
Found 5 items
/user/rajive/alice <dir> 2008-01-20 05:15
/user/rajive/alice-1.5k <dir> 2008-01-20 05:20
/user/rajive/rand0 <dir> 2008-01-20 05:00
gritgw1004:/grid/0/tmp/rajive$ hadoop dfs -put 50-alice30.txt /user/rajive/alice
gritgw1004:/grid/0/tmp/rajive$ hadoop dfs -ls /user/rajive/alice
Found 1 items
/user/rajive/alice/50-alice30.txt <r 3> 8160900 2008-01-20 05:05
gritgw1004:/grid/0/tmp/rajive$ hadoop dfs -cat /user/rajive/alice/50-alice30.txt
***This is the Project Gutenberg Etext of Alice in Wonderland***
*This 30th edition should be labeled alice30.txt or alice30.zip.
***This Edition Is Being Officially Released On March 8, 1994***
**In Celebration Of The 23rd Anniversary of Project Gutenberg***
Yahoo! Inc.
23. Hadoop: Two Services in One
Cluster Nodes run both DFS and MR
(taking computation to the data)
MR X O Y
DFS X Y X X O O O Y Y
Input File X
(128MB blocks) O
Y
Yahoo! Inc.
24. HOD (Hadoop on Demand)
• Map/Reduce is just one programming model
• Hadoop is not a resource manager or scheduler
– Most sites already have a deployed solution
• HOD
– Bridge between Hadoop and resource managers
– Currently supports Torque
– Part of contrib in Hadoop 0.16 release
– http://hadoop.apache.org/core/docs/current/hod.html
Yahoo! Inc.
25. HOD: Provisioning Hadoop
• Hadoop is submitted like any other job
• User specifies number of nodes desired
• HOD deals with allocation and setup
– Allocates requested nodes
– Brings up Map/Reduce and (optionally) HDFS daemons
• User submits Map/Reduce jobs
Yahoo! Inc.
26. HOD Benefits
• Effective usage of the grid
– No need to do ‘social scheduling’
– No need for static node allocation
• Automated setup for Hadoop
– Users / Ops no longer need to know where and how to bring up
daemons
Yahoo! Inc.
27. Running Jobs
gritgw1004:/grid/0/tmp/rajive$ hod -m 5
HDFS UI on grit1002.yahooresearchcluster.com:50070
Mapred UI on grit1278.yahooresearchcluster.com:55118
Hadoop config file in: /grid/0/kryptonite/hod/tmp/hod-15575-tmp/hadoop-
site.xml
allocation information:
1 job tracker node
4 task tracker nodes
5 nodes in total
[hod] (rajive) >>
Yahoo! Inc.
28. Running Jobs
run jar /grid/0/hadoop/current/hadoop-examples.jar wordcount
[hod] (rajive) >>
/user/rajive/alice-1.5k /user/rajive/wcout2
08/01/20 05:21:26 WARN mapred.JobConf: Deprecated resource 'mapred-default.xml' is being loaded, please discontinue its
usage!
08/01/20 05:21:27 INFO mapred.FileInputFormat: Total input paths to process : 1
08/01/20 05:21:30 INFO mapred.JobClient: Running job: job_200801200511_0002
08/01/20 05:21:31 INFO mapred.JobClient: map 0% reduce 0%
08/01/20 05:21:38 INFO mapred.JobClient: map 3% reduce 0%
08/01/20 05:21:42 INFO mapred.JobClient: map 12% reduce 0%
08/01/20 05:21:48 INFO mapred.JobClient: map 20% reduce 0%
08/01/20 05:22:12 INFO mapred.JobClient: map 27% reduce 0%
08/01/20 05:22:18 INFO mapred.JobClient: map 37% reduce 0%
08/01/20 05:22:21 INFO mapred.JobClient: map 41% reduce 0%
08/01/20 05:22:41 INFO mapred.JobClient: map 45% reduce 0%
08/01/20 05:22:48 INFO mapred.JobClient: map 54% reduce 0%
08/01/20 05:22:51 INFO mapred.JobClient: map 59% reduce 0%
08/01/20 05:22:59 INFO mapred.JobClient: map 62% reduce 0%
08/01/20 05:23:19 INFO mapred.JobClient: map 71% reduce 0%
08/01/20 05:23:22 INFO mapred.JobClient: map 76% reduce 0%
08/01/20 05:23:29 INFO mapred.JobClient: map 83% reduce 0%
08/01/20 05:23:49 INFO mapred.JobClient: map 88% reduce 0%
08/01/20 05:23:52 INFO mapred.JobClient: map 93% reduce 0%
08/01/20 05:23:59 INFO mapred.JobClient: map 100% reduce 0%
08/01/20 05:24:19 INFO mapred.JobClient: map 100% reduce 100%
08/01/20 05:24:20 INFO mapred.JobClient: Job complete: job_200801200511_0002
08/01/20 05:24:20 INFO mapred.JobClient: Counters: 11
08/01/20 05:24:20 INFO mapred.JobClient: Job Counters
08/01/20 05:24:20 INFO mapred.JobClient: Launched map tasks=2
08/01/20 05:24:20 INFO mapred.JobClient: Launched reduce tasks=1
08/01/20 05:24:20 INFO mapred.JobClient: Map-Reduce Framework
08/01/20 05:24:20 INFO mapred.JobClient: Map input records=5779500
08/01/20 05:24:20 INFO mapred.JobClient: Map output records=42300000
08/01/20 05:24:20 INFO mapred.JobClient: Map input bytes=244827000
08/01/20 05:24:20 INFO mapred.JobClient: Map output bytes=398698500
08/01/20 05:24:20 INFO mapred.JobClient: Combine input records=42300000
08/01/20 05:24:20 INFO mapred.JobClient: Combine output records=59080
08/01/20 05:24:20 INFO mapred.JobClient: Reduce input groups=5908
08/01/20 05:24:20 INFO mapred.JobClient: Reduce input records=59080
08/01/20 05:24:20 INFO mapred.JobClient: Reduce output records=5908
[hod] (rajive) >>
Yahoo! Inc.
31. Thank you
• Questions?
• Hadoop: http://hadoop.apache.org
• Blog http://developer.yahoo.com/blogs/hadoop
• This presentation: http://public.yahoo.com/rajive/isec2008.pdf
• email: rajive@yahooinc.com
Yahoo! Inc.