Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
This presentation explains the architecture of classic mapreduce or MapReduce 1 in Hadoop, Most of the sides are animated. So please download and read it.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.
This presentation explains the architecture of classic mapreduce or MapReduce 1 in Hadoop, Most of the sides are animated. So please download and read it.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage.
Speaker: Omkar Joshi (Uber)
Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
Clustering is often an essential first step in datamining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can
offer a richer representation by suggesting the potential group
structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a
parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of
single-linkage clustering algorithm due to its natural expression
of iterative process. Our algorithm can be deployed easily in
Amazon’s cloud environment. And a thorough performance
evaluation in Amazon’s EC2 verifies that the scalability of our
algorithm sustains when the datasets scale up.
Spark adds some abstractions and generalizations and performance optimizations to achieve much better efficiency especially in iterative workloads. Yet, spark does not concern itself with being a data file system while Hadoop has what is called HDFS.
Spark can leverage existing distributed files systems (like HDFS), a distributed data base (like HBase), traditional databases through its JDBC or ODBC adaptors, and flat files in local file systems or on a file store like S3 in Amazon cloud.
Hadoop MapReduce framework is similar to Spark in that it uses master slave-like paradigm. It has one Master node (which consists of a job tracker, name node, and RAM) and Worker Nodes (each worker node consists of a task tracker, data node, and a RAM). The task tracker in a worker node is analogues to an executor in Spark environment.
Slides from my talk at 2016 Apache: Big Data conference.
Resource managers like Apache YARN and Mesos have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low-level. We present Apache REEF, a powerful yet simple framework that helps developers of big data systems to retain fine-grained control over the cloud resources and address common problems of fault-tolerance, task scheduling and coordination, caching, interprocess communication, and bulk-data transfers. We will guide the developers through a simple REEF application and discuss current state of Apache REEF project and its place in the Hadoop ecosystem.
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
A deeper explanation of Spark's evaluation principals including lazy evaluation, the Spark execution environment, anatomy of a Spark Job (Tasks, Stages, Query execution plan) and presents one use case to demonstrate these concepts.
Distributed real time stream processing- why and howPetr Zapletal
In this talk you will discover various state-of-the-art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs, their intended use-cases, and how to choose between them. Petr will focus on the popular frameworks, including Spark Streaming, Storm, Samza and Flink. You will also explore theoretical introduction, common pitfalls, popular architectures, and much more.
The demand for stream processing is increasing. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications, include trading, social networks, the Internet of Things, and system monitoring, are becoming more and more important. A number of powerful, easy-to-use open source platforms have emerged to address this.
Petr's goal is to provide a comprehensive overview of modern streaming solutions and to help fellow developers with picking the best possible solution for their particular use-case. Join this talk if you are thinking about, implementing, or have already deployed a streaming solution.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yyaHb8.
The authors discuss Netflix's new stream processing system that supports a reactive programming model, allows auto scaling, and is capable of processing millions of messages per second. Filmed at qconsf.com.
Danny Yuan is an architect and software developer in Netflix’s Platform Engineering team. Justin Becker is Senior Software Engineer at Netflix.
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage.
Speaker: Omkar Joshi (Uber)
Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
Clustering is often an essential first step in datamining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can
offer a richer representation by suggesting the potential group
structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a
parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of
single-linkage clustering algorithm due to its natural expression
of iterative process. Our algorithm can be deployed easily in
Amazon’s cloud environment. And a thorough performance
evaluation in Amazon’s EC2 verifies that the scalability of our
algorithm sustains when the datasets scale up.
Spark adds some abstractions and generalizations and performance optimizations to achieve much better efficiency especially in iterative workloads. Yet, spark does not concern itself with being a data file system while Hadoop has what is called HDFS.
Spark can leverage existing distributed files systems (like HDFS), a distributed data base (like HBase), traditional databases through its JDBC or ODBC adaptors, and flat files in local file systems or on a file store like S3 in Amazon cloud.
Hadoop MapReduce framework is similar to Spark in that it uses master slave-like paradigm. It has one Master node (which consists of a job tracker, name node, and RAM) and Worker Nodes (each worker node consists of a task tracker, data node, and a RAM). The task tracker in a worker node is analogues to an executor in Spark environment.
Slides from my talk at 2016 Apache: Big Data conference.
Resource managers like Apache YARN and Mesos have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low-level. We present Apache REEF, a powerful yet simple framework that helps developers of big data systems to retain fine-grained control over the cloud resources and address common problems of fault-tolerance, task scheduling and coordination, caching, interprocess communication, and bulk-data transfers. We will guide the developers through a simple REEF application and discuss current state of Apache REEF project and its place in the Hadoop ecosystem.
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
A deeper explanation of Spark's evaluation principals including lazy evaluation, the Spark execution environment, anatomy of a Spark Job (Tasks, Stages, Query execution plan) and presents one use case to demonstrate these concepts.
Distributed real time stream processing- why and howPetr Zapletal
In this talk you will discover various state-of-the-art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs, their intended use-cases, and how to choose between them. Petr will focus on the popular frameworks, including Spark Streaming, Storm, Samza and Flink. You will also explore theoretical introduction, common pitfalls, popular architectures, and much more.
The demand for stream processing is increasing. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications, include trading, social networks, the Internet of Things, and system monitoring, are becoming more and more important. A number of powerful, easy-to-use open source platforms have emerged to address this.
Petr's goal is to provide a comprehensive overview of modern streaming solutions and to help fellow developers with picking the best possible solution for their particular use-case. Join this talk if you are thinking about, implementing, or have already deployed a streaming solution.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yyaHb8.
The authors discuss Netflix's new stream processing system that supports a reactive programming model, allows auto scaling, and is capable of processing millions of messages per second. Filmed at qconsf.com.
Danny Yuan is an architect and software developer in Netflix’s Platform Engineering team. Justin Becker is Senior Software Engineer at Netflix.
This presentation walks through the 5 characteristics of an excellent instructional designer. These include 1) belief in impact, 2) proactive, 3) positive attitude, 4) discerning focus, and 5) emotional intelligence.
kita akan mempelajari tentang Studi design dalam epidemiologi. Studi disain epidemiologi yang akan kita pelajari yaitu Eksperimental studi yaitu studi intervensi kemudian observasional studi yaitu studi disain kohort, kasus kontrol dan crossectional studi.
Secara garis besar, desain penelitian dalam epidemiologi terbagi menjadi dua kelompok besar yaitu penelitian eksperimental dan penelitian observasi.Tujuan dari penelitian eksperimen/ uji klinis adalah untuk mengukur efek dari suatu intervensi terhadap hasil tertentu yang diprediksi sebelumnya.Desain ini merupakan metode utama untuk menginvestigasi terapi baru.Misal, efek dari obat X dan obat Y terhadap kesembuhan penyakit Z atau efektivitas suatu program kesehatan terhadap peningkatan kesehatan masyarakat. Sedangkan penelitian observasional tidak melakukan intervensi apapun, peneliti hanya mengobservasi kejadian atau fenomena yang terjadi di suatu masyarakat untuk menjawab pertanyaan penelitian.Misalnya, peneliti ingin mengetahui hubungan antara pengetahuan ibu tentang gizi anak terhadap status gizi anak. Peneliti tidak melakukan intervensi berupa penyuluhan atau pelatihan seputar gizi anak kepada target penelitian terlebih dahulu. Peneliti hanya menyelidiki apakah salah satu yang mempengaruhi status gizi anak itu adalah pengetahuan ibu yang telah mereka miliki sebelumnya tentang gizi anak, mungkin dari media atau penyuluhan rutin oleh tenaga kesehatan di lokasi setempat.
Najmah, 2015, Epidemiologi untuk mahasiswa kesehatan masyarakat. Penerbit: Raja Grafindo
http://rajagrafindoonline.com/kesehatan/buku-epidemiologi-untuk-mahasiswa-kesehatan-masyarakat-pengarang-najmah-skm-mph
Vibrant Technologies is headquarted in Mumbai,India.We are the best Hadoop training provider in Navi Mumbai who provides Live Projects to students.We provide Corporate Training also.We are Best Hadoop classes in Mumbai according to our students and corporates
This presentation will give you Information about :
1. Map/Reduce Overview and Architecture Installation
2. Developing Map/Red Jobs Input and Output Formats
3. Job Configuration Job Submission
4. Practicing Map Reduce Programs (atleast 10 Map Reduce
5. Algorithms )Data Flow Sources and Destinations
6. Data Flow Transformations Data Flow Paths
7. Custom Data Types
8. Input Formats
9. Output Formats
10. Partitioning Data
11. Reporting Custom Metrics
12. Distributing Auxiliary Job Data
Hadoop Institutes : kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
Spark Training Institutes: kelly technologies is the best Spark class Room training institutes in Bangalore. Providing Spark training by real time faculty in Bangalore.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.
It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
Session presented at the 2nd IndicThreads.com Conference on Cloud Computing held in Pune, India on 3-4 June 2011.
http://CloudComputing.IndicThreads.com
Abstract: The processing of massive amount of data gives great insights into analysis for business. Many primary algorithms run over the data and gives information which can be used for business benefits and scientific research. Extraction and processing of large amount of data has become a primary concern in terms of time, processing power and cost. Map Reduce algorithm promises to address the above mentioned concerns. It makes computing of large sets of data considerably easy and flexible. The algorithm offers high scalability across many computing nodes. This session will introduce Map Reduce algorithm, followed by few variations of the same and also hands on example in Map Reduce using Apache Hadoop.
Speaker: Allahbaksh Asadullah is a Product Technology Lead from Infosys Labs, Bangalore. He has over 5 years of experience in software industry in various technologies. He has extensively worked on GWT, Eclipse Plugin development, Lucene, Solr, No SQL databases etc. He speaks at the developer events like ACM Compute, Indic Threads and Dev Camps.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Hadoop Online Training : kelly technologies is the bestHadoop online Training Institutes in Bangalore. ProvidingHadoop online Training by real time faculty in Bangalore.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
2. Announcements
My office hours: M 2:30—3:30 in CSE 212
Cluster is operational; instructions in assignment 1
heavily rewritten
Eclipse plugin is “deprecated”
Students who already created accounts: let me know
if you have trouble
www.kellytechno.com
3. Breaking news!
Hadoop tested on 4,000 node cluster
32K cores (8 / node)
16 PB raw storage (4 x 1 TB disk / node)
(about 5 PB usable storage)
http://developer.yahoo.com/blogs/hadoop/2008/09/
scaling_hadoop_to_4000_nodes_a.html
www.kellytechno.com
4. You Say, “tomato…”
Google calls it: Hadoop equivalent:
MapReduce Hadoop
GFS HDFS
Bigtable HBase
Chubby Zookeeper
www.kellytechno.com
5. Some MapReduce Terminology
Job – A “full program” - an execution of a Mapper and
Reducer across a data set
Task – An execution of a Mapper or a Reducer on a
slice of data
a.k.a. Task-In-Progress (TIP)
Task Attempt – A particular instance of an attempt to
execute a task on a machine
www.kellytechno.com
6. Terminology Example
Running “Word Count” across 20 files is one job
20 files to be mapped imply 20 map tasks + some
number of reduce tasks
At least 20 map task attempts will be performed…
more if a machine crashes, etc.
www.kellytechno.com
7. Task Attempts
A particular task will be attempted at least once,
possibly more times if it crashes
If the same input causes crashes over and over, that
input will eventually be abandoned
Multiple attempts at one task may occur in
parallel with speculative execution turned on
Task ID from TaskInProgress is not a unique identifier;
don’t use it that way
www.kellytechno.com
9. Node-to-Node Communication
Hadoop uses its own RPC protocol
All communication begins in slave nodes
Prevents circular-wait deadlock
Slaves periodically poll for “status” message
Classes must provide explicit serialization
www.kellytechno.com
10. Nodes, Trackers, Tasks
Master node runs JobTracker instance, which accepts
Job requests from clients
TaskTracker instances run on slave nodes
TaskTracker forks separate Java process for task
instances
www.kellytechno.com
11. Job Distribution
MapReduce programs are contained in a Java “jar”
file + an XML file containing serialized program
configuration options
Running a MapReduce job places these files into
the HDFS and notifies TaskTrackers where to
retrieve the relevant program code
… Where’s the data distribution?
www.kellytechno.com
12. Data Distribution
Implicit in design of MapReduce!
All mappers are equivalent; so map whatever data is
local to a particular node in HDFS
If lots of data does happen to pile up on the same
node, nearby nodes will map instead
Data transfer is handled implicitly by HDFS
www.kellytechno.com
13. Configuring With JobConf
MR Programs have many configurable options
JobConf objects hold (key, value) components
mapping String ’a
e.g., “mapred.map.tasks” 20
JobConf is serialized and distributed before running the
job
Objects implementing JobConfigurable can
retrieve elements from a JobConf
www.kellytechno.com
15. Job Launch Process: Client
Client program creates a JobConf
Identify classes implementing Mapper and Reducer
interfaces
JobConf.setMapperClass(), setReducerClass()
Specify inputs, outputs
FileInputFormat.addInputPath(),
FileOutputFormat.setOutputPath()
Optionally, other options too:
JobConf.setNumReduceTasks(), JobConf.setOutputFormat()
…
www.kellytechno.com
16. Job Launch Process: JobClient
Pass JobConf to JobClient.runJob() or submitJob()
runJob() blocks, submitJob() does not
JobClient:
Determines proper division of input into InputSplits
Sends job data to master JobTracker server
www.kellytechno.com
17. Job Launch Process: JobTracker
JobTracker:
Inserts jar and JobConf (serialized to XML) in shared
location
Posts a JobInProgress to its run queue
www.kellytechno.com
18. Job Launch Process: TaskTracker
TaskTrackers running on slave nodes periodically
query JobTracker for work
Retrieve job-specific jar and config
Launch task in separate instance of Java
main() is provided by Hadoop
www.kellytechno.com
19. Job Launch Process: Task
TaskTracker.Child.main():
Sets up the child TaskInProgress attempt
Reads XML configuration
Connects back to necessary MapReduce components
via RPC
Uses TaskRunner to launch user process
www.kellytechno.com
20. Job Launch Process: TaskRunner
TaskRunner, MapTaskRunner, MapRunner work in a
daisy-chain to launch your Mapper
Task knows ahead of time which InputSplits it should
be mapping
Calls Mapper once for each record retrieved from the
InputSplit
Running the Reducer is much the same
www.kellytechno.com
21. Creating the Mapper
You provide the instance of Mapper
Should extend MapReduceBase
One instance of your Mapper is initialized by the
MapTaskRunner for a TaskInProgress
Exists in separate process from all other instances of
Mapper – no data sharing!
www.kellytechno.com
23. What is Writable?
Hadoop defines its own “box” classes for strings
(Text), integers (IntWritable), etc.
All values are instances of Writable
All keys are instances of WritableComparable
www.kellytechno.com
25. Reading Data
Data sets are specified by InputFormats
Defines input data (e.g., a directory)
Identifies partitions of the data that form an InputSplit
Factory for RecordReader objects to extract (k, v)
records from the input source
www.kellytechno.com
26. FileInputFormat and Friends
TextInputFormat – Treats each ‘n’-terminated line of
a file as a value
KeyValueTextInputFormat – Maps ‘n’- terminated
text lines of “k SEP v”
SequenceFileInputFormat – Binary file of (k, v) pairs
with some add’l metadata
SequenceFileAsTextInputFormat – Same, but maps
(k.toString(), v.toString())
www.kellytechno.com
27. Filtering File Inputs
FileInputFormat will read all files out of a specified
directory and send them to the mapper
Delegates filtering this file list to a method subclasses
may override
e.g., Create your own “xyzFileInputFormat” to read
*.xyz from directory list
www.kellytechno.com
28. Record Readers
Each InputFormat provides its own RecordReader
implementation
Provides (unused?) capability multiplexing
LineRecordReader – Reads a line from a text file
KeyValueRecordReader – Used by
KeyValueTextInputFormat
www.kellytechno.com
29. Input Split Size
FileInputFormat will divide large files into chunks
Exact size controlled by mapred.min.split.size
RecordReaders receive file, offset, and length of
chunk
Custom InputFormat implementations may override
split size – e.g., “NeverChunkFile”
www.kellytechno.com
30. Sending Data To Reducers
Map function receives OutputCollector object
OutputCollector.collect() takes (k, v) elements
Any (WritableComparable, Writable) can be used
By default, mapper output type assumed to be same
as reducer output type
www.kellytechno.com
32. Sending Data To The Client
Reporter object sent to Mapper allows simple
asynchronous feedback
incrCounter(Enum key, long amount)
setStatus(String msg)
Allows self-identification of input
InputSplit getInputSplit()
www.kellytechno.com
34. Partitioner
int getPartition(key, val, numPartitions)
Outputs the partition number for a given key
One partition == values sent to one Reduce task
HashPartitioner used by default
Uses key.hashCode() to return partition num
JobConf sets Partitioner implementation
www.kellytechno.com
35. Reduction
reduce( K2 key,
Iterator<V2> values,
OutputCollector<K3, V3> output,
Reporter reporter)
Keys & values sent to one partition all go to the same
reduce task
Calls are sorted by key – “earlier” keys are reduced
and output before “later” keys
www.kellytechno.com