Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop features include a distributed file system called HDFS that stores data on commodity machines, providing fault tolerance. It also provides a programming model called MapReduce that allows users to write applications as a set of map and reduce functions that can automatically parallelize across a distributed system.
Moore’s law has finally hit the wall and CPU speeds have actually decreased in the last few years. The industry is reacting with hardware with an ever-growing number of cores and software that can leverage “grids” of distributed, often commodity, computing resources. But how is a traditional Java developer supposed to easily take advantage of this revolution? The answer is the Apache Hadoop family of projects. Hadoop is a suite of Open Source APIs at the forefront of this grid computing revolution and is considered the absolute gold standard for the divide-and-conquer model of distributed problem crunching. The well-travelled Apache Hadoop framework is currently being leveraged in production by prominent names such as Yahoo, IBM, Amazon, Adobe, AOL, Facebook and Hulu just to name a few.
In this session, you’ll start by learning the vocabulary unique to the distributed computing space. Next, we’ll discover how to shape a problem and processing to fit the Hadoop MapReduce framework. We’ll then examine the incredible auto-replicating, redundant and self-healing HDFS filesystem. Finally, we’ll fire up several Hadoop nodes and watch our calculation process get devoured live by our Hadoop grid. At this talk’s conclusion, you’ll feel equipped to take on any massive data set and processing your employer can throw at you with absolute ease.
Moore’s law has finally hit the wall and CPU speeds have actually decreased in the last few years. The industry is reacting with hardware with an ever-growing number of cores and software that can leverage “grids” of distributed, often commodity, computing resources. But how is a traditional Java developer supposed to easily take advantage of this revolution? The answer is the Apache Hadoop family of projects. Hadoop is a suite of Open Source APIs at the forefront of this grid computing revolution and is considered the absolute gold standard for the divide-and-conquer model of distributed problem crunching. The well-travelled Apache Hadoop framework is currently being leveraged in production by prominent names such as Yahoo, IBM, Amazon, Adobe, AOL, Facebook and Hulu just to name a few.
In this session, you’ll start by learning the vocabulary unique to the distributed computing space. Next, we’ll discover how to shape a problem and processing to fit the Hadoop MapReduce framework. We’ll then examine the incredible auto-replicating, redundant and self-healing HDFS filesystem. Finally, we’ll fire up several Hadoop nodes and watch our calculation process get devoured live by our Hadoop grid. At this talk’s conclusion, you’ll feel equipped to take on any massive data set and processing your employer can throw at you with absolute ease.
Slides that are presented at Big Data Bootcamp in Austin 2016.
This presentation covers
- Current challenges in the US healthcare
- How Accordion Health is helping the industry
- What machine learning techniques are useful to solve such problems
The case presented in HPX 2015. Blindnavi is a navigation app for the blind people using smartphone and iBeacon. This project explores how to install iBeacons, how to navigate for the blind, and how to utter the guiding narratives. This app is under development now, and has won CHISDC golden award and RedDot Best of Best in communication design.
Objectives
- Assess types and densities of NA bacteria in diverse manures and manured soils
- Identify physico-chemical conditions that favor NA activity in soil and reduce N2O emissions
- Evaluate the impact of climate adaptive management practices (C addition, low disturbance) on GHG tradeoffs
This third webinar discusses the fundamentals of LTE Carriers and how LTE mobiles communicate with the network including what factors affect performance.
This is the document describing scenario design process lectured by drhhtang. This is an older version of this process. A newer version please contact drhhtang@drhhtang.net or www.ditldesign.com
Slides that are presented at Big Data Bootcamp in Austin 2016.
This presentation covers
- Current challenges in the US healthcare
- How Accordion Health is helping the industry
- What machine learning techniques are useful to solve such problems
The case presented in HPX 2015. Blindnavi is a navigation app for the blind people using smartphone and iBeacon. This project explores how to install iBeacons, how to navigate for the blind, and how to utter the guiding narratives. This app is under development now, and has won CHISDC golden award and RedDot Best of Best in communication design.
Objectives
- Assess types and densities of NA bacteria in diverse manures and manured soils
- Identify physico-chemical conditions that favor NA activity in soil and reduce N2O emissions
- Evaluate the impact of climate adaptive management practices (C addition, low disturbance) on GHG tradeoffs
This third webinar discusses the fundamentals of LTE Carriers and how LTE mobiles communicate with the network including what factors affect performance.
This is the document describing scenario design process lectured by drhhtang. This is an older version of this process. A newer version please contact drhhtang@drhhtang.net or www.ditldesign.com
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...LeMeniz Infotech
Self adjusting slot configurations for homogeneous and heterogeneous hadoop clusters
Do Your Projects With Technology Experts
To Get this projects Call : 9566355386 / 99625 88976
Web : http://www.lemenizinfotech.com
Web : http://www.ieeemaster.com
Mail : projects@lemenizinfotech.com
Blog : http://ieeeprojectspondicherry.weebly.com
Blog : http://www.ieeeprojectsinpondicherry.blogspot.in/
Youtube:https://www.youtube.com/watch?v=eesBNUnKvws
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelized linear regression parameter optimization on the next-gen YARN framework Iterative Reduce.
Hadoop is one of the booming and innovative data analytics technology which can effectively handle Big Data problems and achieve the data security. It is an open source and trending technology which involves in data collection, data processing and data analytics using HDFS (Hadoop Distributed File System) and MapReduce algorithms.
Presentation slides of the workshop on "Introduction to Pig" at Fifth Elephant, Bangalore, India on 26th July, 2012.
http://fifthelephant.in/2012/workshop-pig
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent processing models with a more efficient and intuitive solution to rapidly process large amount of data in parallel is called MapReduce. It is a framework defining a template approach of programming to perform large-scale data computation on clusters of machines in a cloud computing environment. MapReduce provides automatic parallelization and distribution of computation based on several processors. It hides the complexity of writing parallel and distributed programming code. This paper provides a comprehensive systematic review and analysis of large-scale dataset processing and dataset handling challenges and
requirements in a cloud computing environment by using the MapReduce framework and its open-source implementation Hadoop. We defined requirements for MapReduce systems to perform large-scale data processing. We also proposed the MapReduce framework and one implementation of this framework on Amazon Web Services. At the end of the paper, we presented an experimentation of running MapReduce
system in a cloud environment. This paper outlines one of the best techniques to process large datasets is MapReduce; it also can help developers to do parallel and distributed computation in a cloud environment.
Iterative computations are at the core of the vast majority of data-intensive scientific computations. Recent advancements in data intensive computational fields are fueling a dramatic growth in number as well as usage of such data intensive iterative computations. The utility computing model introduced by cloud computing combined with the rich set of cloud infrastructure services offers a very viable environment for the scientists to perform data intensive computations. However, clouds by nature offer unique reliability and sustained performance challenges to large scale distributed computations necessitating computation frameworks specifically tailored for cloud characteristics to harness the power of clouds easily and effectively. My research focuses on identifying and developing user-friendly distributed parallel computation frameworks to facilitate the optimized efficient execution of iterative as well as non-iterative data-intensive computations in cloud environments, alongside the evaluation of heterogeneous cloud resources offering GPGPU resources in addition to CPU resources, for data-intensive iterative computations.
Dear Students
Ingenious techno Solution offers an expertise guidance on you Final Year IEEE & Non- IEEE Projects on the following domain
JAVA
.NET
EMBEDDED SYSTEMS
ROBOTICS
MECHANICAL
MATLAB etc
For further details contact us:
enquiry@ingenioustech.in
044-42046028 or 8428302179.
Ingenious Techno Solution
#241/85, 4th floor
Rangarajapuram main road,
Kodambakkam (Power House)
http://www.ingenioustech.in/
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
P2P media streaming and file downloading is most popular applications over the Internet.
These systems reduce the server load and provide a scalable content distribution. P2P
networking is a new paradigm to build distributed applications. It describes the design
requirements for P2P media streaming, live and Video on demand system comparison based on their system architecture. In this paper we described and studied the traditional approaches for P2P streaming systems, design issues, challenges, and current approaches for providing P2P VoD services.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
Using Git and GitHub Effectively at Emerge InteractiveMatthew McCullough
Matthew presented on some lesser-known Git and GitHub tactics at Emerge Interactive in Portland, OR on 2012-09-04.
Detailed notes are in a Gist on GitHub: https://gist.github.com/gists/3642254
Delivered on September 4, 2012
Pull Requests are a core part of the GitHub site and many modern Git version control workflows. This free class given by Matthew McCullough of GitHub provides a demo-centric review of Pull Request use and positive co-behavioral impacts.
A presentation given at UberConf 2012 in Broomfield, Colorado, USA.
Further game theory resources an be found at https://gist.github.com/matthewmccullough/2721876 and http://ambientideas.com/blog/index.php/2011/04/game-theory-and-softwaredev/
If you've worked with Git long enough to wish you could modify the history of a repository, this talk is for you. Git's filter-branch command lets you re-write history in an automated way, changing usernames, removing certain commits, or restructuring repositories to have nested folders become the top level folder for potential use as submodules.
Delivered on June 25, 2012
Git is a version control system. We can look at it from that high level. Git is a content tracking system. Some teachers advise us to look at it from that lowered elevation. But I will take you to the very bottom. The floor. The code. The algorithms. The directed acyclic graph of hashed bit sequences made efficient through LZW compression and deferred garbage collection determined by node reachability via hash relationships.
“But why?”, you may ask. “Why go this deep?”" Git is a tool that works so well for so many. It mystically corrects anticipated `merge` conflicts. It’s “where did code come from” results from `blame` are impressive. The ability to re-write history through `rebase` is awesome. The globally unique identifier nature of a hash-produced ref is revolutionary.
Uber-geeks are magic-slayers. We want and need to know precisely how things work. Like a hard 50 push-up workout, this study will make working with Git at the daily developer level a fraction of the effort — like a mere ten push-ups. Join Matthew McCullough of GitHub and let’s dig into the guts of Git.
Delivered on June 17, 2012
Matthew McCullough of GitHub presented on why Java developers have so many reasons to explore Git and Git, including productivity gains, easy OSS contributions, the eGit Eclipse plugin, and JGit, the underlying Java cleanroom implementation that powers https://android.googlesource.com.
Delivered on June 9, 2012
Learn how to use searching, logging, bisecting and pick-axing in Git.
Command history for this event is published at https://gist.github.com/2579381
Delivered on May 2nd, 2012
A Boulder private-event presentation that will additionally be given at DOSUG. Covers the basics of Git tooling, techniques, and the GitHub platform.
Delivered on April 30, 2012
Git is a compelling version control system, but it is useful to talk about it in the context of a destination, made possible by migration tools from previous version control systems like Subversion. This talk offers a set of motivations, tools, and techniques on the Subversion to Git and GitHub migration process.
Delivered on April 21, 2012
Git has a little used feature called Notes that is an excellent support to traditional commit messages. Not surprisingly, this feature also has a great visual rendering on the GitHub.com site when Notes are pushed to a Git repository.
Build Lifecycle Craftsmanship for the Transylvania JUGMatthew McCullough
Matthew McCullough presenting Build Lifecycle Craftsmanship to the Transylvania Java Users Group in October of 2011.
Resources that correspond to this presentation are include:
Maven:
http://delicious.com/matthew.mccullough/maven
https://github.com/matthewmccullough/opensourcedebuggingjava
Gradle:
http://delicious.com/matthew.mccullough/gradle
https://github.com/gradle/
https://github.com/gradleware/oreilly-gradle-book-examples
Sonar:
http://delicious.com/matthew.mccullough/sonar
http://sonarsource.org
BTrace:
http://kenai.com/projects/btrace
VisualVM:
http://visualvm.java.net/
Overarching examples:
https://github.com/matthewmccullough/opensourcedebuggingjava
Delivered on October 20, 2011
Game Theory for Software Developers at the Boulder JUGMatthew McCullough
Game Theory, a segment of economics, can effectively be applied to software development for achieving better financial and decision making outcomes.
Delivered on October 13, 2011
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
7. ado op o ften.
I u se H
d my Tivo
the s oun
Th at’s I s kip a
very time
ma kes e
com mer cial.
rrency in Practice
Ja va Concu
n Goetz, author of
-Bria
44. Applications
Protein folding
pharmaceutical research
Search Engine Indexing
walking billions of web pages
Product Recommendations
based on other customer purchases
Sorting
terabytes to petabyes in size
Classification
government intelligence
46. SELECT reccProd.name, reccProd.id
FROM products reccProd
WHERE purchases.customerId =
(SELECT customerId
FROM customers
WHERE purchases.productId = thisProd)
LIMIT 5
51. MapReduce: Simplified Data
Processing on Large Clusters
Jeffrey Dean and Sanjay Ghem
awat
jeff@google.com, sanjay@goog
le.com
Google, Inc.
Abstract
given day, etc. Most such com
MapReduce is a programmin putations are conceptu-
g model and an associ- ally straightforward. However
ated implementation for proc , the input data is usually
essing and generating large large and the computations have
data sets. Users specify a map to be distributed across
function that processes a hundreds or thousands of mac
key/value pair to generate a set hines in order to finish in
of intermediate key/value a reasonable amount of time.
pairs, and a reduce function that The issues of how to par-
merges all intermediate allelize the computation, dist
values associated with the sam ribute the data, and handle
e intermediate key. Many failures conspire to obscure the
real world tasks are expressible original simple compu-
in this model, as shown tation with large amounts of
in the paper. complex code to deal with
these issues.
Programs written in this func As a reaction to this complex
tional style are automati- ity, we designed a new
cally parallelized and executed abstraction that allows us to exp
on a large cluster of com- ress the simple computa-
modity machines. The run-time tions we were trying to perform
system takes care of the but hides the messy de-
details of partitioning the inpu tails of parallelization, fault-tol
t data, scheduling the pro- erance, data distribution
gram’s execution across a set and load balancing in a libra
of machines, handling ma- ry. Our abstraction is in-
chine failures, and managing spired by the map and reduce
the required inter-machine primitives present in Lisp
communication. This allows and many other functional lang
programmers without any uages. We realized that
experience with parallel and most of our computations invo
distributed systems to eas- lved applying a map op-
ily utilize the resources of a larg eration to each logical “record”
e distributed system. in our input in order to
Our implementation of Map compute a set of intermediat
Reduce runs on a large e key/value pairs, and then
cluster of commodity machine applying a reduce operation to
s and is highly scalable: all the values that shared
a typical MapReduce computa the same key, in order to com
tion processes many ter- bine the derived data ap-
abytes of data on thousands of propriately. Our use of a func
machines. Programmers tional model with user-
find the system easy to use: hun specified map and reduce ope
dreds of MapReduce pro- rations allows us to paral-
grams have been implemente lelize large computations easi
d and upwards of one thou- ly and to use re-execution
sand MapReduce jobs are exec as the primary mechanism for
uted on Google’s clusters fault tolerance.
every day. The major contributions of this
work are a simple and
powerful interface that enables
automatic parallelization
and distribution of large-scale
computations, combined
1 Introduction with an implementation of this
interface that achieves
high performance on large clus
ters of commodity PCs.
Over the past five years, the Section 2 describes the basic
authors and many others at programming model and
Google have implemented hun gives several examples. Sec
dreds of special-purpose tion 3 describes an imple-
computations that process larg mentation of the MapReduce
e amounts of raw data, interface tailored towards
such as crawled documents, our cluster-based computing
web request logs, etc., to environment. Section 4 de-
compute various kinds of deri scribes several refinements of
ved data, such as inverted the programming model
indices, various representatio that we have found useful. Sec
ns of the graph structure tion 5 has performance
of web documents, summaries measurements of our implem
of the number of pages entation for a variety of
crawled per host, the set of tasks. Section 6 explores the
most frequent queries in a use of MapReduce within
Google including our experie
nces in using it as the basis
To appear in OSDI 2004
1
52. mode l and
gram ming
“ A pro for
ment ation
imple
nd ge nera ting
pro cess ing a
dat a s ets ”
la rg e
69. /**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.mapred.lib;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
70. /** A {@link Mapper} that extracts text matching a regular expression.
*/
public class RegexMapper<K> extends MapReduceBase
implements Mapper<K, Text, Text, LongWritable> {
private Pattern pattern;
private int group;
public void configure(JobConf job) {
pattern = Pattern.compile(job.get("mapred.mapper.regex"));
group = job.getInt("mapred.mapper.regex.group", 0);
}
public void map(K key, Text value,
OutputCollector<Text, LongWritable> output,
Reporter reporter)
throws IOException {
String text = value.toString();
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
output.collect(new Text(matcher.group(group)), new LongWritable
(1));
}
}
}
71. Have Code,
Will Travel
Code travels to the data
Opposite of traditional systems
74. uted file
alab le di strib
“sc ibuted
r large distr
s yst em fo icati ons.
nsive appl
d ata-inte ance
fa ult t oler
ovides
It pr
in expensive
e runn ing on
whil
ty h ardw are”
com modi
75. HDFS Basics
Open Source implementation of
Google BigTable
Replicated data store
Stored in 64MB blocks
76. HDFS
Rack location aware
Configurable redundancy factor
Self-healing
Looks almost like *NIX filesystem
95. server Funerals
No pagers go off when machines die
Report of dead machines once a week
Clean out the carcasses
96. ss attributes
Ro bustne eding
ed fro m ble
p revent
plication code
into ap
Data redundancy
Node death
Retries
Data geography
Parallelism
Scalability
128. PIG Questions
Ask big questions on unstructured
data
How many ___?
Should we ____?
Decide on the questions you want to
ask long after you’ve collected the
data.
129. Pig Data
999991,female,Mary,T,Hargrave,600 Quiet Valley Lane,Los Angeles,CA,90017,US,Mary.T.Hargrave@dodgit.com,
999992,male,Harold,J,Candelario,294 Ford Street,OAKLAND,CA,94607,US,Harold.J.Candelario@dodgit.com,ad2U
999993,female,Ruth,G,Carter,4890 Murphy Court,Shakopee,MN,55379,US,Ruth.G.Carter@mailinator.com,uaseu8e
999994,male,Lionel,J,Carter,2701 Irving Road,Saint Clairsville,OH,43950,US,Lionel.J.Carter@trashymail.c
999995,female,Georgia,C,Medina,4541 Cooks Mine Road,CLOVIS,NM,88101,US,Georgia.C.Medina@trashymail.com,
999996,male,Stanley,S,Cruz,1463 Jennifer Lane,Durham,NC,27703,US,Stanley.S.Cruz@pookmail.com,aehoh5rooG
999997,male,Justin,A,Delossantos,3169 Essex Court,MANCHESTER,VT,05254,US,Justin.A.Delossantos@mailinato
999998,male,Leonard,K,Baker,4672 Margaret Street,Houston,TX,77063,US,Leonard.K.Baker@trashymail.com,Aep
999999,female,Charissa,J,Thorne,2806 Cedar Street,Little Rock,AR,72211,US,Charissa.J.Thorne@trashymail.
1000000,male,Michael,L,Powell,2797 Turkey Pen Road,New York,NY,10013,US,Michael.L.Powell@mailinator.com
130. Pig Sample
Person = LOAD 'people.csv' using PigStorage(',');
Names = FOREACH Person GENERATE $2 AS name;
OrderedNames = ORDER Names BY name ASC;
GroupedNames = GROUP OrderedNames BY name;
NameCount = FOREACH GroupedNames
GENERATE group, COUNT(OrderedNames);
store NameCount into 'names.out';
163. Commodity
1..N
TaskTracker
h
dfs.s
Robust, Primary rt-
1 sta h DataNode
e d.s
-m apr
s tart
start-all.sh NameNode
Commodity
1..N
JobTracker sta
rt-d
fs.
sta sh
rt-
ma TaskTracker
pr
ed
.sh
DataNode