The document provides an overview of Hadoop, describing it as an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It discusses key Hadoop components like HDFS for storage, MapReduce for distributed processing, and YARN for resource management. The document also gives examples of how organizations are using Hadoop at large scale for applications like search indexing and data analytics.
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
A Hanborq optimized Hadoop Distribution, especially with high performance of MapReduce. It's the core part of HDH (Hanborq Distribution with Hadoop for Big Data Engineering).
Презентация с технической секции #BitByte - фестиваля профессионального развития, который прошел 19 мая в Санкт-Петербурге.
Владимир Суворов, Инженер отраслевых решений компании EMC2: «Практическое использование Apache Hadoop - технологии распределенной обработки данных».
This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.
This session was given in Arabic and i may provide a video for the session soon.
This page provides access to information about how to integrate Apache Hadoop with Lustre. We have made several enhancements to improve the use of Hadoop with Lustre and have conducted performance tests to compare the performance of Lustre vs. HDFS when used with Hadoop.
http://wiki.lustre.org/index.php/Integrating_Hadoop_with_Lustre
A session focused on ramping you up on what Hadoop is, how its works and what it's capable of. We will also look at what Hadoop 2.x and YARN brings to the table and some future projects in the Hadoop space to keep an eye on.
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
A Hanborq optimized Hadoop Distribution, especially with high performance of MapReduce. It's the core part of HDH (Hanborq Distribution with Hadoop for Big Data Engineering).
Презентация с технической секции #BitByte - фестиваля профессионального развития, который прошел 19 мая в Санкт-Петербурге.
Владимир Суворов, Инженер отраслевых решений компании EMC2: «Практическое использование Apache Hadoop - технологии распределенной обработки данных».
This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.
This session was given in Arabic and i may provide a video for the session soon.
This page provides access to information about how to integrate Apache Hadoop with Lustre. We have made several enhancements to improve the use of Hadoop with Lustre and have conducted performance tests to compare the performance of Lustre vs. HDFS when used with Hadoop.
http://wiki.lustre.org/index.php/Integrating_Hadoop_with_Lustre
A session focused on ramping you up on what Hadoop is, how its works and what it's capable of. We will also look at what Hadoop 2.x and YARN brings to the table and some future projects in the Hadoop space to keep an eye on.
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
When two of the most powerful innovations in modern analytics come together, the result is revolutionary. This session will provide an overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization. It will discuss the ways that R and Hadoop have been integrated and look at use case that provides real-world experience. Finally it will provide suggestions of how enterprises can take advantage of both of these industry-leading technologies.
Speaking of big data analysis, what comes to mind is possibly using HDFS and MapReduce within Hadoop. But to write a MapReduce program, one must face the problem of learning how to write native java. One might wonder is it possible to use R, the most popular language adapted by data scientist, to implement MapReduce program? And through the integration or R and Hadoop, is it truly one can unleash the power of parallel computing and big data analysis?
This slide introduces how to install RHadoop step by step, and introduces how to write a MapReduce program through R. What is more, this slide will discuss whether RHadoop is really a light for big data analysis, or just another method to write MapReduce Program.
Please mail me if you found any problem toward the slide. EMAIL: tr.ywchiu@gmail.com
談到巨量資料,通常大家腦海中聯想到的就是使用Hadoop 的 MapReduce 和HDFS,但是撰寫MapReduce,則就必須要學會撰寫Java 或透過Thrift 接口才能撰寫。但R是否有辦法運行在Hadoop 上呢 ? 而使用R + Hadoop,是否就真的能結合R強大的分析功能,分析巨量資料呢 ?
本次講題將介紹如何Step by step 在Hadoop 上安裝RHadoop相關套件,並介紹如何撰寫R的MapReduce 程式。更重要的是,此次將探討使用RHadoop 是否為巨量資料分析找到一盞明燈? 或者只是另一套實作方法而已?
Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. With the advent of technologies such as RHadoop, optimizing R workloads for use on Hadoop has become much easier. This session will help you understand how RHadoop projects such as RMR, and RHDFS work with Hadoop, and will show you examples of using these technologies on the Hortonworks Data Platform.
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
Session presented at the 2nd IndicThreads.com Conference on Cloud Computing held in Pune, India on 3-4 June 2011.
http://CloudComputing.IndicThreads.com
Abstract: The processing of massive amount of data gives great insights into analysis for business. Many primary algorithms run over the data and gives information which can be used for business benefits and scientific research. Extraction and processing of large amount of data has become a primary concern in terms of time, processing power and cost. Map Reduce algorithm promises to address the above mentioned concerns. It makes computing of large sets of data considerably easy and flexible. The algorithm offers high scalability across many computing nodes. This session will introduce Map Reduce algorithm, followed by few variations of the same and also hands on example in Map Reduce using Apache Hadoop.
Speaker: Allahbaksh Asadullah is a Product Technology Lead from Infosys Labs, Bangalore. He has over 5 years of experience in software industry in various technologies. He has extensively worked on GWT, Eclipse Plugin development, Lucene, Solr, No SQL databases etc. He speaks at the developer events like ACM Compute, Indic Threads and Dev Camps.
This presentation will give you Information about :
1.Configuring HDFS
2.Interacting With HDFS
3.HDFS Permissions and Security
4.Additional HDFS Tasks
HDFS Overview and Architecture
5.HDFS Installation
6.Hadoop File System Shell
7.File System Java API
Schedulers optimization to handle multiple jobs in hadoop clusterShivraj Raj
This effort is projected to give a high level summary of what is Big data and how to solve the issues generated through four V’s and stored in HDFS using various configuration parameters by setting up Hadoop, Pig and Hive to retrieve useful data from bulky data sets.
How to build a news website use CMS wordpressbaran19901990
This tutorial wil show you, how to build a news website use CMS Wordpress step by step. From basic to advantage. I think that this tutorial is amazing and very basic for newbie.
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
When two of the most powerful innovations in modern analytics come together, the result is revolutionary. This session will provide an overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization. It will discuss the ways that R and Hadoop have been integrated and look at use case that provides real-world experience. Finally it will provide suggestions of how enterprises can take advantage of both of these industry-leading technologies.
Speaking of big data analysis, what comes to mind is possibly using HDFS and MapReduce within Hadoop. But to write a MapReduce program, one must face the problem of learning how to write native java. One might wonder is it possible to use R, the most popular language adapted by data scientist, to implement MapReduce program? And through the integration or R and Hadoop, is it truly one can unleash the power of parallel computing and big data analysis?
This slide introduces how to install RHadoop step by step, and introduces how to write a MapReduce program through R. What is more, this slide will discuss whether RHadoop is really a light for big data analysis, or just another method to write MapReduce Program.
Please mail me if you found any problem toward the slide. EMAIL: tr.ywchiu@gmail.com
談到巨量資料,通常大家腦海中聯想到的就是使用Hadoop 的 MapReduce 和HDFS,但是撰寫MapReduce,則就必須要學會撰寫Java 或透過Thrift 接口才能撰寫。但R是否有辦法運行在Hadoop 上呢 ? 而使用R + Hadoop,是否就真的能結合R強大的分析功能,分析巨量資料呢 ?
本次講題將介紹如何Step by step 在Hadoop 上安裝RHadoop相關套件,並介紹如何撰寫R的MapReduce 程式。更重要的是,此次將探討使用RHadoop 是否為巨量資料分析找到一盞明燈? 或者只是另一套實作方法而已?
Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. With the advent of technologies such as RHadoop, optimizing R workloads for use on Hadoop has become much easier. This session will help you understand how RHadoop projects such as RMR, and RHDFS work with Hadoop, and will show you examples of using these technologies on the Hortonworks Data Platform.
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
Session presented at the 2nd IndicThreads.com Conference on Cloud Computing held in Pune, India on 3-4 June 2011.
http://CloudComputing.IndicThreads.com
Abstract: The processing of massive amount of data gives great insights into analysis for business. Many primary algorithms run over the data and gives information which can be used for business benefits and scientific research. Extraction and processing of large amount of data has become a primary concern in terms of time, processing power and cost. Map Reduce algorithm promises to address the above mentioned concerns. It makes computing of large sets of data considerably easy and flexible. The algorithm offers high scalability across many computing nodes. This session will introduce Map Reduce algorithm, followed by few variations of the same and also hands on example in Map Reduce using Apache Hadoop.
Speaker: Allahbaksh Asadullah is a Product Technology Lead from Infosys Labs, Bangalore. He has over 5 years of experience in software industry in various technologies. He has extensively worked on GWT, Eclipse Plugin development, Lucene, Solr, No SQL databases etc. He speaks at the developer events like ACM Compute, Indic Threads and Dev Camps.
This presentation will give you Information about :
1.Configuring HDFS
2.Interacting With HDFS
3.HDFS Permissions and Security
4.Additional HDFS Tasks
HDFS Overview and Architecture
5.HDFS Installation
6.Hadoop File System Shell
7.File System Java API
Schedulers optimization to handle multiple jobs in hadoop clusterShivraj Raj
This effort is projected to give a high level summary of what is Big data and how to solve the issues generated through four V’s and stored in HDFS using various configuration parameters by setting up Hadoop, Pig and Hive to retrieve useful data from bulky data sets.
How to build a news website use CMS wordpressbaran19901990
This tutorial wil show you, how to build a news website use CMS Wordpress step by step. From basic to advantage. I think that this tutorial is amazing and very basic for newbie.
A session focused on ramping you up on what Hadoop is, how its works and what it's capable of. We will also look at what Hadoop 2.x and YARN brings to the table and some future projects in the Hadoop space to keep an eye on.
r packagesdata analytics study material;
learn data analytics online;
data analytics courses;
courses for data analysis;
courses for data analytics;
online data analysis courses;
courses on data analysis;
data analytics classes;
data analysis training courses online;
courses in data analysis;
data analysis courses online;
data analytics training;
courses for data analyst;
data analysis online course;
data analysis certification;
data analysis courses;
data analysis classes;
online course data analysis;
learn data analysis online;
data analysis training;
python for data analysis course;
learn data analytics;
study data analytics;
how to learn data analytics;
data analysis course free;
statistical methods and data analysis;
big data analytics;
data analysis companies;
python data analysis course;
tools that can be used to analyse data;
data analysis consulting;
basic data analytics;
data analysis programs;
examples of data analysis tools;
big data analysis tools;
data analytics tools and techniques;
statistics for data analytics;
data analytics tools;
data analytics and big data;
data analytics big data;
data analysis software;
data analytics with excel;
website data analysis;
data analytics companies;
data analysis qualifications;
tools for data analytics;
data analysis tools;
qualitative data analysis software;
free data analytics;
data analysis website;
tools for analyzing data;
data analytics software;
free data analysis software;
tools for analysing data;
data mining book;
learn data analysis;
about data analytics;
statistical data analysis software;
it data analytics;
data analytics tutorial for beginners;
unstructured data analytics;
data analytics using excel;
dissertation data analysis;
sample of data analysis;
data analysis online;
data analytics;
tools of data analysis;
analytical tools for data analysis;
statistical tools to analyse data;
data analysis help;
data analysis education;
statistical technique for data analysis;
tools for data analysis;
how to learn data analysis;
data analytics tutorial;
excel data analytics;
data mining course;
data analysis software free;
big data and data analytics;
statistical analysis software;
tools to analyse data;
online data analysis;
data mining software;
data analytics statistics;
how to do data analytics;
statistical data analysis tools;
data analyst tools;
business data analysis;
tools and techniques of data analysis;
education data analysis;
advanced data analytics;
study data analysis;
spreadsheet data analysis;
learn data analysis in excel;
software for data analysis;
shared data warehouse;
what are data analysis tools;
data analytics and statistics;
data analyse;
analysis courses;
data analysis tools for research;
research data analysis tools;
big data analysis;
data mining programs;
applications of data analytics;
data analysis tools and techniques;
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
Hadoop is about so much more than batch processing. With the recent release of Hadoop 2, there have been significant changes to how a Hadoop cluster uses resources. YARN, the new resource management component, allows for a more efficient mix of workloads across hardware resources, and enables new applications and new processing paradigms such as stream-processing. This talk will discuss the new design and components of Hadoop 2, and examples of Modern Data Architectures that leverage Hadoop for maximum business efficiency.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
This presentation provides a comprehensive introduction to the Hadoop Distributed System, a powerful and widely used framework for distributed storage and processing of large-scale data. Hadoop has revolutionized the way organizations manage and analyze data, making it a crucial tool in the field of big data and data analytics.
In this presentation, we explore the key components and features of Hadoop, shedding light on the fundamental building blocks that enable its exceptional data processing capabilities. We cover essential topics, including the Hadoop Distributed File System (HDFS), MapReduce, YARN (Yet Another Resource Negotiator), and Hadoop Ecosystem components like Hive, Pig, and Spark.
2. What Is ? inspired by
• System for Processing mind-boggingly large amount
of data.
• The Apache Hadoop software library is a framework:
• that allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
• It is designed to scale up from single servers to thousands
of machines, each offering local computation and storage.
• Rather than rely on hardware to deliver high-availability,
• The library itself is designed to detect and handle failures at
the application layer, so delivering a highly-available service
on top of a cluster of computers, each of which may be
prone to failures.
3. Hadoop Core
• Open sourced, flexible and available architecture
for large scale computation and data processing on
a network of commodity hardware
• Open Source Software + Hardware Commodity
• IT Costs Reduction
MapReduce Computation
HDFS Storage
4. Hadoop, Why?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
• Failure is expected, rather than exceptional.
• The number of nodes in a cluster is not constant.
• Need common infrastructure
• Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
• Workloads are IO bound and not CPU bound
5. Hadoop History
• 2004—Initial versions of what is now Hadoop Distributed Filesystem and
MapReduce implemented by Doug Cutting and Mike Cafarella.
• December 2005—Nutch ported to the new framework. Hadoop runs reliably on
20 nodes.
• January 2006—Doug Cutting joins Yahoo!.
• February 2006—Apache Hadoop project officially started to support the
standalone development of MapReduce and HDFS.
• February 2006—Adoption of Hadoop by Yahoo! Grid team.
• April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.
• May 2006—Yahoo! set up a Hadoop research cluster—300 nodes.
• May 2006—Sort benchmark run on 500 nodes in 42 hours (better hardware than
April benchmark).
• October 2006—Research cluster reaches 600 nodes.
• December 2006—Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3
hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours.
• January 2007—Research cluster reaches 900 nodes.
• April 2007—Research clusters—2 clusters of 1000 nodes.
• April 2008—Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.
• October 2008—Loading 10 terabytes of data per a day on to research clusters.
• March 2009—17 clusters with a total of 24,000 nodes
• April 2009—Won the minute sort by sorting 500 GB in 59 seconds (on 1400
nodes) and the 100 terabyte sort in 173 minutes (on 3400 nodes).
7. How does HDFS Work? MapReduce has
undergone a
Let Suppose we have a file complete
overhaul in
Size : 300MB hadoop-0.23 and
we now
have, what we
call, MapReduce
2.0 (MRv2) or
0
YARN.
0 The fundamental
M idea of MRv2 is
B to split up the
two major
functionalities of
the
JobTracker, resou
rce management
and job
scheduling/monit
oring, into
separate daemo
8. How does HDFS Work? 1 MapReduce has
HDFS splits it into blocks 2 undergone a complete
8 overhaul in hadoop-0.23
Size of each block is 128 MB. M
and we now have, what
we call, MapReduce 2.0
B (MRv2) or YARN.
The fundamental idea of
MRv2 is to split up the
1 two major functionalities
2 of the JobTracker,
8
M
B
4 resource management
4 and job
M scheduling/monitoring, i
B nto separate daemo
9. How does HDFS Work? MapReduce has
undergone a complete
HDFS will keep 3 copies of each overhaul in hadoop-0.23
and we now have, what
Block. we call, MapReduce 2.0
HDFS store these blocks on (MRv2) or YARN.
datanodes,
HDFS distributes the block to
the DNs
10. How does HDFS Work?
The Name Node tracks blocks and Data nodes.
DN DN DN
DN DN
DN
Name Node
DN DN DN
11. How does HDFS Work?
Sometimes a datanode will die. Not a problem,
16. MapReduce: Programming Model
Process data using special map() and reduce()
functions
The map() function is called on every item in the
input and emits a series of intermediate key/value
pairs
All values associated with a given key are grouped
together
The reduce() function is called on every unique
key, and its value list, and emits a value that is
added to the output
17. MapReduce:Programming Model
M <How,1>
<now,1> <How,1 1>
How now <brown,1> <now,1 1> brown 1
Brown M <cow,1> <brown,1> R cow 1
cow <How,1> <cow,1> does 1
<does,1> <does,1> How 2
M <it,1> <it,1>
R it 1
How does
It work now <work,1> <work,1> now 2
<now,1> work 1
M Reduce
MapReduce
Framework
Map
Input Output
19. MapReduce Life Cycle
Map function
Reduce function
Run this program as a
MapReduce job
20. Hadoop Environment
Hadoop has become the kernel of the distributed
operating system for Big Data
The project includes these modules:
Hadoop Common: The common utilities that support
the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A
distributed file system that provides high-throughput
access to application data.
Hadoop YARN: A framework for job scheduling and
cluster resource management.
Hadoop MapReduce: A YARN-based system for
parallel processing of large data sets.
23. What is ZooKeeper
• A centralized service for maintaining
• Configuration information, naming
• Providing distributed synchronization,
• and providing group services.
• A set of tools to build distributed applications that can
safely handle partial failures
• ZooKeeper was designed to store coordination data
• Status information
• Configuration
• Location information
24. ZooKeeper
• ZooKeeper allows distributed processes to coordinate
with each other through a shared hierarchical name
space of data registers (we call these registers
znodes), much like a file system.
26. FLUME
• Flume is a distributed:
• A distributed, reliable, and data collection service
• It efficiently collecting, aggregating, and moving large
amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
• It has a simple and flexible architecture based on
streaming data flows.
29. Sqoop
• Apache Sqoop(TM) is a tool designed for efficiently
transferring bulk data between Apache Hadoop and
structured data stores such as relational databases.
• Easy, parallel database import/export
• What you want do?
• Insert data from RDBMS to HDFS
• Export data from HDFS back into RDBMS
32. Why Hive and Pig?
Although MapReduce is very powerful, it can also be
complex to master
Many organizations have business or data analysts who
are skilled at writing SQL queries, but not at writing
Java code
Many organizations have programmers who are skilled
at writing code in scripting languages
Hive and Pig are two projects which evolved separately
to help such people analyze huge amounts of data
via MapReduce
Hive was initially developed at Facebook, Pig at Yahoo!
33. Hive
What is Hive?
An SQL-like interface to Hadoop
Data Warehouse infrastructure that provides easy data
summarization and ad hoc querying and the analysis
of large datasets stored in Hadoop compatible file
systems.
MapRuduce for execution
HDFS for storage
Hive Query Language
Basic-SQL : Select, From, Join, Group-By
Equi-Join, Muti-Table Insert, Multi-Group-By
Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
35. Pig
Apache Pig is a platform to analyze large data sets.
In simple terms you have lots and lots of data on which
you need to do some processing or analysis , one
way is to write Map Reduce code and then run that
processing on data.
Other way is to write Pig scripts which would inturn be
converted to Map Reduce code and would process
your data.
Pig consists of two parts
• Pig latin language
• Pig engine A = load ‘a.txt’ as (id, name, age, ...)
B = load ‘b.txt’ as (id, address, ...)
C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
36. Pig Latin & Pig Engine
• Pig latin is a scripting language which allows you to
describe how data flow from one or more inputs
should be read , how it should be processed and
then where it should be stored.
• The flows can be simple or complex where some
processing is applied in between. Data can be picked
from multiple inputs.
• We can say Pig Latin describes a directed acyclic
graphs where edges are data flows and the nodes
are operators that process the data
• Pig Engine:
• The job of engine is to execute the data flow written
in Pig latin in parallel on hadoop infrastructure.
37. Why Pig is required when we can code all in MR
• Pig provides all standard data processing operations
like sort , group , join , filter , order by , union right
inside pig latin
• In MR we have to lots of manual coding.
Pig does optimization of Pig latin scripts while
creating them into MR jobs.
• It creates optimized version of Map reduce to run on
hadoop
• It takes very less time to write Pig latin script then to
write corresponding MR code
• Where Pig is useful
Transactional ETL Data pipelines ( Mostly used)
Research on raw data
Iterative processing
39. WordCount Example
• Input
Hello World Bye World
Hello Hadoop Goodbye Hadoop
• For the given sample input the map emits
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
• theBye, 1> just sums up the values
< reduce
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
40. WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
41. WordCount Example By Pig
A = LOAD 'wordcount/input' USING PigStorage as
(token:chararray);
B = GROUP A BY token;
C = FOREACH B GENERATE group, COUNT(A) as count;
DUMP C;
42. WordCount Example By Hive
CREATE TABLE wordcount (token STRING);
LOAD DATA LOCAL INPATH ’wordcount/input'
OVERWRITE INTO TABLE wordcount;
SELECT count(*) FROM wordcount GROUP BY token;
45. • Apache HBase™ is the Hadoop database, a
distributed, scalable, big data store.
• Apache HBase is an open-source, distributed, versioned, column-
oriented store modeled after Google's Bigtable: A Distributed Storage
System for Structured Data by Chang et al.
• Coordinated by Zookeeper
• Low Latency
• Random Reads And Writes
• Distributed Key/Value Store
• Simple API
– PUT
– GET
– DELETE
– SCANE
46. Hbase
HBase is a type of "NoSQL" database. NoSQL?
"NoSQL" is a general term meaning that the database
isn't an RDBMS which supports SQL as its primary
access language, but there are many types of NoSQL
databases: BerkeleyDB is an example of a local
NoSQL database, whereas HBase is very much a
distributed database.
Technically speaking, HBase is really more a "Data
Store" than "Data Base" because it lacks many of the
features you find in an RDBMS, such as typed
columns, secondary indexes, triggers, and advanced
query languages, etc.
49. What is ?
Oozie is a server-based workflow scheduler system to
manage Apache Hadoop jobs (e.g. load data, storing
data, analyze data, cleaning data, running map
reduce jobs, etc.)
A Java Web Application
Oozie is a workflow scheduler for Hadoop
Triggered
Time
Job 1 Job 2
Data
Job 3
Job 4 Job 5
51. What is
Machine-learning tool
Distributed and scalable machine learning algorithms
on the Hadoop platform
Building intelligent applications easier and faster
Our core algorithms for clustering, classfication and
batch based collaborative filtering are implemented
on top of Apache Hadoop using the map/reduce
paradigm.