This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to Apache Hadoop, DFS, and Map Reduce.
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses. Increasingly data has a natural structure as a graph, with vertices linked by edges, and many questions arising about the data involve graph traversals or other complex queries, for which one does not have an a priori given bound on the length of paths.
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
Way too many enterprise decision makers have clouded and uninformed views of how Hadoop works and what it does. Donald Miner offers high-level observations about Hadoop technologies and explains how Hadoop can shift the paradigms inside of an organization, based on his report Hadoop: What You Need To Know—Hadoop Basics for the Enterprise Decision Maker, forthcoming from O’Reilly Media.
After a basic introduction to Hadoop and the Hadoop ecosystem, Donald outlines 10 basic concepts you need to understand to master Hadoop:
Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...Cloudera, Inc.
As Hadoop rapidly becomes the universal standard for scalable data analysis and processing, it becomes increasingly important to understand its strengths and weaknesses for particular application scenarios in order to avoid inefficiency pitfalls. For example, Hadoop has great potential to perform scalable graph analysis, but recent research from Yale University has shown that a conventional approach to graph analysis is 1300 times less efficient than a more advanced approach. This session will give an overview of the advanced approach and then discuss further changes that are needed in the core Hadoop framework to take performance to the next level.
Approximation algorithms for stream and batch processingGabriele Modena
At Improve Digital (http://www.improvedigital.com) we collect and process large amounts of machine generated and behavioral data. Our systems address a variety of use cases that involve both batch and streaming technologies. One common denominator of the overall architecture is the need to share models and workflows across both worlds. Another one is that the analysis of large amounts of data often requires trade-offs; for instance trading accuracy for timeliness in streaming applications. One approach to satisfy these constraints is to make "big data" small. In this talk we will review a number of approximation methods for sketching, summarization and clustering and discuss how they are starting to change the way we think about certain types of analytics, and how they are being integrated into our data pipelines.
Introduction to MapReduce Data Transformationsswooledge
MapReduce is a framework for scalable parallel data processing popularized by Google. Although initially used for simple large-scale text processing, map/reduce has recently been expanded to serve some application tasks normally performed by traditional relational databases.
You Will Learn
* The basics of Map/Reduce programming in Java
* The application domains where the framework is most appropriate
* How to build analytic database systems that handle large datasets and multiple data sources robustly
* Evaluate data warehousing vendors in a realistic and unbiased way
* Emerging trends to combine Map/Reduce with standard SQL for improved power and efficiency
Geared To
* Programmers
* Developers
* Database Administrators
* Data warehouse managers
* CIOs
* CTOs
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the second half of the tutorial.
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
A talk on EDHREC, a service for magic the gathering deck recommendations. I discuss the algorithms used, my infrastructure, and some lessons learned about building data science applications.
Recent developments in Hadoop version 2 are pushing the system from the traditional, batch oriented, computational model based on MapRecuce towards becoming a multi paradigm, general purpose, platform. In the first part of this talk we will review and contrast three popular processing frameworks. In the second part we will look at how the ecosystem (eg. Hive, Mahout, Spark) is making use of these new advancements. Finally, we will illustrate "use cases" of batch, interactive and streaming architectures to power traditional and "advanced" analytics applications.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses. Increasingly data has a natural structure as a graph, with vertices linked by edges, and many questions arising about the data involve graph traversals or other complex queries, for which one does not have an a priori given bound on the length of paths.
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
Way too many enterprise decision makers have clouded and uninformed views of how Hadoop works and what it does. Donald Miner offers high-level observations about Hadoop technologies and explains how Hadoop can shift the paradigms inside of an organization, based on his report Hadoop: What You Need To Know—Hadoop Basics for the Enterprise Decision Maker, forthcoming from O’Reilly Media.
After a basic introduction to Hadoop and the Hadoop ecosystem, Donald outlines 10 basic concepts you need to understand to master Hadoop:
Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...Cloudera, Inc.
As Hadoop rapidly becomes the universal standard for scalable data analysis and processing, it becomes increasingly important to understand its strengths and weaknesses for particular application scenarios in order to avoid inefficiency pitfalls. For example, Hadoop has great potential to perform scalable graph analysis, but recent research from Yale University has shown that a conventional approach to graph analysis is 1300 times less efficient than a more advanced approach. This session will give an overview of the advanced approach and then discuss further changes that are needed in the core Hadoop framework to take performance to the next level.
Approximation algorithms for stream and batch processingGabriele Modena
At Improve Digital (http://www.improvedigital.com) we collect and process large amounts of machine generated and behavioral data. Our systems address a variety of use cases that involve both batch and streaming technologies. One common denominator of the overall architecture is the need to share models and workflows across both worlds. Another one is that the analysis of large amounts of data often requires trade-offs; for instance trading accuracy for timeliness in streaming applications. One approach to satisfy these constraints is to make "big data" small. In this talk we will review a number of approximation methods for sketching, summarization and clustering and discuss how they are starting to change the way we think about certain types of analytics, and how they are being integrated into our data pipelines.
Introduction to MapReduce Data Transformationsswooledge
MapReduce is a framework for scalable parallel data processing popularized by Google. Although initially used for simple large-scale text processing, map/reduce has recently been expanded to serve some application tasks normally performed by traditional relational databases.
You Will Learn
* The basics of Map/Reduce programming in Java
* The application domains where the framework is most appropriate
* How to build analytic database systems that handle large datasets and multiple data sources robustly
* Evaluate data warehousing vendors in a realistic and unbiased way
* Emerging trends to combine Map/Reduce with standard SQL for improved power and efficiency
Geared To
* Programmers
* Developers
* Database Administrators
* Data warehouse managers
* CIOs
* CTOs
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the second half of the tutorial.
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
A talk on EDHREC, a service for magic the gathering deck recommendations. I discuss the algorithms used, my infrastructure, and some lessons learned about building data science applications.
Recent developments in Hadoop version 2 are pushing the system from the traditional, batch oriented, computational model based on MapRecuce towards becoming a multi paradigm, general purpose, platform. In the first part of this talk we will review and contrast three popular processing frameworks. In the second part we will look at how the ecosystem (eg. Hive, Mahout, Spark) is making use of these new advancements. Finally, we will illustrate "use cases" of batch, interactive and streaming architectures to power traditional and "advanced" analytics applications.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
Texto sobre Big Data extraído do blog bigdatabrazil.blogspot.com com informações básicas sobre hadoop, mapreduce, hdfs e hive. Contém indicações de livros e links que detalham o assunto.
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the Advance Map reduce concepts of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson
Overview of the data platform as a service architecture at Netflix. We examine the tools and services built around the Netflix Hadoop platform that are designed to make access to big data at Netflix easy, efficient, and self-service for our users.
From the perspective of a user of the platform, we walk through how various services in the architecture can be used to build a recommendation engine. Sting, a tool for fast in memory aggregation and data visualization, and Lipstick, our workflow visualization and monitoring tool for Apache Pig, are discussed in depth. Lipstick is now part of Netflix OSS - clone it on github, or learn more from our techblog post: http://techblog.netflix.com/2013/06/introducing-lipstick-on-apache-pig.html.
Donald Miner will do a quick introduction to Apache Hadoop, then discuss the different ways Python can be used to get the job done in Hadoop. This includes writing MapReduce jobs in Python in various different ways, interacting with HBase, writing custom behavior in Pig and Hive, interacting with the Hadoop Distributed File System, using Spark, and integration with other corners of the Hadoop ecosystem. The state of Python with Hadoop is far from stable, so we'll spend some honest time talking about the state of these open source projects and what's missing will also be discussed.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
1. Apache Hadoop
DFS and Map Reduce
Víctor Sánchez Anguix
Universitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital
Image
Course 2014/2015
2. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Who has not heard
about Hadoop?
3. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
4. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Who knows exactly
what is Hadoop?
5. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Being simplistic:
What is Apache Hadoop?
DFS
Map
Reduce
6. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Google publishes paper about GFS (2003).
http://research.google.com/archive/gfs.html
➢ Distributed data among cluster of computers
➢ Fault tolerant
➢ Highly scalable with commodity hardware
A bit of history: Distributed File
System (DFS)
7. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Google publishes paper about MR (2004).
http://research.google.
com/archive/mapreduce.html
➢ Algorithm for processing distributed data in
parallel
➢ Simple in concept, extremely useful in
practice
A bit of history: Map Reduce (MR)
8. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Doug Cutting and
Mike Caffarella →
Apache Nutch
➢ Doug Cutting goes
to Yahoo
➢ Yahoo implements
Apache Hadoop
A bit of history: Hadoop is born
9. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Framework for distributed computing
➢ Still based on DFS and MR
➢ It is the main actor in Big Data
➢ Last major release: Apache Hadoop 2.6.0
(Nov 2014)
http://hadoop.apache.org/
Apache Hadoop now
10. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
DFS architecture
11. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS:
creating dirs
➢ Examples:
hdfs dfs -mkdir data
hdfs dfs -mkdir results
12. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS:
uploading files
➢ Examples:
hdfs dfs -put datasets/students.tsv data/students.tsv
hdfs dfs -put datasets/grades.tsv data/grades.tsv
13. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS: listing
➢ Examples:
hdfs dfs -ls data
Found 2 items
-rw-r--r-- 3 sanguix supergroup 450 2015-02-09 10:50 data/grades.tsv
-rw-r--r-- 3 sanguix supergroup 194 2015-02-09 10:45 data/students.tsv
14. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS: get a
file
➢ Examples:
hdfs dfs -get data/students.tsv
hdfs dfs -get data/grades.tsv
15. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS:
deleting files
➢ Examples:
hdfs dfs -rm data/students.tsv
hdfs dfs -rm data/grades.tsv
16. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS: space
use info
➢ Examples:
hdfs dfs -df -h
Filesystem Size Used Available Use%
hdfs://localhost 1.5 T 12 K 491.6 G 0%
17. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce: Overview
Input data
Input data
Input data
Map task
Map task
Map task
Reduce
task
Reduce
task
Reduce
task
Output data
Output data
Output data
chunk of data (key,value) value’
chunk of data (key,value) value’
18. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map: Transform data to (key, value)
Input data
Input data
Input data
Map task
Map task
Map task
chunk of data
chunk of data
19. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Shuffle: Send (key, values)
Reduce
task
Reduce
task
Reduce
task
(key,value)
(key,value)
Map task
Map task
Map task
20. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Reduce: Aggregating (key,values)
Reduce
task
Reduce
task
Reduce
task
Output data
Output data
Output data
value’
value’
21. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce
Input data
Input data
Input data
Map task
Map task
Map task
Reduce
task
Reduce
task
Reduce
task
Output data
Output data
Output data
chunk of data (key,value) value’
chunk of data (key,value) value’
22. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce example: word count
CHUNK 1
this class is about big
data and artificial
intelligence
CHUNK 2
there is nothing big
about this example
CHUNK 3
I am a big artificial
intelligence enthusiast
➢ The file is divided in
chunks to be
processed in
parallel
➢ Data is sent
untransformed to
map nodes
23. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce example: word count
this class is about big
data and artificial
intelligence
[this, class, is, about, big,
data, and, artificial,
intelligence]
Tokenize
(this,1), (class,1), (is,1),
(about,1), (big,1), (class, 1),
(is, 1), (about 1), (big, 1),
(data, 1), (and, 1), (artificial,1),
(intelligence, 1)
Prepare (key,value)
pairs
MAP TASK
Raw
chunk
Ready to shuffle
24. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce example: word countMap Reduce example: word count
(big,1)
(big,1)
(big,1)
(big,3)
Sum
REDUCE TASK
From
shuffle Output
25. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Exercise: Matrix power
row column value
1 1 3.2
2 3 4.3
3 3 5.1
1 3 0.1
26. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce variants: No reduce
Input data
Input data
Input data
Map task
Map task
Map task
Output data
Output data
Output data
chunk of data (key,value)
chunk of data (key,value)
27. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce variants: chaining
Input
data
Input
data
Input
data
Map
task
Map
task
Map
task
Reduce
task
Reduce
task
Reduce
task
Output
data
Output
data
Output
data
Map
task
Map
task
Map
task
Reduce
task
Reduce
task
Reduce
task
Output
data
Output
data
Output
data
28. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Maps are executed in parallel
➢ Reducers do not start until all maps are
finished
➢ Output is not finished until all reducers are
finished
➢ Bottleneck: Unbalanced map/reduce taks
○ Change key distribution
○ Increase reduces for increasing parallelism
Map Reduce: bottlenecks
29. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Hadoop is implemented in Java
➢ It is possible to program jobs formed by maps
and reduces in Java
➢ We won’t go deep in these matters (bear with
me!)
Map Reduce in Hadoop
30. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
http://hadoop.apache.org/
Hadoop architecture
31. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text,
IntWritable>{
private final static IntWritable one = new
IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value,
Context context) throws IOException,
InterruptedException {
StringTokenizer itr = new StringTokenizer
(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Map Reduce job in Hadoop
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,
IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key,
Iterable<IntWritable> values, Context context )
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
...
32. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
public static void main(String[] args) throws
Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path
(args[0]));
FileOutputFormat.setOutputPath(job, new Path
(args[1]));
System.exit(job.waitForCompletion(true) ? 0 :
1);
}
}
Map Reduce job in Hadoop
33. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Compiling
javac -cp opt/hadoop/share/hadoop/common/hadoop-common-2.6.0.jar:
opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.6.0.jar -d
WordCount source/hadoop/WordCount.java
jar -cvf WordCount.jar -C WordCount/ .
➢ Submitting
hadoop jar WordCount.jar es.upv.dsic.iarfid.haia.WordCount
/user/your_username/data/students.tsv /user/your_username/wc
Compiling and submitting a MR job
34. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Hadoop ecosystem
35. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ http://hadoop.apache.org
➢ Hadoop in Practice. Alex Holmes. Ed. Manning
Publications
➢ Hadoop: The Definitive Guide. Tom White. Ed.
O’Reilly.
➢ StackOverflow
Extra information
36. Apache Hadoop
DFS and Map Reduce
Víctor Sánchez Anguix
Universitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital
Image
Course 2014/2015