Hadoop 101: North East Wisconsin Code Camp

•Download as PPTX, PDF•

0 likes•369 views

Jim Argeropoulos

I gave an introduction to Hadoop for the North East Wisconsin Code Camp on 3/22/14.

Technology

Common Types of Analysis
 Text mining
 Index building
 Graph creation and analysis
 Pattern recognition
 Collaborative filtering
 Prediction Models
 Sentiment Analysis
 Risk Assessment

Hadoop
Hadoop is a cluster storage and computing
framework.

Changing of the Guard
“Scale out guarantees that
hardware and software will
fail”
“I don’t want to see anymore
2001 papers about awesome
my IT team was because they
could reshard my database
on demand.”

Solutions
“In pioneer days they
used oxen for heavy
pulling, and when one ox
couldn’t budge a log, we
didn’t try to grow a larger
ox. We shouldn’t be trying
for bigger computers, but
for more systems of
computers.”

Cluster Computing
Complexities
 Process management
 Communication
 Data movement
 Task coordination
 Partial failures
 Scheduling
 Tracking

Cluster Computing
Complexities
 Process management
 Communication
 Data movement
 Task coordination
 Partial failures
 Scheduling
 Tracking
Robustness
Resilience
Performance
Simplicity

Where Do You Fit?
Input Split 1
Shuffle and Sort
Record
Reader
Output Format
Reducer
Mapper
Partitioner
Output File
Input Split 2
Record
Reader
Mapper
Partitioner
Input Split n
Record
Reader
Mapper
Partitioner
Output Format
Reducer
Output File

Where Do You Fit?
Input Split A
Shuffle and Sort
Record
Reader
Output Format
Reducer
Mapper
Partitioner
Output File
Input Split B
Record
Reader
Mapper
Partitioner
Output Format
Reducer
Output File

Mapper Purpose
 Sanitize Data
 Select Subsets
 Convert
Input Split A
Record
Reader
Mapper
Partitioner

Mapper
 Input:
 Key
 Value
 Context
 Output:
 Key
 Value
Input Split A
Record
Reader
Mapper
Partitioner
Mapper

Word Count Mapper
 Input: (Long, Text)
 Key: 0
 Value: “the cat sat on the mat”
 Output: (Text, Long)
Key Value
the 1
cat 1
sat 1
on 1
the 1
mat 1

Reducer
 Input:
 Key
 Values // This is an iterable
 Context
 Output:
 Key
 Value

Reducer
Key Values
cat 1
mat 1
on 1
sat 1
the 1, 1
cat 1
mat 1
on 1
sat 1
the 2
Reducer
reduce(){
}
part-r-00001

Demo
 MRUnit
 Mapper
 Reducer
 Run the whole cycle

Bibliography
 Rear Admiral Hopper http://www.youtube.com/watch?v=1-
vcErOPofQ
 Mike Olson talk
http://web.archive.org/web/20130729201323id_/http://itc.conversationsnetw
ork.org/shows/detail4868.html
 Large Scale C++ by John Lakos http://www.amazon.com/Large-
Scale-Software-Design-John-Lakos/dp/0201633620

Jim Argeropoulos
 tenholeharp@gmail.com
 @exploremqt
 https://github.com/exploremqt

What's hot

The Big Data Ecosystem at LinkedInOSCON Byrum

Big Data Technology Stack : NutshellKhalid Imran

Big Data Analysis Patterns with Hadoop, Mahout and Solrboorad

My other computer is a datacentre - 2012 editionSteve Loughran

Introduction_OF_Hadoop_and_BigDataNilay Mishra

Introduction to Big DataJoey Li

Introduction to Big DataKaran Desai

Big Data Use Casesboorad

Are you ready for BIG DATA?Putchong Uthayopas

Exploring Big Data Analytics ToolsMultisoft Virtual Academy

Bigdata and Hadoop BootcampSpotle.ai

Christian Hansen caseMicrosoft

Big Data Analytics for Real Time SystemsKamalika Dutta

Big Data Tech StackAbdullah Çetin ÇAVDAR

Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople

Big Data Unit 4 - HadoopRojaT4

The Evolution of Big Data FrameworkseXascale Infolab

Big data technology unit 3RojaT4

Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku

Introduction of big data unit 1RojaT4

What's hot (20)

The Big Data Ecosystem at LinkedIn

Big Data Technology Stack : Nutshell

Big Data Analysis Patterns with Hadoop, Mahout and Solr

My other computer is a datacentre - 2012 edition

Introduction_OF_Hadoop_and_BigData

Introduction to Big Data

Big Data Use Cases

Are you ready for BIG DATA?

Exploring Big Data Analytics Tools

Bigdata and Hadoop Bootcamp

Christian Hansen case

Big Data Analytics for Real Time Systems

Big Data Tech Stack

Top Big data Analytics tools: Emerging trends and Best practices

Big Data Unit 4 - Hadoop

The Evolution of Big Data Frameworks

Big data technology unit 3

Dataiku hadoop summit - semi-supervised learning with hadoop for understand...

Introduction of big data unit 1

Similar to Hadoop 101: North East Wisconsin Code Camp

The future of Big Data toolingData Science Society

Big data analytics 1gauravsc36

Big Data Basic Concepts | Presented in 2014Kenneth Igiri

Hadoop programmingMuthusamy Manigandan

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri

BigdataShankar R

How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah

Final deckSteve Watt

Big DataNGDATA

Azure and cloud design patternsVenkatesh Narayanan

Big Data and HadoopFlavio Vit

SparkNitish Upreti

Thinking in parallel ab tuladevPavel Tsukanov

Designing for the Cloud Tutorial - QCon SF 2009Stuart Charlton

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal

Hadoop and Voldemort @ LinkedInHadoop User Group

Big Data , Big Problem?Mohammadhasan Farazmand

NO SQL: What, Why, HowIgor Moochnick

NoSQL and MapReduceJ Singh

Présentation on radoop siliconsudipt

Similar to Hadoop 101: North East Wisconsin Code Camp (20)

The future of Big Data tooling

Big data analytics 1

Big Data Basic Concepts | Presented in 2014

Hadoop programming

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...

Bigdata

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook

Final deck

Big Data

Azure and cloud design patterns

Big Data and Hadoop

Spark

Thinking in parallel ab tuladev

Designing for the Cloud Tutorial - QCon SF 2009

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010

Hadoop and Voldemort @ LinkedIn

Big Data , Big Problem?

NO SQL: What, Why, How

NoSQL and MapReduce

Présentation on radoop

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

🐬 The future of MySQL is Postgres 🐘RTylerCroy

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Partners Life - Insurer Innovation Award 2024The Digital Insurer

A Year of the Servo Reboot: Where Are We Now?Igalia

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Artificial Intelligence: Facts and MythsJoaquim Jorge

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Powerful Google developer tools for immediate impact! (2023-24 C)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

🐬 The future of MySQL is Postgres 🐘

What Are The Drone Anti-jamming Systems Technology?

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Exploring the Future Potential of AI-Enabled Smartphone Processors

Handwritten Text Recognition for manuscripts and early printed texts

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Finology Group – Insurtech Innovation Award 2024

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

How to Troubleshoot Apps for the Modern Connected Worker

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Boost PC performance: How more available memory can improve productivity

Partners Life - Insurer Innovation Award 2024

A Year of the Servo Reboot: Where Are We Now?

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Artificial Intelligence: Facts and Myths

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Hadoop 101: North East Wisconsin Code Camp

1. HADOOP 101 Cluster Computing Made Easy

2. Show of Hands

3. Big Data

4. Big Data Volume Variety Velocity

5. Common Types of Analysis  Text mining  Index building  Graph creation and analysis  Pattern recognition  Collaborative filtering  Prediction Models  Sentiment Analysis  Risk Assessment

6. Hadoop Hadoop is a cluster storage and computing framework.

7. Changing of the Guard “Scale out guarantees that hardware and software will fail” “I don’t want to see anymore 2001 papers about awesome my IT team was because they could reshard my database on demand.”

8. Storage A B A A A B B B

9. Storage A B A A A B B B

10. Tunneling Through the Cost Barrier

11. Solutions

12. Solutions

13. Solutions “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, we didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.”

14. Cluster Computing Complexities  Process management  Communication  Data movement  Task coordination  Partial failures  Scheduling  Tracking

15. Cluster Computing Complexities  Process management  Communication  Data movement  Task coordination  Partial failures  Scheduling  Tracking Robustness Resilience Performance Simplicity

16. Where Do You Fit? Input Split 1 Shuffle and Sort Record Reader Output Format Reducer Mapper Partitioner Output File Input Split 2 Record Reader Mapper Partitioner Input Split n Record Reader Mapper Partitioner Output Format Reducer Output File

17. Storage A B A A A B B B

18. Where Do You Fit? Input Split A Shuffle and Sort Record Reader Output Format Reducer Mapper Partitioner Output File Input Split B Record Reader Mapper Partitioner Output Format Reducer Output File

19. Mapper Purpose  Sanitize Data  Select Subsets  Convert Input Split A Record Reader Mapper Partitioner

20. Mapper  Input:  Key  Value  Context  Output:  Key  Value Input Split A Record Reader Mapper Partitioner Mapper

21. Word Count Mapper  Input: (Long, Text)  Key: 0  Value: “the cat sat on the mat”  Output: (Text, Long) Key Value the 1 cat 1 sat 1 on 1 the 1 mat 1

22. Where Do You Fit? Input Split A Shuffle and Sort Record Reader Output Format Reducer Mapper Partitioner Output File Input Split B Record Reader Mapper Partitioner Output Format Reducer Output File

23. Reducer  Input:  Key  Values // This is an iterable  Context  Output:  Key  Value

24. Reducer Key Values cat 1 mat 1 on 1 sat 1 the 1, 1 cat 1 mat 1 on 1 sat 1 the 2 Reducer reduce(){ } part-r-00001

25. Demo  MRUnit  Mapper  Reducer  Run the whole cycle

26. Platform

27. Bibliography  Rear Admiral Hopper http://www.youtube.com/watch?v=1- vcErOPofQ  Mike Olson talk http://web.archive.org/web/20130729201323id_/http://itc.conversationsnetw ork.org/shows/detail4868.html  Large Scale C++ by John Lakos http://www.amazon.com/Large- Scale-Software-Design-John-Lakos/dp/0201633620

28. Jim Argeropoulos  tenholeharp@gmail.com  @exploremqt  https://github.com/exploremqt

Editor's Notes

I’d like to get an idea of where you are coming from. So I have a couple quick questions.How many have heard of Big Data?How many have heard of Hadoop?How many have used Hadoop?
Big Data is one of those hot buzz words that leaves you with the impression that it can do super human things, like a superhero of the software world maybe.It’s a stretch question, but what have these three images got to do with Big Data?Big Data is sometimes defined as VolumeVelocityVariety
Any two of these often qualifies as some form of big data.Volume is increasing as the number of devices that can generate data, even without direct human input, is increasing.Some of those, like GPS movement, accelerometers, microphones, or cameras can generate a lot more data than a human.Variety is important because there are many new kinds of data sources coming into play and they may not fit into the schema you have today.
Hadoop was created in 2005 by Doug Cutting after reading the Google File System white paper and deciding he needed such a system for a project he was working on. The name “Hadoop” comes from a nonsense name used by son for his yellow toy elephant. It was short, pronounceable, and an open domain, so Doug used it for his new project.The key pieces to see are that Hadoop is a framework for both storage and computing. It does the heavy lifting in each of those areas for you.
My first exposure to Hadoop was a Mike Olson talk at the 2011 MySql Conference. In it he said a number of important things, but two of them really stuck with me. The first one is pretty obvious, its just a numbers game. The more you have, the higher the probability.The second one though… For me it was like the late 90’s and reading John Lakos talk about automated tests for his own software. “Cool, but how? Tell me! I want to know.” Just like John’s book, Mike’s talk really didn’t give any answers. Mike is the CEO of Cloudera, a Hadoop distribution publisher, so his version of the answer lies within Hadoop. It took me a couple years before I’d dig in enough to be able to understand his position. Let’s explore how Hadoop provides an answer to Mikes second statement.
I said Hadoop was a framework for storage and processing. Here we have the storage aspect. This is the answer to both of Mike Olson’s statements.A Hadoop cluster is intentionally made up of commodity hardware. This makes it cost less to scale out. But commodity hardware means no RAID drives, no redundant hot swappable power supplies, and other things that raise the number of 9’s for a server. To raise the reliability of the system, Hadoop plays a RAID controller. When you add a file to Hadoop it breaks the file into blocks (128 MB by default) and replicates each block onto multiple computers in the cluster. The default replication factor is three, which means each block will exist on three nodes of the cluster.
When a node in the cluster invariably fails, two more copies exist and the system can continue to function without data loss.You may be asking “Why not just use RAID drives or a SAN”?
Before we go to the answer, I’d like to briefly introduce an energy nerd, Amory Lovins. Amory does a lot of work in the field of saving energy. He has a favorite phrase: “Tunneling through the Cost Barrier”. Many will want to save energy by just adding insulation. However, you reach a point where adding more insulation doesn’t reduce the heat loss significantly. Because of this many would stop here, but not Amory. He continues to add insulation and other efficiency features to a building. Having done this he can then start to take out furnaces, ducting, and other expensive capital equipment. In the end he saves more money by going past the point of diminishing returns.I defined Hadoop as a cluster storage and computing framework. We looked briefly at the storage aspect. It is the computing aspect that lets us tunnel through the cost barrier.
In traditional computing, you can choose to scale up, but you reach a point of diminishing returns. At some point you just can’t effectively build a big enough computer. It is then time for people to step in with unconventional ideas.
Can anyone identify who this is? <Rear Admiral Grace Hopper>Despite the stern look on her face here, she was a card and a master in the use of word pictures.She is credited for popularizing the term “debugging” after pulling a moth from the relays of a Mark II computer. I put a YouTube link in the bibliography if you want to watch her. Here is one of her word pictures that leads to a scale up solution.
<Pause for audience to read> When we encounter problems that require a large amount of computing resources, the Hadoop solution isn’t a bigger computer, but a system of computers as Rear Admiral Hopper would say.But cluster computing isn’t without it’s own set of problems.
I once heard a joke about multi-threading: The beginning developer thinks mult-threading is hard. The intermediate developer thinks multi-threading is easy. The advanced developer knows multi-threading is hard.If multi-threaded development is hard when it all takes place on a single machine, then managing parallel processing across many machines is going to be harder. The computing aspect of the Hadoop framework takes out much of the complexity.
I once heard a joke about multi-threading. The beginning developer thinks mult-threading is hard. The intermediate developer thinks multi-threading is easy. The advanced developer knows multi-threading is hard.If multi-threaded development is hard when it all takes place on a single machine, then managing parallel processing on many machines is going to be harder. The computing aspect of the Hadoop framework takes out much of the complexity.
These are the major pieces in the system. You have the ability to specify the types designated by rounded rectangles. In most cases you specifically implement the orange rounded rectangles.Why do I have the top square in each column labeled as input split 1..n?
If you recall from our earlier example, our sample file was split into to blocks. And you recall each split was replicated to multiple servers.By adding the ability to process each block where it is stored, we have just tunneled through the cost barrier. Not only did we get rid of expensive RAID drives, we also added a bunch of cores to do analysis work for us.A whole map phase will take place on one of the servers where a block is stored.
So in our sample, the map phase would look more like this. Each green square takes place on a single culster.Hadoop waits for all the mappers to complete. The mapper results are shuffled and sorted. The resulting data is delivered to the reducers.
In BI terms you might think of the mapper as the Extract Transform portion of a standard ETL.
For each record found in the input split, the mapper gets called once. The input is always a key and a value. The mapper does it’s magic and writes out a key and a value. Very often the input key/value is different than the output key/value.
Word Count is the hello world of Hadoop.This sample assumes that we are reading from an hdfs file. A record is a line of text from the file. The key of the record is the starting byte offset of the line. The value is the text of the line.Since we want to count the unique words we will transform our input. We split the input value into individual words and write to the output once for each word.
Lets go back to our topology view. We’ve looked a bit at the mapper now lets look at the reducer.Hadoop waits for all the mappers to complete. The mapper results are shuffled and sorted. The resulting data is delivered to the reducers.
The important thing to note here is that the keys are sorted and each individual value outputted by a mapper is in an array.
I said that Hadoop was framework. And that is true, but it is also a platform. All of these Apache projects are built on top of Hadoop

Hadoop 101: North East Wisconsin Code Camp

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop 101: North East Wisconsin Code Camp

Similar to Hadoop 101: North East Wisconsin Code Camp (20)

Recently uploaded

Recently uploaded (20)

Hadoop 101: North East Wisconsin Code Camp

Editor's Notes