SlideShare a Scribd company logo
1 of 28
HADOOP
101
Cluster Computing Made Easy
Show of Hands
Big Data
Big Data
Volume
Variety
Velocity
Common Types of Analysis
 Text mining
 Index building
 Graph creation and analysis
 Pattern recognition
 Collaborative filtering
 Prediction Models
 Sentiment Analysis
 Risk Assessment
Hadoop
Hadoop is a cluster storage and computing
framework.
Changing of the Guard
“Scale out guarantees that
hardware and software will
fail”
“I don’t want to see anymore
2001 papers about awesome
my IT team was because they
could reshard my database
on demand.”
Storage
A
B
A
A
A
B
B
B
Storage
A
B
A
A
A
B
B
B
Tunneling Through the Cost
Barrier
Solutions
Solutions
Solutions
“In pioneer days they
used oxen for heavy
pulling, and when one ox
couldn’t budge a log, we
didn’t try to grow a larger
ox. We shouldn’t be trying
for bigger computers, but
for more systems of
computers.”
Cluster Computing
Complexities
 Process management
 Communication
 Data movement
 Task coordination
 Partial failures
 Scheduling
 Tracking
Cluster Computing
Complexities
 Process management
 Communication
 Data movement
 Task coordination
 Partial failures
 Scheduling
 Tracking
Robustness
Resilience
Performance
Simplicity
Where Do You Fit?
Input Split 1
Shuffle and Sort
Record
Reader
Output Format
Reducer
Mapper
Partitioner
Output File
Input Split 2
Record
Reader
Mapper
Partitioner
Input Split n
Record
Reader
Mapper
Partitioner
Output Format
Reducer
Output File
Storage
A
B
A
A
A
B
B
B
Where Do You Fit?
Input Split A
Shuffle and Sort
Record
Reader
Output Format
Reducer
Mapper
Partitioner
Output File
Input Split B
Record
Reader
Mapper
Partitioner
Output Format
Reducer
Output File
Mapper Purpose
 Sanitize Data
 Select Subsets
 Convert
Input Split A
Record
Reader
Mapper
Partitioner
Mapper
 Input:
 Key
 Value
 Context
 Output:
 Key
 Value
Input Split A
Record
Reader
Mapper
Partitioner
Mapper
Word Count Mapper
 Input: (Long, Text)
 Key: 0
 Value: “the cat sat on the mat”
 Output: (Text, Long)
Key Value
the 1
cat 1
sat 1
on 1
the 1
mat 1
Where Do You Fit?
Input Split A
Shuffle and Sort
Record
Reader
Output Format
Reducer
Mapper
Partitioner
Output File
Input Split B
Record
Reader
Mapper
Partitioner
Output Format
Reducer
Output File
Reducer
 Input:
 Key
 Values // This is an iterable
 Context
 Output:
 Key
 Value
Reducer
Key Values
cat 1
mat 1
on 1
sat 1
the 1, 1
cat 1
mat 1
on 1
sat 1
the 2
Reducer
reduce(){
}
part-r-00001
Demo
 MRUnit
 Mapper
 Reducer
 Run the whole cycle
Platform
Bibliography
 Rear Admiral Hopper http://www.youtube.com/watch?v=1-
vcErOPofQ
 Mike Olson talk
http://web.archive.org/web/20130729201323id_/http://itc.conversationsnetw
ork.org/shows/detail4868.html
 Large Scale C++ by John Lakos http://www.amazon.com/Large-
Scale-Software-Design-John-Lakos/dp/0201633620
Jim Argeropoulos
 tenholeharp@gmail.com
 @exploremqt
 https://github.com/exploremqt

More Related Content

What's hot

The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInOSCON Byrum
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solrboorad
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionSteve Loughran
 
Introduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigDataIntroduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigDataNilay Mishra
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataJoey Li
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataKaran Desai
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Casesboorad
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
Christian Hansen case
Christian Hansen caseChristian Hansen case
Christian Hansen caseMicrosoft
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopRojaT4
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data FrameworkseXascale Infolab
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1RojaT4
 

What's hot (20)

The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedIn
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 edition
 
Introduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigDataIntroduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigData
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Exploring Big Data Analytics Tools
Exploring Big Data Analytics ToolsExploring Big Data Analytics Tools
Exploring Big Data Analytics Tools
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Christian Hansen case
Christian Hansen caseChristian Hansen case
Christian Hansen case
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 

Similar to Hadoop 101: North East Wisconsin Code Camp

Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1gauravsc36
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014Kenneth Igiri
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Thinking in parallel ab tuladev
Thinking in parallel ab tuladevThinking in parallel ab tuladev
Thinking in parallel ab tuladevPavel Tsukanov
 
Designing for the Cloud Tutorial - QCon SF 2009
Designing for the Cloud Tutorial - QCon SF 2009Designing for the Cloud Tutorial - QCon SF 2009
Designing for the Cloud Tutorial - QCon SF 2009Stuart Charlton
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, HowIgor Moochnick
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduceJ Singh
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop siliconsudipt
 

Similar to Hadoop 101: North East Wisconsin Code Camp (20)

The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Bigdata
BigdataBigdata
Bigdata
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Final deck
Final deckFinal deck
Final deck
 
Big Data
Big DataBig Data
Big Data
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Spark
SparkSpark
Spark
 
Thinking in parallel ab tuladev
Thinking in parallel ab tuladevThinking in parallel ab tuladev
Thinking in parallel ab tuladev
 
Designing for the Cloud Tutorial - QCon SF 2009
Designing for the Cloud Tutorial - QCon SF 2009Designing for the Cloud Tutorial - QCon SF 2009
Designing for the Cloud Tutorial - QCon SF 2009
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Hadoop 101: North East Wisconsin Code Camp

Editor's Notes

  1. I’d like to get an idea of where you are coming from. So I have a couple quick questions.How many have heard of Big Data?How many have heard of Hadoop?How many have used Hadoop?
  2. Big Data is one of those hot buzz words that leaves you with the impression that it can do super human things, like a superhero of the software world maybe.It’s a stretch question, but what have these three images got to do with Big Data?Big Data is sometimes defined as VolumeVelocityVariety
  3. Any two of these often qualifies as some form of big data.Volume is increasing as the number of devices that can generate data, even without direct human input, is increasing.Some of those, like GPS movement, accelerometers, microphones, or cameras can generate a lot more data than a human.Variety is important because there are many new kinds of data sources coming into play and they may not fit into the schema you have today.
  4. Hadoop was created in 2005 by Doug Cutting after reading the Google File System white paper and deciding he needed such a system for a project he was working on. The name “Hadoop” comes from a nonsense name used by son for his yellow toy elephant. It was short, pronounceable, and an open domain, so Doug used it for his new project.The key pieces to see are that Hadoop is a framework for both storage and computing. It does the heavy lifting in each of those areas for you.
  5. My first exposure to Hadoop was a Mike Olson talk at the 2011 MySql Conference. In it he said a number of important things, but two of them really stuck with me. The first one is pretty obvious, its just a numbers game. The more you have, the higher the probability.The second one though… For me it was like the late 90’s and reading John Lakos talk about automated tests for his own software. “Cool, but how? Tell me! I want to know.” Just like John’s book, Mike’s talk really didn’t give any answers. Mike is the CEO of Cloudera, a Hadoop distribution publisher, so his version of the answer lies within Hadoop. It took me a couple years before I’d dig in enough to be able to understand his position. Let’s explore how Hadoop provides an answer to Mikes second statement.
  6. I said Hadoop was a framework for storage and processing. Here we have the storage aspect. This is the answer to both of Mike Olson’s statements.A Hadoop cluster is intentionally made up of commodity hardware. This makes it cost less to scale out. But commodity hardware means no RAID drives, no redundant hot swappable power supplies, and other things that raise the number of 9’s for a server. To raise the reliability of the system, Hadoop plays a RAID controller. When you add a file to Hadoop it breaks the file into blocks (128 MB by default) and replicates each block onto multiple computers in the cluster. The default replication factor is three, which means each block will exist on three nodes of the cluster.
  7. When a node in the cluster invariably fails, two more copies exist and the system can continue to function without data loss.You may be asking “Why not just use RAID drives or a SAN”?
  8. Before we go to the answer, I’d like to briefly introduce an energy nerd, Amory Lovins. Amory does a lot of work in the field of saving energy. He has a favorite phrase: “Tunneling through the Cost Barrier”. Many will want to save energy by just adding insulation. However, you reach a point where adding more insulation doesn’t reduce the heat loss significantly. Because of this many would stop here, but not Amory. He continues to add insulation and other efficiency features to a building. Having done this he can then start to take out furnaces, ducting, and other expensive capital equipment. In the end he saves more money by going past the point of diminishing returns.I defined Hadoop as a cluster storage and computing framework. We looked briefly at the storage aspect. It is the computing aspect that lets us tunnel through the cost barrier.
  9. In traditional computing, you can choose to scale up, but you reach a point of diminishing returns. At some point you just can’t effectively build a big enough computer. It is then time for people to step in with unconventional ideas.
  10. Can anyone identify who this is? <Rear Admiral Grace Hopper>Despite the stern look on her face here, she was a card and a master in the use of word pictures.She is credited for popularizing the term “debugging” after pulling a moth from the relays of a Mark II computer. I put a YouTube link in the bibliography if you want to watch her. Here is one of her word pictures that leads to a scale up solution.
  11. <Pause for audience to read> When we encounter problems that require a large amount of computing resources, the Hadoop solution isn’t a bigger computer, but a system of computers as Rear Admiral Hopper would say.But cluster computing isn’t without it’s own set of problems.
  12. I once heard a joke about multi-threading: The beginning developer thinks mult-threading is hard. The intermediate developer thinks multi-threading is easy. The advanced developer knows multi-threading is hard.If multi-threaded development is hard when it all takes place on a single machine, then managing parallel processing across many machines is going to be harder. The computing aspect of the Hadoop framework takes out much of the complexity.
  13. I once heard a joke about multi-threading. The beginning developer thinks mult-threading is hard. The intermediate developer thinks multi-threading is easy. The advanced developer knows multi-threading is hard.If multi-threaded development is hard when it all takes place on a single machine, then managing parallel processing on many machines is going to be harder. The computing aspect of the Hadoop framework takes out much of the complexity.
  14. These are the major pieces in the system. You have the ability to specify the types designated by rounded rectangles. In most cases you specifically implement the orange rounded rectangles.Why do I have the top square in each column labeled as input split 1..n?
  15. If you recall from our earlier example, our sample file was split into to blocks. And you recall each split was replicated to multiple servers.By adding the ability to process each block where it is stored, we have just tunneled through the cost barrier. Not only did we get rid of expensive RAID drives, we also added a bunch of cores to do analysis work for us.A whole map phase will take place on one of the servers where a block is stored.
  16. So in our sample, the map phase would look more like this. Each green square takes place on a single culster.Hadoop waits for all the mappers to complete. The mapper results are shuffled and sorted. The resulting data is delivered to the reducers.
  17. In BI terms you might think of the mapper as the Extract Transform portion of a standard ETL.
  18. For each record found in the input split, the mapper gets called once. The input is always a key and a value. The mapper does it’s magic and writes out a key and a value. Very often the input key/value is different than the output key/value.
  19. Word Count is the hello world of Hadoop.This sample assumes that we are reading from an hdfs file. A record is a line of text from the file. The key of the record is the starting byte offset of the line. The value is the text of the line.Since we want to count the unique words we will transform our input. We split the input value into individual words and write to the output once for each word.
  20. Lets go back to our topology view. We’ve looked a bit at the mapper now lets look at the reducer.Hadoop waits for all the mappers to complete. The mapper results are shuffled and sorted. The resulting data is delivered to the reducers.
  21. The important thing to note here is that the keys are sorted and each individual value outputted by a mapper is in an array.
  22. I said that Hadoop was framework. And that is true, but it is also a platform. All of these Apache projects are built on top of Hadoop