SlideShare a Scribd company logo
1 of 35
1
What Can Hadoop Do for You?
@EvaAndreasson | Sr. Product Manager, Cloudera
2014
CONFIDENTIAL - RESTRICTED
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/hadoop-usage-examples
http://www.infoq.com/presentati
ons/nasa-big-data
http://www.infoq.com/presentati
ons/nasa-big-data
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Agenda
• Today’s data challenges
• Apache Hadoop and its ecosystem 101
• Common use cases
• Where to learn more
• Q&A
2
Today’s Data Challenges
3
• “Return On Bytes”
• How to cost efficiently query, manage, and store
TBs (or even PBs) of data?
• Pre-mature data death
• Off-disk and archived data
difficult and costly to access
• Forced data silos
• Costly copies and moves of data
• Organizational blind-spots
Key Challenge #1: Volume
• Enough time to process raw data before you need it
• Data ingest from sensors, cameras, feeds, streaming, logs,
user interactions…
• Raw data structuring for
various ETL and DB models
Key Challenge #2: Velocity
• Costly adaption to new data types
• Saving account info, images, videos, url clicks, logs, and
transactional data – together?
• Inflexible data models
• Major surgery for future queries
• Most data is modeled for questions we know will be asked…
• Raw data value loss
Key Challenge #3: Variety
Apache Hadoop and its Ecosystem 101
7
8
A software framework that lets you find and
process data directly where it is stored and
where you only need to apply structure at
query time
What is Hadoop?
Hadoop Distributed File System (HDFS)
9
MapReduce: A Parallel, Scalable Processing Framework
HDFS
The Apache Hadoop Ecosystem – a Zoo!
MapReduce
HDFS
The Apache Hadoop Ecosystem – a Zoo!
MapReduce
Yarn
HDFS
The Apache Hadoop Ecosystem – a Zoo!
Flume
MapReduce
Yarn
HDFS
The Apache Hadoop Ecosystem – a Zoo!
Flume
MapReduce
Hive Pig
Mahout,
Oryx, …
Yarn
HDFS
The Apache Hadoop Ecosystem – a Zoo!
Flume
MapReduce
HBase
Hive Pig
Mahout,
Oryx, …
ZooKeeper
Yarn
HDFS
The Apache Hadoop Ecosystem – a Zoo!
Flume
MapReduce
HBase
Hive Pig
Mahout,
Oryx, …
ZooKeeper
Yarn
Oozie
Hue
HDFS
The Apache Hadoop Ecosystem – a Zoo!
Flume
MapReduce
HBase
Hive Pig
Mahout,
Oryx, …
ZooKeeperOozie
Impala Solr
Hue
Yarn
Spark
HDFS
The Apache Hadoop Ecosystem – a Zoo!
Flume
MapReduce
HBase
Hive Pig
Mahout,
Oryx, …
ZooKeeperOozie
Hue
Yarn
Impala Solr
Spark
Sentry
Inter-
active
SQL
Distributed File System
The Hadoop Ecosystem – Explained!
Stream / event-based data ingest
Batch Processing
KeyValue
Store
SQL
Proc.
Oriented
Query
Machine
Learning
Process MgmtWorkflow Mgmt
GUI
Resource and Workload Scheduler
Free-
Text
Search Real Time
Processing
Access Control
Example Enterprise Architecture: Use Hadoop as an EDH
The Enterprise Data Hub: One landing zone for all data
Some Typical Use cases
#1: Scale Data Processing at Low Cost
• Do what I usually do, but on a larger set of data
• Do my complex queries, but within a reasonable time
#2: Create an Active Archive
• Eliminate pre-mature data death
• Data as a (back-chargeable) service across the
organization
#3: Break Silos and Ask Bigger Questions
• What new insights can we achieve by combining
siloed data sets?
• What else can we find by asking questions over new
types of data?
There are a Lot of Use Cases…
• Predictive analysis (event prediction)
• Anomaly detection
• Customer profiling
• Recommendation engines
• Clickstream analysis
• Image processing
• Product and process improvements (feedback loops)
• Genome sequence processing
• Object matching
• Path optimization
• …
Do the Same – and More – to a Lower Cost
• Network and storage solution company
• Create pro-active support
• 600000 “phone home” logs/week
• SLA: 40% done within 18 hours
• Complex queries taking weeks or not even possible to run
• Expected future data growth of ~7TB a month!
• With Hadoop et al and Cloudera’s help
• Future proof scale
• Faster and more flexible analytic capabilities
• Correlation of disk latency with manufacturer (24 billion records)
• 64x query performance improvement (from weeks to hours)
• Pattern matching queries in the same infrastructure detecting bugs
(240 billion records)
• TCO freed up budget for other customer-focused projects
Increasing Revenue Through an Active Archive
• Global on-line retailer
• Need to correlate online/offline data
• 10+ year history records, 1,000’s product categories
• Large parts of data archived, complex to access
• With Hadoop et al and Cloudera’s help
• Unified long-term storage and processing over all data
• Machine learning and query without data moves
• Correlate all customer, product, and sales data ad hoc
• Lead to targeted marketing and increased revenue streams
Where To Learn More?
28
To Learn More…
1. Read some good stuff
• Order the Hadoop Operations book
(http://shop.oreilly.com/product/0636920025085.do) and/or the Definitive Guide
to Hadoop (http://shop.oreilly.com/product/0636920021773.do)
• Visit Cloudera’s blog: blog.cloudera.com/
2. Play on your own
• Cloudera QuickStart VM:
https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+Edition+Dem
o+VM
• View the videos at gethue.com
3. Get help and training
• Join or send an email to: cdh-user@cloudera.org
• Visit the Cloudera dev center: cloudera.com/content/dev-center/en/home.html
• Get training: university.cloudera.com
4. Contact Cloudera
• eva@cloudera.com
• On-line contact form: http://cloudera.com/content/cloudera/en/about/contact-
us/contact-form.html
Q&A
32 ©2012 Cloudera, Inc.
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/hadoop-
usage-examples

More Related Content

More from C4Media

Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 

More from C4Media (20)

Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 

Recently uploaded

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

What Can Hadoop Do for You?

  • 1. 1 What Can Hadoop Do for You? @EvaAndreasson | Sr. Product Manager, Cloudera 2014 CONFIDENTIAL - RESTRICTED
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /hadoop-usage-examples http://www.infoq.com/presentati ons/nasa-big-data http://www.infoq.com/presentati ons/nasa-big-data
  • 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Agenda • Today’s data challenges • Apache Hadoop and its ecosystem 101 • Common use cases • Where to learn more • Q&A 2
  • 6. • “Return On Bytes” • How to cost efficiently query, manage, and store TBs (or even PBs) of data? • Pre-mature data death • Off-disk and archived data difficult and costly to access • Forced data silos • Costly copies and moves of data • Organizational blind-spots Key Challenge #1: Volume
  • 7. • Enough time to process raw data before you need it • Data ingest from sensors, cameras, feeds, streaming, logs, user interactions… • Raw data structuring for various ETL and DB models Key Challenge #2: Velocity
  • 8. • Costly adaption to new data types • Saving account info, images, videos, url clicks, logs, and transactional data – together? • Inflexible data models • Major surgery for future queries • Most data is modeled for questions we know will be asked… • Raw data value loss Key Challenge #3: Variety
  • 9. Apache Hadoop and its Ecosystem 101 7
  • 10. 8 A software framework that lets you find and process data directly where it is stored and where you only need to apply structure at query time What is Hadoop?
  • 11. Hadoop Distributed File System (HDFS) 9
  • 12. MapReduce: A Parallel, Scalable Processing Framework
  • 13. HDFS The Apache Hadoop Ecosystem – a Zoo! MapReduce
  • 14. HDFS The Apache Hadoop Ecosystem – a Zoo! MapReduce Yarn
  • 15. HDFS The Apache Hadoop Ecosystem – a Zoo! Flume MapReduce Yarn
  • 16. HDFS The Apache Hadoop Ecosystem – a Zoo! Flume MapReduce Hive Pig Mahout, Oryx, … Yarn
  • 17. HDFS The Apache Hadoop Ecosystem – a Zoo! Flume MapReduce HBase Hive Pig Mahout, Oryx, … ZooKeeper Yarn
  • 18. HDFS The Apache Hadoop Ecosystem – a Zoo! Flume MapReduce HBase Hive Pig Mahout, Oryx, … ZooKeeper Yarn Oozie Hue
  • 19. HDFS The Apache Hadoop Ecosystem – a Zoo! Flume MapReduce HBase Hive Pig Mahout, Oryx, … ZooKeeperOozie Impala Solr Hue Yarn Spark
  • 20. HDFS The Apache Hadoop Ecosystem – a Zoo! Flume MapReduce HBase Hive Pig Mahout, Oryx, … ZooKeeperOozie Hue Yarn Impala Solr Spark Sentry
  • 21. Inter- active SQL Distributed File System The Hadoop Ecosystem – Explained! Stream / event-based data ingest Batch Processing KeyValue Store SQL Proc. Oriented Query Machine Learning Process MgmtWorkflow Mgmt GUI Resource and Workload Scheduler Free- Text Search Real Time Processing Access Control
  • 22. Example Enterprise Architecture: Use Hadoop as an EDH The Enterprise Data Hub: One landing zone for all data
  • 24. #1: Scale Data Processing at Low Cost • Do what I usually do, but on a larger set of data • Do my complex queries, but within a reasonable time
  • 25. #2: Create an Active Archive • Eliminate pre-mature data death • Data as a (back-chargeable) service across the organization
  • 26. #3: Break Silos and Ask Bigger Questions • What new insights can we achieve by combining siloed data sets? • What else can we find by asking questions over new types of data?
  • 27. There are a Lot of Use Cases… • Predictive analysis (event prediction) • Anomaly detection • Customer profiling • Recommendation engines • Clickstream analysis • Image processing • Product and process improvements (feedback loops) • Genome sequence processing • Object matching • Path optimization • …
  • 28. Do the Same – and More – to a Lower Cost • Network and storage solution company • Create pro-active support • 600000 “phone home” logs/week • SLA: 40% done within 18 hours • Complex queries taking weeks or not even possible to run • Expected future data growth of ~7TB a month! • With Hadoop et al and Cloudera’s help • Future proof scale • Faster and more flexible analytic capabilities • Correlation of disk latency with manufacturer (24 billion records) • 64x query performance improvement (from weeks to hours) • Pattern matching queries in the same infrastructure detecting bugs (240 billion records) • TCO freed up budget for other customer-focused projects
  • 29. Increasing Revenue Through an Active Archive • Global on-line retailer • Need to correlate online/offline data • 10+ year history records, 1,000’s product categories • Large parts of data archived, complex to access • With Hadoop et al and Cloudera’s help • Unified long-term storage and processing over all data • Machine learning and query without data moves • Correlate all customer, product, and sales data ad hoc • Lead to targeted marketing and increased revenue streams
  • 30. Where To Learn More? 28
  • 31. To Learn More… 1. Read some good stuff • Order the Hadoop Operations book (http://shop.oreilly.com/product/0636920025085.do) and/or the Definitive Guide to Hadoop (http://shop.oreilly.com/product/0636920021773.do) • Visit Cloudera’s blog: blog.cloudera.com/ 2. Play on your own • Cloudera QuickStart VM: https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+Edition+Dem o+VM • View the videos at gethue.com 3. Get help and training • Join or send an email to: cdh-user@cloudera.org • Visit the Cloudera dev center: cloudera.com/content/dev-center/en/home.html • Get training: university.cloudera.com 4. Contact Cloudera • eva@cloudera.com • On-line contact form: http://cloudera.com/content/cloudera/en/about/contact- us/contact-form.html
  • 32.
  • 33. Q&A
  • 35. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/hadoop- usage-examples