SlideShare a Scribd company logo
1 of 24
High Speed Smart Data Ingest
into Hadoop
Oleg Zhurakousky
Principal Architect @ Hortonworks
Twitter: z_oleg

© Hortonworks Inc. 2012
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/real-time-hadoop

InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Agenda
• Domain Problem
– Event Sourcing
– Streaming Sources
– Why Hadoop?

• Architecture
– Compute Mechanics
– Why things are slow – understand the problem
– Mechanics of the ingest
– Write >= Read | Else ;(

– Achieving Mechanical Sympathy –
– Suitcase pattern – optimize Storage, I/O and CPU

© Hortonworks Inc. 2012
Domain Problem

© Hortonworks Inc. 2012
Domain Events
• Domain Events represent the state of entities at a
given time when an important event occurred
• Every state change of interest to the business is
materialized in a Domain Event
Retail & Web
Point of Sale
Web Logs

Energy
Metering
Diagnostics

Telco
Network Data
Device Interactions
Financial
Services
Trades
Market Data
© Hortonworks Inc. 2012

Healthcare
Clinical Trials
Claims
Page 4
Event Sourcing
• All Events are sent to an Event Processor
• The Event Processor stores all events in an Event Log
• Many different Event Listeners can be added to the
Event Processor (or listen directly on the Event Log)

Web

Messaging

Batch

© Hortonworks Inc. 2012

Search

Event
Processor

Analytics

Apps

Page 5
Streaming Sources
• When dealing with Event data (log messages, call
detail records, other event / message type data) there
are some important observations:
– Source data is streamed (not a static copy), thus always in motion
– More then one streaming source
– Streaming (by definition) never stops
– Events are rather small in size (up to several hundred bytes), but
there is a lot of them
Hadoop was not design with that in mind
Telco Use Case
E

E

Geos

© Hortonworks Inc. 2012

E

– Tailing the syslog of a network device
– 200K events per second per streaming source
– Bursts that can double or triple event production
Why Hadoop for Event Sourcing?
• Event data is immutable, an event log is append-only
• Event Sourcing is not just event storage, it is event
processing (HDFS + MapReduce)
• While event data is generally small in size, event
volumes are HUGE across the enterprise
• For an Event Sourcing Solution to deliver its benefits
to the enterprise, it must be able to retain all history
• An Event Sourcing solution is ecosystem dependent –
it must be easily integrated into the Enterprise
technology portfolio (management, data, reporting)

© Hortonworks Inc. 2012

Page 7
Ingest Architecture

© Hortonworks Inc. 2012
Mechanics of Data Ingest & Why and which
things are slow and/or inefficient ?
• Demo

© Hortonworks Inc. 2012

Page 9
Mechanics of a compute
Atos, Portos, Aramis and the D’Artanian – of the compute

© Hortonworks Inc. 2012
Why things are slow?
• I/O | Network bound
• Inefficient use of storage
• CPU is not helping
• Memory is not helping

Imbalanced resource usage leads to limited growth potentials
and misleading assumptions about the capability of yoru
hardware

© Hortonworks Inc. 2012

Page 11
Mechanics of the ingest
• Multiple sources of data (readers)
• Write (Ingest) >= Read (Source) | Else
• “Else” is never good
– Slows the everything down or
– Fills accumulation buffer to the capacity which slows everything
down
– Renders parts or whole system unavailable which slows or brings
everything down
– Etc…
“Else” is just not good!

© Hortonworks Inc. 2012

Page 12
Strive for Mechanical Sympathy
• Mechanical Sympathy – ability to write software with
deep understanding of its impact or lack of impact on
the hardware its running on
• http://mechanical-sympathy.blogspot.com/
• Ensure that all available resources (hardware and
software) work in balance to help achieve the end
goal.

© Hortonworks Inc. 2012

Page 13
Achieving mechanical sympathy during the
ingest?
• Demo

© Hortonworks Inc. 2012

Page 14
The Suitcase Pattern
• Storing event data in uncompressed form not feasible
– Event Data just keeps on coming, and storage is finite
– The more history we can keep in Hadoop the better our trend
analytics will be
– Not just a Storage problem, but also Processing
– I/O is #1 impact on both Ingest & MapReduce performance

• Suitcase Pattern
– Before we travel, we take our clothes off the
rack and pack them (easier to store)
– We then unpack them when we arrive and put
them back on the rack (easier to process)
– Consider event data “traveling” over the network to Hadoop
– we want to compress it before it makes the trip, but in a way that
facilitates how we intend to process it once it arrives
© Hortonworks Inc. 2012
What’s next?
• How much data is available or could be available to us
during the ingest is lost after the ingest?
• How valuable is the data available to us during the
ingest after the ingest completed?
• How expensive would it be to retain (not loose) the
data available during the ingest?

© Hortonworks Inc. 2012

Page 16
Data available during the ingest?
• Record count
• Highest/Lowest record length
• Average record length
• Compression ratio
But with a little more work. . .
• Column parsing
– Unique values
– Unique values per column
– Access to values of each column independently from the record
– Relatively fast column-based searches, without indexing
– Value encoding
– Etc…
Imagine the implications on later data access. . .
© Hortonworks Inc. 2012

Page 17
How expensive is to do that extra work?
• Demo

© Hortonworks Inc. 2012

Page 18
Conclusion - 1
• Hadoop wasn’t optimized to process <1 kB records
– Mappers are event-driven consumers: they do not have a running
thread until a callback thread delivers a message
– Message delivery is an event that triggers the Mapper into action

• “Don’t wake me up unless it is important”
– In Hadoop, generally speaking,
several thousand bytes to
several hundred thousand bytes
is deemed important
– Buffering records during collection allows us to collect more data
before waking up the elephant
– Buffering records during collection also allows us to compress the
whole block of records as a single record to be sent over the
network to Hadoop – resulting in lower network and file I/O
– Buffering records during collection also helps us handle bursts
© Hortonworks Inc. 2012
Conclusion - 2
• Understand your SLAs, but think ahead
– While you may have accomplished your SLA for the ingest your
overall system may still be in the unbalanced state
– Don’t settle on “throw more hardware” until you can validate that
you have approached the limits of your existing hardware
– Strive for mechanical sympathy. Profile your software to
understand its impact on the hardware its running on.
– Don’t think about the ingest in isolation. Think how the ingested
data is going to be used.
– Analyze, the information that is available to you during the ingest
and devise a convenient mechanism for storing it. Your data
analysts will thank you for that.

© Hortonworks Inc. 2012

Page 20
Thank You!
Questions & Answers
Follow: @z_oleg

© Hortonworks Inc. 2012

Page 21
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/realtime-hadoop

More Related Content

More from C4Media

High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 

More from C4Media (20)

High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 

Recently uploaded

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 

Recently uploaded (20)

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 

High Speed Smart Data Ingest into Hadoop

  • 1. High Speed Smart Data Ingest into Hadoop Oleg Zhurakousky Principal Architect @ Hortonworks Twitter: z_oleg © Hortonworks Inc. 2012
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /real-time-hadoop InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Agenda • Domain Problem – Event Sourcing – Streaming Sources – Why Hadoop? • Architecture – Compute Mechanics – Why things are slow – understand the problem – Mechanics of the ingest – Write >= Read | Else ;( – Achieving Mechanical Sympathy – – Suitcase pattern – optimize Storage, I/O and CPU © Hortonworks Inc. 2012
  • 6. Domain Events • Domain Events represent the state of entities at a given time when an important event occurred • Every state change of interest to the business is materialized in a Domain Event Retail & Web Point of Sale Web Logs Energy Metering Diagnostics Telco Network Data Device Interactions Financial Services Trades Market Data © Hortonworks Inc. 2012 Healthcare Clinical Trials Claims Page 4
  • 7. Event Sourcing • All Events are sent to an Event Processor • The Event Processor stores all events in an Event Log • Many different Event Listeners can be added to the Event Processor (or listen directly on the Event Log) Web Messaging Batch © Hortonworks Inc. 2012 Search Event Processor Analytics Apps Page 5
  • 8. Streaming Sources • When dealing with Event data (log messages, call detail records, other event / message type data) there are some important observations: – Source data is streamed (not a static copy), thus always in motion – More then one streaming source – Streaming (by definition) never stops – Events are rather small in size (up to several hundred bytes), but there is a lot of them Hadoop was not design with that in mind Telco Use Case E E Geos © Hortonworks Inc. 2012 E – Tailing the syslog of a network device – 200K events per second per streaming source – Bursts that can double or triple event production
  • 9. Why Hadoop for Event Sourcing? • Event data is immutable, an event log is append-only • Event Sourcing is not just event storage, it is event processing (HDFS + MapReduce) • While event data is generally small in size, event volumes are HUGE across the enterprise • For an Event Sourcing Solution to deliver its benefits to the enterprise, it must be able to retain all history • An Event Sourcing solution is ecosystem dependent – it must be easily integrated into the Enterprise technology portfolio (management, data, reporting) © Hortonworks Inc. 2012 Page 7
  • 11. Mechanics of Data Ingest & Why and which things are slow and/or inefficient ? • Demo © Hortonworks Inc. 2012 Page 9
  • 12. Mechanics of a compute Atos, Portos, Aramis and the D’Artanian – of the compute © Hortonworks Inc. 2012
  • 13. Why things are slow? • I/O | Network bound • Inefficient use of storage • CPU is not helping • Memory is not helping Imbalanced resource usage leads to limited growth potentials and misleading assumptions about the capability of yoru hardware © Hortonworks Inc. 2012 Page 11
  • 14. Mechanics of the ingest • Multiple sources of data (readers) • Write (Ingest) >= Read (Source) | Else • “Else” is never good – Slows the everything down or – Fills accumulation buffer to the capacity which slows everything down – Renders parts or whole system unavailable which slows or brings everything down – Etc… “Else” is just not good! © Hortonworks Inc. 2012 Page 12
  • 15. Strive for Mechanical Sympathy • Mechanical Sympathy – ability to write software with deep understanding of its impact or lack of impact on the hardware its running on • http://mechanical-sympathy.blogspot.com/ • Ensure that all available resources (hardware and software) work in balance to help achieve the end goal. © Hortonworks Inc. 2012 Page 13
  • 16. Achieving mechanical sympathy during the ingest? • Demo © Hortonworks Inc. 2012 Page 14
  • 17. The Suitcase Pattern • Storing event data in uncompressed form not feasible – Event Data just keeps on coming, and storage is finite – The more history we can keep in Hadoop the better our trend analytics will be – Not just a Storage problem, but also Processing – I/O is #1 impact on both Ingest & MapReduce performance • Suitcase Pattern – Before we travel, we take our clothes off the rack and pack them (easier to store) – We then unpack them when we arrive and put them back on the rack (easier to process) – Consider event data “traveling” over the network to Hadoop – we want to compress it before it makes the trip, but in a way that facilitates how we intend to process it once it arrives © Hortonworks Inc. 2012
  • 18. What’s next? • How much data is available or could be available to us during the ingest is lost after the ingest? • How valuable is the data available to us during the ingest after the ingest completed? • How expensive would it be to retain (not loose) the data available during the ingest? © Hortonworks Inc. 2012 Page 16
  • 19. Data available during the ingest? • Record count • Highest/Lowest record length • Average record length • Compression ratio But with a little more work. . . • Column parsing – Unique values – Unique values per column – Access to values of each column independently from the record – Relatively fast column-based searches, without indexing – Value encoding – Etc… Imagine the implications on later data access. . . © Hortonworks Inc. 2012 Page 17
  • 20. How expensive is to do that extra work? • Demo © Hortonworks Inc. 2012 Page 18
  • 21. Conclusion - 1 • Hadoop wasn’t optimized to process <1 kB records – Mappers are event-driven consumers: they do not have a running thread until a callback thread delivers a message – Message delivery is an event that triggers the Mapper into action • “Don’t wake me up unless it is important” – In Hadoop, generally speaking, several thousand bytes to several hundred thousand bytes is deemed important – Buffering records during collection allows us to collect more data before waking up the elephant – Buffering records during collection also allows us to compress the whole block of records as a single record to be sent over the network to Hadoop – resulting in lower network and file I/O – Buffering records during collection also helps us handle bursts © Hortonworks Inc. 2012
  • 22. Conclusion - 2 • Understand your SLAs, but think ahead – While you may have accomplished your SLA for the ingest your overall system may still be in the unbalanced state – Don’t settle on “throw more hardware” until you can validate that you have approached the limits of your existing hardware – Strive for mechanical sympathy. Profile your software to understand its impact on the hardware its running on. – Don’t think about the ingest in isolation. Think how the ingested data is going to be used. – Analyze, the information that is available to you during the ingest and devise a convenient mechanism for storing it. Your data analysts will thank you for that. © Hortonworks Inc. 2012 Page 20
  • 23. Thank You! Questions & Answers Follow: @z_oleg © Hortonworks Inc. 2012 Page 21
  • 24. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/realtime-hadoop