High Speed Smart Data Ingest into Hadoop


Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/18tBPec.

Oleg Zhurakousky discusses architectural tradeoffs and alternative implementations of real-time high speed data ingest into Hadoop. Filmed at qconnewyork.com.

Oleg Zhurakousky is Principal Architect with Hortonworks. Before that, Oleg was part of the SpringSource/VMWare team where he was a core engineer working on Spring Integration framework, leading Spring Integration Scala DSL. As a speaker Oleg presented seminars at dozens of conferences worldwide (i.e.SpringOne, JavaOne, Java Zone, Jazoon, Java2Days, Scala Days, Uberconf, and others)."

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

High Speed Smart Data Ingest into Hadoop

  1. 1. High Speed Smart Data Ingest into Hadoop Oleg Zhurakousky Principal Architect @ Hortonworks Twitter: z_oleg © Hortonworks Inc. 2012
  2. 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /real-time-hadoop InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  3. 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Agenda • Domain Problem – Event Sourcing – Streaming Sources – Why Hadoop? • Architecture – Compute Mechanics – Why things are slow – understand the problem – Mechanics of the ingest – Write >= Read | Else ;( – Achieving Mechanical Sympathy – – Suitcase pattern – optimize Storage, I/O and CPU © Hortonworks Inc. 2012
  5. 5. Domain Problem © Hortonworks Inc. 2012
  6. 6. Domain Events • Domain Events represent the state of entities at a given time when an important event occurred • Every state change of interest to the business is materialized in a Domain Event Retail & Web Point of Sale Web Logs Energy Metering Diagnostics Telco Network Data Device Interactions Financial Services Trades Market Data © Hortonworks Inc. 2012 Healthcare Clinical Trials Claims Page 4
  7. 7. Event Sourcing • All Events are sent to an Event Processor • The Event Processor stores all events in an Event Log • Many different Event Listeners can be added to the Event Processor (or listen directly on the Event Log) Web Messaging Batch © Hortonworks Inc. 2012 Search Event Processor Analytics Apps Page 5
  8. 8. Streaming Sources • When dealing with Event data (log messages, call detail records, other event / message type data) there are some important observations: – Source data is streamed (not a static copy), thus always in motion – More then one streaming source – Streaming (by definition) never stops – Events are rather small in size (up to several hundred bytes), but there is a lot of them Hadoop was not design with that in mind Telco Use Case E E Geos © Hortonworks Inc. 2012 E – Tailing the syslog of a network device – 200K events per second per streaming source – Bursts that can double or triple event production
  9. 9. Why Hadoop for Event Sourcing? • Event data is immutable, an event log is append-only • Event Sourcing is not just event storage, it is event processing (HDFS + MapReduce) • While event data is generally small in size, event volumes are HUGE across the enterprise • For an Event Sourcing Solution to deliver its benefits to the enterprise, it must be able to retain all history • An Event Sourcing solution is ecosystem dependent – it must be easily integrated into the Enterprise technology portfolio (management, data, reporting) © Hortonworks Inc. 2012 Page 7
  10. 10. Ingest Architecture © Hortonworks Inc. 2012
  11. 11. Mechanics of Data Ingest & Why and which things are slow and/or inefficient ? • Demo © Hortonworks Inc. 2012 Page 9
  12. 12. Mechanics of a compute Atos, Portos, Aramis and the D’Artanian – of the compute © Hortonworks Inc. 2012
  13. 13. Why things are slow? • I/O | Network bound • Inefficient use of storage • CPU is not helping • Memory is not helping Imbalanced resource usage leads to limited growth potentials and misleading assumptions about the capability of yoru hardware © Hortonworks Inc. 2012 Page 11
  14. 14. Mechanics of the ingest • Multiple sources of data (readers) • Write (Ingest) >= Read (Source) | Else • “Else” is never good – Slows the everything down or – Fills accumulation buffer to the capacity which slows everything down – Renders parts or whole system unavailable which slows or brings everything down – Etc… “Else” is just not good! © Hortonworks Inc. 2012 Page 12
  15. 15. Strive for Mechanical Sympathy • Mechanical Sympathy – ability to write software with deep understanding of its impact or lack of impact on the hardware its running on • http://mechanical-sympathy.blogspot.com/ • Ensure that all available resources (hardware and software) work in balance to help achieve the end goal. © Hortonworks Inc. 2012 Page 13
  16. 16. Achieving mechanical sympathy during the ingest? • Demo © Hortonworks Inc. 2012 Page 14
  17. 17. The Suitcase Pattern • Storing event data in uncompressed form not feasible – Event Data just keeps on coming, and storage is finite – The more history we can keep in Hadoop the better our trend analytics will be – Not just a Storage problem, but also Processing – I/O is #1 impact on both Ingest & MapReduce performance • Suitcase Pattern – Before we travel, we take our clothes off the rack and pack them (easier to store) – We then unpack them when we arrive and put them back on the rack (easier to process) – Consider event data “traveling” over the network to Hadoop – we want to compress it before it makes the trip, but in a way that facilitates how we intend to process it once it arrives © Hortonworks Inc. 2012
  18. 18. What’s next? • How much data is available or could be available to us during the ingest is lost after the ingest? • How valuable is the data available to us during the ingest after the ingest completed? • How expensive would it be to retain (not loose) the data available during the ingest? © Hortonworks Inc. 2012 Page 16
  19. 19. Data available during the ingest? • Record count • Highest/Lowest record length • Average record length • Compression ratio But with a little more work. . . • Column parsing – Unique values – Unique values per column – Access to values of each column independently from the record – Relatively fast column-based searches, without indexing – Value encoding – Etc… Imagine the implications on later data access. . . © Hortonworks Inc. 2012 Page 17
  20. 20. How expensive is to do that extra work? • Demo © Hortonworks Inc. 2012 Page 18
  21. 21. Conclusion - 1 • Hadoop wasn’t optimized to process <1 kB records – Mappers are event-driven consumers: they do not have a running thread until a callback thread delivers a message – Message delivery is an event that triggers the Mapper into action • “Don’t wake me up unless it is important” – In Hadoop, generally speaking, several thousand bytes to several hundred thousand bytes is deemed important – Buffering records during collection allows us to collect more data before waking up the elephant – Buffering records during collection also allows us to compress the whole block of records as a single record to be sent over the network to Hadoop – resulting in lower network and file I/O – Buffering records during collection also helps us handle bursts © Hortonworks Inc. 2012
  22. 22. Conclusion - 2 • Understand your SLAs, but think ahead – While you may have accomplished your SLA for the ingest your overall system may still be in the unbalanced state – Don’t settle on “throw more hardware” until you can validate that you have approached the limits of your existing hardware – Strive for mechanical sympathy. Profile your software to understand its impact on the hardware its running on. – Don’t think about the ingest in isolation. Think how the ingested data is going to be used. – Analyze, the information that is available to you during the ingest and devise a convenient mechanism for storing it. Your data analysts will thank you for that. © Hortonworks Inc. 2012 Page 20
  23. 23. Thank You! Questions & Answers Follow: @z_oleg © Hortonworks Inc. 2012 Page 21
  24. 24. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/realtime-hadoop