© Hortonworks Inc. 2012
High Speed Smart Data Ingest
into Hadoop
Oleg Zhurakousky
Principal Architect @ Hortonworks
Twitte...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Agenda
• Domain Problem
–Event Sourcing
–Streaming Sources
–Why Hadoop?
• A...
© Hortonworks Inc. 2012
Domain Problem
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Domain Events
• Domain Events represent the state of entities at a
given ti...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Event Sourcing
• All Events are sent to an Event Processor
• The Event Proc...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Streaming Sources
• When dealing with Event data (log messages, call
detail...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Why Hadoop for Event Sourcing?
• Event data is immutable, an event log is a...
© Hortonworks Inc. 2012
Ingest Mechanics
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Mechanics of Data Ingest & Why and which
things are slow and/or inefficient...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Mechanics of a compute
Atos, Portos, Aramis and the D’Artanian – of the com...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Why things are slow?
• I/O | Network bound
• Inefficient use of storage
• C...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Mechanics of the ingest
• Multiple sources of data (readers)
• Write (Inges...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Strive for Mechanical Sympathy
• Mechanical Sympathy – ability to write sof...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Achieving mechanical sympathy during the
ingest?
• Demo
Page 14
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
The Suitcase Pattern
• Storing event data in uncompressed form not feasible...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
What’s next?
• How much data is available or could be available to us
durin...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Data available during the ingest?
• Record count
• Highest/Lowest record le...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
How expensive is to do that extra work?
• Demo
Page 18
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Conclusion - 1
• Hadoop wasn’t optimized to process <1 kB records
–Mappers ...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Conclusion - 2
• Understand your SLAs, but think ahead
–While you may have ...
© Hortonworks Inc. 2012
Thank You!
Questions & Answers
Follow: @z_oleg, @tmccuch, @hortonworks
Page 21
Upcoming SlideShare
Loading in …5
×

High Speed Smart Data Ingest

1,070
-1

Published on

This talk will explore the area of real-time data ingest into Hadoop and present the architectural trade-offs as well as demonstrate alternative implementations that strike the appropriate balance across the following common challenges: * Decentralized writes (multiple data centers and collectors) * Continuous Availability, High Reliability * No loss of data * Elasticity of introducing more writers * Bursts in Speed per syslog emitter * Continuous, real-time collection * Flexible Write Targets (local FS, HDFS etc.)

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,070
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Multiple sources and source itself can help with the ingest by preparing data before the actual ingest
  • High Speed Smart Data Ingest

    1. 1. © Hortonworks Inc. 2012 High Speed Smart Data Ingest into Hadoop Oleg Zhurakousky Principal Architect @ Hortonworks Twitter: z_oleg Tom McCuch Solution Engineer @ Hortonworks Twitter: tmccuch
    2. 2. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Agenda • Domain Problem –Event Sourcing –Streaming Sources –Why Hadoop? • Architecture –Compute Mechanics –Why things are slow – understand the problem –Mechanics of the ingest – Write >= Read | Else ;( –Achieving Mechanical Sympathy – – Suitcase pattern – optimize Storage, I/O and CPU
    3. 3. © Hortonworks Inc. 2012 Domain Problem
    4. 4. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Domain Events • Domain Events represent the state of entities at a given time when an important event occurred • Every state change of interest to the business is materialized in a Domain Event Page 4 Retail & Web Point of Sale Web Logs Telco Network Data Device Interactions Financial Services Trades Market Data Energy Metering Diagnostics Healthcare Clinical Trials Health Records
    5. 5. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Event Sourcing • All Events are sent to an Event Processor • The Event Processor stores all events in an Event Log • Many different Event Listeners can be added to the Event Processor (or listen directly on the Event Log) Event Processor Web Messaging Search Batch Analytics Page 5 Apps
    6. 6. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Streaming Sources • When dealing with Event data (log messages, call detail records, other event / message type data) there are some important observations: –Source data is streamed (not a static copy), thus always in motion –More then one streaming source –Streaming (by definition) never stops –Events are rather small in size (up to several hundred bytes), but there is a lot of them Hadoop was not design with that in mind E E E Geos Telco Use Case – Tailing the syslog of a network device – 200K events per second per streaming source – Bursts that can double or triple event production
    7. 7. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Why Hadoop for Event Sourcing? • Event data is immutable, an event log is append-only • Event Sourcing is not just event storage, it is event processing (HDFS + MapReduce) • While event data is generally small in size, event volumes are HUGE across the enterprise • For an Event Sourcing Solution to deliver its benefits to the enterprise, it must be able to retain all history • An Event Sourcing solution is ecosystem dependent – it must be easily integrated into the Enterprise technology portfolio (management, data, reporting) Page 7
    8. 8. © Hortonworks Inc. 2012 Ingest Mechanics
    9. 9. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Mechanics of Data Ingest & Why and which things are slow and/or inefficient ? • Demo Page 9
    10. 10. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Mechanics of a compute Atos, Portos, Aramis and the D’Artanian – of the compute
    11. 11. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Why things are slow? • I/O | Network bound • Inefficient use of storage • CPU is not helping • Memory is not helping Imbalanced resource usage leads to limited growth potentials and misleading assumptions about the capability of yoru hardware Page 11
    12. 12. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Mechanics of the ingest • Multiple sources of data (readers) • Write (Ingest) >= Read (Source) | Else • “Else” is never good –Slows the everything down or –Fills accumulation buffer to the capacity which slows everything down –Renders parts or whole system unavailable which slows or brings everything down –Etc… “Else” is just not good! Page 12
    13. 13. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Strive for Mechanical Sympathy • Mechanical Sympathy – ability to write software with deep understanding of its impact or lack of impact on the hardware its running on • http://mechanical-sympathy.blogspot.com/ • Ensure that all available resources (hardware and software) work in balance to help achieve the end goal. Page 13
    14. 14. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Achieving mechanical sympathy during the ingest? • Demo Page 14
    15. 15. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 The Suitcase Pattern • Storing event data in uncompressed form not feasible –Event Data just keeps on coming, and storage is finite –The more history we can keep in Hadoop the better our trend analytics will be –Not just a Storage problem, but also Processing – I/O is #1 impact on both Ingest & MapReduce performance –Consider event data “traveling” over the network to Hadoop – we want to compress it before it makes the trip, but in a way that facilitates how we intend to process it once it arrives • Suitcase Pattern –Before we travel, we take our clothes off the rack and pack them (easier to store) –We then unpack them when we arrive and put them back on the rack (easier to process)
    16. 16. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 What’s next? • How much data is available or could be available to us during the ingest is lost after the ingest? • How valuable is the data available to us during the ingest after the ingest completed? • How expensive would it be to retain (not lose) the data available during the ingest? Page 16
    17. 17. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Data available during the ingest? • Record count • Highest/Lowest record length • Average record length • Compression ratio But with a little more work. . . • Column parsing –Unique values –Unique values per column –Access to values of each column independently from the record –Relatively fast column-based searches, without indexing –Value encoding –Etc… Imagine the implications on later data access. . . Page 17
    18. 18. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 How expensive is to do that extra work? • Demo Page 18
    19. 19. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Conclusion - 1 • Hadoop wasn’t optimized to process <1 kB records –Mappers are event-driven consumers: they do not have a running thread until a callback thread delivers a message –Message delivery is an event that triggers the Mapper into action • “Don’t wake me up unless it is important” –In Hadoop, generally speaking, several thousand bytes to several hundred thousand bytes is deemed important –Buffering records during collection allows us to collect more data before waking up the elephant –Buffering records during collection also allows us to compress the whole block of records as a single record to be sent over the network to Hadoop – resulting in lower network and file I/O –Buffering records during collection also helps us handle bursts
    20. 20. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Conclusion - 2 • Understand your SLAs, but think ahead –While you may have accomplished your SLA for the ingest your overall system may still be in the unbalanced state –Don’t settle on “throw more hardware” until you can validate that you have approached the limits of your existing hardware –Strive for mechanical sympathy. Profile your software to understand its impact on the hardware its running on. –Don’t think about the ingest in isolation. Think how the ingested data is going to be used. –Analyze, the information that is available to you during the ingest and devise a convenient mechanism for storing it. Your data analysts will thank you for that. Page 20
    21. 21. © Hortonworks Inc. 2012 Thank You! Questions & Answers Follow: @z_oleg, @tmccuch, @hortonworks Page 21
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×