Soft-Shake 2013 : Enabling Realtime Queries to End Users


Published on

Since it became an Apache Top Level Project in early 2008, Hadoop has established itself as the de-facto industry standard for batch processing. The two layers composing its core, HDFS and MapReduce, are strong building blocks for data processing. Running data analysis and crunching petabytes of data is no longer fiction. But the MapReduce framework does have two major drawbacks: query latency and data freshness.

At the same time, businesses have started to exchange more and more data through REST API, leveraging HTTP words (GET, POST, PUT, DELETE) and URI (for instance http://company/api/v2/domain/identifier), pushing the need to read data in a random access style – from simple key/value to complex queries.

Enhancing the BigData stack with real time search capabilities is the next natural step for the Hadoop ecosystem, because the MapReduce framework was not designed with synchronous processing in mind.

There is a lot of traction today in this area and this talk will try to answer the question of how to fill in this gap with specific open-source components, ultimately building a dedicated platform that will enable real-time queries on Internet-scale data sets. After discussing the evolution of the deployments of common Hadoop platform, a hybrid approach called lambda architecture will be proposed. It will be demonstrated with concrete examples, discussing which technology could be a good match, and how they would interact together.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The remainder of this presentation focuses on the prevention of DNS or DDoS attacks and global server load balancing to add resiliency to your eCommerce architecture
  • Soft-Shake 2013 : Enabling Realtime Queries to End Users

    1. 1. Enabling Real-time Queries to End Users Benoit Perroud SoftShake, Geneva, October 24, 2013
    2. 2. About Me • • • • • Benoit Perroud Software Engineer @ Verisign Leading Hadoop Infrastructure Team Apache Committer @killerwhile Verisign Public 2
    3. 3. Agenda • • • • • What’s going on Data lifecycle Batch and Realtime Hadoop Deployments Next Steps Verisign Public 3
    4. 4. What’s going on • Mainframes are obsolete, replaced by commodity hardware’s cluster • TenG (10Gb/s) links are the new standard • RESTful APIs are everywhere • Everybody wants to visit Paxos Island • Firehoses do not only carry water • Asynchronous non-blocking functional programming is taught at primary school • NoSQL is the new way to store data at scale • API management startups are rising (and raising) • Hadoop keywords boost your LinkedIn profile by 2000% • Public clouds are responsible for more than 50% of the global Internet traffic • … and counting … Verisign Public 4
    5. 5. A Possible Deployment Verisign Public Source: Note: the diagram is stamped from 2009, it is probably partially or even completely outdated today 5
    6. 6. Data Lifecycle Verisign Public 6
    7. 7. Data Lifecycle Data Storage Data Retrieval Data Ingestion Consumers Producers Data Processing Verisign Public 7
    8. 8. • Copying internal and external sources of data into the cluster • Pre-processing: data cleanup, proper format, … • Time vs. block-size tradeoff • Targeted property: Availability Source of Data Ingesting the flow Uploading to HDFS HDFS Local buffering Verisign Public 8
    9. 9. • Hadoop HDFS is a well established distributed file system • File system is the central component of every datadriven approach • Space vs. network tradeoff • Targeted property: Reliability DataNode1 DataNode2 File1 Upload to HDFS Verisign Public DataNode3 DataNode4 9
    10. 10. • Hadoop MapReduce • Higher level tools (Hive, Pig, Impala) help • Data catalog needs to be maintained Targeted property: parallelism Verisign Public 10
    11. 11. • • • • Only way to make use of the data Business driven need At scale, data needs to be stored as they are queried. DPI: Data Programmable Interfaces Targeted property: user friendliness, reliability Verisign Public 11
    12. 12. Batch and Realtime Verisign Public 12
    13. 13. Batch Processing Batch 1 starts processing Batch 2 starts processing Batch 2 ready to be served Batch 1 ready to be served Batch 1 Batch 2 t2 t1 Batch 3 starts processing t4 t3 Query data from batch 1 Data gap Verisign Public Batch 3 t5 Time Query data from batch 2 Data gap 13
    14. 14. Batch Processing in details Let some time for data to finish upload Load results in a data store Batch with data from yesterday Time New batch granularity period Processing time Query data from the day before yesterday? Verisign Public Notify the retrieval system a new batch is ready to be served 14
    15. 15. Realtime Query • Interactive query • REST like request/response queries • With SLA And • Query the latest version of the data • Latest means n seconds ago with n predictible Verisign Public 15
    16. 16. Hadoop Deployments Verisign Public 16
    17. 17. Naïve Hadoop Deployment hdfs dfs -put Gateway mapred job …jar hdfs dfs -get Verisign Public NameNode JobTracker DataNode DataNode DataNode DataNode DataNode Processing DataNode DataNode DataNode DataNode DataNode 17
    18. 18. Industry Hadoop Deployment Gateway Data In GW Data Out GW NameNode NameNode JobTracker JobTracker DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode Processing DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode Monitoring Verisign Public NameNode NameNode J DataNode DataNode DataN Dat D DataNode Research, DataNode DataNode Data Science DataNode DataNode DataNode DataNode DataNode DataNode Metadata Store 18
    19. 19. Realtime Hadoop Deployment Gateway NameNode NameNode JobTracker JobTracker DataNode DataNode DataNode DataNode Processing Data In GW DataNode DataNode DataNode DataNode RT Data Out GW RT processing Verisign Public 19
    20. 20. Hybrid Approach Batch 1 starts processing Batch 2 starts processing Batch 2 ready to be served Batch 1 ready to be served Batch 1 t1 Batch 2 t2 t3 t4 Time Complementary data for batch 1 Complementary data for batch 2 Verisign Public 20
    21. 21. Realtime Search with Hadoop Gateway Data In GW NameNode NameNode Generate Indexes DataNode DataNode DataNode DataNode JobTracker JobTracker DataNode DataNode DataNode DataNode Coordinator RT Data Out GW Update indexes Verisign Public 21
    22. 22. Next Steps Verisign Public 22
    23. 23. Hadoop Ecosystem … is moving … really fast • Interactive Queries: Cloudera Impala, Apache Drills, Tez, … • Search: SolrCloud, ElasticSearch, Cloudera Search • Hybrid layer: Twitter SummingBird • … and counting… Verisign Public 23
    24. 24. Thanks for the attention! Follow @killewhile “Copyright © 2013 VeriSign, Inc. All rights reserved. The VERISIGN word mark, the Verisign logo, and other Verisign trademarks, service marks, and designs that may appear herein are registered or unregistered trademarks or service marks of VeriSign, Inc., and its subsidiaries in the United States and foreign countries. All other trademarks, service marks, and designs are property of their respective owners. Verisign has made efforts to ensure the accuracy and completeness of the information in this document. However, Verisign makes no warranties of any kind (whether express, implied or statutory) with respect to the information contained herein. Verisign assumes no liability to any party for any loss or damage (whether direct or indirect) caused by any errors, omissions, or statements of any kind contained in this document. Further, Verisign assumes no liability arising from the application or use of the products, services, or materials described or referenced herein and specifically disclaims any representation that any such products, services, or materials do not infringe upon any existing or future intellectual property rights.”