Aug 2012 HUG: Random vs. Sequential


Published on

Hadoop helps to make big data tasks feasible by providing two important services: while HDFS introduces controlled redundancy to prevent data loss, the Map/Reduce framework encourages algorithm designers to read and write data sequentially and thus optimize throughput and resource utilization. In this talk we dive into the details of how sequential access affects performance. In the first part of the talk, we show that sequential access is important not only for hard drives, but all storage components used in today's computers. Based on this observation, we then discuss statistical techniques to improve performance of common analytical tasks. In particular, we show how randomness can be used strategically to improve speed and possibly accuracy.

Presenter: Ulrich Rückert, Datameer

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Aug 2012 HUG: Random vs. Sequential

  1. 1. Random vs SequentialUlrich Rückert © 2012 Datameer, Inc. All rights reserved.
  2. 2. Motivation Nobody ever got fired for using Hadoop on a cluster Antony Rowstron Dushyanth Narayanan Austin Donnelly Greg O’Shea Andrew Douglas Microsoft Research, Cambridge Abstract at least two analytics production clusters (at Microsoft and The norm for data analytics is now to run them on com- Y ahoo) have median job input sizes under 14 GB, and 90% modity clusters with MapReduce-like abstractions. One only of jobs on a Facebook cluster have input sizes under 100 GB. needs to read the popular blogs to see the evidence of this. We also speculate that many algorithms are complex We believe that we could now say that “nobody ever got fired to scale out and therefore expensive in terms of human for using Hadoop on a cluster”! engineering. Many important analytics jobs, for example We completely agree that Hadoop on a cluster is the iterative-machine learning algorithms, do not map trivially right solution for jobs where the input data is multi-terabyte to MapReduce. Typically the algorithm is mapped into a or larger. However, in this position paper we ask if this is series of MapReduce “rounds”, and to achieve scalability the right path for general purpose data analytics? Evidence it often has to be approximated thus sacrificing accuracy suggests that many MapReduce-like jobs process relatively and/or increasing the number of MapReduce rounds. small input data sets (less than 14 GB). Memory has reached Finally, we observe that DRAM is at an inflection point. a GB/$ ratio such that it is now technically and financially The latest 16 GB DIMMs cost around $220, meaning 192 GB feasible to have servers with 100s GB of DRAM. We there- can be put on a server for less than half the price of the server. fore ask, should we be scaling by using single machines with This has significant cost implications: increasing the server very large memories rather than clusters? We conjecture that, memory allowed 33% fewer servers to be provisioned, it in terms of hardware and programmer time, this may be a would reduce capital expenditure as well as operating ex- better option for the majority of data processing jobs. penditure. Further, Moore’s Law benefits memory as well, so every eighteen months the price drops by a factor of two. Categories and Subject Descriptors C.2.4 [Distributed Next year we would expect to be able to buy 384 GB for the Systems]: Distributed applications same price as 192 GB today. General Terms Performance In this position paper we take the stand that we should not automatically scale up all data analytics workloads by using Keywords MapReduce, Hadoop, scalability, analytics, big clusters running Hadoop. In many cases it may be easier and data © 2012 Datameer, Inc. All rights reserved. cheaper to scale up using a single server and adding more memory. We will provide evidence of this using some of our 1. Introduction own experiences, and we also note that has been recognized We are all familiar with the saying “nobody ever got fired for for a long time by the in-memory database community. How- buying IBM”. In the data analytics world, are we at the point ever, in the MapReduce/Hadoop data analytics community where “nobody ever got fired for using Hadoop on clus- this message seems to have been drowned out. ter”? We understand that there are many jobs that process To support this we consider (i) memory costs, (ii) job in- exabytes, petabytes, or multi-terabytes of data that need in- put sizes, (iii) implementation complexity and (iv) perfor- frastructures like MapReduce [1, 6] (Hadoop) running over mance. The first two are based on empirical data. For the
  3. 3. Motivation © 2012 Datameer, Inc. All rights reserved.
  4. 4. Agenda• Why sequential access?• Counting and Aggregation• How to reduce runtime by 99%.• Algorithm Design• Conclusion © 2012 Datameer, Inc. All rights reserved.
  5. 5. What is Hadoop About?• Failure tolerance • Solution: redundant storage• Resource utilization • Solution: parallelism, sequential data access• But these problems are universal. © 2012 Datameer, Inc. All rights reserved.
  6. 6. Sequential Access © 2012 Datameer, Inc. All rights reserved.
  7. 7. This presentation is only partially previewed.