NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications


Published on

Slides from:

There are a number of assumptions that come with using standard Hadoop that are based on Hadoop's initial architecture. Many of these assumptions can be relaxed with more advanced architectures such as those provided by MapR. These changes in assumptions have ripple effects throughout the system architecture. This is significant because many systems like Mahout provide multiple implementations of various algorithms with very different performance and scaling implications.

I will describe several case studies and use these examples to show how these changes can simplify systems or, in some cases, make certain classes of programs run an order of magnitude faster.

About the speaker: Ted Dunning - Chief Application Architect (MapR)

Ted has held Chief Scientist positions at Veoh Networks, ID Analytics and at MusicMatch, (now Yahoo Music). Ted is responsible for building the most advanced identity theft detection system on the planet, as well as one of the largest peer-assisted video distribution systems and ground-breaking music and video recommendations systems. Ted has 15 issued and 15 pending patents and contributes to several Apache open source projects including Hadoop, Zookeeper and Hbase. He is also a committer for Apache Mahout. Ted earned a BS degree in electrical engineering from the University of Colorado; a MS degree in computer science from New Mexico State University; and a Ph.D. in computing science from Sheffield University in the United Kingdom. Ted also bought the drinks at one of the very first Hadoop User Group meetings.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Constant time implies constantfactor of growth. Thus the accumulation of all of history before 10 time units ago is less than half the accumulation in the last 10 units alone. This is true at all time.
  • Startups use this fact to their advantage and completely change everything to allow time-efficient development initially with conversion to computer-efficient systems later.
  • Here the later history is shown after the initial exponential growth phase. This changes the economics of the company dramatically.
  • The startup can throw away history because it is so small. That means that the startup has almost no compatibility requirement because the data lost due to lack of compatibility is a small fraction of the total data.
  • A large enterprise cannot do that. They have to have access to the old data and have to share between old data and Hadoop accessible data.This doesn’t have to happen with the proof of concept level, but it really must happen when hadoop first goes to production.
  • But stock Hadoop does not handle this well.
  • This is because Hadoop and other data silos have different foundations. What is worse, there is a semantic wall that separates HDFS from normal resources.
  • Here is a picture that shows how MapR can replace the foundation and provide compatibility. Of course, MapR provide much more than just the base, but the foundation is what provides the fundamental limitation or lack of limit in MapR’s case.
  • NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications

    1. 1. MapR, Architecture, Philosophy and Applications<br />NY HUG – October 2011<br />
    2. 2. Outline<br />Architecture (MapR)<br />Philosophy<br />Architectural (Machine learning)<br />
    3. 3. Map-Reduce, the Original Mission<br />Shuffle<br />Input<br />Output<br />
    4. 4. Bottlenecks and Issues<br />Read-only files<br />Many copies in I/O path<br />Shuffle based on HTTP<br />Can’t use new technologies<br />Eats file descriptors<br />Spills go to local file space<br />Bad for skewed distribution of sizes<br />
    5. 5. MapR Areas of Development<br />
    6. 6. MapR Improvements<br />Faster file system<br />Fewer copies<br />Multiple NICS<br />No file descriptor or page-buf competition<br />Faster map-reduce<br />Uses distributed file system<br />Direct RPC to receiver<br />Very wide merges<br />
    7. 7. MapR Innovations<br />Volumes<br />Distributed management<br />Data placement<br />Read/write random access file system<br />Allows distributed meta-data<br />Improved scaling<br />Enables NFS access<br />Application-level NIC bonding<br />Transactionally correct snapshots and mirrors<br />
    8. 8. MapR'sContainers<br />Files/directories are sharded into blocks, whichare placed into mini NNs (containers ) on disks<br /><ul><li>Each container contains
    9. 9. Directories & files
    10. 10. Data blocks
    11. 11. Replicated on servers
    12. 12. No need to manage directly</li></ul>Containers are 16-32 GB segments of disk, placed on nodes<br />
    13. 13. MapR'sContainers<br /><ul><li>Each container has a replication chain
    14. 14. Updates are transactional
    15. 15. Failures are handled by rearranging replication</li></li></ul><li>MapR'sContainers<br />Files/directories are sharded into blocks, whichare placed into mini NNs (containers ) on disks<br /><ul><li>Each container contains
    16. 16. Directories & files
    17. 17. Data blocks
    18. 18. Replicated on servers
    19. 19. No need to manage directly</li></ul>Containers are 16-32 GB segments of disk, placed on nodes<br />
    20. 20. Container locations and replication<br />CLDB<br />N1, N2<br />N1<br />N3, N2<br />N1, N2<br />N2<br />N1, N3<br />N3, N2<br />N3<br />Container location database (CLDB) keeps track of nodes hosting each container and replication chain order<br />
    21. 21. MapR Scaling<br />Containers represent 16 - 32GB of data<br /><ul><li>Each can hold up to 1 Billion files and directories
    22. 22. 100M containers = ~ 2 Exabytes (a very large cluster)</li></ul>250 bytes DRAM to cache a container<br /><ul><li>25GB to cache all containers for 2EB cluster
    23. 23. But not necessary, can page to disk
    24. 24. Typical large 10PB cluster needs 2GB</li></ul>Container-reports are 100x - 1000x < HDFS block-reports<br /><ul><li>Serve 100x more data-nodes
    25. 25. Increase container size to 64G to serve 4EB cluster
    26. 26. Map/reduce not affected</li></li></ul><li>MapR's Streaming Performance<br />11 x 7200rpm SATA<br />11 x 15Krpm SAS<br />MB<br />per<br />sec<br />Higher is better<br />Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB<br />
    27. 27. Terasort on MapR<br />10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm<br />Elapsed time (mins)<br />Lower is better<br />
    28. 28. HBase on MapR<br />YCSB Random Read with 1 billion 1K records<br />10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM<br />Recordspersecond<br />Higher is better<br />
    29. 29. Small Files (Apache Hadoop, 10 nodes)<br />Out of box<br />Op: - create file - write 100 bytes - close<br />Notes:<br />- NN not replicated<br />- NN uses 20G DRAM<br />- DN uses 2G DRAM<br />Tuned<br />Rate (files/sec)<br /># of files (m)<br />
    30. 30. MUCH faster for some operations<br />Same 10 nodes …<br />Create<br />Rate<br /># of files (millions)<br />
    31. 31. What MapR is not<br />Volumes != federation<br />MapR supports > 10,000 volumes all with independent placement and defaults<br />Volumes support snapshots and mirroring<br />NFS != FUSE<br />Checksum and compress at gateway<br />IP fail-over<br />Read/write/update semantics at full speed<br />MapR != maprfs<br />
    32. 32. Philosophy<br />
    33. 33. Physics of startup companies<br />
    34. 34. For startups<br />History is always small<br />The future is huge<br />Must adopt new technology to survive<br />Compatibility is not as important<br />In fact, incompatibility is assumed<br />
    35. 35. Physics of large companies<br />Absolute growth still very large<br />Startup phase<br />
    36. 36. For large businesses<br />Present state is always large<br />Relative growth is much smaller<br />Absolute growth rate can be very large<br />Must adopt new technology to survive<br />Cautiously!<br />But must integrate technology with legacy<br />Compatibility is crucial<br />
    37. 37. The startup technology picture<br />No compatibility requirement<br />Old computers<br />and software<br />Expected hardware<br />and software growth<br />Current computers<br />and software<br />
    38. 38. The large enterprise picture<br />Must work<br />together<br />?<br />Proof of concept Hadoop cluster<br />Current hardware<br />and software<br />Long-term Hadoop cluster<br />
    39. 39. What does this mean?<br />Hadoop is very, very good at streaming through things in batch jobs<br />Hbase is good at persisting data in very write-heavy workloads<br />Unfortunately, the foundation of both systems is HDFS which does not export or import well<br />
    40. 40. Narrow Foundations<br />Pig<br />Hive<br />Big data is heavy and expensive to move.<br />Web Services<br />Sequential File Processing<br />OLAP<br />OLTP<br />Map/<br />Reduce<br />Hbase<br />RDBMS<br />NAS<br />HDFS<br />
    41. 41. Narrow Foundations<br />Because big data has inertia, it is difficult to move<br />It costs time to move<br />It costs reliability because of more moving parts<br />The result is many duplicate copies<br />
    42. 42. One Possible Answer<br />Widen the foundation<br />Use standard communication protocols<br />Allow conventional processing to share with parallel processing<br />
    43. 43. Broad Foundation<br />Pig<br />Hive<br />Web Services<br />Sequential File Processing<br />OLAP<br />OLTP<br />Map/<br />Reduce<br />Hbase<br />RDBMS<br />NAS<br />HDFS<br />MapR<br />
    44. 44. New Capabilities<br />
    45. 45. Export to the world<br />NFS<br />Server<br />NFS<br />Server<br />NFS<br />Server<br />NFS<br />Server<br />NFS<br />Client<br />
    46. 46. Local server<br />Client<br />Application<br />NFS<br />Server<br />Cluster Nodes<br />
    47. 47. Universal export to self<br />Cluster Nodes<br />Cluster<br />Node<br />Task<br />NFS<br />Server<br />
    48. 48. Cluster<br />Node<br />Task<br />NFS<br />Server<br />Cluster<br />Node<br />Task<br />Cluster<br />Node<br />Task<br />NFS<br />Server<br />NFS<br />Server<br />Nodes are identical<br />
    49. 49. Application architecture<br />High performance map-reduce is nice<br />But algorithmic flexibility is even nicer<br />
    50. 50. Hybrid model flow<br />Map-reduce<br />Map-reduce<br />Feature extraction<br />and <br />down sampling<br />Down <br />stream <br />modeling<br />Deployed<br />Model<br />??<br />SVD<br />(PageRank)<br />(spectral)<br />
    51. 51.
    52. 52. Hybrid model flow<br />Feature extraction<br />and <br />down sampling<br />Down <br />stream <br />modeling<br />Deployed<br />Model<br />Sequential<br />Map-reduce<br />SVD<br />(PageRank)<br />(spectral)<br />
    53. 53. Shardedtext indexing<br />Mapper assigns document to shard<br />Shard is usually hash of document id<br />Reducer indexes all documents for a shard<br />Indexes created on local disk<br />On success, copy index to DFS<br />On failure, delete local files<br />Must avoid directory collisions <br />can’t use shard id!<br />Must manage and reclaim local disk space<br />
    54. 54. Sharded text Indexing<br />Index text to local disk and then copy index to distributed file store<br />Assign documents to shards<br />Map<br />Reducer<br />Clustered index storage<br />Input documents<br />Copy to local disk typically required before index can be loaded<br />Local<br />disk<br />Search<br />Engine<br />Local<br />disk<br />
    55. 55. Conventional data flow<br />Failure of search engine requires another download of the index from clustered storage.<br />Map<br />Failure of a reducer causes garbage to accumulate in the local disk<br />Reducer<br />Clustered index storage<br />Input documents<br />Local<br />disk<br />Search<br />Engine<br />Local<br />disk<br />
    56. 56. Simplified NFS data flows<br />Index to task work directory via NFS<br />Map<br />Reducer<br />Search<br />Engine<br />Input documents<br />Clustered index storage<br />Failure of a reducer is cleaned up by map-reduce framework<br />Search engine reads mirrored index directly.<br />
    57. 57. Simplified NFS data flows<br />Search<br />Engine<br />Mirroring allows exact placement of index data<br />Map<br />Reducer<br />Input documents<br />Search<br />Engine<br />Aribitrary levels of replication also possible<br />Mirrors<br />
    58. 58. K-means<br />Classic E-M based algorithm<br />Given cluster centroids,<br />Assign each data point to nearest centroid<br />Accumulate new centroids<br />Rinse, lather, repeat<br />
    59. 59. K-means, the movie<br />Centroids<br />Assign<br />to<br />Nearest<br />centroid<br />I<br />n<br />p<br />u<br />t<br />Aggregate<br />new<br />centroids<br />
    60. 60. Parallel Stochastic Gradient Descent<br />Model<br />Train<br />sub<br />model<br />I<br />n<br />p<br />u<br />t<br />Average<br />models<br />
    61. 61. VariationalDirichlet Assignment<br />Model<br />Gather<br />sufficient<br />statistics<br />I<br />n<br />p<br />u<br />t<br />Update<br />model<br />
    62. 62. Old tricks, new dogs<br />Mapper<br />Assign point to cluster<br />Emit cluster id, (1, point)<br />Combiner and reducer<br />Sum counts, weighted sum of points<br />Emit cluster id, (n, sum/n)<br />Output to HDFS<br />Read from local disk from distributed cache<br />Read from<br />HDFS to local disk by distributed cache<br />Written by map-reduce<br />
    63. 63. Old tricks, new dogs<br />Mapper<br />Assign point to cluster<br />Emit cluster id, (1, point)<br />Combiner and reducer<br />Sum counts, weighted sum of points<br />Emit cluster id, (n, sum/n)<br />Output to HDFS<br />Read from<br />NFS<br />Written by map-reduce<br />MapR FS<br />
    64. 64. Poor man’s Pregel<br />Mapper<br />Lines in bold can use conventional I/O via NFS<br />while not done:<br /> read and accumulate input models<br /> for each input:<br /> accumulate model<br /> write model<br /> synchronize<br /> reset input format<br />emit summary<br />51<br />
    65. 65. Click modeling architecture<br />Map-reduce<br />Side-data<br />Now via NFS<br />Feature<br />extraction<br />and<br />down<br />sampling<br />I<br />n<br />p<br />u<br />t<br />Data<br />join<br />Sequential<br />SGD<br />Learning<br />
    66. 66. Click modeling architecture<br />Map-reduce<br />Map-reduce<br />Side-data<br />Map-reduce cooperates with NFS<br />Sequential<br />SGD<br />Learning<br />Feature<br />extraction<br />and<br />down<br />sampling<br />Sequential<br />SGD<br />Learning<br />I<br />n<br />p<br />u<br />t<br />Data<br />join<br />Sequential<br />SGD<br />Learning<br />Sequential<br />SGD<br />Learning<br />
    67. 67. Trivial visualization interface<br />Map-reduce output is visible via NFS<br />Legacy visualization just works<br />$ R<br />> x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”)<br />> plot(error ~ t, x)<br />> q(save=‘n’)<br />
    68. 68. Conclusions<br />We used to know all this<br />Tab completion used to work<br />5 years of work-arounds have clouded our memories<br />We just have to remember the future<br />