MapR, Implications for Integration


Published on

Ted Dunning's talk for the Sept. 2011 TriHUG ( meeting. Cover's MapR's distributed file system and map-reduce technology.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

MapR, Implications for Integration

  1. 1. MapR, Implications for Integration<br />CHUG – August 2011<br />
  2. 2. Outline<br />MapR system overview<br />Map-reduce review<br />MapR architecture<br />Performance Results<br />Map-reduce on MapR<br />Architectural implications<br />Search indexing / deployment<br />EM algorithm for machine learning<br />… and more …<br />
  3. 3. Map-Reduce<br />Shuffle<br />Input<br />Output<br />
  4. 4. Bottlenecks and Issues<br />Read-only files<br />Many copies in I/O path<br />Shuffle based on HTTP<br />Can’t use new technologies<br />Eats file descriptors<br />Spills go to local file space<br />Bad for skewed distribution of sizes<br />
  5. 5. MapR Areas of Development<br />
  6. 6. MapR Improvements<br />Faster file system<br />Fewer copies<br />Multiple NICS<br />No file descriptor or page-buf competition<br />Faster map-reduce<br />Uses distributed file system<br />Direct RPC to receiver<br />Very wide merges<br />
  7. 7. MapR Innovations<br />Volumes<br />Distributed management<br />Data placement<br />Read/write random access file system<br />Allows distributed meta-data<br />Improved scaling<br />Enables NFS access<br />Application-level NIC bonding<br />Transactionally correct snapshots and mirrors<br />
  8. 8. MapR'sContainers<br />Files/directories are sharded into blocks, whichare placed into mini NNs (containers ) on disks<br /><ul><li>Each container contains
  9. 9. Directories & files
  10. 10. Data blocks
  11. 11. Replicated on servers
  12. 12. No need to manage directly</li></ul>Containers are 16-32 GB segments of disk, placed on nodes<br />
  13. 13. Container locations and replication<br />CLDB<br />N1, N2<br />N1<br />N3, N2<br />N1, N2<br />N2<br />N1, N3<br />N3, N2<br />N3<br />Container location database (CLDB) keeps track of nodes hosting each container<br />
  14. 14. MapR Scaling<br />Containers represent 16 - 32GB of data<br /><ul><li>Each can hold up to 1 Billion files and directories
  15. 15. 100M containers = ~ 2 Exabytes (a very large cluster)</li></ul>250 bytes DRAM to cache a container<br /><ul><li>25GB to cache all containers for 2EB cluster
  16. 16. But not necessary, can page to disk
  17. 17. Typical large 10PB cluster needs 2GB</li></ul>Container-reports are 100x - 1000x < HDFS block-reports<br /><ul><li>Serve 100x more data-nodes
  18. 18. Increase container size to 64G to serve 4EB cluster
  19. 19. Map/reduce not affected</li></li></ul><li>MapR's Streaming Performance<br />11 x 7200rpm SATA<br />11 x 15Krpm SAS<br />MB<br />per<br />sec<br />Higher is better<br />Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB<br />
  20. 20. Terasort on MapR<br />10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm<br />Elapsed time (mins)<br />Lower is better<br />
  21. 21. HBase on MapR<br />YCSB Random Read with 1 billion 1K records<br />10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM<br />Recordspersecond<br />Higher is better<br />
  22. 22. Small Files (Apache Hadoop, 10 nodes)<br />Out of box<br />Op: - create file - write 100 bytes - close<br />Notes:<br />- NN not replicated<br />- NN uses 20G DRAM<br />- DN uses 2G DRAM<br />Tuned<br />Rate (files/sec)<br /># of files (m)<br />
  23. 23. MUCH faster for some operations<br />Same 10 nodes …<br />Create<br />Rate<br /># of files (millions)<br />
  24. 24. What MapR is not<br />Volumes != federation<br />MapR supports > 10,000 volumes all with independent placement and defaults<br />Volumes support snapshots and mirroring<br />NFS != FUSE<br />Checksum and compress at gateway<br />IP fail-over<br />Read/write/update semantics at full speed<br />MapR != maprfs<br />
  25. 25. New Capabilities<br />
  26. 26. NFS mounting models<br />Export to the world<br />NFS gateway runs on selected gateway hosts<br />Local server<br />NFS gateway runs on local host<br />Enables local compression and check summing<br />Export to self<br />NFS gateway runs on all data nodes, mounted from localhost<br />
  27. 27. Export to the world<br />NFS<br />Server<br />NFS<br />Server<br />NFS<br />Server<br />NFS<br />Server<br />NFS<br />Client<br />
  28. 28. Local server<br />Client<br />Application<br />NFS<br />Server<br />Cluster Nodes<br />
  29. 29. Universal export to self<br />Cluster Nodes<br />Cluster<br />Node<br />Task<br />NFS<br />Server<br />
  30. 30. Cluster<br />Node<br />Task<br />NFS<br />Server<br />Cluster<br />Node<br />Task<br />Cluster<br />Node<br />Task<br />NFS<br />Server<br />NFS<br />Server<br />Nodes are identical<br />
  31. 31. Application architecture<br />So now we have a hammer<br />Let’s find us some nails!<br />
  32. 32. Sharded text Indexing<br />Index text to local disk and then copy index to distributed file store<br />Assign documents to shards<br />Map<br />Reducer<br />Clustered index storage<br />Input documents<br />Copy to local disk typically required before index can be loaded<br />Local<br />disk<br />Search<br />Engine<br />Local<br />disk<br />
  33. 33. Shardedtext indexing<br />Mapper assigns document to shard<br />Shard is usually hash of document id<br />Reducer indexes all documents for a shard<br />Indexes created on local disk<br />On success, copy index to DFS<br />On failure, delete local files<br />Must avoid directory collisions <br />can’t use shard id!<br />Must manage and reclaim local disk space<br />
  34. 34. Conventional data flow<br />Failure of search engine requires another download of the index from clustered storage.<br />Map<br />Failure of a reducer causes garbage to accumulate in the local disk<br />Reducer<br />Clustered index storage<br />Input documents<br />Local<br />disk<br />Search<br />Engine<br />Local<br />disk<br />
  35. 35. Simplified NFS data flows<br />Index to task work directory via NFS<br />Map<br />Reducer<br />Search<br />Engine<br />Input documents<br />Clustered index storage<br />Failure of a reducer is cleaned up by map-reduce framework<br />Search engine reads mirrored index directly.<br />
  36. 36. Simplified NFS data flows<br />Search<br />Engine<br />Mirroring allows exact placement of index data<br />Map<br />Reducer<br />Input documents<br />Search<br />Engine<br />Aribitrary levels of replication also possible<br />Mirrors<br />
  37. 37. How about another one?<br />
  38. 38. K-means<br />Classic E-M based algorithm<br />Given cluster centroids,<br />Assign each data point to nearest centroid<br />Accumulate new centroids<br />Rinse, lather, repeat<br />
  39. 39. K-means, the movie<br />Centroids<br />Assign<br />to<br />Nearest<br />centroid<br />I<br />n<br />p<br />u<br />t<br />Aggregate<br />new<br />centroids<br />
  40. 40. But …<br />
  41. 41. Parallel Stochastic Gradient Descent<br />Model<br />Train<br />sub<br />model<br />I<br />n<br />p<br />u<br />t<br />Average<br />models<br />
  42. 42. VariationalDirichlet Assignment<br />Model<br />Gather<br />sufficient<br />statistics<br />I<br />n<br />p<br />u<br />t<br />Update<br />model<br />
  43. 43. Old tricks, new dogs<br />Mapper<br />Assign point to cluster<br />Emit cluster id, (1, point)<br />Combiner and reducer<br />Sum counts, weighted sum of points<br />Emit cluster id, (n, sum/n)<br />Output to HDFS<br />Read from local disk from distributed cache<br />Read from<br />HDFS to local disk by distributed cache<br />Written by map-reduce<br />
  44. 44. Old tricks, new dogs<br />Mapper<br />Assign point to cluster<br />Emit cluster id, (1, point)<br />Combiner and reducer<br />Sum counts, weighted sum of points<br />Emit cluster id, (n, sum/n)<br />Output to HDFS<br />Read from<br />NFS<br />Written by map-reduce<br />MapR FS<br />
  45. 45. Poor man’s Pregel<br />Mapper<br />Lines in bold can use conventional I/O via NFS<br />while not done:<br /> read and accumulate input models<br /> for each input:<br /> accumulate model<br /> write model<br /> synchronize<br /> reset input format<br />emit summary<br />37<br />
  46. 46. Click modeling architecture<br />Map-reduce<br />Side-data<br />Now via NFS<br />Feature<br />extraction<br />and<br />down<br />sampling<br />I<br />n<br />p<br />u<br />t<br />Data<br />join<br />Sequential<br />SGD<br />Learning<br />
  47. 47. Click modeling architecture<br />Map-reduce<br />Map-reduce<br />Side-data<br />Map-reduce cooperates with NFS<br />Sequential<br />SGD<br />Learning<br />Feature<br />extraction<br />and<br />down<br />sampling<br />Sequential<br />SGD<br />Learning<br />I<br />n<br />p<br />u<br />t<br />Data<br />join<br />Sequential<br />SGD<br />Learning<br />Sequential<br />SGD<br />Learning<br />
  48. 48. And another…<br />
  49. 49. Hybrid model flow<br />Map-reduce<br />Map-reduce<br />Feature extraction<br />and <br />down sampling<br />Down <br />stream <br />modeling<br />Deployed<br />Model<br />??<br />SVD<br />(PageRank)<br />(spectral)<br />
  50. 50.
  51. 51. Hybrid model flow<br />Feature extraction<br />and <br />down sampling<br />Down <br />stream <br />modeling<br />Deployed<br />Model<br />Sequential<br />Map-reduce<br />SVD<br />(PageRank)<br />(spectral)<br />
  52. 52. And visualization…<br />
  53. 53. Trivial visualization interface<br />Map-reduce output is visible via NFS<br />Legacy visualization just works<br />$ R<br />> x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”)<br />> plot(error ~ t, x)<br />> q(save=‘n’)<br />
  54. 54. Conclusions<br />We used to know all this<br />Tab completion used to work<br />5 years of work-arounds have clouded our memories<br />We just have to remember the future<br />