How to Make Hadoop Easy, Dependable and Fast


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • MapR provides a complete distribution for Apache Hadoop. MapR has integrated, tested and hardened a broad array of packages as part of this distribution Hive, Pig, Oozie, Sqoop, plus additional packages such as Cascading. We have spent over a two year well funded effort to provide deep architectural improvements to create the next generation distribution for Hadoop. MapR has made significant updates while providing a 100% compatible Hadoop for Apache distribution.This is in stark contrast with the alternative distributions from Cloudera, HortonWorks, Apache which are all equivalent.
  • With MapR Hadoop is Lights out Data Center ReadyMapR provides 5 99999’s of availability including support for rolling upgrades, self –healing and automated stateful failover. MapR is the only distribution that provides these capabilities, MapR also provides dependable data storage with full data protection and business continuity features. MapR provides point in time recovery to protect against application and user errors. There is end to end check summing so data corruption is automatically detected and corrected with MapR’s self healing capabilities. Mirroring across sites is fully supported.All these features support lights out data center operations. Every two weeks an administrator can take a MapR report and a shopping cart full of drives and replace failed drives.
  • How to Make Hadoop Easy, Dependable and Fast

    1. 1. How to Make HadoopEasy, Dependable, and Fast©MapR Technologies - Confidential 1
    2. 2. Agenda Quick MapR overview Typical and atypical use cases – restaurant recommendation – network security – mega-scale fraud modeling – log analysis through creative abuse of text retrieval Lessons Learned – data import and export techniques – zen integration – accessing data from a variety of applications – how to protect data from the most common cause of data loss©MapR Technologies - Confidential 2
    3. 3. MapR’s Complete Distribution for Apache Hadoop Apache Applications are MapR Control System MapR Integrated, tested, Heatmap™ LDAP, NIS Integration Quotas, Alerts, Alarms CLI, REST APT hardened and Supported. Hive Pig Oozle Sqoop HBase Whirr 100% Hadoop, HBase, Mahout Cascading Naglos Ganglia Flume Zoo- Integration Integration keeper HDFS API compatible Easy portability/ migration between Direct Real-Time Snap- Data Access Streaming Volumes Mirrors shots Placement distributions NFS No NameNode High Performance Stateful Failover No changes required Architecture Direct Shuffle and Self Healing to Hadoop applications 2.7 MapR’s Storage Services™ ©MapR Technologies - Confidential 3
    4. 4. MapR: Lights Out Data Center Ready Reliable Compute Dependable Storage Automated stateful failover  Business continuity with snapshots Automated re-replication and mirrors Self-healing from HW and SW failures  Recover to a point in time Load balancing  End-to-end check summing Rolling upgrades  Strong consistency No lost jobs or data  Data safe 99999’s of uptime  Mirror across sites to meet Recovery Time Objectives ©MapR Technologies - Confidential 4
    5. 5. Restaurant Recommendation Use transaction data to characterize users Determine restaurant affinities for transactors On demand, produce geo-local restaurant recommendation Web or mobile interface©MapR Technologies - Confidential 5
    6. 6. Restaurant Recommendation Training goes-ins – transaction data from purchases – user feedback on recommendations Training goes-outs – large recommendation data files Online goes-ins – user id, current location, recent transaction history – filters Online goes-outs – restaurant recommendations©MapR Technologies - Confidential 6
    7. 7. What is the Delivery Mechanism Database? – export takes forever – limited scalability Key value store? – export takes forever – YAWTM (yet another widget to maintain) Do we really need a mechanism at all?©MapR Technologies - Confidential 7
    8. 8. Deploying Recommendations Final recommendations computed in browser/app©MapR Technologies - Confidential 8
    9. 9. Summary With mirrors and NFS, no special “deployment” mechanism is necessary User’s browser can do final assembly on the recommendations Recommendation components served as static files by web-server©MapR Technologies - Confidential 9
    10. 10. Mega-scale Fraud Modeling Why not use the simplest modeling technology around? – similar folk do similar things! – just find tens of thousands of similar folk and see what they did Can we make it a million times faster than the prototype? – well, yes … we can And can you deploy that into a live system? And can sequential and parallel versions co-exist?©MapR Technologies - Confidential 10
    11. 11. Modeling with k-nearest Neighbors a b c©MapR Technologies - Confidential 11
    12. 12. Speeds and Feeds Single machine version can cluster at 20μs per point – 1 million points in ~20s – 100 million points in ~2000s = 40 minutes Parallel version can cluster at 20μs / nodes per point + 30 seconds – 1 million points in 31 s on 20 nodes (ish) – 100 million points in 150 s = 2.5 minutes (on 20 nodes) Really would like interchangeable versions©MapR Technologies - Confidential 12
    13. 13. What About Deployment? Final matrix size is several GB Can’t have copy per thread – can’t even wait to load many copies What about mmap? – needs real files, can’t use HDFS – NFS works great Need to deploy in map-reduce and real-time environments – can’t depend on Hadoop features like distributed cache©MapR Technologies - Confidential 13
    14. 14. ©MapR Technologies - Confidential 14
    15. 15. Summary With mirrors and NFS, no special “deployment” mechanism is necessary The modeling client can use NFS + mmap share memory between threads or processes Mirrors can stage as many replicas as desired on whichever machines are specified©MapR Technologies - Confidential 15
    16. 16. Network Security Take an existing network security appliance Add magical parallel machine learning to find new attacks But don’t spend time copying data back and forth And don’t change the legacy code©MapR Technologies - Confidential 16
    17. 17. ©MapR Technologies - Confidential 17
    18. 18. Summary Legacy code “just works” with MapR’s NFS Map-reduce programs don’t care where the input comes from Exposing new control data requires no special mechanism©MapR Technologies - Confidential 18
    19. 19. Log Analysis Receive 200K log lines per second or more Want to do multi-field search Want to search on log lines with < 30 second delay before search©MapR Technologies - Confidential 19
    20. 20. Solr Based Flexible Analytics Solr/Lucene can index at 500K small documents per second Faceting provides simple aggregation Multiple index search is a given, not a special future enhancement Solr/Lucene has awesome record of stability©MapR Technologies - Confidential 20
    21. 21. Data Ingestion and Indexing SolR SolR SolrIncoming Indexer Text Kafka Indexer indexer Data analysis Real-time Raw Live index Older index shard documents shards Time sharded Solr indexes©MapR Technologies - Confidential 21
    22. 22. Some Special Points Textual analysis is done in parallel outside of the indexer Raw documents are stored outside of Solr to minimize index size Index hot-spotting is a feature here because it gives time-based sharding Indexing into NFS allows legacy code reuse©MapR Technologies - Confidential 22
    23. 23. Basic Search Solr search Query Web tier SolR SolR Indexer Solr Indexer search Raw Live index Older index shard documents shards©MapR Technologies - Confidential 23
    24. 24. Additional Points The number of shards per core can be adjusted easily to match load Near real-time indexing not really required No transaction logs need be kept by Solr for failure tolerance – core failure requires other cores take on lost shards – indexer failure requires indexer restart … Kafka retains unprocessed input – indexing is idempotent©MapR Technologies - Confidential 24
    25. 25. Secure Search Auth data Solr Security search Query Web tier filter SolR SolR Indexer Solr Indexer search Raw Live index Older index shard documents shards©MapR Technologies - Confidential 25
    26. 26. Conclusions©MapR Technologies - Confidential 26
    27. 27. Lessons Learned Import/export is often a non-issue – NFS allows processing in place Legacy access via NFS provides high performance, minimal effort Interchangeable map-reduce and conventional programs are key Do simple tasks in simple ways. Save the effort for the big tasks©MapR Technologies - Confidential 27
    28. 28. Zen Integration The student went to the master and asked how to integrate multiple programs using different models – The master said, “to do more, do less” The student went away and came back pointing out that HDFS allows copying data in and out. He quoted Turing. – The master said, “to do more, do less” The student thought about this for many days. In the meantime, the master installed MapR and deleted all the integration code. When the student returned and saw this, he asked where the integration was. The master answered “ ” and the student was enlightened.©MapR Technologies - Confidential 28
    29. 29. The Cause of Almost All Data Loss©MapR Technologies - Confidential 29
    30. 30. The Cause of Almost All Data Loss©MapR Technologies - Confidential 30
    31. 31. The Cause of Almost All Data Loss And snapshots are the cure (partially)©MapR Technologies - Confidential 31
    32. 32. Time for Questions Download MapR to learn more – Send email with questions later – Tweet as the spirit moves – @ted_dunning These slides and other resources –©MapR Technologies - Confidential 32