Expect More from Hadoop


Published on

MapR Technologies Chief Marketing Officer, Jack Norris, talks about the advantages of Hadoop. He elaborates and multiple use cases and explains how MapR Technologies is the best Hadoop distribution.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Let’s start with this chart. To reinforce you’re in the right room you picked the right session…HadoopNot only is it the fastest growing Big Data technology…It is one of the fastest technologies period….Hadoop adoption is happening across industries and across a wide range of application areas.What’s driving this adoption
  • There are many drivers for Hadoop adoption…
  • One of the drivers for Hadoop adoption is storage costs… Dramatically cheaper….. You might say I can’t use raw disks because I need the high end availability and data protection and speed. We agree with you that’s where MapR focused bringing the performance and features of high end to Disk Attached Storage…This is a paradigm shift
  • Map Reduce is a paradigm shiftGoogle Poster ChildWhat exactly does Hadoop look like?
  • This is a Hadoop distribution it includes a series of open source packages that are tested, hardened and combined into a complete suite. With MapR we’ve combined this with our own innovations at the data platform level to make it highly available, dependable and easier to access and integrate through industry standards like NFS, ODBC, etc…
  • How do you benefit. I mentioned that used wide variety of use cases…I’ve generalized these into 4 groups… The first
  • Is expanding data….Sampled to all of the transactions, ….. Netflix….recommends 5 movies to you and. It’s because they look at everybody’s movie watching and ratings and identify like clusters of individuals like you….Risk triangles for insurance companies go from zip code level down to the neighborhood street…Trading information going for last 3 months to 7 years….
  • Let’s look at a specific example…
  • Load CDR – Call detail records into the data warehouse and transform data into the proper format for processing and analysis…
  • The problem with this process is that 70% of the EDW load is related to the CDR normalization process AI: Why is this the case?CDR normalization difficult within the EDWBinary extraction and conversion to SQL is difficult
  • The first is “simple algorithms and lots of data trump complex models”. This comes from an IEEE article written by 3 research directors at Google. The article was titled the “Unreasonable effectiveness of Data” it was reaction to an article called “The Unreasonable Effectives of Mathematics in Natural Science” This paper made the point that simple formulas can explain the complex natural world. The most famous example being E=MC2 in physics. Their paper talked about how economist were jealous since they lacked similar models to neatly explain human behavior. But they found that in the area of Natural Language Processing an area notoriously complex that has been studied for years with many AI attempts at addressing this. They found that relatively simple approaches on massive data produced stunning results. They cited an example of scene completion. An algorithm is used to eliminate something in a picture a car for instance and based on a corpus of thousands of pictures fill in the the missing background. Well this algorithm did rather poorly until they increased the corpus to millions of photos and with this amount of data the same algorithm performed extremely well. While not a direct example from financial services I think it’s a great analogy. After all aren’t you looking for an approach that can fill in the missing pieces of a picture or pattern.
  • Okay interesting graphs how does this translate to the real world. Here are some broad examples.
  • Start with the right platform…Power to address your needs and the flexibility to grow with your expansion..If you haven’t started with this platform it is easy to switch….
  • Take all of Twitter400 x 10^6 tweets per day < 400 GB per day < 40MB/s
  • Expect More from Hadoop

    1. 1. 1©MapR Technologies Expect More from Hadoop Jack Norris, MapR Technologies
    2. 2. 3©MapR Technologies Hadoop Growth
    3. 3. 4©MapR Technologies Important Drivers for Hadoop  Data on compute  You don’t need to know what questions to ask beforehand  Simple algorithms on Big Data  Analysis of unstructured data
    4. 4. 5©MapR Technologies The Cost of Enterprise Storage SAN Storage $2 - $10/Gigabyte $1M gets: 0.5Petabytes 200,000 IOPS 1Gbyte/sec NAS Filers $1 - $5/Gigabyte $1M gets: 1 Petabyte 400,000 IOPS 2Gbyte/sec Local Storage $0.02/Gigabyte $1M gets: 50 Petabytes 10,000,000 IOPS 800 Gbytes/sec 1/100 to 1/20 the cost
    5. 5. 6©MapR Technologies MapReduce: A Paradigm Shift  Distributed, scalable computing platform – Data/Compute framework – Commodity hardware  Pioneered at Google  Commercially available as Hadoop
    6. 6. 7©MapR Technologies MapR Distribution for Apache Hadoop  Complete Hadoop distribution  Comprehensive management suite  Industry-standard interfaces  Enterprise-grade dependability  Higher performance
    7. 7. 8©MapR Technologies How do you Benefit?
    8. 8. 9©MapR Technologies Expanding data for existing applications
    9. 9. 10©MapR Technologies Use Case #1  Major telecom vendor  Key step in billing pipeline handled by data warehouse (EDW)  EDW at maximum capacity  Multiple rounds of software optimization already done  Revenue limiting (= career limiting) bottleneck
    10. 10. 11©MapR Technologies Transformation Extract and Load CDR billing records Billing reports Data Warehouse Customer bills Original Flow
    11. 11. 12©MapR Technologies Problem Analysis  70% of EDW load is related to call detail record (CDR) normalization –< 10% of total lines of code –CDR normalization difficult within the EDW –Binary extraction and conversion  Data rates are too high for upstream transform –Requires high volume joins
    12. 12. 13©MapR Technologies ETL CDR billing records Billing reports Data Warehouse Customer billing With ETL Offload Hadoop Cluster
    13. 13. 15©MapR Technologies Simplified Analysis  70% of EDW consumed by ETL processing – Offload frees capacity  EDW direct hardware cost is approximately $30 million vs. Hadoop cluster at 1/50 the cost  Additional EDW only increases capacity by 50% due to poor division of labor
    14. 14. 17©MapR Technologies The Results  EDW strategy –1.5 x performance –$30 million  MapR Strategy –3 x faster –20x cost/performance advantage for MapR strategy –With High Availability and data protection
    15. 15. 19©MapR Technologies Use Case #2 Combine Many Different Data Sources
    16. 16. 20©MapR Technologies Use Case #2 – Customer Example  Global Credit Card Issuer  Launching a New Location Based Service  Benefits both Merchants and Consumers
    17. 17. 21©MapR Technologies Combining different feeds on one platform Hadoop and HBase Storage and Processing … Real-time data feed from social network Stored in Hadoop Historical Purchase Information Predictive Analytics from Historical data combined with NoSQL querying on real-time social networking data Billing Data
    18. 18. 22©MapR Technologies Results  New Service Rolled out in 1 quarter  Processing time cut from 20 hours per day to 3  Recommendation engine load time decreased from 8 hours to 3 minutes  Includes data versioning support for easier development and updating of models
    19. 19. 25©MapR Technologies Use Case #3 New Application from New Data Source
    20. 20. 26©MapR Technologies Ancestry.com – Family Tree
    21. 21. 27©MapR Technologies Overview and Requirements  Collect and Collate information from disparate sources (Text files, Images, etc.)  Leverage new data source: Spit  Machine learning techniques and DNA Matching Algorithms
    22. 22. 28©MapR Technologies The Results  Storage Infrastructure for billions of small and large files  Blob Store for large images through NoSQL solutions  Multi-tenant capability for data-mining and machine-learning algorithm development  One highly available, efficient platform
    23. 23. 29©MapR Technologies MapR M7: Making HBase Enterprise Grade Disks ext3 JVM DFS JVM HBase Other Distributions Disks Unified Easy Dependable Fast No RegionServers No compactions Consistent low latency Seamless splits Instant recovery from node failure Real-time in-memory configuration Automatic merges Snapshots Disk and network compression In-memory column families Mirroring Reduced I/O to disk
    24. 24. 30©MapR Technologies Use Case New Analytics on Existing Data
    25. 25. 31©MapR Technologies Analytic Flexibility  MapReduce enabled Machine learning algorithms  Enhanced Search  Real-time event processing  No need to sample the data Fraud Detection Target Marketing Consumer Behavior Analysis …
    26. 26. 32©MapR Technologies Hadoop Expands Analytics “Simple algorithms and lots of data trump complex models ” Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems
    27. 27. 34©MapR Technologies Use Case #4 Combine All Three
    28. 28. 35©MapR Technologies Where do you Start?
    29. 29. 36©MapR Technologies One Platform for Big Data … Batch 99.999% HA Data Protection Disaster Recovery Scalability & Performance Enterprise Integration Multi- tenancy Batch Processing File-Based Applications SQL Database Search Stream Processing Interactive Realtime
    30. 30. 37©MapR Technologies World Record Performance Why is MapR faster and more efficient? – C/C++ vs. Java – Distributed metadata – Optimized shuffle New Minute Sort World Record 1.5 TB in 1 minute 2103 nodes
    31. 31. 38©MapR Technologies Thank You
    32. 32. 39©MapR Technologies