YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • I want to take a minute and highlight a new offering that RedPoint recently announced: RedPoint Data Management for Hadoop. Simply put, it’s data management for Big Data – it allows users to perform the same kind of data management functions on Big Data as they are already to do with traditional data – integrate data, clean it, append it, reformat it, etc.


    Previously, if someone wanted to perform these data management functions on data stored in a Hadoop cluster, they had two options:

    They could use MapReduce, the programming model used with Hadoop. But programming data management processes with MapReduce is complex – it’s real coding, as you can see in this little snippet of MapReduce code. So, it requires new skills that many companies don’t have on staff. And MapReduce, while able to scale and process large volumes of data, isn’t actually a very efficient way to execute data management processes, so it winds up either being slow or being fast and consuming vast computing resources while it executes. So, MapReduce hasn’t been a great option for data management in Hadoop.

    The other option was to move data out of Hadoop into a more traditional data store and perform data management procedures there. But this takes extra time and effort, and is expensive because you need to buy the extra (often more expensive) storage on top of what you’ve already spent on Hadoop. Really, this approach defeats the entire purpose of Hadoop, which is to keep the data in Hadoop where it’s the most economical.


    But now, with the advent of Hadoop 2.0 and RedPoint Data Management for Hadoop, there’s another option. With RedPoint, you get an easy-to-use interface to perform your data management functions – the same user interface already used and appreciated by many RedPoint Data Management customers. This allows you to leverage your existing data management and data analyst skills, rather than investing in new MapReduce skills. All your data management processes will execute right in Hadoop, using the YARN infrastructure that’s part of Hadoop 2.0. And it’s fast and efficient, since there’s no MapReduce involved.

    Even more valuable, it’s possible to use RedPoint Data Management Hadoop to combine the Big Data in Hadoop with your traditional data to create a more complete view of your customers, to increase customer insight and make targeted marketing more relevant and effect.

    And by using RedPoint Data Management for Hadoop the data immediately becomes actionable, because RedPoint’s Data Management functionality is connected to RedPoint Interact, our campaign and interaction management software.

    All these benefits are only available from RedPoint because RedPoint Data Management for Hadoop is the only pure YARN data management platform.


    In summary, RedPoint Data Management for Hadoop makes Hadoop data management easy, fast, low-cost. And it makes Big Data clean, integrated and usable.

  • YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption

    1. 1. YARN: The Key To Overcoming The Challenges To Broad Based Hadoop Adoption
    2. 2. 1 RedPoint Global Inc.17 June 2014© Confidential Overview - What is Hadoop/Hadoop 2.0 Lower cost scaling No need for structure Ease of data capture Hadoop 1.0 • All operations based on Map Reduce • Intrinsic inconsistency of code based solutions • Highly skilled and expensive resources needed • 3rd party applications constrained by the need to generate code Hadoop 2.0 • Introduction of the YARN: “a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters.” • Mature applications can now operate directly on Hadoop • Reduce skill requirements and increased consistency
    3. 3. 2 RedPoint Global Inc.17 June 2014© Confidential RedPoint Data Management on Hadoop Partitioning AM / Tasks Execution AM / Tasks Data I/O Key / Split Analysis Parallel Section (UI) YARN MapReduce
    4. 4. 3 RedPoint Global Inc.17 June 2014© Confidential Top Challenges to Adoption • Severe shortage of MR skilled resources • Very expensive resources and hard to retain • Inconsistent skills lead to inconsistent results • Under utilizes existing resources • Prevents broad leverage of investments across enterprise Skills Gap • A nascent technology ecosystem around Hadoop • Emerging technologies only address narrow slivers of functionality • New applications are not enterprise class • Legacy applications have built short term capabilities Maturity & Governance • Data is not useful in its raw state, it must be turned into information • Benefit of Hadoop is that same data can be used from many perspectives • Analysts must now do the structuring of the data based on intended use of the data Data Into Information
    5. 5. 4 RedPoint Global Inc.17 June 2014© Confidential RedPoint Overcomes Challenges First YARN compliant ETL/data quality toolset on the market – brings together both Big Data and traditional data to create Big Information! • Customer or Party Data • Processing Speed • Match Quality • Ease of Use by in: RANKED #1 The power to make your data the biggest asset your organization has
    6. 6. 5 RedPoint Global Inc.17 June 2014© Confidential Key features of RedPoint Data Management Master Key Management ETL & ELT Data Quality Web Services Integration Integration & Matching Process Automation & Operations • Profiling, reads/writes, transformations • Single project for all jobs • Cleanse data • Parsing, correction • Geo-spatial analysis • Grouping • Fuzzy match • Create keys • Track changes • Maintain matches over time • Consume and publish • HTTP/HTTPS protocols • XML/JSON/SOAP formats • Job scheduling, monitoring, notifications • Central point of control All functions can be used on both TRADITIONAL and BIG DATA Creates clean, integrated, actionable data – quickly, reliably and at low cost
    7. 7. 6 RedPoint Global Inc.17 June 2014© Confidential Monitoring and Management Tools RedPoint Functional Footprint AMBARI MAPREDUCE REST DATA REFINEMENT HIVEPIG HTTP STREAM STRUCTURE HCATALOG (metadata services) Query/Visualization/ Reporting/Analytical Tools and Apps SOURCE DATA - Sensor Logs - Clickstream - Flat Files - Unstructured - Sentiment - Customer - Inventory DBs JMS Queue’s Fil es Fil esFiles Data Sources RDBMS EDW INTERACTIVE HIVE Server2 LOAD SQOOP WebHDFS Flume NFS LOAD SQOOP/Hive Web HDFS YARN                                      n HDFS 1                                                  
    8. 8. 7 RedPoint Global Inc.17 June 2014© Confidential No Coding Necessary For data management in Hadoop: • Easy-to-use interface • Leverages existing skills • Executes in Hadoop 2.0 (using YARN architecture) • Fast – no MapReduce • Can combine Big Data with traditional data • Data becomes actionable by RedPoint Interaction WITH REDPOINT the only pure YARN data management platform Makes Hadoop data management easy, fast, low-cost. Makes Big Data clean, integrated, usable. You get more out of your Big Data investment. Use MapReduce  complex  requires new skills  inefficient execution Move data out of Hadoop  extra time and effort  extra storage (expensive)  defeats the purpose of Hadoop PREVIOUS OPTIONS
    9. 9. 8 RedPoint Global Inc.17 June 2014© Confidential Resource Manager Launches Tasks Node Manager DM App Master DM Task Node Manager DM Task DM Task Node Manager DM Task DM Task Launches DM App Master Data Management Designer DM Execution Server Parallel Section Running DM Task 1 2 3 RedPoint DM for Hadoop: Processing Flow
    10. 10. 9 RedPoint Global Inc.17 June 2014© Confidential The Data Management designer
    11. 11. 10 RedPoint Global Inc.17 June 2014© Confidential DM Parallel Section on Hadoop
    12. 12. 11 RedPoint Global Inc.17 June 2014© Confidential DM Hadoop Settings
    13. 13. 12 RedPoint Global Inc.17 June 2014© Confidential RedPoint Benchmarks – Project Gutenberg Map Reduce Pig Sample MapReduce (small subset of the entire code which totals nearly 150 lines): public static class MapClass extends Mapper<WordOffset, Text, Text, IntWritable> { private final static String delimiters = "',./<>?;:"[]{}-=_+()&*%^#$!@`~ |«»¡¢£¤¥¦©¬®¯±¶·¿"; private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WordOffset key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line, delimiters); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } Sample Pig script without the UDF: SET pig.maxCombinedSplitSize 67108864 SET pig.splitCombination true A = LOAD '/testdata/pg/*/*/*'; B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0 C = FOREACH B GENERATE UPPER(word) AS word; D = GROUP C BY word; E = FOREACH D GENERATE COUNT(C) AS occurrences, group F = ORDER E BY occurrences DESC; STORE F INTO '/user/cleonardi/pg/pig-count'; >150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code 6 hours of development 3 hours of development 15 min. of development 6 minutes runtime 15 minutes runtime 3 minutes runtime Extensive optimization needed User Defined Functions required prior to running script No tuning or optimization required
    14. 14. 13 RedPoint Global Inc.17 June 2014© Confidential RedPoint in a modern data architecture APPLICATIONS Data Quality Data Integration Identity Resolution ELT  ETL  Cleanse  Match  De-dupe  Merge/Purge  Household Partition  Parse  Append  Standardize  Key  Automate  Monitor  Notify DATASYSTEMSDATASOURCES Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensors, social media) Pure YARN application No MapReduce needed. No in-cluster installation. One application, one graphical user interface for traditional and Big Data Pre-built native adapters Any analytics Any reporting Any other application YARN 1                                      n HDFS HADOOPTRADITIONAL RESPOSITORIES + others
    15. 15. 14 RedPoint Global Inc.17 June 2014© Confidential Who Should Care Companies interested in exploring the promise of Big Data Analytics and need an easy way to get started. Companies already investing heavily investing in Big Data Analytics technologies but are stuck due to the shortage of skilled resources Large organizations that are focused on “Operational Offloading” and need to achieve it cost effectively Companies who recognize that much of the data that lands in Hadoop is external to the organization and need to have Data Quality and proper data governance applied to their Hadoop data.
    16. 16. 15 RedPoint Global Inc.17 June 2014© Confidential Users can work across any/all data Easy to integrate data from any source No need for extra storage No time wasted moving data Minimizes extra computing resources No compromises in quality or integration for data in Hadoop Overcomes the skills gap Existing staff can start working now RedPoint benefits and value Makes Hadoop data management: •Faster •Easier •Less expensive •More effective FEATURES Pure YARN, no MapReduce Graphical UI, not code-based All DQ/DI functions available Executes in Hadoop, no data movement Zero footprint install, nothing in the cluster Same product for Hadoop and database Top rated for ease-of-use BENEFITS VALUE
    17. 17. 16 RedPoint Global Inc.17 June 2014© Confidential For More Information on RedPoint Visit us in booth P13 Download YARN article here: http://bit.ly/YARN- Article Email: contact.us@redpoin t.net