Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop 2.0 - Solving the Data Quality Challenge

1,574 views

Published on

The Briefing Room with Dr. Claudia Imhoff and RedPoint Global
Live Webcast on July 22, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=7bb4cbc33402c3b5f649343052cb9a6d

Whether data is big or small, quality remains the critical characteristic. While traditional approaches to cleansing data have made strides, nonetheless, data quality remains a serious hurdle for all organizations. This is especially true for identity resolution in customer data, but also for a range of other data sets, including social, supply chain, financial and other domains. One of the most promising approaches for solving this decades-old challenge incorporates the power of massive parallel processing, a la Hadoop.

Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Claudia Imhoff, who will explain how Hadoop 2.0 and its YARN architecture can make a serious impact on the previously intractable problem of data quality. She’ll be briefed by George Corugedo of RedPoint Global, who will show how his company’s platform can serve as a super-charged marshaling area for accessing, cleansing and delivering high-quality data. He’ll explain how RedPoint was one of the first applications to be certified for running on YARN, which is the latest rendition of the now-ubiquitous Hadoop.

Visit InsideAnlaysis.com for more information.

Published in: Technology
  • Be the first to comment

Hadoop 2.0 - Solving the Data Quality Challenge

  1. 1. Grab some coffee and enjoy the pre-show banter before the top of the hour!
  2. 2. Hadoop 2.0: Solving the Data Quality Challenge The Briefing Room
  3. 3. Twitter Tag: #briefr The Briefing Room Welcome Host: Eric Kavanagh eric.kavanagh@bloorgroup.com @eric_kavanagh
  4. 4. ! Reveal the essential characteristics of enterprise software, good and bad ! Provide a forum for detailed analysis of today’s innovative technologies ! Give vendors a chance to explain their product to savvy analysts ! Allow audience members to pose serious questions... and get answers! Twitter Tag: #briefr The Briefing Room Mission
  5. 5. This Month: INNOVATIVE TECHNOLOGY August: BIG DATA ECOSYSTEM September: INTEGRATION Twitter Tag: #briefr The Briefing Room Topics 2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
  6. 6. Twitter Tag: #briefr The Briefing Room Analyst: Dr. Claudia Imhoff Claudia Imhoff is President & Founder of Intelligent Solutions, Inc.
  7. 7. Twitter Tag: #briefr The Briefing Room RedPoint Global ! RedPoint Global is a data management and integrated marketing technology company ! Its Convergent Marketing Platform™ offers products designed for data management, collaboration and architecture integration. ! RedPoint Data Management for Hadoop is YARN-compliant and enables analysts to access and manipulate data directly within the Hadoop cluster.
  8. 8. Twitter Tag: #briefr The Briefing Room Guest: George Corugedo George Corugedo is Chief Technology Officer & Co- Founder at RedPoint Global Inc. A mathematician and seasoned technology executive, George has over 20 years of business and technical expertise. As co-founder and CTO of RedPoint Global, George is responsible for leading the development of the RedPoint Convergent Marketing Platform™. A former math professor, George left academia to co-found Accenture’s Customer Insight Practice, which specialized in strategic data utilization, analytics and customer strategy. Previous positions include director of client delivery at ClarityBlue, Inc., a provider of hosted customer intelligence solutions to enterprise commercial entities, and COO/CIO of Riscuity, a receivables management company specializing in the utilization of analytics to drive collections.
  9. 9. The Neglected Discipline of Data Quality in Hadoop July 2014
  10. 10. Overview – Challenges to Adoption • Severe Skills Gap shortage of MR skilled resources • Very expensive resources and hard to retain • Inconsistent skills lead to inconsistent results • Under uAlizes exisAng resources • Prevents broad leverage of investments across enterprise Maturity & Governance • A nascent technology ecosystem around Hadoop • Emerging technologies only address narrow slivers of funcAonality • New applicaAons are not enterprise class • Legacy applicaAons have built short term capabiliAes Data Into InformaAon • Data 11 © RedPoint Global Inc. 2014 Confidential is not useful in its raw state, it must be turned into informaAon • Benefit of Hadoop is that same data can be used from many perspecAves • Analysts must now do the structuring of the data based on intended use of the data
  11. 11. Key Points to Cover Today ! Broad functionality across data processing domains ! Validated ease of use, speed, match quality and party data superiority ! Hadoop 2.0/YARN certified – 1 of first 17 companies to do so ! Not a repackaging of Hadoop 1.0 functionality. RedPoint Data Management is a pure YARN application (1 of only 2 in the initial wave of certifications) ! Building a complex job in RPDM takes a fraction of the time that it takes to write the same job in Map Reduce and none of the coding or java skills. ! Big functional footprint without touching a line of code ! Design model consistent with data flow paradigm ! RPDM has a “Zero-Footprint” install in the Hadoop cluster ! The same interface and functionality is available for both structured and unstructured databases. Thus it is seamless to work across both from a users perspective. ! Data quality done completely within the cluster 12 © RedPoint Global Inc. 2014 Confidential
  12. 12. Key features of RedPoint Data Management ETL & ELT Data Quality Master Key Management Web Services IntegraAon IntegraAon & Matching Process AutomaAon 13 © RedPoint Global Inc. 2014 Confidential & OperaAons • Profiling, reads/writes, transformaAons • Single project for all jobs • Cleanse data • Parsing, correcAon • Geo-­‐spaAal analysis • Grouping • Fuzzy match • Create keys • Track changes • Maintain matches over Ame • Consume and publish • HTTP/HTTPS protocols • XML/JSON/SOAP formats • Job scheduling, monitoring, noAficaAons • Central point of control All func(ons can be used on both TRADITIONAL and BIG DATA Creates clean, integrated, ac/onable data – quickly, reliably and at low cost
  13. 13. RedPoint Data Management on Hadoop ParAAoning AM / Tasks Parallel SecAon (UI) ExecuAon AM / Tasks Data I/O Key / Split Analysis YARN 14 © RedPoint Global Inc. 2014 Confidential MapReduce
  14. 14. RedPoint Functional Footprint Monitoring and Management Tools AMBARI DATA REFINEMENT PIG HIVE MAPREDUCE REST HTTP STREAM STRUCTURE HCATALOG (metadata services) DBs Fil esF il Feilse s NFS Ÿ 15 © RedPoint Global Inc. 2014 Confidential Query/Visualization/ Reporting/Analytical Tools and Apps SOURCE DATA - Sensor Logs - Clickstream JMS - Flat Queue’s Files - Unstructured - Sentiment - Customer - Inventory Data Sources RDBMS EDW INTERACTIVE HIVE Server2 LOAD SQOOP WebHDFS Flume LOAD SQOO P/Hive Web HDFS YARN Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ n HDFS 1 Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ
  15. 15. Sample MapReduce (small subset of the entire code which totals nearly 150 lines): public static class MapClass extends Mapper<WordOffset, Text, Text, IntWritable> { 16 © RedPoint Global Inc. 2014 Confidential RedPoint Benchmarks – Project Gutenberg Map Reduce Pig private final static String delimiters = "',./<>?;:"[]{}-=_+()&*%^#$!@`~ |«»¡¢£¤¥¦©¬®¯±¶·¿"; private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WordOffset key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line, delimiters); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } Sample Pig script without the UDF: SET pig.maxCombinedSplitSize 67108864 SET pig.splitCombination true A = LOAD '/testdata/pg/*/*/*'; B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; C = FOREACH B GENERATE UPPER(word) AS word; D = GROUP C BY word; E = FOREACH D GENERATE COUNT(C) AS occurrences, group; F = ORDER E BY occurrences DESC; STORE F INTO '/user/cleonardi/pg/pig-count'; >150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code 6 hours of development 3 hours of development 15 min. of development 6 minutes runtime 15 minutes runtime 3 minutes runtime Extensive optimization needed User Defined Functions required prior to running script No tuning or optimization required
  16. 16. Attributes of Information RELEVANT InformaAon must pertain to a specific problem. General data must be connected to reveal relevance of the informaAon. COMPLETE ParAal informaAon is oaen worse than no informaAon. ParAal informaAon frequently leads to worse conclusions than if no data had been used at all. ACCURATE This one is obvious. In a context like health care, inaccurate data can be fatal. Precision is required across all applicaAons of informaAon. CURRENT As data ages, it becomes less accurate. MulAple research studies by Google and others show the decay in the accuracy of analyAcs as data becomes stale. ECONOMICAL There has to be a clear cost benefit. This requires work to idenAfy the realizable benefit of informaAon but this is also what rives the use if successful 17 © RedPoint Global Inc. 2014 Confidential
  17. 17. Reference Architecture for Matching in Hadoop Data Sources CRM ERP Billing Subscriber Product Network Weather Compete Manuf. Clickstream Online Chat Sensor Data Social Media Call Detail Records FabricaAon Logs Sales Feedback Field Feedback Field Feedback + 18 © RedPoint Global Inc. 2014 Confidential
  18. 18. Resource Manager 19 © RedPoint Global Inc. 2014 Confidential Launches Tasks Node Manager DM App Master DM Task Node Manager DM Task DM Task Node Manager DM Task DM Task Launches DM App Master Data Management Designer DM ExecuAon Server Parallel SecAon Running DM Task 1 2 3 RedPoint DM for Hadoop: Processing Flow
  19. 19. The Data Management designer 20 © RedPoint Global Inc. 2014 Confidential
  20. 20. DM Hadoop Settings 21 © RedPoint Global Inc. 2014 Confidential
  21. 21. DM Parallel Section on Hadoop 22 © RedPoint Global Inc. 2014 Confidential
  22. 22. Who Should Care ! Companies interested in exploring the promise of Big Data Analytics and need an easy way to get started. ! Companies already investing heavily investing in Big Data Analytics technologies but are stuck due to the shortage of skilled resources ! Large organizations that are focused on “Operational Offloading” and need to achieve it cost effectively ! Companies who recognize that much of the data that lands in Hadoop is external to the organization and need to have Data Quality and proper data 23 governance © RedPoint applied Global Inc. 2014 to their Hadoop Confidential data.
  23. 23. RedPoint Convergent Marketing Ecosystem Data Inputs No SQL Social SQL Enhancement Mobile Social Digital RedPoint Interaction Segmentation Inbox Analysis Attribution GIS Marketing Rules Engine CRM Trigger Audience Offer RedPoint Data Management Machine Learning Analytics Email Address Std. Web Services Geocoding 24 © RedPoint Global Inc. 2014 Confidential Real Time Cache Marketing Operations Analytics Hadoop
  24. 24. RedPoint real-time decisions: how it works (web site example) RedPoint update/ maintain over Ame 25 © RedPoint Global Inc. 2014 Confidential www profile data context data real-­‐Ame profile winning content Machine Learning rules inbound personalizaAons combined with outbound contacts to create cross-­‐channel interacAon history web site REDPOINT EXECUTION ENVIRONMENT personalizaAo n opportunity API call perCsOoNnTaElNizTe NdE cEoDnEDt ent content candidate content with associated eligibility & scoring rules content stored in RedPoint, or RedPoint points to content in CMS or other system API Nulla tincidunt dolor sit amet erat. Suspendisse dictum mauris sollicitudin luctus varius. Duis a mauris leo. Aenean vel euismod est. Phasellus pretium, sem id varius viverra, nisl elit commodo orci, vel sollicitudin dolor nibh ut nisl. Sed ut magna a arcu vulputate bibendum. Duis vehicula tellus commodo mauris consequat rutrum eget sit amet arcu. Sed quis erat leo. Morbi accumsan aliquet tellus, ac consectetur nibh aliquet nec. Vivamus vel lacus ac ipsum ornare rhoncus. Aliquam libero magna, hendrerit vitae cursus vitae, accumsan eu sapien. 1st Party Customer data in database(s) and/or Hadoop
  25. 25. RedPoint vs. alternatives ü û ü û ü û ü û ü û ü û ü û Pure YARN, no MapReduce Graphical UI, not code-­‐based Top rated for ease-­‐of-­‐use All DQ/DI funcAons available Executes in Hadoop, no data movement Zero footprint install, nothing in the cluster Same product for Hadoop and database 26 © RedPoint Global Inc. 2014 Confidential
  26. 26. Twitter Tag: #briefr The Briefing Room Perceptions & Questions Analyst: Dr. Claudia Imhoff
  27. 27. Data Quality in the Hadoop Age Solve your business puzzles with Intelligent Solutions By Claudia Imhoff, PhD Intelligent Solutions, Inc. Boulder BI Brain Trust Claudia@BBBT.US SPONSORED BY HOSTED BY Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
  28. 28. Claudia Imhoff Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 29 President and Founder Intelligent Solutions, Inc. A thought leader, visionary, and practitioner, Claudia Imhoff, Ph.D., is an internationally recognized expert on analytics, business intelligence, and the architectures to support these initiatives. Dr. Imhoff has co-authored five books on these subjects and writes articles (totaling more than 150) for technical and business magazines. She is also the Founder of the Boulder BI Brain Trust (BBBT), an international consortium of independent analysts and experts. You can follow them on Twitter at #BBBT or become a subscriber at www.bbbt.us. Email: claudia@bbbt.us Phone: 303-444-6650 Twitter: Claudia_Imhoff
  29. 29. Agenda § Extending the Data Warehouse Architecture § Things to Ponder… Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 30
  30. 30. Next Generation BI Next generation BI Based on a concept by Shree Dandekar of Dell 31 Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved Slide compliments of Colin White – BI Research, Inc. New business insights Reduced costs New technologies Enhanced data management Advanced analytics New deployment options DRIVERS TECHNOLOGIES
  31. 31. Systems of Record § Remember – It all starts here! § Transactional systems generate most of the data used for all other activities – operational processes, BI & analytical capabilities, etc. § The point here is a reminder: § Extend OLTP systems of record as a “key” source of data § Many companies do not (or can not) leverage data they already have in their operational systems Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 32 Operational systems RT BI services Other internal & external structured & multi-structured data Real-time streaming data
  32. 32. Next Generation – Extended Data Warehouse Architecture (XDW) Analytic tools & applications RT analysis platform Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 33 Traditional EDW environment Investigative computing platform Data refinery Data integration platform Operational real-time environment Other internal & external structured & multi-structured data Real-time streaming data Operational systems RT BI services Slide created by Colin White – BI Research, Inc.
  33. 33. Use Case: Traditional EDW Most BI environments today: § New technologies can be incorporated Analytic tools & applications Traditional EDW environment Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved into the EDW environment to improve performance, efficiency & reduce costs 34 Use cases: § Production reporting (data quality sensitive) § Historical comparisons § Customer analysis (next best offer, segmentation, life-time value scores, churn analysis, etc.) § KPI calculations § Profitability analysis § Forecasting Data integration platform Operational systems RT BI services real-time models & rules
  34. 34. Data Quality Needed § EDW is now the “production” analytical environment § Produces standard reports, comparisons, and analytics to be used as final word on situations § Data must be integrated as much as possible § Data must be run through data quality grist mill § There must be a full audit trail from source to ultimate report, analytic, etc. Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 35
  35. 35. Use Case: Data Refinery Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved Ingests raw detailed data in batch and/or real-time into managed data store (lake, hub, swamp, dump…) Distills the data into useful business information and distributes the results to downstream systems May also directly analyze some data Employs low-cost hardware and software to enable large amounts of detailed data to be managed cost effectively Requires (flexible) governance policies to manage data security, privacy, quality, archiving and destruction 36 Traditional EDW environment Investigative computing platform Data refinery Data integration platform
  36. 36. Data Quality Needed § This is not a data dumping ground! § It should be monitored and assessed as to the data integration and quality needs § Just because you can store massive sets of data doesn’t mean it is ignored or assumed to not need governance § Nor does it mean that there is no need for a business case for the massive amount of data § If analytic accuracy is at 99% using 45% of the data, why deal with all of it? § But speed of integration and quality processing is also important Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 37
  37. 37. Use Case: Investigative Computing New technologies used here include: § Hadoop, in-memory computing, columnar storage, data compression, appliances, etc. Use cases: § Data mining and predictive modeling for EDW and real-time environments § Cause and effect analysis § Data exploration (“Did this ever happen?” “How often?”) § Pattern analysis § General, unplanned investigations of data Operational systems RT BI services Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 38 Analytic tools & applications Investigative computing Data refinery platform Data integration platform RT analysis platform Operational real-time environment
  38. 38. Data Quality Needed § Much more experimental in nature – lots of queries with null results § Analytics may be approximations § Data integration may be needed for some data, not for other § Data quality also varies in terms of what data must go through DQ process § Difficulty is in determining what get integrated and run through data quality processing Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 39
  39. 39. Use Case: Real Time Operational Environment Embedded or callable BI services: § Real-time fraud detection § Real-time loan risk assessment § Optimizing online promotions § Location-based offers § Contact center optimization § Supply chain optimization Real-time analysis engine: § Traffic flow optimization § Web event analysis § Natural resource exploration analysis § Stock trading analysis § Risk analysis § Correlation of unrelated data streams (e.g., weather effects on product sales) RT analysis platform Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 40 Operational real-time environment Other internal & external structured & multi-structured data Real-time streaming data Operational systems RT BI services
  40. 40. Data Quality Needed § Because of operational nature, data must be as good as it can possibly be § Data may or may not bee integrated with other operational systems’ data § False positives and negatives to models must be reconciled as quickly as possible § But speed of integration and quality processing is of the utmost importance! Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 41
  41. 41. All Components Must Work Together Investigative computing platform Analytic tools & apps Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 42 analytic models analyses Data refinery Traditional EDW environment Operational systems existing customer data next best customer offer 3rd party data location data social data feedback RT analysis platform call center dashboard or web event stream Slide created by Colin White – BI Research, Inc. Other internal & external structured & multi-structured data Real-time streaming data
  42. 42. Agenda § Extending the Data Warehouse Architecture § Things to Ponder… Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 43
  43. 43. What Makes People Think These Have Gone Away? § Data Redundancy § Each system, application, and department in enterprise collects own version of key business entities and attributes § Data Inconsistency § Enormous resources (time, money, and people) spent in reconciliation because of fractured data § Business Inefficiency § Fractured data generates business inefficiency – low productivity, inefficient supply chain management, customer dissatisfaction, wasted marketing efforts § Business Change § Organizations are constantly changing and these disruptive events cause a constant stream of changes to data Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 44
  44. 44. Data Quality Challenges § Cultural Hurdles § Generating business case and obtaining executive backing and funding § Requires a phased approach to quality deployment § Overcoming political barriers § E.g., moving from enterprise view to LOB/parochial view of quality, yet still agreeing on common business definitions Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 45
  45. 45. Data Quality Challenges § Technology Challenges § Unusual sources of data § Creating a flexible data governance model § Supporting complex & constantly changing data § Providing a flexible data integration infrastructure § Wild West mentality… Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 46
  46. 46. Data Governance and Data Quality is Changing § People using BI must “trust” the data § IT must work with the business to create certified data sets § Note: not all data must be certified but all data usage must be documented and monitored § Governance still has an important role § Determine whether data used is “governed” (e.g., in a data warehouse or MDM environment) or “ungoverned” (e.g., individual spreadsheets, external source) § Difficulty is figuring out differences – hence the need to monitor data usage § IT must have monitoring or oversight capability Note: LOB IT or experienced information producers may have to take on some previously traditional central IT roles Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 47
  47. 47. Questions § What are the biggest challenges for data quality in the Hadoop age? § How do you justify the need for integration and quality processing in the “age of hurry up and give me the data”? § Not all data needs to be cleaned up and integrated but how do people determine what does and doesn’t? § What tips can you give us to help get the time, resources and funding for DQ in the refinery? § Technologically speaking, what is different about the Hadoop environment versus a traditional RDBMS one? § Who sponsors/is responsible for the data quality/ integration effort in the age of Hadoop? Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 48
  48. 48. Twitter Tag: #briefr The Briefing Room
  49. 49. This Month: INNOVATIVE TECHNOLOGY August: BIG DATA ECOSYSTEM September: INTEGRATION www.insideanalysis.com/webcasts/the-briefing-room Twitter Tag: #briefr The Briefing Room Upcoming Topics 2014 Editorial Calendar at www.insideanalysis.com
  50. 50. Twitter Tag: #briefr THANK YOU for your ATTENTION! The Briefing Room

×