SlideShare a Scribd company logo
1 of 20
Hadoop:
      Do Data Warehousing rules apply?

    Tony Baer

    tony.baer@ovum.com

    June 14, 2012




1                              © Copyright Ovum. All rights reserved. Ovum is a subsidiary of Informa plc.
Agenda



     §  Challenges traditional data stewardship practice

     §  Privacy – is all the world a stage?

     §  Limits to data lifecycle?

     §  Data quality: the big, the bad, the ugly – and it all might be good!




2                                                         © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Data stewardship challenges –
    What s old is new

    Remember?

    § Back to undifferentiated gobblobs of data

    § Programmatic access reigns

    § File systems, not (always) tables             10.102.8.152 - - [05/Nov/2003:00:19:54 -0500] "GET /
                                                     inventory/index.jsp HTTP/1.1" 200 4028 "http://
                                                     www.mycompany.com/index.jsp" "Mozilla/4.08 [en] (Win98;
                                                     I ;Nav)"

    § Batch is back                                 192.168.114.201, -, 03/20/01, 7:55:20, W3SVC2, SALES1,
                                                     172.21.13.45, 4502, 163, 3223, 200, 0, GET,/DeptLogo.gif,
                                                     -, 172.16.255.255, anonymous, 03/20/01, 23:58:11,
                                                     MSFTPSVC, SALES1, 172.16.255.255, 60, 275, 0, 0,

    But…                                                         if index(tempvalue,'?') then tempvalue=scan
                                                                 (tempvalue,1,'?');
                                                                 else if index(tempvalue,'&')>1 then
                                                                 tempvalue=scan(tempvalue,1,'&');

    § Volume, variety, velocity, and where s the
    value??

    § Just because you can, should you?


3                                                   © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Data stewardship questions for Big Data


    §  Can we, should we control this data?

    §  Are there limits to how much we should know?

    §  Can we just keep piling up data forever?

    §  Can we cleanse terabytes of data?

    §  Do we still need good data?




4                                                      © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Use of repeated table of contents page

     §  Challenges traditional data stewardship practice

     §  Privacy – is all the world a stage?

     §  Limits to data lifecycle?

     §  Data quality: the big, the bad, the ugly – and it all might be good!




5                                                         © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Privacy –
    the more things change…

     You have zero privacy
    anyway…. Get over it
        -- Scott McNealy, 1999




                                 Facebook does not actually
                                 delete images… but instead
                                 merely removes the links – a fix
                                  is in sight
                                                         -- ZDNet, 2/6/12

                                 Facebook agrees to 20 years of
                                 federal privacy audits
                                                          -- NY Times, 11/29/11



6                                  © Copyright Ovum. All rights reserved. Ovum is an Informa business.
What privacy?



    Florida made $63m last
    year by selling DMV
    information (name, date
    of birth, type of vehicle
    driven) to companies like
    LexusNexus & Shadow
    Soft.

    -- Terence Craig   & Mary Ludloff
    Privacy and Big Data
    (O’Reilly Media, 2011)




7                                       © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Big Data privacy 101 –
    Don t be creepy

    §  Governance problem first,          How Companies Learn Your
        technology second                         Secrets

    §  Understand the relationship
        with your customers & business
        partners

    §  Keep communications in
        context

    §  Don t catch your customers by       My daughter got this in the mail! he
        surprise                           said. She s still in high school, and
                                           you re sending her coupons for baby
                                           clothes and cribs? Are you trying to
    §  The law still trying to catch up   encourage her to get pregnant?
                                                           -- NY Times 2/16/12

8                                                   © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Use of repeated table of contents page

     §  Challenges traditional data stewardship practice

     §  Privacy – is all the world a stage?

     §  Limits to data lifecycle?

     §  Data quality: the big, the bad, the ugly – and it all might be good!




9                                                         © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Data lifecycle –
     How long can this go on?

     §    Google, Yahoo, Facebook, etc.
           don t deprecate web data

     §    Hadoop designed for
           economical scale-out

     §    Moore s Law, declining cost of
           storage

     §    Is Hadoop Archive the answer?

     §    Is Hadoop the new tape?




Management & skills will be the limit       Aerial view of Quincy, WA data ctrs


10                                                                 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Use of repeated table of contents page

      §  Challenges traditional data stewardship practice

      §  Privacy – is all the world a stage?

      §  Limits to data lifecycle?

      §  Data quality: the big, the bad, the ugly – and it all might be
          good!




11                                                       © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Data Quality & Hadoop –
     Big Quality Questions

     §  Can we cleanse terabytes of data?

     §  Do we still need good data?

     §  Are there new approaches to cleansing Big Data?




12                                                    © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Framing the issue

     §        Garbage in, garbage out, but DW forced the
             issue

     §      Traditional approaches
                §  Profiling, cleansing, MDM

     §      DW vs. Hadoop data quality challenges
                §  Known data sets & known criteria vs. vaguely known
                §  Bounded vs. less bounded tasks

     §      Limitations of MapReduce*
                §  Cleansing & transformation within a single Map
                    operation;
                §  Profiling & matching of unstructured data
                §  Matching of data in operations without inter-process
                    communications

           *Source: David Loshin, "Hadoop and Data Quality, Data Integration, Data Analysis" at
           http://www.dataroundtable.com/?p=8841


13                                                                                      © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Is data quality necessary for Hadoop?


     §  The App
         §  How mission-critical?
         §  Regulatory compliance impacts?
         §  What degree of business impact?

     §  The Data
         §  The 4V s (volume, variety,
             velocity, value) determine what
             approaches to quality are feasible




14                                                © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Examples


     §    Web ad placement optimization

     §    Counter-party risk management
           for capital markets

     §    Customer sentiment analysis

     §    Managing smart utility grids or
           urban infrastructure




15                                           © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Bad data may be good


     §  Sensory data
         §  Outlier or drift?
         §  Time to recalibrate devices?
         §  Time to perform preventive
             maintenance?
         §  Are new/unaccounted environmental
             factors skewing readings?

     §  Human-readable data
         §  Flawed concept of reality?
         §  Flawed assumptions on data meaning?
         §  Changes producing new norm


16                                                 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Big Data quality in Hadoop –
     Emergent approaches

     §    Crowdsourcing data –
            §  Collect data far & wide from as many diverse sources as possible. Torrents of data
                overcome the noise.
            §  Comparative trend analysis of incoming streams to dynamically ID the norm or
                sweet spot of good data
     §    Apply data science to correct the dots
            §  Don t go record by record. Statistically analyze the data set in aggregate.
            §  Iteratively analyze & re-analyze nature of data, keep analyzing outliers
            §  Apply off-the-wall approaches
     §    Enterprise Architectural approach
            §  Semantic (domain) model-driven
            §  Apply cleansing logic at run time
            §  Critical for sensitive, regulatory-driven apps



17                                                                      © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Summary


     §    Challenges traditional data stewardship practice
            §  Combination of old & new
     §    Privacy – is all the world a stage?
            §  Best practices, legal requirements still in flux
            §  Don t be creepy!
     §    Limits to data lifecycle?
            §  Few enterprises are Google or Facebook
            §  Ability to manage large infrastructure will be major limit

     §    Data quality
            §  Strategy depends on type of app & data set(s)
            §  A spectrum of approaches -- from none to classic ETL to aggregate statistical
            §  No single silver bullet



18                                                                           © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Disclaimer


     All Rights Reserved.

     No part of this publication may be reproduced, stored in a retrieval system or
     transmitted in any form by any means, electronic, mechanical, photocopying,
     recording or otherwise, without the prior permission of the publisher, Ovum
     (an Informa business).

     The facts of this report are believed to be correct at the time of publication but
     cannot be guaranteed. Please note that the findings, conclusions and
     recommendations that Ovum delivers will be based on information gathered in
     good faith from both primary and secondary sources, whose accuracy we are not
     always in a position to guarantee. As such Ovum can accept no liability whatever
     for actions taken based on any information that may subsequently prove to be
     incorrect.




19                                                             © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Sessions will resume at 11:25am




                             Page 20

More Related Content

Viewers also liked

Elephant grooming: quality with Hadoop
Elephant grooming: quality with HadoopElephant grooming: quality with Hadoop
Elephant grooming: quality with HadoopRoman Nikitchenko
 
Hadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality ChallengeHadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality ChallengeInside Analysis
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryDataWorks Summit/Hadoop Summit
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersDataWorks Summit/Hadoop Summit
 
What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it DataWorks Summit/Hadoop Summit
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudDataWorks Summit/Hadoop Summit
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataDataWorks Summit/Hadoop Summit
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...RTTS
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasDataWorks Summit/Hadoop Summit
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
 
The Social Lifecycle: Consumer Insights to Improve Your Business
The Social Lifecycle: Consumer Insights to Improve Your BusinessThe Social Lifecycle: Consumer Insights to Improve Your Business
The Social Lifecycle: Consumer Insights to Improve Your BusinessHubSpot
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit
 

Viewers also liked (19)

Elephant grooming: quality with Hadoop
Elephant grooming: quality with HadoopElephant grooming: quality with Hadoop
Elephant grooming: quality with Hadoop
 
Hadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality ChallengeHadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality Challenge
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop Clusters
 
What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Beyond TCO
Beyond TCOBeyond TCO
Beyond TCO
 
Extreme Analytics @ eBay
Extreme Analytics @ eBayExtreme Analytics @ eBay
Extreme Analytics @ eBay
 
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse ModernizationAccelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache Atlas
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Building a Data Analytics PaaS for Smart Cities
Building a Data Analytics PaaS for Smart CitiesBuilding a Data Analytics PaaS for Smart Cities
Building a Data Analytics PaaS for Smart Cities
 
The Social Lifecycle: Consumer Insights to Improve Your Business
The Social Lifecycle: Consumer Insights to Improve Your BusinessThe Social Lifecycle: Consumer Insights to Improve Your Business
The Social Lifecycle: Consumer Insights to Improve Your Business
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 

Similar to Hadoop do data warehousing rules apply

EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...
EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...
EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...European Data Forum
 
Making Big Data a First Class citizen in the enterprise
Making Big Data a First Class citizen in the enterpriseMaking Big Data a First Class citizen in the enterprise
Making Big Data a First Class citizen in the enterpriseTony Baer
 
Getting Started with Big Data for Business Managers
Getting Started with Big Data for Business ManagersGetting Started with Big Data for Business Managers
Getting Started with Big Data for Business ManagersDatameer
 
Action from Insight - Joining the 2 Percent Who are Getting Big Data Right
Action from Insight - Joining the 2 Percent Who are Getting Big Data RightAction from Insight - Joining the 2 Percent Who are Getting Big Data Right
Action from Insight - Joining the 2 Percent Who are Getting Big Data RightStampedeCon
 
Fraud webinar - Prevention & Risk Management
Fraud webinar - Prevention & Risk ManagementFraud webinar - Prevention & Risk Management
Fraud webinar - Prevention & Risk ManagementFernando Mesa
 
Big Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your DataBig Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your DataKai Wähner
 
The Failure of Information Security Classification: A New Model is Afoot!
The Failure of Information Security Classification: A New Model is Afoot!The Failure of Information Security Classification: A New Model is Afoot!
The Failure of Information Security Classification: A New Model is Afoot!InnoTech
 
Mark logic ediscovery and governance v1
Mark logic ediscovery and governance v1Mark logic ediscovery and governance v1
Mark logic ediscovery and governance v1Fernando Mesa
 
A Buyer\'s Guide - What to look for in online backup and recovery services - ...
A Buyer\'s Guide - What to look for in online backup and recovery services - ...A Buyer\'s Guide - What to look for in online backup and recovery services - ...
A Buyer\'s Guide - What to look for in online backup and recovery services - ...jatabq
 
IC-SDV 2019: The Economics of Artificial Intelligence and Machine Learning fo...
IC-SDV 2019: The Economics of Artificial Intelligence and Machine Learning fo...IC-SDV 2019: The Economics of Artificial Intelligence and Machine Learning fo...
IC-SDV 2019: The Economics of Artificial Intelligence and Machine Learning fo...Dr. Haxel Consult
 
Veritas corporate brochure emea
Veritas corporate brochure emeaVeritas corporate brochure emea
Veritas corporate brochure emeaHayatollah Ayoubi
 
Big data introduction
Big data introductionBig data introduction
Big data introductionChirag Ahuja
 
Information Management As Emerging Discipline 20040329
Information Management As Emerging Discipline 20040329Information Management As Emerging Discipline 20040329
Information Management As Emerging Discipline 20040329iain heron
 
Mobile Workplace Risks
Mobile Workplace RisksMobile Workplace Risks
Mobile Workplace RisksParag Deodhar
 
DAMA Webinar: What Does "Manage Data Assets" Really Mean?
DAMA Webinar: What Does "Manage Data Assets" Really Mean?DAMA Webinar: What Does "Manage Data Assets" Really Mean?
DAMA Webinar: What Does "Manage Data Assets" Really Mean?DATAVERSITY
 
What are some Real-Life Challenges of Big Data? | JanBask Training
What are some Real-Life Challenges of Big Data? | JanBask TrainingWhat are some Real-Life Challenges of Big Data? | JanBask Training
What are some Real-Life Challenges of Big Data? | JanBask TrainingJanBask Training
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationDoug Denton
 

Similar to Hadoop do data warehousing rules apply (20)

EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...
EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...
EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...
 
Making Big Data a First Class citizen in the enterprise
Making Big Data a First Class citizen in the enterpriseMaking Big Data a First Class citizen in the enterprise
Making Big Data a First Class citizen in the enterprise
 
Getting Started with Big Data for Business Managers
Getting Started with Big Data for Business ManagersGetting Started with Big Data for Business Managers
Getting Started with Big Data for Business Managers
 
Action from Insight - Joining the 2 Percent Who are Getting Big Data Right
Action from Insight - Joining the 2 Percent Who are Getting Big Data RightAction from Insight - Joining the 2 Percent Who are Getting Big Data Right
Action from Insight - Joining the 2 Percent Who are Getting Big Data Right
 
Fraud webinar - Prevention & Risk Management
Fraud webinar - Prevention & Risk ManagementFraud webinar - Prevention & Risk Management
Fraud webinar - Prevention & Risk Management
 
Big Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your DataBig Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your Data
 
The Failure of Information Security Classification: A New Model is Afoot!
The Failure of Information Security Classification: A New Model is Afoot!The Failure of Information Security Classification: A New Model is Afoot!
The Failure of Information Security Classification: A New Model is Afoot!
 
Mark logic ediscovery and governance v1
Mark logic ediscovery and governance v1Mark logic ediscovery and governance v1
Mark logic ediscovery and governance v1
 
A Buyer\'s Guide - What to look for in online backup and recovery services - ...
A Buyer\'s Guide - What to look for in online backup and recovery services - ...A Buyer\'s Guide - What to look for in online backup and recovery services - ...
A Buyer\'s Guide - What to look for in online backup and recovery services - ...
 
Big data primer
Big data primerBig data primer
Big data primer
 
IC-SDV 2019: The Economics of Artificial Intelligence and Machine Learning fo...
IC-SDV 2019: The Economics of Artificial Intelligence and Machine Learning fo...IC-SDV 2019: The Economics of Artificial Intelligence and Machine Learning fo...
IC-SDV 2019: The Economics of Artificial Intelligence and Machine Learning fo...
 
Veritas corporate brochure emea
Veritas corporate brochure emeaVeritas corporate brochure emea
Veritas corporate brochure emea
 
Big data introduction
Big data introductionBig data introduction
Big data introduction
 
Information Management As Emerging Discipline 20040329
Information Management As Emerging Discipline 20040329Information Management As Emerging Discipline 20040329
Information Management As Emerging Discipline 20040329
 
Mobile Workplace Risks
Mobile Workplace RisksMobile Workplace Risks
Mobile Workplace Risks
 
DAMA Webinar: What Does "Manage Data Assets" Really Mean?
DAMA Webinar: What Does "Manage Data Assets" Really Mean?DAMA Webinar: What Does "Manage Data Assets" Really Mean?
DAMA Webinar: What Does "Manage Data Assets" Really Mean?
 
What are some Real-Life Challenges of Big Data? | JanBask Training
What are some Real-Life Challenges of Big Data? | JanBask TrainingWhat are some Real-Life Challenges of Big Data? | JanBask Training
What are some Real-Life Challenges of Big Data? | JanBask Training
 
IDOL presentation
IDOL presentationIDOL presentation
IDOL presentation
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentation
 
Ayala mar23
Ayala mar23Ayala mar23
Ayala mar23
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Hadoop do data warehousing rules apply

  • 1. Hadoop: Do Data Warehousing rules apply? Tony Baer tony.baer@ovum.com June 14, 2012 1 © Copyright Ovum. All rights reserved. Ovum is a subsidiary of Informa plc.
  • 2. Agenda §  Challenges traditional data stewardship practice §  Privacy – is all the world a stage? §  Limits to data lifecycle? §  Data quality: the big, the bad, the ugly – and it all might be good! 2 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 3. Data stewardship challenges – What s old is new Remember? § Back to undifferentiated gobblobs of data § Programmatic access reigns § File systems, not (always) tables 10.102.8.152 - - [05/Nov/2003:00:19:54 -0500] "GET / inventory/index.jsp HTTP/1.1" 200 4028 "http:// www.mycompany.com/index.jsp" "Mozilla/4.08 [en] (Win98; I ;Nav)" § Batch is back 192.168.114.201, -, 03/20/01, 7:55:20, W3SVC2, SALES1, 172.21.13.45, 4502, 163, 3223, 200, 0, GET,/DeptLogo.gif, -, 172.16.255.255, anonymous, 03/20/01, 23:58:11, MSFTPSVC, SALES1, 172.16.255.255, 60, 275, 0, 0, But… if index(tempvalue,'?') then tempvalue=scan (tempvalue,1,'?'); else if index(tempvalue,'&')>1 then tempvalue=scan(tempvalue,1,'&'); § Volume, variety, velocity, and where s the value?? § Just because you can, should you? 3 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 4. Data stewardship questions for Big Data §  Can we, should we control this data? §  Are there limits to how much we should know? §  Can we just keep piling up data forever? §  Can we cleanse terabytes of data? §  Do we still need good data? 4 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 5. Use of repeated table of contents page §  Challenges traditional data stewardship practice §  Privacy – is all the world a stage? §  Limits to data lifecycle? §  Data quality: the big, the bad, the ugly – and it all might be good! 5 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 6. Privacy – the more things change… You have zero privacy anyway…. Get over it -- Scott McNealy, 1999 Facebook does not actually delete images… but instead merely removes the links – a fix is in sight -- ZDNet, 2/6/12 Facebook agrees to 20 years of federal privacy audits -- NY Times, 11/29/11 6 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 7. What privacy? Florida made $63m last year by selling DMV information (name, date of birth, type of vehicle driven) to companies like LexusNexus & Shadow Soft. -- Terence Craig & Mary Ludloff Privacy and Big Data (O’Reilly Media, 2011) 7 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 8. Big Data privacy 101 – Don t be creepy §  Governance problem first, How Companies Learn Your technology second Secrets §  Understand the relationship with your customers & business partners §  Keep communications in context §  Don t catch your customers by My daughter got this in the mail! he surprise said. She s still in high school, and you re sending her coupons for baby clothes and cribs? Are you trying to §  The law still trying to catch up encourage her to get pregnant? -- NY Times 2/16/12 8 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 9. Use of repeated table of contents page §  Challenges traditional data stewardship practice §  Privacy – is all the world a stage? §  Limits to data lifecycle? §  Data quality: the big, the bad, the ugly – and it all might be good! 9 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 10. Data lifecycle – How long can this go on? §  Google, Yahoo, Facebook, etc. don t deprecate web data §  Hadoop designed for economical scale-out §  Moore s Law, declining cost of storage §  Is Hadoop Archive the answer? §  Is Hadoop the new tape? Management & skills will be the limit Aerial view of Quincy, WA data ctrs 10 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 11. Use of repeated table of contents page §  Challenges traditional data stewardship practice §  Privacy – is all the world a stage? §  Limits to data lifecycle? §  Data quality: the big, the bad, the ugly – and it all might be good! 11 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 12. Data Quality & Hadoop – Big Quality Questions §  Can we cleanse terabytes of data? §  Do we still need good data? §  Are there new approaches to cleansing Big Data? 12 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 13. Framing the issue §  Garbage in, garbage out, but DW forced the issue §  Traditional approaches §  Profiling, cleansing, MDM §  DW vs. Hadoop data quality challenges §  Known data sets & known criteria vs. vaguely known §  Bounded vs. less bounded tasks §  Limitations of MapReduce* §  Cleansing & transformation within a single Map operation; §  Profiling & matching of unstructured data §  Matching of data in operations without inter-process communications *Source: David Loshin, "Hadoop and Data Quality, Data Integration, Data Analysis" at http://www.dataroundtable.com/?p=8841 13 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 14. Is data quality necessary for Hadoop? §  The App §  How mission-critical? §  Regulatory compliance impacts? §  What degree of business impact? §  The Data §  The 4V s (volume, variety, velocity, value) determine what approaches to quality are feasible 14 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 15. Examples §  Web ad placement optimization §  Counter-party risk management for capital markets §  Customer sentiment analysis §  Managing smart utility grids or urban infrastructure 15 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 16. Bad data may be good §  Sensory data §  Outlier or drift? §  Time to recalibrate devices? §  Time to perform preventive maintenance? §  Are new/unaccounted environmental factors skewing readings? §  Human-readable data §  Flawed concept of reality? §  Flawed assumptions on data meaning? §  Changes producing new norm 16 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 17. Big Data quality in Hadoop – Emergent approaches §  Crowdsourcing data – §  Collect data far & wide from as many diverse sources as possible. Torrents of data overcome the noise. §  Comparative trend analysis of incoming streams to dynamically ID the norm or sweet spot of good data §  Apply data science to correct the dots §  Don t go record by record. Statistically analyze the data set in aggregate. §  Iteratively analyze & re-analyze nature of data, keep analyzing outliers §  Apply off-the-wall approaches §  Enterprise Architectural approach §  Semantic (domain) model-driven §  Apply cleansing logic at run time §  Critical for sensitive, regulatory-driven apps 17 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 18. Summary §  Challenges traditional data stewardship practice §  Combination of old & new §  Privacy – is all the world a stage? §  Best practices, legal requirements still in flux §  Don t be creepy! §  Limits to data lifecycle? §  Few enterprises are Google or Facebook §  Ability to manage large infrastructure will be major limit §  Data quality §  Strategy depends on type of app & data set(s) §  A spectrum of approaches -- from none to classic ETL to aggregate statistical §  No single silver bullet 18 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 19. Disclaimer All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the publisher, Ovum (an Informa business). The facts of this report are believed to be correct at the time of publication but cannot be guaranteed. Please note that the findings, conclusions and recommendations that Ovum delivers will be based on information gathered in good faith from both primary and secondary sources, whose accuracy we are not always in a position to guarantee. As such Ovum can accept no liability whatever for actions taken based on any information that may subsequently prove to be incorrect. 19 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 20. Sessions will resume at 11:25am Page 20