Your SlideShare is downloading. ×
Hadoop do data warehousing rules apply
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Hadoop do data warehousing rules apply

2,185
views

Published on

With Hadoop entering the mainstream, can -- and should -- it benefit from best practices from the world of Data Warehousing. Should the same ground rules developed for capacity-constrained internal …

With Hadoop entering the mainstream, can -- and should -- it benefit from best practices from the world of Data Warehousing. Should the same ground rules developed for capacity-constrained internal enterprise DWs apply to Hadoop data stores designed for scale out, or for harvesting data from the Internet? We will pinpoint 3 key areas: data quality, privacy & confidentiality, and lifecycle management, addressing issues such as: 1. Does it make sense to apply traditional data cleansing practices to Hadoop data? Or will removing "errors" remove the possibility for discovering new insights? 2. Do different standards for privacy protection apply when harvesting sources such as social media that are already public? Should enterprises track their customers on Facebook or Twitter? 3. Will Hadoop make conventional data archiving practices obsolete? Is it cost effective to "move" petabytes of data offline? Just because the Googles & Yahoos of the world retain all their data, should mainstream enterprises? Should Hadoop be considered the new tape?

Published in: Technology, Business

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,185
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop: Do Data Warehousing rules apply? Tony Baer tony.baer@ovum.com June 14, 20121 © Copyright Ovum. All rights reserved. Ovum is a subsidiary of Informa plc.
  • 2. Agenda §  Challenges traditional data stewardship practice §  Privacy – is all the world a stage? §  Limits to data lifecycle? §  Data quality: the big, the bad, the ugly – and it all might be good!2 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 3. Data stewardship challenges – What s old is new Remember? § Back to undifferentiated gobblobs of data § Programmatic access reigns § File systems, not (always) tables 10.102.8.152 - - [05/Nov/2003:00:19:54 -0500] "GET / inventory/index.jsp HTTP/1.1" 200 4028 "http:// www.mycompany.com/index.jsp" "Mozilla/4.08 [en] (Win98; I ;Nav)" § Batch is back 192.168.114.201, -, 03/20/01, 7:55:20, W3SVC2, SALES1, 172.21.13.45, 4502, 163, 3223, 200, 0, GET,/DeptLogo.gif, -, 172.16.255.255, anonymous, 03/20/01, 23:58:11, MSFTPSVC, SALES1, 172.16.255.255, 60, 275, 0, 0, But… if index(tempvalue,?) then tempvalue=scan (tempvalue,1,?); else if index(tempvalue,&)>1 then tempvalue=scan(tempvalue,1,&); § Volume, variety, velocity, and where s the value?? § Just because you can, should you?3 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 4. Data stewardship questions for Big Data §  Can we, should we control this data? §  Are there limits to how much we should know? §  Can we just keep piling up data forever? §  Can we cleanse terabytes of data? §  Do we still need good data?4 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 5. Use of repeated table of contents page §  Challenges traditional data stewardship practice §  Privacy – is all the world a stage? §  Limits to data lifecycle? §  Data quality: the big, the bad, the ugly – and it all might be good!5 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 6. Privacy – the more things change… You have zero privacy anyway…. Get over it -- Scott McNealy, 1999 Facebook does not actually delete images… but instead merely removes the links – a fix is in sight -- ZDNet, 2/6/12 Facebook agrees to 20 years of federal privacy audits -- NY Times, 11/29/116 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 7. What privacy? Florida made $63m last year by selling DMV information (name, date of birth, type of vehicle driven) to companies like LexusNexus & Shadow Soft. -- Terence Craig & Mary Ludloff Privacy and Big Data (O’Reilly Media, 2011)7 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 8. Big Data privacy 101 – Don t be creepy §  Governance problem first, How Companies Learn Your technology second Secrets §  Understand the relationship with your customers & business partners §  Keep communications in context §  Don t catch your customers by My daughter got this in the mail! he surprise said. She s still in high school, and you re sending her coupons for baby clothes and cribs? Are you trying to §  The law still trying to catch up encourage her to get pregnant? -- NY Times 2/16/128 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 9. Use of repeated table of contents page §  Challenges traditional data stewardship practice §  Privacy – is all the world a stage? §  Limits to data lifecycle? §  Data quality: the big, the bad, the ugly – and it all might be good!9 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 10. Data lifecycle – How long can this go on? §  Google, Yahoo, Facebook, etc. don t deprecate web data §  Hadoop designed for economical scale-out §  Moore s Law, declining cost of storage §  Is Hadoop Archive the answer? §  Is Hadoop the new tape?Management & skills will be the limit Aerial view of Quincy, WA data ctrs10 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 11. Use of repeated table of contents page §  Challenges traditional data stewardship practice §  Privacy – is all the world a stage? §  Limits to data lifecycle? §  Data quality: the big, the bad, the ugly – and it all might be good!11 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 12. Data Quality & Hadoop – Big Quality Questions §  Can we cleanse terabytes of data? §  Do we still need good data? §  Are there new approaches to cleansing Big Data?12 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 13. Framing the issue §  Garbage in, garbage out, but DW forced the issue §  Traditional approaches §  Profiling, cleansing, MDM §  DW vs. Hadoop data quality challenges §  Known data sets & known criteria vs. vaguely known §  Bounded vs. less bounded tasks §  Limitations of MapReduce* §  Cleansing & transformation within a single Map operation; §  Profiling & matching of unstructured data §  Matching of data in operations without inter-process communications *Source: David Loshin, "Hadoop and Data Quality, Data Integration, Data Analysis" at http://www.dataroundtable.com/?p=884113 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 14. Is data quality necessary for Hadoop? §  The App §  How mission-critical? §  Regulatory compliance impacts? §  What degree of business impact? §  The Data §  The 4V s (volume, variety, velocity, value) determine what approaches to quality are feasible14 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 15. Examples §  Web ad placement optimization §  Counter-party risk management for capital markets §  Customer sentiment analysis §  Managing smart utility grids or urban infrastructure15 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 16. Bad data may be good §  Sensory data §  Outlier or drift? §  Time to recalibrate devices? §  Time to perform preventive maintenance? §  Are new/unaccounted environmental factors skewing readings? §  Human-readable data §  Flawed concept of reality? §  Flawed assumptions on data meaning? §  Changes producing new norm16 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 17. Big Data quality in Hadoop – Emergent approaches §  Crowdsourcing data – §  Collect data far & wide from as many diverse sources as possible. Torrents of data overcome the noise. §  Comparative trend analysis of incoming streams to dynamically ID the norm or sweet spot of good data §  Apply data science to correct the dots §  Don t go record by record. Statistically analyze the data set in aggregate. §  Iteratively analyze & re-analyze nature of data, keep analyzing outliers §  Apply off-the-wall approaches §  Enterprise Architectural approach §  Semantic (domain) model-driven §  Apply cleansing logic at run time §  Critical for sensitive, regulatory-driven apps17 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 18. Summary §  Challenges traditional data stewardship practice §  Combination of old & new §  Privacy – is all the world a stage? §  Best practices, legal requirements still in flux §  Don t be creepy! §  Limits to data lifecycle? §  Few enterprises are Google or Facebook §  Ability to manage large infrastructure will be major limit §  Data quality §  Strategy depends on type of app & data set(s) §  A spectrum of approaches -- from none to classic ETL to aggregate statistical §  No single silver bullet18 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 19. Disclaimer All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the publisher, Ovum (an Informa business). The facts of this report are believed to be correct at the time of publication but cannot be guaranteed. Please note that the findings, conclusions and recommendations that Ovum delivers will be based on information gathered in good faith from both primary and secondary sources, whose accuracy we are not always in a position to guarantee. As such Ovum can accept no liability whatever for actions taken based on any information that may subsequently prove to be incorrect.19 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 20. Sessions will resume at 11:25am Page 20