Transcript of "DataEd Online: Demystifying Big Data"
Copyright 2013 by Data Blueprint1Demystifying Big DataYes, we face a data deluge and big data seems tobe largely about how to deal with it. But much ofwhat has been written about big data is focused onselling hardware and services. The truth is thatuntil the concept of big data can be objectivelydefined, any measurements, claims of success,quantifications, etc. must be viewed skepticallyand with suspicion. While both the need for andapproaches to these new requirements are facedby virtually every organization, jumping into thefray ill-prepared has (to date) reproduced thesame dismal IT project results.Date: May 14, 2013Time: 2:00 PM ET/11:00 AM PTPresenter:Peter Aiken, Ph.D.• Every century, a new technology-steam power,electricity, atomic energy, or microprocessors-hasswept away the old world with a vision of a new one.Today, we seem to be entering the era of Big Data– Michael Coren• Every century, a new technology-steam power,electricity, atomic energy, or microprocessors-hasswept away the old world with a vision of a new one.Today, we seem to be entering the era of Big Data– Michael Coren
Copyright 2013 by Data BlueprintGet Social With Us!Live Twitter FeedJoin the conversation!Follow us:@datablueprint@paikenAsk questions and submityour comments: #dataed2Like Us on Facebookwww.facebook.com/datablueprintPost questions and commentsFind industry news, insightfulcontentand event updates.Join the GroupData Management &Business IntelligenceAsk questions, gain insightsand collaborate with fellowdata managementprofessionals
Demystifying Big DataIts about separating the signal from the noisePresented by Peter Aiken, Ph.D.
Copyright 2013 by Data BlueprintMeet Your Presenter:Peter Aiken, Ph.D.• 30 years of experience in datamanagement– Multiple international awards– Founder, Data Blueprint(http://datablueprint.com)• 8 books and dozens of articles• Experienced w/ 500+ data managementpractices in 20 countries• Multi-year immersions with organizationsas diverse as the US DoD, Deutsche Bank,Nokia, Wells Fargo, and theCommonwealth of Virginia4
• Data Analysis- Origins• Challenges- Faced by virtually everyone• Compliment- Existing data management practices• Pre-requisites- Necessary to exploit big data techniques• Prototyping- Iterative means of practicing big data techniques• Take Aways and Q&ADemystifying Big DataCopyright 2013 by Data BlueprintTweeting now:#dataed5• Data Analysis- Origins• Challenges- Faced by virtually everyone• Compliment- Existing data management practices• Pre-requisites- Necessary to exploit big data techniques• Prototyping- Iterative means of practicing big data techniques• Take Aways & Q&ADemystifying Big DataTweeting now:#dataed
• Data Analysis- Origins• Challenges- Faced by virtually everyone• Compliment- Existing data management practices• Pre-requisites- Necessary to exploit big data techniques• Prototyping- Iterative means of practicing big data techniques• Take Aways and Q&ADemystifying Big DataCopyright 2013 by Data BlueprintTweeting now:#dataed6
Copyright 2013 by Data Blueprint7Bills of Mortality by Captain John Graunt
Copyright 2013 by Data Blueprint9Mortality GeocodingWhere is it happening?
Copyright 2013 by Data Blueprint10Plague PeakWhen is it happening?("Whereas of the Plague")
Copyright 2013 by Data Blueprint11Black Rats or Rattus RattusWhy is it happening?Black Rats or Rattus RattusWhy is it happening?
Copyright 2013 by Data Blueprint12What will happen?
Copyright 2013 by Data Blueprint13John Snows 1854 Cholera Map of London
• Data Analysis- Origins• Challenges- Faced by virtually everyone• Compliment- Existing data management practices• Pre-requisites- Necessary to exploit big data techniques• Prototyping- Iterative means of practicing big data techniques• Take Aways and Q&ADemystifying Big DataCopyright 2013 by Data BlueprintTweeting now:#dataed14
Copyright 2013 by Data BlueprintData InflationUnit Size What it meansBit (b) 1 or 0Short for “binary digit”, aAer the binary code (1 or 0) computers use to store and process dataByte (B) 8 bitsEnough informaFon to create an English leGer or number in computer code. It is the basic unit of compuFngKilobyte (KB) 1,000, or 210, bytes From “thousand” in Greek. One page of typed text is 2KBMegabyte (MB) 1,000KB; 220 bytesFrom “large” in Greek. The complete works of Shakespeare total 5MB. A typical pop song is about 4MBGigabyte (GB) 1,000MB; 230 bytes From “giant” in Greek. A two-‐hour ﬁlm can be compressed into 1-‐2GBTerabyte (TB) 1,000GB; 240 bytesFrom “monster” in Greek. All the catalogued books in America’s Library of Congress total 15TBPetabyte (PB) 1,000TB; 250 bytesAll leGers delivered by America’s postal service this year will amount to around 5PB. Google processes around 1PB every hourExabyte (EB) 1,000PB; 260 bytes Equivalent to 10 billion copies of The EconomistZeGabyte (ZB) 1,000EB; 270 bytes The total amount of informaFon in existence this year is forecast to be around 1.2ZBYoGabyte (YB) 1,000ZB; 280 bytes Currently too big to imagineThe preﬁxes are set by an intergovernmental group, the International Bureau of Weights and Measures. Source: The Economist Yotta and Zetta wereadded in 1991; terms for larger amounts have yet to be established15
Copyright 2013 by Data BlueprintWhat do they mean big?16"Every 2 days we create as muchinformation as we did up to 2003"– Eric SchmidtThe number of things that can produce data is rapidly growing (smart phones for example)IP traﬃc will quadruple by 2015– Asigra 2012
Copyright 2013 by Data BlueprintGoogle Search Results for the Term "Big Data"17Data from Google Trends Source: Gartner (October 2012)
Copyright 2013 by Data BlueprintNumber of Internet Pages Mentioning Big Data18Data from Google Trends Source: Gartner (October 2012)
Copyright 2013 by Data BlueprintIncreasingly individuals make use ofthe things data producing capabilitiesto perform services for them19
• IBM– 100 terabytes uploaded daily– $2.1 billion spent on mobile ads in 2011– 4.8 trillion ad impressions/daily• WIPRO– 2.9 million emails sent every second– 20 hours of video uploaded to youtube every minute– 50 million tweets per day• Asigra– 2.5 quintillion bytes created daily– 90% of data was created in the last two years– By 2015 8 zettabytes will have been createdCopyright 2013 by Data BlueprintExamples20
Copyright 2013 by Data Blueprint21• 60 GB of data/second• 200,000 hours of big datawill be generated testingsystems• 2,000 hours mediacoverage/daily• 845 million Facebook usersaveraging 15 TB/day• 13,000 tweets/second• 4 billion watching• 8.5 billion devicesconnected2012 London Summer Games
22Sloan Management Review/Harvard Business ReviewCopyright 2013 by Data BlueprintMIT Sloan Management Review Fall 2012 Page 22 By Thomas H. Davenport, Paul Barth And Randy Bean
Copyright 2013 by Data BlueprintBig Data(has something to do with Vs - doesnt it?)• Volume– Amount of data• Velocity– Speed of data in and out• Variety– Range of data types and sources• 2001 Doug Laney• Variability– Many options or variable interpretations confound analysis• 2011 ISRC• Vitality– A dynamically changing Big Data environment in which analysis and predictivemodels must continually be updated as changes occur to seize opportunitiesas they arrive• 2011 CIA• Virtual– Scoping the discussion to only include online assets• 2012 Courtney Lambert23
Copyright 2013 by Data BlueprintNanex 1/2 Second Trading Data(May 2, 2013 Johnson and Johnson)24http://www.youtube.com/watch?v=LrWfXn_mvK8The European Union last year approved a new rule mandating that all trades must exist for atleast a half-second in this instance that is 1,200 orders and 215 actual trades
Copyright 2013 by Data Blueprint25Historyflow-Wikipedia entry for the word “Islam”
Copyright 2013 by Data Blueprint26Spatial Information Flow-New York Talk Exchange
Some Far-out Thoughtson Computers by OrrinClotworthy• Predicted use ofnot justcomputing in theintelligencecommunity• Also forecastpredictiveanalytics• AccompanyingprivacychallengesCopyright 2013 by Data Blueprint27
• Big Data are high-volume, high-velocity, and/or high-varietyinformation assets that require new forms of processing toenable enhanced decision making, insight discoveryand process optimization.– Gartner 2012• Big data refers to datasets whose size is beyond the ability oftypical database software tools to capture, store, manage, and analyze.– IBM 2012• Big data usually includes data sets with sizes beyond the ability ofcommonly-used software tools to capture, curate, manage, and process thedata within a tolerable elapsed time.– Wikipedia• Shorthand for advancing trends in technology that open the door to a newapproach to understanding the world and making decisions.– NY Times 2012• Big data is about putting the "I" back into IT.– Peter Aiken 2007• We have no objective definition of big data!– Any measurements, claims of success, quantifications, etc. must be viewedskeptically and with suspicion!• Question: "Would it be more useful to refer to "big data techniques?"Copyright 2013 by Data BlueprintDefining Big Data28
Copyright 2013 by Data BlueprintBig Data Techniques• New techniques available to impact the productivity (order ofmagnitude) of any analytical insight cycle that compliment,enhance, or replace conventional (existing) analysis methods• Big data techniques are currently characterized by:– Continuous, instantaneouslyavailable data sources– Non-von NeumannProcessing (defined later in the presentation)– Capabilities approachingor past human comprehension– Architecturally enhanceableidentity/security capabilities– Other tradeoff-focused data processing• So a good question becomes "where in our existing architecturecan we most effectively apply Big Data Techniques?"29
ComputersHuman resourcesCommunication facilitiesSoftwareManagementresponsibilitiesPolicies,directives,and rulesDataCopyright 2013 by Data BlueprintWhat Questions Can Architectures Address?30• How and why do thecomponents interact?• Where do they go?• When are they needed?• Why and how will thechanges be implemented?• What should be managedorganization-wide and whatshould be managedlocally?• What standards should beadopted?• What vendors should bechosen?• What rules should governthe decisions?• What policies should guidethe process?
! !! !Copyright 2013 by Data Blueprint 31Organizational Needsbecome instantiatedand integrated into anData/InformationArchitectureInforma(on)System)Requirementsauthorizes andarticulatessatisfyspecificorganizationalneedsData Architectures produce and are made up of information models that aredeveloped in response to organizational needs
Copyright 2013 by Data BlueprintEnterprises Spend $38 Million a Year on Data32• Worldwide spending onbusiness information isnow $1.1 trillion/year• Enterprises spend anaverage of $38 millionon information/year• Small and Mediumsized Businesses onaverage spend$332,000– http://www.cio.com.au/article/429681/five_steps_how_better_manage_your_data/
• Data Analysis- Origins• Challenges- Faced by virtually everyone• Compliment- Existing data management practices• Pre-requisites- Necessary to exploit big data techniques• Prototyping- Iterative means of practicing big data techniques• Take Aways and Q&ADemystifying Big DataCopyright 2013 by Data BlueprintTweeting now:#dataed33
Copyright 2013 by Data BlueprintGartner Five-phase Hype Cyclehttp://www.gartner.com/technology/research/methodologies/hype-cycle.jsp34Technology Trigger: A potential technology breakthrough kicks things off. Early proof-of-concept stories and media interesttrigger significant publicity. Often no usable products exist and commercial viability is unproven.Trough of Disillusionment: Interest wanes as experiments and implementations fail to deliver. Producers of thetechnology shake out or fail. Investments continue only if the surviving providers improve their products to thesatisfaction of early adopters.Peak of Inflated Expectations: Early publicity produces a number ofsuccess stories—often accompanied by scores of failures. Somecompanies take action; many do not.Slope of Enlightenment: More instances of how the technology can benefit theenterprise start to crystallize and become more widely understood. Second- and third-generation products appear from technology providers. More enterprises fund pilots;conservative companies remain cautious.Plateau of Productivity: Mainstream adoption starts totake off. Criteria for assessing provider viability are moreclearly defined. The technology’s broad marketapplicability and relevance are clearly paying off.
Copyright 2013 by Data BlueprintGartner Big Data Hype Cycle35"A focus on big data is not a substitute for thefundamentals of information management."
Copyright 2013 by Data Blueprint36Big Data in Gartner’s Hype Cycle
Copyright 2013 by Data BlueprintTechnology Continues to Advance37• (Gordon) Moores law– Over time, the number of transistors onintegrated circuits doubles approximatelyevery two years
Copyright 2013 by Data BlueprintPick any two!and there arestill tradeoffsto be made!38
• Faster processors outstrippednot only the hard disk, but mainmemory– Hard disk too slow– Memory too small• Flash drives remove bothbottlenecks– Combined Apple and Yahoohave spend more than $500million to date• Make it look like traditionalstorage or more systemmemory– Minimum 10x improvements– Dragonstone server is 3.2 tbflash memory (Facebook)• Bottom line - new capabilities!Copyright 2013 by Data Blueprint"There’s now a blurring between the storage world and the memory world"39
• von Neumannbottleneck(computer science)– "An inefficiency inherent inthe design of any vonNeumann machine thatarises from the fact thatmost computer time isspent in movinginformation betweenstorage and the centralprocessing unit rather thanoperating on it"[http://encyclopedia2.thefreedictionary.com/von+Neumann+bottleneck]• Michael Stonebraker– Ingres (Berkeley/MIT)– Modern databaseprocessing isapproximately 4%efficient• Many "big dataarchitectures areattempts to addressthis, but:– Zero sum game– Trade characteristicsagainst each other• Reliability• Predictability– Google/MapReduce/Bigtable– Amazon/Dynamo– Netflix/Chaos Monkey– Hadoop– McDipper• Big data exploitsnon-von NeumannprocessingCopyright 2013 by Data BlueprintNon-von Neumann Processing/Efficiencies40
Copyright 2013 by Data BlueprintPotential Tradeoffs:CAP theorem: consistency, availability and partition-tolerance41Partition(Fault)ToleranceAvailabilityConsistencyRDBMS NoSQLSmall datasets can be both consistent & availableAtomicityConsistencyIsolationDurabilityBasicAvailabilitySoft-stateEventual consistency
Copyright 2013 by Data BlueprintPotential either/or Tradeoffs42SQL Big DataPrivacy Big DataSecurity Big Data?Massive High-speed Flexible
• Patterns/objects,hypotheses emerge– What can beobserved?• Operationalizing– The dots can berepeatedly connected• Things arehappening– Sensemakingtechniques address"what" is happening?• Patterns/objects,hypotheses emerge– What can beobserved?• Operationalizing– The dots can berepeatedly connected– "Big Data"contributions areshown in orange <-‐Feedback"Sensemaking" Techniques!Exis&ng!Knowledge/baseCopyright 2013 by Data BlueprintAnalytics Insight Cycle43VolumeVelocityVarietyPotenFal/actual insightsDiscernmentExploitableInsightPaGern/Object EmergenceCombined/informed insightsAnalyDcal boEleneck
Copyright 2013 by Data Blueprint• Data analysis struggles with the social– Your brain is excellent at social cognition - people can• Mirror each other’s emotional states• Detect uncooperative behavior• Assign value to things through emotion– Data analysis measures the quantity of socialinteractions but not the quality• Map interactions with co-workers you see during work days• Cant capture devotion to childhood friends seen annually– When making (personal) decisions about socialrelationships, it’s foolish to swap the amazing machinein your skull for the crude machine on your desk• Data struggles with context– Decisions are embedded in sequences and contexts– Brains think in stories - weaving together multiplecauses and multiple contexts– Data analysis is pretty bad at• Narratives / Emergent thinking / Explaining• Data creates bigger haystacks– More data leads to more statistically significantcorrelations– Most are spurious and deceive us– Falsity grows exponentially greater amounts of datawe collect• Big data has trouble with big problems– For example: the economic stimulus debate– No one has been persuaded by data to switch sides• Data favors memes over masterpieces– Detect when large numbers of people take an instantliking to some cultural product– Products are hated initially because they are unfamiliar• Data obscures values– Data is never raw; it’s always structured according tosomebody’s predispositions and values44Some Big Data Limitations
Savings come from a varietyof agreed upon categoriesand values:• Reduced hospital re-admissions• Patient Monitoring:Inpatient, out-patient,emergency visits and ICU• Preventive care for ACO• Epidemiology• Patient care quality andprogram analysisCopyright 2013 by Data Blueprint$300 billion is the potential annual value to health care45$165$108$47$9$5Transparency in clinical data and clinical decision supportResearch & DevelopmentAdvanced fraud detection-performance based drug pricingPublic health surveillance/response systemsAggregation of patient records, online platforms, & communities
0255075100Current ImprovedCopyright 2013 by Data BlueprintReversing The Measures• Currently:– Analysts spend 80% of their time manipulating data and 20% of their timeanalyzing data– Hidden productivity bottlenecks• After rearchitecting:– Analysts spend less time manipulating data and more of their time analyzing data– Significant improvements in knowledge worker productivity46Manipulation Analysis
• Data Analysis- Origins• Challenges- Faced by virtually everyone• Compliment- Existing data management practices• Pre-requisites- Necessary to exploit big data techniques• Prototyping- Iterative means of practicing big data techniques• Take Aways and Q&ADemystifying Big DataCopyright 2013 by Data BlueprintTweeting now:#dataed47
Copyright 2013 by Data BlueprintWhat do we teach business people about data?48What percentage of the deal with it daily?
Copyright 2013 by Data BlueprintWhat do we teach IT professionals about data?• 1 course– How to build a newdatabase– 80% if UT expensesare used to improveexisting IT assets• What impressions doIT professionals getfrom this education?– Data is a technical skillthat is used to developnew databases• This is not the bestway to educate IT andbusinessprofessionals - everyorganizations– Sole, non-depletable,non-degrading,durable, strategicasset49
Copyright 2013 by Data BlueprintApplication-Centric DevelopmentOriginal articulation from Doug Bagley @ Walmart50ttStrategyGoals/ObjectivesSystems/ApplicationsNetwork/InfrastructureData/Informationt• In support of strategy, the organizationdevelops specific goals/objectives• The goals/objectives drive thedevelopment of specific systems/applications• Development of systems/applicationsleads to network/infrastructurerequirements• Data/information are typicallyconsidered after the systems/applications and network/infrastructure have been articulated• Problems with this approach:– Ensures that data is formedaround the application and notthe information requirements– Process are narrowlyformed around applications– Very little data reuse is possible
Einstein QuoteCopyright 2013 by Data Blueprint51"The significantproblems we facecannot be solvedat the same levelof thinking wewere at when wecreated them."- Albert Einstein
Copyright 2013 by Data BlueprintWhat does it mean to treat data as an organizational asset?• Assets are economic resources– Must own or control– Must use to produce value– Value can be converted into cash• An asset is a resource controlled bythe organization as a result of pastevents or transactions and from whichfuture economic benefits are expectedto flow to the organization [Wikipedia]• With assets:– Formalize the care and feeding of data• Cash management - HR planning– Put data to work in unique/significantways• Identify data the organization will need[Redman 2008]52
Copyright 2013 by Data BlueprintData-Centric Development FlowOriginal articulation from Doug Bagley @ Walmart53ttStrategyGoals/ObjectivesData/InformationNetwork/InfrastructureSystems/Applicationst• In support of strategy, the organizationdevelops specific goals/objectives• The goals/objectives drive thedevelopment of specific data/information assets with an eye toorganization-wide usage• Network/infrastructure components aredeveloped to support organization-wide use of data• Development of systems/applicationsis derived from thedata/network architecture• Advantages of this approach:– Data/information assets aredeveloped from anorganization-wide perspective– Systems support organizationaldata needs and complimentorganizational process flows– Maximum data/information reuse
TopOperationsJobCopyright 2013 by Data BlueprintCDO Reporting54Top JobTopFinanceJobTopInformationTechnologyJobTopMarketingJob• There is enough work to justify the function• There is not much talent• The CDO provides significant input to the Top Information Technology JobData Governance OrganizationChiefDataOfficer
• Data Analysis- Origins• Challenges- Faced by virtually everyone• Compliment- Existing data management practices• Pre-requisites- Necessary to exploit big data techniques• Prototyping- Iterative means of practicing big data techniques• Take Aways and Q&ADemystifying Big DataCopyright 2013 by Data BlueprintTweeting now:#dataed55
Copyright 2013 by Data Blueprint56Traditional Systems Life Cycle Challengesprojectcartoon.com• Original business concept• As the consultant described it• As the customer explained it• How the project leader understood it• How the programmer wrote it• What the beta testers received• What operations installed• As accredited for operation• When it was delivered• How the project was documented• How the help desk supported it• How the customer was billed• After patches were applied• What the customer wanted
Copyright 2013 by Data Blueprint"Waterfall" model ofSystems Development57
Copyright 2013 by Data Blueprint"Waterfall" model (with iteration possible)58
Copyright 2013 by Data BlueprintSpiral Model ofSystemsDevelopment59Barry W. Boehm "A Spiral Model ofSoftware Development and Enhancement"ACM SIGSOFT Software Engineering NotesAugust 1986, 11(4):14-24"The majordistinguishingfeature of thespiral model isthat it creates arisk-drivenapproach ...rather than aprimarilydocument-drivenor code-drivenprocess"
Copyright 2013 by Data BlueprintPrototyping & Big Data Technologies60
AdvancedDataPractices• Cloud• MDM• Mining• Big Data• Analytics• Warehousing• SOACopyright 2013 by Data Blueprint• 5 Datamanagementpractices areas /data managementbasics ...• ... are necessarybut insufficientprerequisites toorganizational dataleveragingapplications that isself actualizing dataor advanced datapracticesHierarchy of Data Management Practices (after Maslow)Basic Data Management Practices– Data Program Management– Organizational Data Integration– Data Stewardship– Data Development– Data Support Operationshttp://3.bp.blogspot.com/-ptl-9mAieuQ/T-idBt1YFmI/AAAAAAAABgw/Ib-nVkMmMEQ/s1600/maslows_hierarchy_of_needs.png
Copyright 2013 by Data BlueprintExperience Pebbles62
• Data Analysis- Origins• Challenges- Faced by virtually everyone• Compliment- Existing data management practices• Pre-requisites- Necessary to exploit big data techniques• Prototyping- Iterative means of practicing big data techniques• Take Aways and Q&ADemystifying Big DataCopyright 2013 by Data BlueprintTweeting now:#dataed63
Copyright 2013 by Data BlueprintWhich Source of Data Represents the Most Immediate Opportunity?64Guidance• The real change: cost-effectivenessand timely delivery• Business process optimization — nottechnology• High fidelity and quality information• Big data technology, all is not new• Keep your focus, develop skills(Source: Gartner (January 2013)
Copyright 2013 by Data BlueprintGartner Recommendations65Impacts Top RecommendationsSome of the new analytics that are madepossible by big data have no precedence,so innovative thinking will be required toachieve valueTreat big data projects as innovationprojects that will require changemanagement efforts. The business willtake time to trust new data sources andnew analyticsCreative thinking can unearth valuableinformation sources already inside theenterprise that are underusedWork with the business to conduct aninventory of internal data sources outsideof ITs direct control, and consideraugmenting existing data that is ITcontrolled. With an innovation mindset,explore the potential insight that can begained from each of these sourcesBig data technologies often create theability to analyze faster, but getting valuefrom faster analytics requires businesschangesEnsure that big data projects that improveanalytical speed always include a processredesign effort that aims at gettingmaximum benefit from that speedGartner 2012
Copyright 2013 by Data BlueprintResults: It is not always about money• Solution:– Integrate multiple databases intoone to create holistic view of data– Automation of manual process• Results:– Data is passed safely and effectively– Eliminate inconsistencies,redundancies, and corruption– Ability to cross-analyze– Significantly reduced turnaroundtime for matching patients withpotential donor -> increasedpotential to make life-savingconnection in a manner that isfaster, safer and more reliable– Increased safe matches from 3 outof 10 to 6 out of 1066
Copyright 2013 by Data BlueprintData versus Tools-A History Lesson• Sophisticated tool without data are useless• Mediocre tools with the data are frustrating• Analysts will always opt for frustration over futility, ifthat is their only option– Ira "Gus" Hunt CIA/CTO• http://www.huffingtonpost.com/2013/03/20/cia-gus-hunt-big-data_n_2917842.html67
Copyright 2013 by Data BlueprintQuestions?68+ =
Unlock Business Value throughData Quality EngineeringJune 11, 2013 @ 2:00 PM ET/11:00 AM PTData Systems Integration & Business Value Pt. 1:MetadataJuly 9, 2013 @ 2:00 PM ET/11:00 AM PTSign up here:www.datablueprint.com/webinar-scheduleor www.dataversity.netCopyright 2013 by Data BlueprintUpcoming Events69
10124 W. Broad Street, Suite CGlen Allen, Virginia 23060804.521.4056
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.