Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data: An Overview


Published on

A high level semi-technical overview of Big Data (specifically Hadoop).

Published in: Technology

Big Data: An Overview

  1. 1. <Insert Picture Here>Big Data: An Overview
  2. 2. What Is Big Data?
  3. 3. What Is Big Data?• Big Data is not simply a huge pile of information• A good starting place is the following paraphrase:“Big Data describes datasets so large they becomeawkward to manage with traditional database tools• at a reasonable cost.”
  4. 4. VOLUME VELOCITY VARIETY VALUESOCIALBLOGSMARTMETER101100101001001001101010101011100101010100100101A Breakdown Of What Makes Up Big Data
  5. 5. Data Growth Explosion• 1 GB of stored content can create 1 PB of data in transitData & Image courtesy of IDC• The totality of stored data is doubling about every 2 years• This meant 130 EB in 2005• 1227 EB in 2010 (1.19 ZB)• 7910 EB in 2015 (7.72 ZB)
  6. 6. 2005 20152010• More than 90% is unstructured dataand managed outside RelationalDatabase• Approx. 500 quadrillion files• Quantity doubles every 2 years1.8 trillion gigabytes of datawas created in 2011…10,0000GBofData(INBILLIONS)STRUCTURED DATA(MANAGED INSIDE RELATIONAL DATABASE)UNSTRUCTURED DATA(MANAGED OUTSIDE RELATIONAL DATABASE)Growth Of Big DataHarnessing Insight From Big Data Is Now Possible
  7. 7. So, Any Just Any Dataset?• Big Data Can WorkWith Any Dataset• However, Big DataShines When DealingWith Unstructured Data
  8. 8. Structured Vs. UnstructuredStructured Data is any data to which apre-defined data model can be appliedin an automated fashion producing in asemantically meaningful result withoutreferencing some outside elements.If you can’t, it’s unstructuredIn other words, if you can apply sometemplate to a data set and have itinstantly make sense to the averageperson, it’s structured.
  9. 9. Really? Only Two Categories?Okay, there’s alsosemi-structured data.Which basicallymeans after thetemplate is applied,some of the resultwill make sense andsome will not.XML is a classicexample of thiskind of data.
  10. 10. Formal Definitions Of Data TypesStructured Data:Entities in the same group have the same descriptions (or attributes), while descriptions forall entities in a group (or schema): a) have the same defined format; b) have a predefinedlength; c) are all present; and d) follow the same order. Structured data are what is normallyassociated with conventional databases such as relational transactional ones whereinformation is organized into rows and columns within tables. Spreadsheets are anotherexample. Nearly all understood database management systems (DBMS) are designed forstructural dataSemi-Structured Data:Semi-structured data are intermediate between the two forms above wherein “tags” or“structure” are associated or embedded within unstructured data. Semi-structured data areorganized in semantic entities, similar entities are grouped together, entities in the samegroup may not have same attributes, the order of attributes is not necessarily important, notall attributes may be required, and the size or type of same attributes in a group may differ. Tobe organized and searched, semi-structured data should be provided electronically fromdatabase systems, file systems (e.g., bibliographic data, Web data) or via data exchangeformats (e.g., EDI, scientific data, XML).Unstructured Data:Data can be of any type and do not necessarily follow any format or sequence, do not followany rules, are not predictable, and can generally be described as “free form.” Examples ofunstructured data include text, images, video or sound (the latter two also known as“streaming media”). Generally, “search engines” are used for retrieval of unstructured datavia querying on keywords or tokens that are indexed at time of the data ingest.
  11. 11. Informal Definitions Of Data TypesStructured Data:Fits neatly into a relational structure.Semi-Structured Data:Think documents or EDI.Unstructured Data:Can be anything.Text Video Sound Images
  12. 12. Tools For Dealing With Semi/Un-Structured Data
  13. 13. What Is Hadoop?“The Apache™ Hadoop® project develops open-source software forreliable, scalable, distributed computing.“The Apache Hadoop software library is a framework that allows for thedistributed processing of large data sets across clusters of computersusing simple programming models. It is designed to scale up fromsingle servers to thousands of machines, each offering localcomputation and storage. Rather than rely on hardware to deliverhigh-availability, the library itself is designed to detect and handlefailures at the application layer, so delivering a highly-availableservice on top of a cluster of computers, each of which may beprone to failures.”
  14. 14. Rather than moving the data to a central server for processingThe Paradigm Shift Of HadoopCentralized Processing Doesn’t WorkMoving data to a central location forprocessing (like, say, Informatica)cannot scale. You can only buy amachine so big.
  15. 15. The Paradigm Shift Of HadoopBandwidth Is The Bottleneck• Moving data aroundis expensive.• Bandwidth $$ > CPU $$
  16. 16. The Paradigm Shift Of HadoopProcess The Data Locally Where It Lives
  17. 17. The Paradigm Shift Of HadoopThen Return Only The Results• You move much less dataaround this way• You also gain the advantageof greater parallel processing
  18. 18. Where Did Hadoop Originate?GFSPresented To ThePublic In 2003MapReducePresented To ThePublic in 2004
  19. 19. Spreading Out From GoogleDoug Cutting was working on “Nutch”, Yahoo’s next generation searchengine at the same time when he read the Google papers and reverseengineered the technology. The elephant was his son’s toy named….
  20. 20. Going Open SourceHDFS MapReduceReleased To Public 2006
  21. 21. A Bit More In Depth, Then A Lot More In DepthHDFS MapReduceHDFS is primarily a dataredundancy solution.MapReduce is wherethe work gets done.
  22. 22. How Hadoop WorksHadoop is basically a massively parallel, sharednothing, distributed processing algorithm
  23. 23. GFS / HDFSHDFS Distributes Files At The Block Level Across MultipleCommodity Devices For Redundancy On The CheapNot RAID:Distribution Is Across Machines/Racks
  24. 24. Data DistributionBy Default, HDFS Writes Into Blocks & The BlocksAre Distributed x3
  25. 25. WORMData Is Written Once & (Basically) Never Erased
  26. 26. How Is The Data Manipulated?Not Random ReadsData Is Read From The Stream InLarge, Contiguous Chunks
  27. 27. The Key To Hadoop Is MapReduceIn a Shared Nothing architecture,programmers must break the workdown into distinct segments that are:• Autonomous• Digestible• Can be processed independently• With the expectation of incipientfailure at every step
  28. 28. A Canonical MapReduce ExampleImage Credit: Martijn van Groningen
  29. 29. The dataarrives intothe system.A MapReduce ExampleThe Input
  30. 30. The data is moved into theHDFS system, divided intoblocks, each of which are copiedmultiple times for redundancy.A MapReduce ExampleSplitting The Input Into Chunks
  31. 31. The Mapper picks up a chunk forprocessing. The MR Frameworkensures only one mapper will beassigned to a given chunkA MapReduce ExampleMapping The Chunks
  32. 32. In this case, the Mapperemits a word with the numberof times it was found.A MapReduce ExampleMapping The Chunks
  33. 33. The Shuffler can do a roughsort of like items (optional)A MapReduce ExampleA Shuffle Sort
  34. 34. The Reducer combinesthe Mapper’s output intoa totalA MapReduce ExampleReducing The Emissions
  35. 35. The job completes with anumeric index of words foundwithin the original input.A MapReduce ExampleThe Output
  36. 36. MapReduce Is Not Only Hadoop is a programming paradigm, not a language. You can do MapReducewithin an Oracle database; it’s just usually not a good idea. A large MapReducejob would quickly exhaust the SGA of any Oracle environment.
  37. 37. Problem Solving With MapReduce• The key feature is the Shared Nothing architecture.• Any MapReduce program has to understandand leverage that architecture.• This is usually a paradigm shift for mostprogrammers and one that many cannotovercome.
  38. 38. Programming With MapReduce• HDFS & MapReduce IsWritten In Java1. package org.myorg;2.3. import*;4. import java.util.*;5.6. import org.apache.hadoop.fs.Path;7. import org.apache.hadoop.filecache.DistributedCache;8. import org.apache.hadoop.conf.*;9. import*;10. import org.apache.hadoop.mapreduce.*;11. import org.apache.hadoop.mapreduce.lib.input.*;12. import org.apache.hadoop.mapreduce.lib.output.*;13. import org.apache.hadoop.util.*;14.15. public class WordCount2 extends Configured implements Tool {16.17. public static class Map18. extends Mapper<LongWritable, Text, Text, IntWritable> {19.20. static enum Counters { INPUT_WORDS }21.22. private final static IntWritable one = new IntWritable(1);23. private Text word = new Text();24.25. private boolean caseSensitive = true;26. private Set<String> patternsToSkip = new HashSet<String>();27.28. private long numRecords = 0;29. private String inputFile;30.31. public void setup(Context context) {32. Configuration conf = context.getConfiguration();33. caseSensitive = conf.getBoolean("", true);34. inputFile = conf.get("");35.36. if (conf.getBoolean("wordcount.skip.patterns", false)) {37. Path[] patternsFiles = new Path[0];38. try {39. patternsFiles = DistributedCache.getLocalCacheFiles(conf);40. } catch (IOException ioe) {41. System.err.println("Caught exception while getting cached files: "42. + StringUtils.stringifyException(ioe));43. }44. for (Path patternsFile : patternsFiles) {45. parseSkipFile(patternsFile);46. }47. }48. }49.50. private void parseSkipFile(Path patternsFile) {51. try { ,,,,,,• Will Work With Any LanguageSupporting STDIN/STDOUT• Lots Of People Using Python,R, Matlab, Perl, Ruby et al• Is Still Very Immature &Requires Low Level Coding
  39. 39. What Are Some Big Data Use Cases?• Inverse Frequency / Weighting• Co-Occurrence• Behavioral Discovery• “The Internet Of Things”• Classification / Machine Learning• Sorting• Indexing• Data Intake• Language ProcessingBasically, Clustering And Targeting
  40. 40. Inverse Frequency WeightingRecommendationSystems
  41. 41. Co-OccurrenceFundamental Data Mining –People Who Did This Also Do That
  42. 42. Behavioral Discovery
  43. 43. Behavioral Discovery“The best minds of my generation arethinking about how to make peopleclick ads.”Jeff Hammerbacher,Former Research Scientist at FacebookCurrently Chief Scientist at Cloudera
  44. 44. “The Internet Of Things”“Data Exhaust”
  45. 45. Classification / Machine Learning
  46. 46. SortingCurrent Record Holder:•10PB sort•8000 nodes•6 hours, 27 minutes•September 7, 2011Current Record Holder:•1.5 TB•2103 nodes•59 seconds•February 26, 2013
  47. 47. Indexing
  48. 48. Data IntakeHadoop can be used as a massive parallel ETL tool;Flume to ingest files, MapReduce to transform them.
  49. 49. Language ProcessingIncludes SentimentAnalysisHow can you infermeaning fromsomeone’s words?Does that smile meanhappy? Sarcastic?Bemusement?Anticipation?
  50. 50. How Can Big Data Help You?9 Use Cases:• Natural Language Processing• Internal Misconduct• Fraud Detection• Marketing• Risk Management• Compliance / Regulatory Reporting• Portfolio Management• IT Optimization• Predictive Analysis
  51. 51. Compliance / Regulatory Reporting
  52. 52. Predictive AnalysisThink data mining onsteroids. One of the mainbenefits Hadoop brings tothe enterprise is the abilityto analyze every piece ofdata, not just a statisticalsample or an aggregatedform of the entiredatastream.
  53. 53. Risk ManagementPhoto credit: Guinness WorldRecords (88 catches, by the way)
  54. 54. When considering a new hire, an extended investigation may show risky behavior on theapplicant’s part which may exclude him or her from more sensitive positions.Risk ManagementBehavioral Analysis
  55. 55. Fraud Detection“Dear Company: I hurt myself working on the line and now I can’t walk without acane.” Then he tells his Facebook friends he’s going to his house in Belize forsome waterskiing.
  56. 56. Internal MisconductOne of the reasons why the FBI was able to close in on the identities of the peopleinvolved is that they geolocated the sender and recipient of the Gmail emails andconnected those IP addresses with known users on those same IP addresses.
  57. 57. Portfolio Management• Evaluate portfolio performance on existing holdings• Evaluate portfolio for future activities• High speed arbitrage trading• Simply keeping up:"Options were 4.55B contracts in 2011 -- 17% over 2010 and the 9thstraight year in a row”10,000 credit card transactions per secondStatistics courtesy of ComputerWorld, April 2012
  58. 58. Sentiment Analysis – Social Network AnalysisCompanies used torely on warranty cardsand the like to collectdemographic data.People either did notfill out the forms or didso with inaccurateinformation.
  59. 59. Sentiment Analysis – Social Network AnalysisPeople are much more likely to be truthful when talking to their friends.
  60. 60. Sentiment Analysis – Social Network AnalysisThis person – and20 of their friends– are talking aboutthe NFL.This personis a runnerSomeonelikes KindleSomeone iscurrent withpop music
  61. 61. Sentiment Analysis – Social Network AnalysisEven Where You Least Expect It.You Might Be Thinking Something Like “My Customer Will Never Use SocialMedia For Anything I Care About. No Sargent Is Ever Going To Tweet “The StrapsOn This New Rucksack Are So Comfortable!!!”
  62. 62. Sentiment Analysis – Social Network AnalysisInternal Social Networking At Customer Sites• Oracle already uses an internal social network to facilitate work.• The US Military is beginning to explore a similar type of environment.• It is not unreasonable to plan for the DoD installing a network on base; Yourcompany could incorporate feedback from end users into design decisions.
  63. 63. Sentiment Analysis – Apple iOS6, Maps & Stock PriceApple Released iOS6 with theirown version of Maps. It has hadsome issues, to put it mildly.Photo courtesy of
  64. 64. Sentiment Analysis – Apple iOS6, Maps & Stock PriceOver half of all trades in the US are initiated by a computer algorithm.Source: Planet Money (NPR) Aug 2012
  65. 65. Sentiment Analysis – Apple iOS6, Maps & Stock PricePhoto courtesy of started totweet about themaps problem, andit went viral (to thepoint that someonecreated a Tumblrblog to make funof Apple’s fiasco.
  66. 66. Sentiment Analysis – Apple iOS6, Maps & Stock PricePhoto courtesy of the twitter stream started to peak, Apple’sstock price took a short dip. I believe it likelythat automatic trading algorithms started tosell off Apple based on the negative sentimentanalysis from Twitter and Facebook.
  67. 67. Natural Language ProcessingBigHugeBloomingAmpleBlimpGiganticAbundantBroadBulkyCapaciousColossalComprehensiveCopiousEnormousExcessiveExorbitantExtensiveExtravagantFullGenerousGiantGoodlyBigGrandGrandioseGreatHeftyHumongousImmeasurableImmenseJumboGargantuanMassiveMonumentalMountainousPlentifulPopulousRoomySizableSpaciousStupendousSubstantialSuperSweepingVastVoluminousWhoppingWideGinormousMongoBadonkaBookuDoozy
  68. 68. Natural Language ProcessingBigHugeBloomingAmpleBlimpGiganticAbundantBroadBulkyCapaciousColossalComprehensiveCopiousEnormousExcessiveExorbitantExtensiveExtravagantFullGenerousGiantGoodlyBigGrandGrandioseGreatHeftyHumongousImmeasurableImmenseJumboGargantuanMassiveMonumentalMountainousPlentifulPopulousRoomySizableSpaciousStupendousSubstantialSuperSweepingVastVoluminousWhoppingWideGinormousMongoBadonkaBookuDoozyLarge
  69. 69. Natural Language ProcessingAnticipate Customer Need
  70. 70. Natural Language ProcessingReact To Competitor’s Missteps
  71. 71. Natural Language ProcessingCultural Fit For HiresAs of Apr 22, there were 724 Hadoopopenings in the DC area. There will behundreds – if not thousands – of applicantsfor each position. How can you determinewho is the most appropriate candidate, notjust technically, but culturally?
  72. 72. Natural Language ProcessingCultural Fit?A good way to think of cultural fit isthe “airport test.” If you’re thinkingof hiring someone and you had to sitwith them in an airport for a fewhours because of a delayed flight,would that make you happy? Orwould you cringe at the thought ofhours of forced conversation?
  73. 73. Natural Language ProcessingAnalyze Their Writings For Cultural FitGo beyond simple keyword searches to find out more about the person.Regardless of what their resume says, language analysis can reveal details aboutwhere they grew up and where they experienced their formative years.
  74. 74. Do they say “faucet” or “spigot”? “Wallet” or “billfold”? “Dog”, “hound” or“hound dog”? “Groovy”, “cool”, “sweet” or “off the hook”? While these wordsare synonyms, they carry cultural connotations with them. Find candidates withthe same markers as your existing team for a more cohesive unit.Natural Language ProcessingAnalyze Their Writings For Cultural Fit
  75. 75. IT Optimization
  76. 76. IT Optimization – Enabling The EnvironmentI’m runningout ofsupplies!I’m overheating!EverythingIs Fine.Wheel 21is out ofalignment.I’m 42.4%full.
  77. 77. IT Optimization – Enabling The Shop FloorA More Specific ExampleI’m 42.4%full.
  78. 78. IT Optimization – Enabling The Shop FloorMake The Trash SmartWe can make the trash bins “smart” byputting a wifi enabled scale beneath eachbin and using that to determine when thebins reaching capacity.
  79. 79. As of now, the custodian has to check each bin to see ifit is full. With a “smart” bin, the custodian can check hissmart phone and see does and does not need to be done.IT Optimization – Enabling The Shop FloorCut Down On Clean Up Labor
  80. 80. More importantly, we can now focus on what is happeningto the bins and how they are being used. For example, wemay find outliers where one bin is filling much faster thanall of the others.IT Optimization – Enabling The Shop FloorCut Down On Clean Up Labor
  81. 81. “Data Exhaust”We can drill into why that bin is filling faster, leveragethe Six Sigma efficiency processes already in placeand improve the overall performance of the line.IT Optimization – Enabling The Shop FloorDrilling Into To Waste Production
  82. 82. IT Optimization – Classify Legacy DataA customer can use a machine learning processto take unknown data and sort it into useful dataelements. For example, a retail car part companymight use this process to sort photos – is thatcircle a steering wheel, a hubcap or a tire?
  83. 83. So, All We Need Is Hadoop, Right?• Lack of Security • Ad-hoc Query Support• SQL support • Readily AvailableTechnical ResourcesHadoop is amazing at processing, but lacks a number of features found intraditional RDBMS platforms (like, say Oracle).
  84. 84. Then How Do We Fix Those Problems?In general, do the data crunching in Hadoop, then import the results into a systemlike Oracle for more traditional BI analysis.
  85. 85. Oracle’s Big Data Appliance
  86. 86. Oracle’s Big Data ApplianceIn Depth
  87. 87. Big Data ApplianceThe Specs Of The MachineHardware:•18 Compute/Storage Nodes• 2 6 code Intel processors• 48G Memory (up to 144G)• 12x3TB SAS DIsks•3 InfiniBand Switches•Ethernet Switch, KVM, PDU•42U rackSoftware:•Oracle Linux•Java Virtual Machine•Cloudera Hadoop Distribution•R (statistical programming language)•Oracle NoSQL DatabaseEnvironmental:•12.25 kVA Power Draw•41k BTU/hr Cooling•1886 CFM Airflow216 Cores864G RAM (2.5T Max)648T Storage• 12.0 KW Power Draw• 42k KJ/hr Cooling
  88. 88. Big Data ApplianceThe Cloudera Distribution
  89. 89. 105 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.CompetitiveAdvantageDegree Of ComplexitySome arehereGrowinginvestment hereThe Analytics EvolutionWhat Is Happening In The IndustryStandard Reporting: What Happened?Ad Hoc Reporting: How Many, How Often, Where?Query/Drill Down: What Exactly Is The Problem?Alerts: What Actions Are Needed?Simulation: What Could Happen….?Forecasting: What If These Trends Continue?Predictive Modeling: What Will Happen Next If…?Optimization: How Can We Achieve The Best Outcome?How can we achieve the bestStochastic Optimization: outcome, including the effectsof variability?Descriptive: Analyzing DataTo Determine What HasHappened Or Is HappeningNowPredictive: ExaminingData To DiscoverWhether Trends WillContinue Into The FuturePrescriptive: Studying DataTo Elevate The Best CourseOf Action For The FutureCompeting On Analytics: The New Science Of Winning;Thomas Davenport & Jeanne Harris, 2007
  90. 90. 106 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.CompetitiveAdvantageDegree Of ComplexitySome arehereGrowinginvestment hereThe Analytics EvolutionWhere Big Data Fits On This ModelStandard Reporting: What Happened?Ad Hoc Reporting: How Many, How Often, Where?Query/Drill Down: What Exactly Is The Problem?Alerts: What Actions Are Needed?Simulation: What Could Happen….?Forecasting: What If These Trends Continue?Predictive Modeling: What Will Happen Next If…?Optimization: How Can We Achieve The Best Outcome?How can we achieve the bestStochastic Optimization: outcome, including the effectsof variability?Descriptive: Analyzing DataTo Determine What HasHappened Or Is HappeningNowPredictive: ExaminingData To DiscoverWhether Trends WillContinue Into The FuturePrescriptive: Studying DataTo Elevate The Best CourseOf Action For The FutureCompeting On Analytics: The New Science Of Winning;Thomas Davenport & Jeanne Harris, 2007Where Big Data Best Fits
  91. 91. 107 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.Typical Stages In AnalyticsChoosing The Right Solutions For The Right Data NeedsGrowinginvestmenthereGrowinginvestmenthere
  92. 92. 108 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.IncreasingBusinessValueInformation Architecture MaturityDATA &ANALYTICSDIVERSITYCONSOLIDATE DATADATA WAREHOUSE& What is happening todayMostare here!DATA MARTS& What happened yesterdayBIG DATA& What could happen tomorrowSome arehereGrowinginvestment hereThe Data Warehouse EvolutionWhat Are Oracle’s Customers Deploying Today?
  93. 93. 109 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.How will youacquire livestreams ofunstructured data?ANALYZEDECIDEORGANIZEACQUIREWhat Is Your Big Data Strategy?Where Does Your Data Originate?
  94. 94. 110 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.How will youorganize big dataso it can beintegrated intoyour data center?ANALYZEDECIDE ACQUIREORGANIZEWhat Is Your Big Data Strategy?What Do You Do With It Once You Have It?
  95. 95. 111 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.What skill setsand tools will youuse to analyze bigdata?ANALYZEDECIDE ACQUIREORGANIZEANALYZEWhat Is Your Big Data Strategy?How Do You Manipulate It Once You Have It?
  96. 96. 112 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.How will youshare theanalysis in real-time?ANALYZEACQUIREORGANIZEDECIDEWhat Is Your Big Data Strategy?What To You Do After You’re Done?
  97. 97. 113 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.MakeBetterDecisionsUsingBig DataANALYZEDECIDE ACQUIREORGANIZEBig Data In Action
  98. 98. 114 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.Traditional BI Big DataChangeRequestsHypothesisIdentify Data SourcesExplore ResultsReduce AmbiguityRefine ModelsImproved HypothesisThe Big Data Development Process
  99. 99. 115 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.OracleExalyticsInfiniBandOracleReal-TimeDecisionsOracleBig DataApplianceOracleExadataInfiniBandAcquireOrganize & Discover Analyze DecideEndeca Information DiscoveryOracle’s Big Data Solution
  100. 100. PerformanceAchievementPerformanceAchievementTime(Days)Time(Months)100%Measure,diagnose,tune andreconfigureTest & debugfailure modesAssembledozens ofcomponentsMulti-vendorfingerpointingCustom ConfigurationOracle’s Big Data SolutionPre-Built And Optimized Out Of The Box
  101. 101. 6x faster than custom 20-node Hadoopcluster for large batch transformationjobs2.5x faster than 30-node Hadoopcluster for tagging and parsing textdocumentsBig Data Appliance Performance Comparisons
  102. 102. • Oracle Loader for Hadoop (OLH)• A MapReduce utility to optimize data loading from HDFS into OracleDatabase• Oracle Direct Connector for HDFS• Access data directly in HDFS using external tables• ODI Application Adapter for Hadoop• ODI Knowledge Modules optimized for Hive and OLH• Oracle R Connector for Hadoop• Load Results intoOracle Databaseat 12TB/hourBDAOracleExadataInfiniBandOracleBig DataConnectorsOracle Big Data Connectors
  103. 103. • The R open source environment for statistical computing andgraphics is growing in popularity for advanced analytics• Widely taught in colleges and universities• Popular among millions of statisticians• R programs can run unchanged againstdata residing in the Oracle Database• Reduce latency• Improve data security• Augment results with powerful graphics• Integrate R results and graphics withOBIEE dashboardsOracle Database Advanced Analytics OptionOracle R Enterprise
  104. 104. ClassificationAssociationRulesClusteringAttributeImportanceProblem Algorithm ApplicabilityClassical statistical techniquePopular / Rules / transparencyEmbedded appWide / narrow data / textMinimum Description Length (MDL) Attribute reductionIdentify useful dataReduce data noiseHierarchical K-MeansHierarchical O-ClusterProduct groupingText miningGene and protein analysisAprioriMarket basket analysisLink analysisMultiple Regression (GLM)Support Vector MachineClassical statistical techniqueWide / narrow data / textRegressionFeatureExtractionNon-Negative Matrix Factorization (NMF)Text analysisFeature reductionLogistic Regression (GLM)Decision TreesNaïve BayesSupport Vector MachineOne Class Support Vector Machine (SVM) Lack examplesAnomalyDetectionA1 A2 A3 A4 A5 A6 A7F1 F2 F3 F4Oracle Database Advanced Analytics OptionOracle Data Mining
  105. 105. • Ranking functions• rank, dense_rank, cume_dist, percent_rank,ntile•Window Aggregate functions (movingand cumulative)• Avg, sum, min, max, count, variance, stddev,first_value, last_value• LAG/LEAD functions• Direct inter-row reference using offsets• Reporting Aggregate functions• Sum, avg, min, max, variance, stddev, count,ratio_to_report• Statistical Aggregates• Correlation, linear regression family, covariance• Linear regression• Fitting of an ordinary-least-squares regressionline to a set of number pairs.• Frequently combined with the COVAR_POP,COVAR_SAMP, and CORR functionsDescriptive Statistics• DBMS_STAT_FUNCS: summarizes numericalcolumns of a table and returns count, min, max,range, mean, median, stats_mode, variance,standard deviation, quantile values, +/- n sigmavalues, top/bottom 5 values• Correlations• Pearson’s correlation coefficients, Spearmansand Kendalls (both nonparametric).• Cross Tabs• Enhanced with % statistics: chi squared, phicoefficient, Cramers V, contingency coefficient,Cohens kappa• Hypothesis Testing• Student t-test , F-test, Binomial test, WilcoxonSigned Ranks test, Chi-square, Mann Whitneytest, Kolmogorov-Smirnov test, One-wayANOVA• Distribution Fitting• Kolmogorov-Smirnov Test, Anderson-DarlingTest, Chi-Squared Test, Normal, Uniform,Weibull, ExponentialOracle Database SQL AnalyticsIncluded In The Oracle Database
  107. 107. Having Said That…
  108. 108. Big Data Is More Than Just Hardware & Software
  109. 109. The Math Is The Hard PartThis is a very simple equation for a Fourier transformation of a wave kernel at 0.
  110. 110. The Math Is The Hard PartThis is a photograph of a data scientist’s white board at
  111. 111. Data Scientists Are Expensive And Hard To Find• Typical Job Description:“Ph.D. in data mining, machinelearning, statistical analysis,applied mathematics or equivalent;three-plus years hands-on practicalexperience with large-scale dataanalysis; and fluency in analyticaltools such as SAS, R, etc.”• Looking For “baIT”:• Business• Analytics• ITAll in the same personAll in the same personThese people exist, but arevery expensive.
  112. 112. Growing Your Own Data Scientist• Business Acumen• Familiarity/Likes ComputationalLinear Algebra / Matrix Analysis• Interest in SAS, R, Matlab• Familiarity/Likes Lisp
  113. 113. Big Data Cannot Do Everything
  114. 114. Big Data Cannot Do EverythingBig Data Is A Great ToolBut Not A Silver BulletYou would never run a POS system on Hadoop; Hadoop is far too batch orientedto support this type of activity. Similarly, random access of data does not workwell in the Hadoop world.
  115. 115. When Big Data? When Relational?Size Of Data(rough measure)
  116. 116. When Big Data? When Relational?RDBMS vs Hadoop: A ComparisonFully SQL Compliant Helper Languages (Hive, Pig)Many RDBMS Vendors Extend SQL In Useful Ways Very Useful But Not As Robust As SQLOptmized For Query Performance Optmized For Analytics OperationsTunable (Input Vs Output, Long Running Queries, Etc) Specifically Those Of A Statistics NatureArmies Of Trained And Available ResourcesResources Are Hard To Find AndExpensive When FoundRequires More Specialized HardwareAt Performance ExtremesDesigned To Work On CommodityHardware At All LevelsOLTP, OLAP, ODS, DSS, Hybrid -- More General Purpose Basically Only For AnalyticsExpensive To Implement Over Wide Geographical Distribution Designed To Span Data CentersVery Mature Technology Very New TechnologyReal Time or Batch Processing Batch Operations OnlyNontrivial Licensing Costs Open Source ("Free" --ish)About 2 PB As Largest CommercialCluster (Telecom Company)100+ PB As Largest CommercialCluster (Facebook) (as of March 2013)Ad Hoc Operations Common, If Not EncouragedAd Hoc Operations Possible With HBaseBut Nontrivial
  117. 117. It Is Not An “Either/Or” ChoiceRDBMS and Hadoop Each Solve Different Problems
  118. 118. Where Are Things Heading?
  119. 119. A Quick RecapGFSPresented To ThePublic In 2003MapReducePresented To ThePublic in 2004
  120. 120. YESHadoop Is Already Dead?Sort Of** = for a specific set of problems…
  121. 121. NamePubYear Use What It Does Impact Open Source?Colossus n/aGFS for realtimesystems NoCaffeine 2009 Real Time SearchIncremental updates of analyticsand indexes in real timeEstimated to be 100x fasterthan Hadoop NoPregel 2009Social Graphs,Location Graphs,Learning &Discovery, NetworkOptimization,Internet Of Things Analyze next neighbor problemsEstimated to handled billionsof nodes & trillions of edgesAlphaApacheGiraphPercolator 2010Large scaleincrementalprocessing usingdistributedtransactionsMakes transactional, atomicupdates in a widely distributeddata environment. Eliminatesneed to rerun a batch for a(relatively) small update.Data in the environmentremains much more up todate with less effort.Dremel 2010SQL like languagefor queries on theabove technologiesInteractive, ad hoc queries overtrillion row tables in subsecondtime. Works against Caffeine /Pregel / Colossus withoutrequiring MapReduceEasier for analysts and nontechnical people to beproductive (i.e. not as manydata scientists are required)VeryAlphaApacheDrill(Incubator)SpannerOct2012Fully consistent (?),transactional,horizontallyscalable, distributeddatabase spanningthe globeUses GPS sensors and atomicclocks to keep the clocks ofservers in sync regardless oflocation or other factors.Transactional support on aglobal scale at a fraction ofthe cost and where (manytimes) not technicallypossible otherwise.No, andunlikelyto everbeStorm 2012Real time Hadoop-like processingThe power of Hadoop in realtime. Not from Google; fromTwitterEliminates requirement forbatch processingYesBeta*The New Stuff In Overview
  122. 122. One Last ThingIs Just The Start Of The Equation
  123. 123. One Last ThingHadoop For Analytics And Determining Boundary ConditionsIs Just The Start Of The EquationUse Hadoop to analyze all of the data in your environment and then generatemathematical models from that data.
  124. 124. One Last ThingActing On Boundary ConditionsOnce the model has been built (and vetted), it can be used to resolve events inreal time, thereby getting around the batch bottleneck of Hadoop.
  125. 125. No Really. One More Last Thing
  126. 126. Who Is Hilary Mason?• Chief Data Scientist At• One of the major innovatorsin data science• Scary smart and fun to be around• A heck of a teacher, to bootPhoto credit: Pinar Ozger, Strata 2011
  127. 127. InterpretThe end goal of any Big Data solution is to provide data which can be interpretedinto meaningful decisions. But, before we can interpret the data, we must first…The Mason 5 Step Process For Big DataIn Reverse Order
  128. 128. ModelModel the data into a useful paradigm which will allow us to make sense of anynew data based on past experiences. But, before we can model the data, we mustfirst….The Mason 5 Step Process For Big DataIn Reverse Order
  129. 129. ExploreExplore the data we have and look for meaningful patterns from which we couldextract a useful model. But, before we can look through the data for meaningfulpatterns, we first have to…The Mason 5 Step Process For Big DataIn Reverse Order
  130. 130. ScrubClean and clarify the data we have to make it as neat as possible and easier tomanipulate. But, before we can clean the data, we have to start with…The Mason 5 Step Process For Big DataIn Reverse Order
  131. 131. ObtainObtaining as much data as possible. Advances in technology – coupled withMoore’s law – means that DASD is very, very cheap these days. So much so thatyou may as well hang on to as much data as you can, because you never knowwhen it will prove useful.The Mason 5 Step Process For Big DataIn Reverse Order
  132. 132. Questions?
  133. 133. Some ResourcesWhite Papers:• An Architect’s Guide To Big Data• Big Data For The Enterprise• Big Data Gets Real Time• Build vs. Buy For HadoopThis Deck:SlideshareWeb Resources:• Oracle Big Data• Oracle Big Data Appliance• Oracle Big Data ConnectorsMe:charles dot scyphers oracle dot com@scyphers (twitter)
  134. 134. 153