Big data


Published on

An exploration of what Big Data is and when it adds incremental information and when it does not.

Published in: Education

Big data

  1. 1. Big DataGaetan LionApril 5, 2013 1
  2. 2. Table of Content1) Big Data trends.2) How Big is your Data?3) Big Data Potential.4) Big technologies. New databases.5) Big quantitative methods. New stats.6) Big Data temperaments.7) Is Big always better? 2
  3. 3. 1) Big Data Trends 3
  4. 4. Cost of Data storage has dropped 4
  5. 5. Social Media (Facebook & Twitter) has grown exponentially Facebook vs Twitter # Active Users in 000 exponential growth Facebook started 1,200,000 in Feb 2004. Has 1 billion active 1,000,000 users. 800,000 600,000 Twitter started in March 2006. 400,000 Has 500 million 200,000 users. 0 Ap 8 Ap 9 O 9 Ap 0 O 0 Ap 1 O 1 Ap 2 O 2 13 O 8 Ja 8 Ja 9 Ju 0 Ju 1 Ja 1 Ju 2 Ja 2 Ju 8 Ju 9 Ja 0 l-0 l-1 l-1 1 l-1 0 r-0 l-0 -0 0 r-0 -0 1 r-1 -1 1 r-1 -1 r-1 -1 n- n- n- n- n- n- ct ct ct ct ct Ja Facebook TwitterSocial networks are creating a huge live Unstructured Data. 5
  6. 6. Unstructured Data is taking over… 6
  7. 7. 2) How Big is your Data?• How Tall is it? How large is your sample (rows)?• How Wide is it? How many variables (columns)?• What is its Velocity? How frequently is it updated?• Does it include unstructured data (documents, emails, Social Media)? 7
  8. 8. 3) Big Data Potential 8
  9. 9. 4) Big Technologies. New Databases 9
  10. 10. 10
  11. 11. Database: Structured vs Unstructured Database Database Database Reporting Data Type type language structure toolStructured. SQL Data Warehouse Relational Oracle EssbaseCustomers, structured Data Marts database & IBM Cognostransactions, query language Reportingnumbers in rows. Business Intelligence Hadoop ConnectorsUnstructured. NoSQL Non-relational HadoopSocial Media, not only SQL databaseText documents,Web services 11
  12. 12. 5) Big quantitative methods. New Stats 12
  13. 13. New Stats Map A/B Testing (hypothesis testing) Statistics & Regression Spatial Analysis Regression Time Series Signal Processing AnalysisPredictiveAnalytics Association Rule Learning Data Mining & Cluster Analysis Machine Learning (formerly Artificial Classification Intelligence) Pattern Recognition Neural Networks Optimization Genetic Algorithms Natural Language Sentiment Analysis Processing 13
  14. 14. Definitions. Part IAssociation Rule Learning: method to uncover interesting relationshipsby generating and testing possible rules. One application is “marketbasket analysis”, where a retailer figures out what products arefrequently bought together. A cited example is that shoppers who buydiapers often buy beer.Classification: identifies the categories in which new data belongs,based on an existing data set grouped in predefined categories. Itdiffers from Cluster Analysis that starts without predefined categories.Genetic algorithms: an optimization method inspired by the “survival ofthe fittest” process. Potential solutions are encoded as “chromosomes”that can combine and mutate. The chromosomes are selected forsurvival within a modeled “environment.” Examples: optimizing theperformance of an investment portfolio. 14
  15. 15. Definitions. Part IINatural language processing (NLP): it uses algorithms to analyze text data. Sentiment Analysis is a common application. It measures customers’reaction to a product campaign by analyzing social media.Neural networks: models inspired by the workings of neurons andsynapses within the brain. Used for finding nonlinear patterns. They canbe used for Pattern recognition and Optimization. Examples of neuralnetwork applications include identifying customers that may leave andidentifying fraudulent insurance claims.Signal processing: an electrical engineering method to analyze signals(radio, etc…) and discern between signal and noise. It is used to extractthe signal from the noise from a set of less precise data [Signal DetectionTheory]. 15
  16. 16. Definitions. Part IIISpatial Analysis: it analyzes geographic location encoded withinthe data. The information comes from GPS. Applicationsinclude spatial regression to figure a consumer willingness topurchase a product given his location. 16
  17. 17. 6) Big Data TemperamentsSource: Harvard Business Review, April 2012 by Shvetank Shah, Andrew Horneand Jaime Capella. 17
  18. 18. 7) Is Big always better? 18
  19. 19. No! says Nate Silver•“I came to realize that prediction in the era of Big Data wasnot going very well.”•“If the quantity of information is increasing [exponentially]…Most of it is just noise.”•He refers to John P. Ioannidis 2005paper: “Why Most PublishedResearch Findings are False.”2/3ds of scientific papers’ resultscan’t be replicated!“… numbers have no way of speaking forthemselves. We speak for them.” 19
  20. 20. Nate’s targets• Political pundits. Their “intuitive” election predictions have been disastrous. Granted, it was not because of Big Data but instead No Data. He showed them how to do it using Small Data (polls with samples < 1,000);• Economists forecasters. They have used Big Data with poor results. The majority of them can’t forecast a recession already underway. ECRI predicted with certainty a double dip recession in 2011 using tens of variables they did not understand. Instead, the economy improved;• Stock market & financial market forecasters. Similar performance as economists forecasters;• Earthquake forecasting. The field is not well understood. “… Statistical inferences are much stronger when backed up by theory… about their root causes.” 20
  21. 21. No! says Vincent Granville• Big Data is huge, but information is very sparse;• Storing and processing the entire data is very inefficient;• You can do better by smartly sampling only 5% of the data;You don’t need Big Data, you need Smart Data. 21
  22. 22. Yes! Says Chris Anderson • He quotes Peter Norvig, Google’s research director: “All models are wrong, and increasingly you can succeed without them.” • “… with massive data, [the scientific method] is becoming obsolete.” • “We can throw the numbers into the biggest computing clusters … and let statistical algorithms find patterns where science cannot.” He mentions examples such as J.Craig Venter gene sequencing, Google Search, and Google Translator, among other successes.“With enough data, the numbers speak for themselves.”“Correlation supersedes causation, and science can advance without 22coherent models, unified theories, or … any … explanation at all.”
  23. 23. Big Data Effectiveness Map Field needing causal understanding Field not needing Rule Based causal Theory not well Theory well understanding understood understood More data more More data moreTall data Noise Signal Oversampling Oversampling More data better More data better More variables more More variables more model performance model performance false positives explanationWide data Multicollinearity Multicollinearity Model overfitting Model overfitting Economics, Google Search, Games & Sports Financial markets, Weather forecasting, Google Translator,Examples [Chess, Baseball, Earthquake Customer behavior Google Flu-trends, etc…], Politics forecasting Customer behavior 23