Introduction to Data Science


Published on

Two hour lecture I gave at the Jyväskylä Summer School. The purpose of the talk is to give a quick non-technical overview of concepts and methodologies in data science. Topics include a wide overview of both pattern mining and machine learning.

See also Part 2 of the lecture: Industrial Data Science. You can find it in my profile (click the face)

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to Data Science

  2. 2. DATA SCIENCE WITH A BROAD BRUSH Concepts and methodologies
  3. 3. DATA SCIENCE IS AN UMBRELLA, A FUSION • Databases and infrastructure • Pattern mining • Statistics • Machine learning • Numerical optimization • Stochastic modeling • Data visualization … of specialties needed for data-driven business optimization
  4. 4. DATA SCIENTIST • Data scientist is defined as DS : business problem  data solution • Combination of strong programming, math, computational and business skills • Recipe for success 1. Convert vague business requirements into measurable technical targets 2. Develop a solution to reach the targets 3. Communicate business results 4. Deploy the solution in production
  5. 5. UNDERSTANDING DATA Monday 19 August 2013
  7. 7. UNSUPERVISED LEARNING • Could be called pattern recognition or structure discovery • What kind of a process could have produced this data? • Discovery of “interesting” phenomena in a dataset • Now how do you define interesting? • Learning algorithms exist for a huge collection of pattern types • Analogy: You decide if you want to see westerns or comedies, but the machine picks the movies • But does “interesting” imply useful and significant?
  8. 8. EXAMPLES OF STRUCTURES IN DATA • Clustering and mixture models: separation of data into parts • Dictionary learning: a compact grammar of the dataset • Single class learning: learn the natural boundaries of data Example: Early detection of machine failure or network intrusion • Latent allocation: learn hidden preferences driving purchase decisions • Source separation: find independent generators of the data Example: Independent phenomena affecting exchange rates
  9. 9. MORE EXAMPLES OF “INTERESTING” PATTERNS • { charcoal, mustard } ⇒ sausage • Grocery customer types with differing paths around the trading floor • Pricing trend change in a web ad exchange • Communities and topics in a social network • Distinct features of a person’s face and fingerprints • Objects emerging in front of a moving car
  10. 10. KNOW YOUR EIGENS AND SINGULARS • Eigenvalue and singular value decompositions are central data analysis tools • They describe the energy distribution and static core structures of data Examples • Face detection, speaker adaptation • Google PageRank is basically just the world’s largest EVD • Zombie outbreak risk is determined by its eigenvalues • As a sub-component in every second learning algorithm
  11. 11. DIMENSION REDUCTION • Some applications encounter large dimension counts up to millions • Dimension reduction may either 1. Retain space: preserve the most “descriptive” dimensions 2. Transform space: trade interpretability for powerful rendition • Usually transformations work oblivious to data (they are simple functions) • Curvilinear transformations try to see how the data is “folded” and build new dimensions specific to the given dataset
  12. 12. DIMENSION REDUCTION EXAMPLE • Singular value decomposition is commonly used to remove the “noise dimensions” with little energy • Example: gene expression data and movie preferences have lots of these • After this more complex methods can be used for unfolding the data
  14. 14. BLIND SOURCE SEPARATION • Find latent sources that generated the data • Tries to discover the real truth beneath all noise and convolution • Examples: • Air defense missile guidance systems • Error-correcting codes • Language modeling • Brain activity factors • Industrial process dynamics • Factors behind climate change
  15. 15. (STATISTICAL) SIGNIFICANCE TESTING • Example: Rejection rate increase in a manufacturing plant • “What is the probability of observing this increase if everything was OK?” • “What is the probability of having a valid alert if there really was something wrong?” • Reliability of significance testing results is wholly dependent on correct modeling of the data source and pattern type • Statistical significance is different from material significance
  16. 16. CORRELATION IS NOT CAUSALITY A correlation may hide an almost arbitrary truth • Cities with more firemen have more fires • Companies spending more in marketing have higher revenues • Marsupials exist mainly in Australia • However, making successful predictions does not require causality
  17. 17. MACHINE LEARNING Basics
  18. 18. SUPERVISED LEARNING • Simplistically task is to find function f : f(input) = output • Examples: spam filtering, speech recognition, steel strength estimation • Risks for different types of errors can be very skewed • Complex inputs may confuse or slow down models • Unsupervised methods often useful in improving results by simplifying the input
  19. 19. SEMI-SUPERVISED LEARNING • Only a part of data is labeled • Needed when labeling data is expensive • Understanding the structure of unlabeled data enhances learning by bringing diversity and generalization and by constraining learning • Relates to multi-source learning, some sources labeled, some not • Examples: • Object detection from a video feed • Web page categorization • Sentiment analysis • Transfer learning between domains
  20. 20. TRAINING, TESTING, VALIDATION • A model is trained using a training dataset • The quality of the model is measured by using it on a separate testing dataset • A model often contains hyper-parameters chosen by the user • A separate validation dataset is split off from the training data • Validation data is used for testing and finding good hyper-parameter values • Cross-validation is common practice and asymptotically unbiased
  21. 21. BIAS AND VARIANCE • Squared error of predictions consists of bias and variance (and noise) • BIAS Model incapability of approximating the underlying truth • VARIANCE Model reliance on whims of the observed data • Complex models often have low bias and high variance • Simple models often have high bias and low variance • Having more data instances (rows) may reduce variance • Having more detailed data (variables) may reduce bias • Testing different types of models can explain how to improve your data
  22. 22. TRAINING AND TESTING, BIAS AND VARIANCE Complex modelSimple model Minimal testing error Minimal training error
  23. 23. MACHINE LEARNING Learning new tricks
  24. 24. THE KERNEL TRICK • Many learning methods rely on inner products of data points • The “kernel trick” maps the data to an implicitly defined, high dimension space • Kernel is the matrix of the new inner products in this space • Mapping itself often left unknown • Example: Gaussian kernel associates local Euclidean neighborhoods to similarity • Example: String kernels are used for modeling DNA sequence structure • Kernels can be combined and custom built to match expert knowledge A kernel is a dataset-specific space transformation, success depends on good understanding of the dataset
  25. 25. ENSEMBLE LEARNING • The power of many: combine multiple models into one • Wide and strong proof of superior performance • Extra bonus: often trivially parallelizable OUR EXPERIENCE IS THAT MOST EFFORTS SHOULD BE CONCENTRATED IN DERIVING SUBSTANTIALLY DIFFERENT APPROACHES, RATHER THAN REFINING A SINGLE TECHNIQUE. Netflix $1M prize winner (ensemble of 107 models) “ “
  26. 26. ENSEMBLE LEARNING IN PRACTICE • Boosting: weigh (⇒ low bias) focused (⇒ low bias) simple models (⇒ low bias) • Bagging: average (⇒ low variance) results of simple models (⇒ low bias) • What aspect of the data am I still missing? • Variable mixing, discretized jumps, independent factors, transformations, etc. • Questions about practical implementability and ROI • Failure: Netflix winner solution never taken to production • Success: Official US hurricane model is an ensemble of 43
  27. 27. RANDOMIZED LEARNING • Motivation: random variation beats expert guidance surprisingly often • Introducing randomness can improve generalization performance (smaller variance) • Randomness allows methods to discover unexpected success • Examples: genetic models, simulated annealing, parallel tempering • Increasingly useful to allow scale-out for large datasets • Many successful methods combine random models as an ensemble • Example: combining random projections or transformations can often beat optimized unsupervised models
  28. 28. ONLINE LEARNING • Instead of ingesting a training dataset, adjust the data model after every incoming (instance, label) pair • Allows quick adaptation and “always-on” operation • Finds good models fast, but may miss the great one ⟹ suitable also as a burn-in for other models • Useful especially for the present trend towards analyzing data streams
  29. 29. BAYESIAN BASICS • Bayesians see data as fixed and parameters as distributions • Parameters have prior assumptions that can encode expert knowledge • Data is used as evidence for possible parameter values • Final output is a set of posterior distributions for the parameters • Models may employ only the most probable parameter values or their full probability distribution • Variational Bayes approximates the posterior with a simpler distribution
  30. 30. MODEL COMPLEXITY • Limiting model size and complexity can be used to avoid excessive bias • Minimum description length and Akaike/Bayesian information criteria are the Occam’s razor of data science • VC dimension of a model provides a theoretical limit for generalization error • Regularization can limit instance weights or parameter sizes • Bayesian models use hyper-parameters to limit parameter overfit
  31. 31. THE END