Department of Statistical Science
BOX 90251, DURHAM, NC 27708-0251
(919) 684-4210, WWW.STAT.DUKE.EDU
Spring (March 4 2019)
Sudipta Dasmohapatra (sd345@duke.edu)
Introduction to Data Science
and Machine Learning
Do you recognize this Company?
What is Data Science?
The Era of Big Data
• Digital data and ecommerce
• Online transactions
• Social media data
• Financial data
• Retail and other data
• Etc.
Volume, Velocity and Variety
Jobs in Data Science
• 25 best jobs of 2019 (US News)
https://money.usnews.com/
money/careers/slideshows/th
e-25-best-jobs?slide=27
…
Google Trends (US)
Google Trends (Data Science)
What is Data Science?
• Data Science is an area at the interface of
statistics, computer science, and mathematics
• Statisticians contributed a large inferential framework,
important Bayesian perspectives, the bootstrap and
CART and random forests, and the concepts of
sparsity and parsimony
• Computer scientists pioneered neural networks,
boosting, PAC bounds, and developed programming
languages such as Spark, Hadoop etc. for handling
Big Data
• Mathematicians contributed support vector machines,
modern optimization, tensor analysis and (maybe)
topological data analysis
What is Data Science?
• Data Science tries to find
hidden structure in large, high
dimensional datasets. But
there is significant variance in
the interpretability of results
• Interesting structure can arise
in regression analysis,
discriminant analysis, cluster
analysis, or more exotic
situations, such as
multidimensional scaling
What is Data Science?
Visualizations
• https://bost.ocks.org/mike/nations/
• https://observablehq.com/@d3/sankey-
diagram
• http://bl.ocks.org/nbremer/94db779237655
907b907
Machine Learning
• Machine learning is an application of artificial intelligence (AI)
that provides systems the ability to automatically learn and
improve from experience without being explicitly programmed
• Machine learning focuses on the development of computer
programs that can access data and use it to learn for
themselves
https://towardsdatascience.com/machine-learning-65dbd95f1603
Applications of ML
• Google Photos: To recognize faces,
emotions, location, etc.
• Google Gmail: Content modeling
• Youtube: Improve search results
• Amazon: Product recommendations
• Facebook: Rank and personalize News
Feed stories, filtering out offensive
content, highlighting trending topics,
ranking search results, and recognizing
image and video content
• Uber: UberEATS to estimate time to
deliver food
hiphotos35/Getty Images/iStockphoto
Machine Learning Algorithms
• Supervised Learning: Response
Multiple Linear Regression (MLR)
• Models with more than one predictor
variable are called multiple regression
models.
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽 𝑘 𝑥 𝑘 + 𝜀
Independent variables
Machine Learning Algorithms
• Supervised Learning: Response
List of Common Supervised ML
Algorithms
• Linear Regression: Approach to
modelling the relationship between a
response (or dependent variable) and
one or more explanatory variables (or
independent variables)
• Nearest Neighbor: ML algorithm for
classification and prediction (The
goal of the k-nearest neighbor
technique is to classify an unknown
observation by computing the
distance of the observation (on the
variables/features) to other previously
known groups or labels in data )
• Decision Trees: Decision Support tool
that makes decisions based on a tree
like structure to classify possible
outcomes
List of Common Supervised ML
Algorithms
• Support Vector Machines:
The goal of SVM is to find
the right hyperplane (line)
that can distinctly separate
the two classes
• Neural Networks: Neural
Networks (NN) are a class
of machine learning
techniques that are
modeled loosely after the
human brain, to recognize
patterns in the data
Machine Learning Algorithms
• Un-Supervised Learning: No Response
• Data analysis without a right answer
• You don’t have an outcome variable you are
seeking to fit or otherwise predict
• Often best applied as exploratory analysis en
route to predictive modeling
List of Common Unsupervised
ML Algorithms
• Cluster Analysis: A
clustering problem is
where you want to
discover the inherent
groupings in the data,
such as grouping
customers by
purchasing behavior
• Association Analysis: An
association rule learning
problem is where you
want to discover rules
that describe large
portions of your data,
such as people that buy
X also tend to buy Y
Data of arrests per 100,000 residents for
assault, murder, and rape in each of the 50 US
states in 1973
Machine Learning
Remember that machine learning only works if the problem is actually solvable
with the data that you have.
Traditional Modeling to Deep
Learning
Source: Cook, 2019, houseofbots.com
Deep Learning?
• A method that makes predictions using a
sequence of non-linear processing stages
• The resulting intermediate representations can
be interpreted as feature hierarchies and the
whole system is jointly learned from data
• Deep learning is a new way of fitting neural
networks
Image Analysis and Deep
Learning
• Images take a lot of space so
image compression important
• Image segmentation is the
process of partitioning a
digital image into multiple
segments (sets of pixels)
• Image segmentation is
typically used to locate
objects and boundaries (lines,
curves, etc.) in images.
Image Preprocessing and
Segmentation
Deep Learning?
Source: Deshpande, https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
Neural Networks Basic
Architecture
Deep Learning?
Source: Deshpande, https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
A Kernel
Hidden LayerInput Layer
Visualization of the First Layer
Multiple Applications
Common Data Issues and
Problems
• Too much data across multiple sources
– no consolidation or integration
• Turf war over who owns the data
• Issues of data quality
• Understanding the low hanging fruits
(data exploration and management,
standardization, visualization)
• Understanding value of data across
organization
• …
Questions?

NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in Data Science and Machine Learning - Sudipta Dasmohapatra, March 4, 2019

  • 1.
    Department of StatisticalScience BOX 90251, DURHAM, NC 27708-0251 (919) 684-4210, WWW.STAT.DUKE.EDU Spring (March 4 2019) Sudipta Dasmohapatra (sd345@duke.edu) Introduction to Data Science and Machine Learning
  • 2.
    Do you recognizethis Company?
  • 3.
    What is DataScience?
  • 4.
    The Era ofBig Data • Digital data and ecommerce • Online transactions • Social media data • Financial data • Retail and other data • Etc. Volume, Velocity and Variety
  • 5.
    Jobs in DataScience • 25 best jobs of 2019 (US News) https://money.usnews.com/ money/careers/slideshows/th e-25-best-jobs?slide=27 …
  • 6.
  • 7.
  • 8.
    What is DataScience? • Data Science is an area at the interface of statistics, computer science, and mathematics • Statisticians contributed a large inferential framework, important Bayesian perspectives, the bootstrap and CART and random forests, and the concepts of sparsity and parsimony • Computer scientists pioneered neural networks, boosting, PAC bounds, and developed programming languages such as Spark, Hadoop etc. for handling Big Data • Mathematicians contributed support vector machines, modern optimization, tensor analysis and (maybe) topological data analysis
  • 9.
    What is DataScience? • Data Science tries to find hidden structure in large, high dimensional datasets. But there is significant variance in the interpretability of results • Interesting structure can arise in regression analysis, discriminant analysis, cluster analysis, or more exotic situations, such as multidimensional scaling
  • 10.
    What is DataScience?
  • 11.
  • 12.
    Machine Learning • Machinelearning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed • Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves https://towardsdatascience.com/machine-learning-65dbd95f1603
  • 13.
    Applications of ML •Google Photos: To recognize faces, emotions, location, etc. • Google Gmail: Content modeling • Youtube: Improve search results • Amazon: Product recommendations • Facebook: Rank and personalize News Feed stories, filtering out offensive content, highlighting trending topics, ranking search results, and recognizing image and video content • Uber: UberEATS to estimate time to deliver food hiphotos35/Getty Images/iStockphoto
  • 14.
    Machine Learning Algorithms •Supervised Learning: Response
  • 16.
    Multiple Linear Regression(MLR) • Models with more than one predictor variable are called multiple regression models. 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽 𝑘 𝑥 𝑘 + 𝜀 Independent variables
  • 17.
    Machine Learning Algorithms •Supervised Learning: Response
  • 18.
    List of CommonSupervised ML Algorithms • Linear Regression: Approach to modelling the relationship between a response (or dependent variable) and one or more explanatory variables (or independent variables) • Nearest Neighbor: ML algorithm for classification and prediction (The goal of the k-nearest neighbor technique is to classify an unknown observation by computing the distance of the observation (on the variables/features) to other previously known groups or labels in data ) • Decision Trees: Decision Support tool that makes decisions based on a tree like structure to classify possible outcomes
  • 19.
    List of CommonSupervised ML Algorithms • Support Vector Machines: The goal of SVM is to find the right hyperplane (line) that can distinctly separate the two classes • Neural Networks: Neural Networks (NN) are a class of machine learning techniques that are modeled loosely after the human brain, to recognize patterns in the data
  • 20.
    Machine Learning Algorithms •Un-Supervised Learning: No Response • Data analysis without a right answer • You don’t have an outcome variable you are seeking to fit or otherwise predict • Often best applied as exploratory analysis en route to predictive modeling
  • 21.
    List of CommonUnsupervised ML Algorithms • Cluster Analysis: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior • Association Analysis: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y Data of arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973
  • 22.
    Machine Learning Remember thatmachine learning only works if the problem is actually solvable with the data that you have.
  • 23.
    Traditional Modeling toDeep Learning Source: Cook, 2019, houseofbots.com
  • 24.
    Deep Learning? • Amethod that makes predictions using a sequence of non-linear processing stages • The resulting intermediate representations can be interpreted as feature hierarchies and the whole system is jointly learned from data • Deep learning is a new way of fitting neural networks
  • 25.
    Image Analysis andDeep Learning • Images take a lot of space so image compression important • Image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels) • Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images.
  • 26.
  • 27.
    Deep Learning? Source: Deshpande,https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
  • 28.
  • 29.
    Deep Learning? Source: Deshpande,https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
  • 30.
  • 31.
  • 32.
  • 33.
    Common Data Issuesand Problems • Too much data across multiple sources – no consolidation or integration • Turf war over who owns the data • Issues of data quality • Understanding the low hanging fruits (data exploration and management, standardization, visualization) • Understanding value of data across organization • …
  • 34.

Editor's Notes

  • #3 Can you identify what company could this be? There are a couple things here. See that we are buying something. It shows recommendations and lists of what other customers are buying. We are using data science to generate this list.
  • #5 This is the era of big data – we have data coming from everywhere – what that means is that we need to have resources and skills to analyze these data. Imagine data from all these sources and in all these industries. Small companies as well as big realize that there is value in looking at and evaluating data. Big Data is any data that is expensive to manage and hard to extract value from Volume The size of the data Velocity The latency of data processing relative to the growing demand for interactivity Variety and Complexity the diversity of sources, formats, quality, structures.
  • #6 Both 1 and 2 are very closely related to data science (computer science + statistics + Math)
  • #9 There are some cultural differences A key concept in data science is sparsity, which is closely related to parsimony and regularization. One wants to have the simplest possible model that is adequate to ones purpose. This often implies that the model is parsimonious (containing only few terms) and this may be achieved by regularization (e.g., forcing terms with small coefficients to zero) Sparsity is essentially Ockham’s Razor, and is a key idea in all inferential paradigms. It takes many forms. CART- classification and regression trees PAC: Probability actually corrected learnings “Bayesian statistics is a mathematical procedure that applies probabilities to statistical problems. It provides people the tools to update their beliefs in the evidence of new data.” You got that? Let me explain it with an example: Suppose, out of all the 4 championship races (F1) between Niki Lauda and James hunt, Niki won 3 times while James managed only 1. So, if you were to bet on the winner of next race, who would he be ? I bet you would say Niki Lauda. Here’s the twist. What if you are told that it rained once when James won and once when Niki won and it is definite that it will rain on the next date. So, who would you bet your money on now ? By intuition, it is easy to see that chances of winning for James have increased drastically. But the question is: how much ? Substituting the values in the conditional probability formula, we get the probability to be around 50%, which is almost the double of 25% when rain was not taken into account (Solve it at your end). This further strengthened our belief  of  James winning in the light of new evidence i.e rain. You must be wondering that this formula bears close resemblance to something you might have heard a lot about. Think! Probably, you guessed it right. It looks like Bayes Theorem. Bayes  theorem is built on top of conditional probability and lies in the heart of Bayesian Inference. 
  • #10 Statisticians tend to favor interpretation, whereas computer scientists often prefer black box models with good accuracy and broad applicability.
  • #13 Machine learning is the idea that there are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data. For example, one kind of algorithm is a classification algorithm. It can put data into different groups. The same classification algorithm used to recognize handwritten numbers could also be used to classify emails into spam and not-spam without changing a line of code. It’s the same algorithm but it’s fed different training data so it comes up with different classification logic.
  • #14  Machine Learning is being widely used nowadays. Some of the examples which we are using on daily basis: Facebook has used machine learning in ranking and personalizing News Feed stories, filtering out offensive content, highlighting trending topics, ranking search results, and recognizing image and video content. Google uses Machine learning almost in every product: Photos -: Uses machine learning to recognize the faces, location, emotions etc. Gmail -: Analyses the content in email and provide the smart replies. Youtube: Youtube uses the machine learning to improve the search result. Previously it is used to search according to meta tag and text provided by content creator but now it analyses the video content and provide best content to the user. Amazon uses machine learning for product recommendation. Uber uses the machine learning in UberEATS to calculate estimated amount of time to delivery food.
  • #15 Machine learning” is an umbrella term covering lots of these kinds of generic algorithms: Supervised Learning Let’s say you are a real estate agent. Your business is growing, so you hire a bunch of new trainee agents to help you out. But there’s a problem — you can glance at a house and have a pretty good idea of what a house is worth, but your trainees don’t have your experience so they don’t know how to price their houses. To help your trainees (and maybe free yourself up for a vacation), you decide to write a little app that can estimate the value of a house in your area based on it’s size, neighborhood, etc, and what similar houses have sold for. So you write down every time someone sells a house in your city for 3 months. For each house, you write down a bunch of details — number of bedrooms, size in square feet, neighborhood, etc. But most importantly, you write down the final sale price: Using that training data, we want to create a program that can estimate how much any other house in your area is worth. This is called supervised learning. You knew how much each house sold for, so in other words, you knew the answer to the problem and could work backwards from there to figure out the logic. To build your app, you feed your training data about each house into your machine learning algorithm. The algorithm is trying to figure out what kind of math needs to be done to make the numbers work out In supervised learning, you are letting the computer work out that relationship for you. And once you know what math was required to solve this specific set of problems, you could answer to any other problem of the same type!
  • #17 The method for finding the line of best fit for multiple linear regression is the exact same for simple linear regression – the least squares method. The only thing that has changed is the predicted value of the response, 𝑦 𝑖 .
  • #18 Machine learning” is an umbrella term covering lots of these kinds of generic algorithms: Supervised Learning Let’s say you are a real estate agent. Your business is growing, so you hire a bunch of new trainee agents to help you out. But there’s a problem — you can glance at a house and have a pretty good idea of what a house is worth, but your trainees don’t have your experience so they don’t know how to price their houses. To help your trainees (and maybe free yourself up for a vacation), you decide to write a little app that can estimate the value of a house in your area based on it’s size, neighborhood, etc, and what similar houses have sold for. So you write down every time someone sells a house in your city for 3 months. For each house, you write down a bunch of details — number of bedrooms, size in square feet, neighborhood, etc. But most importantly, you write down the final sale price: Using that training data, we want to create a program that can estimate how much any other house in your area is worth. This is called supervised learning. You knew how much each house sold for, so in other words, you knew the answer to the problem and could work backwards from there to figure out the logic. To build your app, you feed your training data about each house into your machine learning algorithm. The algorithm is trying to figure out what kind of math needs to be done to make the numbers work out In supervised learning, you are letting the computer work out that relationship for you. And once you know what math was required to solve this specific set of problems, you could answer to any other problem of the same type!
  • #21 Let’s go back to our original example with the real estate agent. What if you didn’t know the sale price for each house? Even if all you know is the size, location, etc of each house, it turns out you can still do some really cool stuff. This is called unsupervised learning. This is kind of like someone giving you a list of numbers on a sheet of paper and saying “I don’t really know what these numbers mean but maybe you can figure out if there is a pattern or grouping or something — good luck!” So unsupervised learning is a broad term encompassing data analysis without a right answer. So what could do with this data? For starters, you could have an algorithm that automatically identified different market segments in your data. Maybe you’d find out that home buyers in the neighborhood near the local college really like small houses with lots of bedrooms, but home buyers in the suburbs prefer 3-bedroom houses with lots of square footage. Knowing about these different kinds of customers could help direct your marketing efforts. Another cool thing you could do is automatically identify any outlier houses that were way different than everything else. Maybe those outlier houses are giant mansions and you can focus your best sales people on those areas because they have bigger commissions.
  • #23 But it’s important to remember that machine learning only works if the problem is actually solvable with the data that you have. For example, if you build a model that predicts home prices based on the type of potted plants in each house, it’s never going to work. There just isn’t any kind of relationship between the potted plants in each house and the home’s sale price. So no matter how hard it tries, the computer can never deduce a relationship between the two. So remember, if a human expert couldn’t use the data to solve the problem manually, a computer probably won’t be able to either. Instead, focus on problems where a human could solve the problem, but where it would be great if a computer could solve it much more quickly. https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471 https://medium.com/@ageitgey/machine-learning-is-fun-part-2-a26a10b68df3
  • #25 Are you tired of reading endless news stories about deep learning and not really knowing what that means? Let’s change that!
  • #26 Slide 4 Images are very large – you can imagine a dataset with 10 images could be 100 mega bytes. What will happen when you have 1000 images or 10000 images. We’re working with color images, each with dimension x, y, z, where x and y are specific to each photo. Image files, insofar as a computer understands them, are three layers of matrices stacked on top of each other, with each pixel being an individual entry in that matrix. So, to begin with, we use image algorithms to compress these images for processing. Manytimes, we grayscale and resize images so they’re smaller to work with. In computer vision, image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as super-pixels). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics. You can see from this slide that the image of the traffic and surrounding view in the top figure is pixelated and then classified into segments that are represented as various colors in the bottom figure.
  • #27 Slide 5 This process can be clarified further by looking at another example. Image classification is the process of taking an input image and outputting a class number out of a set of categories. So, for example, if we knew that our data consists of images of dogs, cats, birds, etc. We first classify one image as a dog based on the training model that looks at all the other images. However, our job is not only to produce a class label but also a bounding box that describes where the object is in the picture. We also have the task of object detection, where localization needs to be done on all of the objects in the image. Therefore, you will have multiple bounding boxes and multiple class labels. Finally, in segmentation the task is to output a class label as well as an outline of every object in the input image.
  • #28  Any 3-year-old child can recognize a photo of a bird, but figuring out how to make a computer recognize objects has puzzled the very best computer scientists for over 50 years. In the last few years, we’ve finally found a good approach to object recognition using deep convolutional neural networks. That sounds like a a bunch of made up words from a William Gibson Sci-Fi novel, but the ideas are totally understandable if you break them down one by one.
  • #29 A NN typically contains one input layer, one or more hidden layers, and an output layer. The input layer consists of your p predictors, or input units / nodes. Needless to say, it is generally good practice to center, scale and transform predictors, if not at least to speed up the optimization procedure. These input units can be connected to one or more hidden units in the first hidden layer. A hidden layer that is fully connected to the preceding layer is designated dense. In the diagram below, both hidden layers are dense. The output layer computes the prediction, and the number of units therein is determined by the problem in hands. Conventionally, a binary classification problem requires a single output unit (as shown above), whereas a multiclass problem with k classes will require k corresponding output units. The former can simply use a sigmoid function to directly compute a probability, while the latter usually requires a softmax transformation, whereby all values across all k output units sum up to one and can thus be treated as probabilities. Rather than having categorical predictions you can retrieve the actual probabilities, which are much more informative, and inspect their quality using calibration plots and lift charts. Every arrow displayed in the diagram above passes on an input that is associated with a weight. Each weight is essentially one of many coefficient estimates that contribute to the regressions computed in the nodes the corresponding arrows point to. These are unknown parameters that must be tuned by the model as to minimize the loss function, using an optimization procedure. In effect, for any particular observation each neuron can be mathematically represented as the equation that you see here. In this equation b denotes the intercept (also known as bias, and technically a weight itself) and W and x are m-long vectors carrying the weights and values from all m inputs, respectively. Before training, all weights are initialized with random values
  • #30  Any 3-year-old child can recognize a photo of a bird, but figuring out how to make a computer recognize objects has puzzled the very best computer scientists for over 50 years. In the last few years, we’ve finally found a good approach to object recognition using deep convolutional neural networks. That sounds like a a bunch of made up words from a William Gibson Sci-Fi novel, but the ideas are totally understandable if you break them down one by one.
  • #31 Slide 31 So to give you a flavor of some kernels – we look at an example of a kernel we use for blurring. So, you know when you blur out a photo in photoshop, it does something like this. For each pixel, it takes a weighted average of the pixels around it. So the pixel in the center will get a weight of 41 and the pixel closest gets a weight of 26 and the pixels at the edges get a weight of 1 but it will influence it. And on the other hand, you may want to emphasize edges which means you emphasize contrast. So a kernel that does contrast takes into account differences. So, the pixel in the middle may be zero you may do negatives at the top and positives at the bottom. These things you could try yourself in a photograph and see that it pick out those edges for you. So convolutional neural networks will develop many different kernels that themselves are learned. So you may say, we need blurring or we need features like contrast. You don’t have to figure out these features by yourself. As a process of this complex model fitting, all the way backward from the right answer neural network will figure out what features are needed. This is all accomplished with one fitting process. CNNs take these averages of pixels in different ways in parallel, so one detects edges, one roundness, and scores them all to “a face score” or a car score.
  • #32 Slide 34 Now, let’s go back to visualizing this mathematically. When we have this filter at the top left corner of the input volume, it is computing multiplications between the filter and the pixel values at that region. Now let’s take an example of an image that we want to classify, and let’s put our filter at the top left corner. Remember, what we have to do is multiply the values in the filter with the original pixel values of the image. Basically, in the input image, if there is a shape that generally resembles the curve that this filter is representing, then all of the multiplications summed together will result in a large value! Now let’s see what happens when we move our filter