www.edureka.in/data-science
Data Science
Make Business decisions
Smarter
www.edureka.co/r-for-analyticsSlide 2 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Objectives
What is data mining
What is data science??
What is need of data scientist??
Stages of data mining??
Roles and Responsibilities of a Data Scientist.
Sentiment analysis on Zomato reviews
At the end of this session, you will be able to
www.edureka.in/data-scienceSlide 3
Data Science Applications: Wine Recommendation
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
www.edureka.in/data-scienceSlide 4
Data Science Applications: Pizza Hut
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
www.edureka.in/data-scienceSlide 5
Data Science Applications: Summarize News
www.edureka.in/data-scienceSlide 6
How about this?
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
www.edureka.in/data-scienceSlide 7
What’s Common in these Applications?
According to Wikipedia: Data science is the study of the generalizable extraction of knowledge
from data, yet the key word is science.
These scenarios involve:
 Storing, organizing and integrating huge amount of unstructured data
 Processing and analyzing the data
 Extracting knowledge, insights and predict future from the data
Storage of big data is done in Hadoop. For more details on Hadoop please refer Big data and
Hadoop blog http://www.edureka.in/blog/category/big-data-and-hadoop/
Processing, Analyzing, extracting knowledge and insights are done through Machine Learning.
All above technologies and steps together can be termed as data mining process.
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Slide 8Slide 8 www.edureka.co/r-for-analyticsTwitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Cross Industry standard Process for data mining ( CRISP – DM )
Stages of Analytics / Data Mining
Slide 9Slide 9 www.edureka.co/r-for-analyticsTwitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Knowledge discovery and data mining ( KDD)
Stages of Analytics / Data Mining
Slide 10Slide 10 www.edureka.co/r-for-analyticsTwitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
What is data science??
“More data usually beats better algorithms,” Such as: Recommending movies or music
based on past preferences
No matter how extremely unpleasant your algorithm is, they can often be beaten simply by
having more data (and a less sophisticated algorithm).
Slide 11Slide 11 www.edureka.co/r-for-analyticsTwitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Components data science??
Slide 12Slide 12 www.edureka.co/r-for-analyticsTwitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
What is R
R is Programming Language
R is Environment for Statistical Analysis
R is Data Analysis Software
Slide 13 www.edureka.in/data-science
Data Science: Demand Supply Gap
Big Data Analyst
Big Data Architect
Big Data Engineer
Big Data Research Analyst
Big Data Visualizer
Data Scientist
50
43
44
31
23
18
50
57
56
69
77
82
Filled job vs unfilled jobs in big data
Filled Unfilled
Vacancy/Filled(%)
Gartner Says Big Data Creates Big Jobs: 4.4 Million IT
Jobs Globally to Support Big Data By
2015http://www.gartner.com/newsroom/id/2207915
Slide 14 www.edureka.in/data-science
Slide 15Slide 15 www.edureka.co/r-for-analyticsTwitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
R : Characteristics
Effective and fast data handling and storage facility
A bunch of operators for calculations on arrays, lists, vectors etc
A large integrated collection of tools for data analysis, and visualization
Facilities for data analysis using graphs and display either directly at the computer or paper
A well implemented and effective programming language called ‘S’ on top of which R is built
A complete range of packages to extend and enrich the functionality of R
Slide 16Slide 16 www.edureka.co/r-for-analyticsTwitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Data Visualization in R
This plot represents the
locations of all the traffic
signals in the city.
It is recognizable as
Toronto without any other
geographic data being
plotted - the structure of
the city comes out in the
data alone.
Slide 17 www.edureka.in/data-science
Data Science: Job Trends
Slide 18 www.edureka.in/data-science
Machine Learning
We have so many algorithms for data mining which can be used to build systems that can read past data and can
generate a system that can accommodate any future data and derive useful insight from it
Such set of algorithms comes under machine learning
Machine learning focuses on the development of computer programs that can teach themselves to grow and change
when exposed to new data
Train data
ML
model
Algorithms
Slide 19 www.edureka.in/data-science
Types of Learning
Supervised Learning Unsupervised Learning
1. Uses a known dataset to make
predictions.
2. The training dataset includes
input data and response values.
3. From it, the supervised learning
algorithm builds a model to make
predictions of the response
values for a new dataset.
1. Draw inferences from datasets
consisting of input data without
labeled responses.
2. Used for exploratory data analysis
to find hidden patterns or grouping
in data
3. The most common unsupervised
learning method is cluster analysis.
Machine Learning
Slide 20 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
• Common Machine Learning Algorithms
Types of Learning
Supervised Learning
Unsupervised Learning
Algorithms
 Naïve Bayes
 Support Vector Machines
 Random Forests
 Decision Trees
Algorithms
 K-means
 Fuzzy Clustering
 Hierarchical Clustering
Gaussian mixture models
Self-organizing maps
Slide 21 www.edureka.in/data-science
Use Case : Zomato Ratings Review
Slide 22 www.edureka.in/data-science
 Module 1
» Introduction to Data Science
 Module 2
» Basic Data Manipulation using R
 Module 3
» Machine Learning Techniques using R Part -1
- Clustering
- TF-IDF and Cosine Similarity
- Association Rule Mining
 Module 4
» Machine Learning Techniques using R Part -2
- Supervised and Unsupervised Learning
- Decision Tree Classifier
Course Topics
 Module 5
» Machine Learning Techniques using R Part -3
- Random Forest Classifier
- Naïve Bayer’s Classifier
 Module 6
» Introduction to Hadoop Architecture
 Module 7
» Integrating R with Hadoop
 Module 8
» Mahout Introduction and Algorithm
Implementation
 Module 9
» Additional Mahout Algorithms and Parallel
Processing in R
 Module 10
» Project
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Slide 23
Questions?
Enroll for the Complete Course at : www.edureka.in/data_science
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
www.edureka.in/data_science
Please Don’t forget to fill in the survey report
Class Recording and Presentation will be available in 24 hours at:
http://www.edureka.in/blog/application-of-clustering-in-data-science-using-real-life-examples/

Data Science : Make Smarter Business Decisions

  • 1.
  • 2.
    www.edureka.co/r-for-analyticsSlide 2 Twitter@edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions Objectives What is data mining What is data science?? What is need of data scientist?? Stages of data mining?? Roles and Responsibilities of a Data Scientist. Sentiment analysis on Zomato reviews At the end of this session, you will be able to
  • 3.
    www.edureka.in/data-scienceSlide 3 Data ScienceApplications: Wine Recommendation Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
  • 4.
    www.edureka.in/data-scienceSlide 4 Data ScienceApplications: Pizza Hut Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
  • 5.
  • 6.
    www.edureka.in/data-scienceSlide 6 How aboutthis? Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
  • 7.
    www.edureka.in/data-scienceSlide 7 What’s Commonin these Applications? According to Wikipedia: Data science is the study of the generalizable extraction of knowledge from data, yet the key word is science. These scenarios involve:  Storing, organizing and integrating huge amount of unstructured data  Processing and analyzing the data  Extracting knowledge, insights and predict future from the data Storage of big data is done in Hadoop. For more details on Hadoop please refer Big data and Hadoop blog http://www.edureka.in/blog/category/big-data-and-hadoop/ Processing, Analyzing, extracting knowledge and insights are done through Machine Learning. All above technologies and steps together can be termed as data mining process. Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
  • 8.
    Slide 8Slide 8www.edureka.co/r-for-analyticsTwitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions Cross Industry standard Process for data mining ( CRISP – DM ) Stages of Analytics / Data Mining
  • 9.
    Slide 9Slide 9www.edureka.co/r-for-analyticsTwitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions Knowledge discovery and data mining ( KDD) Stages of Analytics / Data Mining
  • 10.
    Slide 10Slide 10www.edureka.co/r-for-analyticsTwitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions What is data science?? “More data usually beats better algorithms,” Such as: Recommending movies or music based on past preferences No matter how extremely unpleasant your algorithm is, they can often be beaten simply by having more data (and a less sophisticated algorithm).
  • 11.
    Slide 11Slide 11www.edureka.co/r-for-analyticsTwitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions Components data science??
  • 12.
    Slide 12Slide 12www.edureka.co/r-for-analyticsTwitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions What is R R is Programming Language R is Environment for Statistical Analysis R is Data Analysis Software
  • 13.
    Slide 13 www.edureka.in/data-science DataScience: Demand Supply Gap Big Data Analyst Big Data Architect Big Data Engineer Big Data Research Analyst Big Data Visualizer Data Scientist 50 43 44 31 23 18 50 57 56 69 77 82 Filled job vs unfilled jobs in big data Filled Unfilled Vacancy/Filled(%) Gartner Says Big Data Creates Big Jobs: 4.4 Million IT Jobs Globally to Support Big Data By 2015http://www.gartner.com/newsroom/id/2207915
  • 14.
  • 15.
    Slide 15Slide 15www.edureka.co/r-for-analyticsTwitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions R : Characteristics Effective and fast data handling and storage facility A bunch of operators for calculations on arrays, lists, vectors etc A large integrated collection of tools for data analysis, and visualization Facilities for data analysis using graphs and display either directly at the computer or paper A well implemented and effective programming language called ‘S’ on top of which R is built A complete range of packages to extend and enrich the functionality of R
  • 16.
    Slide 16Slide 16www.edureka.co/r-for-analyticsTwitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions Data Visualization in R This plot represents the locations of all the traffic signals in the city. It is recognizable as Toronto without any other geographic data being plotted - the structure of the city comes out in the data alone.
  • 17.
  • 18.
    Slide 18 www.edureka.in/data-science MachineLearning We have so many algorithms for data mining which can be used to build systems that can read past data and can generate a system that can accommodate any future data and derive useful insight from it Such set of algorithms comes under machine learning Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data Train data ML model Algorithms
  • 19.
    Slide 19 www.edureka.in/data-science Typesof Learning Supervised Learning Unsupervised Learning 1. Uses a known dataset to make predictions. 2. The training dataset includes input data and response values. 3. From it, the supervised learning algorithm builds a model to make predictions of the response values for a new dataset. 1. Draw inferences from datasets consisting of input data without labeled responses. 2. Used for exploratory data analysis to find hidden patterns or grouping in data 3. The most common unsupervised learning method is cluster analysis. Machine Learning
  • 20.
    Slide 20 Twitter@edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions • Common Machine Learning Algorithms Types of Learning Supervised Learning Unsupervised Learning Algorithms  Naïve Bayes  Support Vector Machines  Random Forests  Decision Trees Algorithms  K-means  Fuzzy Clustering  Hierarchical Clustering Gaussian mixture models Self-organizing maps
  • 21.
    Slide 21 www.edureka.in/data-science UseCase : Zomato Ratings Review
  • 22.
    Slide 22 www.edureka.in/data-science Module 1 » Introduction to Data Science  Module 2 » Basic Data Manipulation using R  Module 3 » Machine Learning Techniques using R Part -1 - Clustering - TF-IDF and Cosine Similarity - Association Rule Mining  Module 4 » Machine Learning Techniques using R Part -2 - Supervised and Unsupervised Learning - Decision Tree Classifier Course Topics  Module 5 » Machine Learning Techniques using R Part -3 - Random Forest Classifier - Naïve Bayer’s Classifier  Module 6 » Introduction to Hadoop Architecture  Module 7 » Integrating R with Hadoop  Module 8 » Mahout Introduction and Algorithm Implementation  Module 9 » Additional Mahout Algorithms and Parallel Processing in R  Module 10 » Project Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
  • 23.
    Slide 23 Questions? Enroll forthe Complete Course at : www.edureka.in/data_science Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions www.edureka.in/data_science Please Don’t forget to fill in the survey report Class Recording and Presentation will be available in 24 hours at: http://www.edureka.in/blog/application-of-clustering-in-data-science-using-real-life-examples/