Presented By: Sarfaraz Hussain
Software Consultant
Knoldus Inc
Text Mining using R
Agenda
01 Introduction to R
02 Data Structures in R
03 Machine Learning
04 Data Visualization
05 Text Mining
Introduction to R
◾ R is a programming language and software environment for statistical analysis, graphics representation
and reporting.
◾ It is made by statisticians and data miners for statistical analysis and graphical representation for
statistical computation.
◾ It was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, in the
year 1993 and is currently developed by the R Development Core Team.
◾ It is an interpreted language.
◾ It allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for
efficiency.
R vs. Python
2013 2014 201 2016 2017 2018
◾ Speed – Winner = R
◾ Memory management – Winner = Python
◾ Visualization – Winner = R
◾ Deep Learning support – Winner = Python
◾ R is great for statistical analysis and RStudio is a big advantage for ease of task.
◾ Python is great for deep learning task.
Data Structures in R
Machine Learning
Linear Regression:
◾ Regression analysis is a form of predictive modeling technique which investigates the relationship
between a dependent (target) and independent variable(s) (predictor).
◾ It falls under Supervised Learning technique.
◾ Here, we fit a curve / line to the data points, in such a manner that the differences between the
distances of data points from the curve or line is minimized.
◾ In this technique, the dependent variable is continuous, independent variable(s) can be continuous or
discrete, and nature of regression line is linear.
◾ Linear Regression establishes a relationship between dependent variable (Y) and one or more
independent variables (X) using a best fit straight line (also known as regression line).
Machine Learning (contd.)
Data Visualization
Using package “plotrix”, we will see the following visualizations in action:
- 2D Pie Chart
- 3D Pie Chart
- Bar Chart
Text Mining
◾ Text Mining or Text Analytics is one of the branch of Data Analytics where we specifically look at
the textual data.
◾ It is the process of extracting meaning insights from text (unstructured).
◾ We can analyze words, clusters of words used in documents using various
algorithms/packages.
◾ In the most general terms, text mining will “turn text into numbers”.
Text Mining (contd.)
◾ CORPUS: A text corpus is a large and unstructured set of texts.
◾ Term Document Matrix and Document Term Matrix:
Thank You !

Text Mining Using R

  • 1.
    Presented By: SarfarazHussain Software Consultant Knoldus Inc Text Mining using R
  • 2.
    Agenda 01 Introduction toR 02 Data Structures in R 03 Machine Learning 04 Data Visualization 05 Text Mining
  • 3.
    Introduction to R ◾R is a programming language and software environment for statistical analysis, graphics representation and reporting. ◾ It is made by statisticians and data miners for statistical analysis and graphical representation for statistical computation. ◾ It was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, in the year 1993 and is currently developed by the R Development Core Team. ◾ It is an interpreted language. ◾ It allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency.
  • 4.
    R vs. Python 20132014 201 2016 2017 2018 ◾ Speed – Winner = R ◾ Memory management – Winner = Python ◾ Visualization – Winner = R ◾ Deep Learning support – Winner = Python ◾ R is great for statistical analysis and RStudio is a big advantage for ease of task. ◾ Python is great for deep learning task.
  • 5.
  • 6.
    Machine Learning Linear Regression: ◾Regression analysis is a form of predictive modeling technique which investigates the relationship between a dependent (target) and independent variable(s) (predictor). ◾ It falls under Supervised Learning technique. ◾ Here, we fit a curve / line to the data points, in such a manner that the differences between the distances of data points from the curve or line is minimized. ◾ In this technique, the dependent variable is continuous, independent variable(s) can be continuous or discrete, and nature of regression line is linear. ◾ Linear Regression establishes a relationship between dependent variable (Y) and one or more independent variables (X) using a best fit straight line (also known as regression line).
  • 7.
  • 8.
    Data Visualization Using package“plotrix”, we will see the following visualizations in action: - 2D Pie Chart - 3D Pie Chart - Bar Chart
  • 9.
    Text Mining ◾ TextMining or Text Analytics is one of the branch of Data Analytics where we specifically look at the textual data. ◾ It is the process of extracting meaning insights from text (unstructured). ◾ We can analyze words, clusters of words used in documents using various algorithms/packages. ◾ In the most general terms, text mining will “turn text into numbers”.
  • 10.
    Text Mining (contd.) ◾CORPUS: A text corpus is a large and unstructured set of texts. ◾ Term Document Matrix and Document Term Matrix:
  • 11.