The session focused on Data Mining using R Language where I analyzed a large volume of text files to find out some meaningful insights using concepts like DocumentTermMatrix and WordCloud.
2. Agenda
01 Introduction to R
02 Data Structures in R
03 Machine Learning
04 Data Visualization
05 Text Mining
3. Introduction to R
◾ R is a programming language and software environment for statistical analysis, graphics representation
and reporting.
◾ It is made by statisticians and data miners for statistical analysis and graphical representation for
statistical computation.
◾ It was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, in the
year 1993 and is currently developed by the R Development Core Team.
◾ It is an interpreted language.
◾ It allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for
efficiency.
4. R vs. Python
2013 2014 201 2016 2017 2018
◾ Speed – Winner = R
◾ Memory management – Winner = Python
◾ Visualization – Winner = R
◾ Deep Learning support – Winner = Python
◾ R is great for statistical analysis and RStudio is a big advantage for ease of task.
◾ Python is great for deep learning task.
6. Machine Learning
Linear Regression:
◾ Regression analysis is a form of predictive modeling technique which investigates the relationship
between a dependent (target) and independent variable(s) (predictor).
◾ It falls under Supervised Learning technique.
◾ Here, we fit a curve / line to the data points, in such a manner that the differences between the
distances of data points from the curve or line is minimized.
◾ In this technique, the dependent variable is continuous, independent variable(s) can be continuous or
discrete, and nature of regression line is linear.
◾ Linear Regression establishes a relationship between dependent variable (Y) and one or more
independent variables (X) using a best fit straight line (also known as regression line).
8. Data Visualization
Using package “plotrix”, we will see the following visualizations in action:
- 2D Pie Chart
- 3D Pie Chart
- Bar Chart
9. Text Mining
◾ Text Mining or Text Analytics is one of the branch of Data Analytics where we specifically look at
the textual data.
◾ It is the process of extracting meaning insights from text (unstructured).
◾ We can analyze words, clusters of words used in documents using various
algorithms/packages.
◾ In the most general terms, text mining will “turn text into numbers”.
10. Text Mining (contd.)
◾ CORPUS: A text corpus is a large and unstructured set of texts.
◾ Term Document Matrix and Document Term Matrix: