Introduction to Data Science and
13 August 2013
– Introduction to Data Science
– Hidden skills of Data Scientist
– Failure of Current Statistical
tools like SAS and Excel
– Introduction to R language
– R Basic Commands
– Running SQL server with R
– Visualizing Data with R
– Introduction to Shiny
– Future of R
Data Science is all about telling a STORY from the data.
5 Hidden Skills for Data Scientists
– Be Clear: Is Your Problem Really A
Big Data Problem?
– Communicating About Your Data
– Invest in Interactive Analytics, not
– Understand the Role and Quality
of Human Evaluations of Data
– Spend Time on the Plumbing
Difference between Data Science
and Big Data
Big data is more concerned with the engineering components of data and in
answering the following questions:
– How do you store it,
– How do you manipulate it,
– How do you do parallelized computations on it,
– How do you access it,
– How do you mine it
But science is more than that.
– It deals with looking at the algorithmic and mathematical aspects of
extracting knowledge from data.
– Data science applies advanced analytical tools and algorithms to generate
predictive insights and new product innovations that are a direct result of
Shortcomings of current
Visualization and statistical tools
– The most commonly-used statistical software tools either fail completely or are
too slow to be useful on huge data sets
– Less scalability
– Less Flexibility to new and fast scalable algorithms
– Problems printing charts in Excel: Missing legend data or sometimes x or y axis
– If there’s a value in the upper-left corner of the data set (A1 in this case), Excel fails
to chart the data correctly. e.g.
Introduction to R
– R is a computer language and run-time environment which is used for
data manipulation, statistics, and graphics
– The base part of R comes with a wide range of standard statistical and
graphical analyses and user-developed extension packages built in.
– R is an expression-based language.
– It is possible to interface procedures written in C, C+, or FORTRAN
languages for efficiency, and to write additional primitives.
R, And the Rise of the Best Software Money Can’t
R users rely on functions that have been developed for them
by statistical researchers, but they can also create their own or
modify the existing ones as per their needs.
Getting started R
▪ Latest Version 3.0.1 for windows
▪ Link to download R setup http://cran.r-project.org/bin/windows/base/
▪ 51.5MB set up file
▪ GUI for R – R Studio. Latest Version 0.97.551
▪ Link to download R studio
▪ 32.5MB exe file.
Introduction to Shiny – R web UI
•R Package Shiny from RStudio supplies
–interactive web application / dynamic HTML-
Pages with plain R
–GUI for own needs
–Website as server
What makes Shiny so special?
– Very Simple: Ready to Use Components
– Shiny is very slick, achieving interactive and pleasant looking web UI’s.
– Event-driven (reactive programming): input <-> output (without requiring a
reload of the browser)
– Shiny user interfaces can be built entirely using R, or can be written directly
– A highly customizable slider widget with built-in support for animation.
– Pre-built output widgets for displaying plots, tables, and printed output of R
– Fast bidirectional communication between the web browser and R using the
R is the most powerful and flexible statistical programming language in the
Job trends in Statistical Market
Software 2012 2013 Difference Ratio
SAS 13234 12272 -961 0.93
SPSS 3299 3289 -10 1
R 1196 1693 497 1.42
Minitab 1769 1615 -154 0.91
Stata 842 898 56 1.07
JMP 644 619 -25 0.96
Statistica 61 71 10 1.17
Systat 14 15 1 1.07
BMDP 6 10 3 1.53
SAS SPSS R Minitab Stata JMP Statistica Systat BMDP
Trend of Jobs on Indeed.com in March 2012 and 2013
Final Words of Warning
• “Using R is a bit akin to smoking.
The beginning is difficult, one may
get headaches and even gag the
first few times. But in the long
run,it becomes pleasurable and
even addictive. Yet, deep
down, for those willing to be
honest, there is something not
fully healthy in it.” --Francois
Visualization is only one slice of R
R deals with
• Machine Learning
• Social Media Analytics
• Sentiment Analysis
• Predictive Modeling
• Network Analysis
• Time series Analysis
• And lot more
To be continued……….