Data Science as a Career and Intro to R

A Discussion on Data Science as a
Career option
-By Anshik
- Under Student Mentorship Prog.

Overview
As data has multiplied, so has the ability to collect, organize, and analyze it. Data
storage is cheaper than ever, processing power is more massive than ever, and tools are
more accessible than ever to mine huge amount of available data for business
intelligence.
The McKinsey Global Institute predicted that by 2018 the U.S. could face a shortage of
1.5 million people who know how to leverage data analysis to make effective decisions.
Enter: you, taking stock of your three main career options: data analyst, data scientist,
and data engineer

Career options and difference between them
Data Analyst (1.6l - 8L)
Solve problems using
existing tools
No mathematical or
research background
required.
Manage quality of scraped
data, querying databases
and serve data as
visualization.
Data Scientist(3.5L - 18L)
Similar to data analyst in
many aspects.
Responsible for doing
undirected research and
tackle open-ended
problems and questions.
Data analyst summarizes
the past; a data scientist
strategizes for the future
Data Engineer(3L - 21L)
Does groundwork for the
former two.
Responsible for compiling
and installing database
systems, writing complex
queries, scaling to multiple
machines, and putting
disaster recovery systems
into place.

Should you put in
the time and
effort?
●

What do you think?
Data set that contains the
salaries of people who work
at an organization.
-- What questions can be
formed?
-- What Interpretations can
be made?

1.Most of the
positions sought
Masters / PhD
students (especially
in statistics).
Lorem ipsum dolor sit amet,
consectetur adipiscing elit, sed do
eiusmod tempor incididunt
The competition:
● Lorem ipsum
● Dolor sit amet

2.Learning from
MOOCs is not easy
and is
time-consuming
The competition:
● Lorem ipsum
● Dolor sit amet

3.Condense what
you know in
presentable
manner.
The competition:
● Lorem ipsum
● Dolor sit amet

Srikanth Velamakanni,CEO of CA
headquartered Fractal Analytics:
“In the next few years, the size of the
analytics market will evolve to at
least one-thirds of the global IT
market from the current one-tenths”

Big Data
Analytics Jobs
Trends

Key points
● Huge Job Opportunities & Meeting the Skill Gap
● Salary Aspects
● The Rise of Unstructured and Semistructured Data Analytics
● Used Everywhere

Total Enterprise Data Growth 2005-2015
The way we capture, store,
analyze, and distribute data
is transforming.
Deduplication,compression,
and analysis tools are
lowering costs.

Tools and
Resources
The competition:
● Lorem ipsum
● Dolor sit amet

Categories and Links
Books
ISLR, R for Dummies, Advanced R, Machine learning
for Hackers(Py), NLP with Python
Websites and Blogs
Analytics Vidhya, Rbloggers, Kaggle Scripts,
CrowdAnalytics, students.brown.edu, github.io
Statistics and Linear
Algebra
Inferential and Descriptive statistics by Udacity,
MSR sir’s Prob & stats Slide, Khan Acad(Lin.Alg)
Machine Learning
and AI
Andrew Ng's ML Class, John Hopkins Data Analysis,
Deepak Khemani(AI-nptel)
Data Storage and
Visualization
MongoDB(Udacity), D3.js documentation and wiki

1 3 5 7 10 12 14 20
Timeline(Weeks)[Beginers]
Learn the
Language -
R/Python
Start Doing
Hackathons/Pet Projects
Practice the
Langauge, Finish
Intro in ML
Do more advance
ML, start optimizing
your code.Start
reading git commits

Intro To ML & R
Installing Packages :-
To install a package, use the install.packages() function. Once a package is installed, it must be loaded
into your current R session before being used using library() or require(). Think of this as taking the
book off of the shelf and opening it up to read.
TIP :- Use require function for loading a package as it throws false if package is not found.
Data Types :-
R has a number of basic data types.
1. Numeric :- Also known as Double. The default type when dealing with numbers.
Examples: 1, 1.0, 42.5
2. Integer: - Examples: 1L, 2L, 42L
3. Complex : - Example: 4 + 2i
4. Logical : - Two possible values: TRUE and FALSE, you can also use T and F, but this is not
recommended.
NA is also considered logical.
5. Character :- Examples: "a", "Statistics", "1 plus 2."

R Object oriented System
S3
Lacks formal definition
Objects are created by
setting the class attribute
Attributes are accessed
using $
Methods belong to generic
function
Follows copy-on-modify
semantics
S4
Class defined using
setClass()
Objects are created using
new()
using @
Methods belong to generic
function
Follows copy-on-modify
semantics
Reference Classes
Class defined using
setRefClass()
Objects are created using
generator functions
using $
Methods belong to the
class
Does not follow
copy-on-modify semantics

We will use inbuilt Cars dataset in R-base
Data gathered during the 1920s about the speed of cars
and the resulting distance it takes for the car to come to a
stop.
Objective :- How far a car travels before stopping, when
traveling at a certain speed?

What sort of function should we use for f(X)[Y=f(X) +e) for
the cars data?
- A Horizontal Line?
We see this doesn’t seem to do a very good job. Many of
the data points are very far from the orange line
representing cc . This is an example of underfitting.
- Make f(x) depend on x
- As speed increases, the distance required to come to a
stop increases. There is still some variation about this
line, but it seems to capture the overall trend.

Assumptions of Linear Regression
LINE
Linear. The relationship between Y and x is linear, of the form β0+β1x .
Independent. The errors ϵ are independent.
Normal. The errors, ϵ are normally distributed. That is the “error” around the line follows a
normal distribution.
Equal Variance. At each value of x , the variance of Y is the same, σ2 .
We have to find a line that minimize sum of all squared distances from point to line.

lm()
stop_dist_model = lm(dist ~ speed, data = cars)
The abline() function is used to add lines of the form a+bx to a plot. (Hence abline.) When we give it
stop_dist_model as an argument, it automatically extracts the regression coefficient estimates ( β̂0 and
β̂1) and uses them as the slope and intercept of the line. Here we also use lwd to modify the width of
the line, as well as col to modify the color of the line.
lm() function returns an object of class lm()
We can access the members using $ operator
> names(stop_dist_model)
> stop_dist_model$residuals
Use summary() to summarize the output for linear regression.The summary() command also returns a
list, and we can again use names() to learn what about the elements of this list.
> names(summary(stop_dist_model))
> summary(stop_dist_model)$r.squared
Use predict function to predict output for certain input values
> predict(stop_dist_model, data.frame(speed = c(8, 21, 50)))

Thank You
-Anshik
The competition:
● Lorem ipsum
● Dolor sit amet
8826274098 (Watsapp)

Data Science as a Career and Intro to R

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Science as a Career and Intro to R

Similar to Data Science as a Career and Intro to R (20)

Recently uploaded

Recently uploaded (20)

Data Science as a Career and Intro to R