This document discusses data science as a career option and provides an overview of the roles of data analyst, data scientist, and data engineer. It notes that data analysts solve problems using existing tools and manage data quality, while data scientists are responsible for undirected research and strategic planning. Data engineers compile and install database systems. The document also outlines the typical salaries for each role and discusses the growing demand for data science skills. It provides recommendations for learning tools and resources to pursue a career in data science.
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
Data Science as a Career and Intro to R
1. A Discussion on Data Science as a
Career option
-By Anshik
- Under Student Mentorship Prog.
2.
3. Overview
As data has multiplied, so has the ability to collect, organize, and analyze it. Data
storage is cheaper than ever, processing power is more massive than ever, and tools are
more accessible than ever to mine huge amount of available data for business
intelligence.
The McKinsey Global Institute predicted that by 2018 the U.S. could face a shortage of
1.5 million people who know how to leverage data analysis to make effective decisions.
Enter: you, taking stock of your three main career options: data analyst, data scientist,
and data engineer
4. Career options and difference between them
Data Analyst (1.6l - 8L)
Solve problems using
existing tools
No mathematical or
research background
required.
Manage quality of scraped
data, querying databases
and serve data as
visualization.
Data Scientist(3.5L - 18L)
Similar to data analyst in
many aspects.
Responsible for doing
undirected research and
tackle open-ended
problems and questions.
Data analyst summarizes
the past; a data scientist
strategizes for the future
Data Engineer(3L - 21L)
Does groundwork for the
former two.
Responsible for compiling
and installing database
systems, writing complex
queries, scaling to multiple
machines, and putting
disaster recovery systems
into place.
6. What do you think?
Data set that contains the
salaries of people who work
at an organization.
-- What questions can be
formed?
-- What Interpretations can
be made?
7. 1.Most of the
positions sought
Masters / PhD
students (especially
in statistics).
Lorem ipsum dolor sit amet,
consectetur adipiscing elit, sed do
eiusmod tempor incididunt
The competition:
● Lorem ipsum
● Dolor sit amet
8. 2.Learning from
MOOCs is not easy
and is
time-consuming
Lorem ipsum dolor sit amet,
consectetur adipiscing elit, sed do
eiusmod tempor incididunt
The competition:
● Lorem ipsum
● Dolor sit amet
9. 3.Condense what
you know in
presentable
manner.
Lorem ipsum dolor sit amet,
consectetur adipiscing elit, sed do
eiusmod tempor incididunt
The competition:
● Lorem ipsum
● Dolor sit amet
11. Srikanth Velamakanni,CEO of CA
headquartered Fractal Analytics:
“In the next few years, the size of the
analytics market will evolve to at
least one-thirds of the global IT
market from the current one-tenths”
13. Key points
● Huge Job Opportunities & Meeting the Skill Gap
● Salary Aspects
● The Rise of Unstructured and Semistructured Data Analytics
● Used Everywhere
14. Total Enterprise Data Growth 2005-2015
The way we capture, store,
analyze, and distribute data
is transforming.
Deduplication,compression,
and analysis tools are
lowering costs.
15. Tools and
Resources
Lorem ipsum dolor sit amet,
consectetur adipiscing elit, sed do
eiusmod tempor incididunt
The competition:
● Lorem ipsum
● Dolor sit amet
16. Categories and Links
Books
ISLR, R for Dummies, Advanced R, Machine learning
for Hackers(Py), NLP with Python
Websites and Blogs
Analytics Vidhya, Rbloggers, Kaggle Scripts,
CrowdAnalytics, students.brown.edu, github.io
Statistics and Linear
Algebra
Inferential and Descriptive statistics by Udacity,
MSR sir’s Prob & stats Slide, Khan Acad(Lin.Alg)
Machine Learning
and AI
Andrew Ng's ML Class, John Hopkins Data Analysis,
Deepak Khemani(AI-nptel)
Data Storage and
Visualization
MongoDB(Udacity), D3.js documentation and wiki
17. 1 3 5 7 10 12 14 20
Timeline(Weeks)[Beginers]
Learn the
Language -
R/Python
Start Doing
Hackathons/Pet Projects
Practice the
Langauge, Finish
Intro in ML
Do more advance
ML, start optimizing
your code.Start
reading git commits
18. Intro To ML & R
Installing Packages :-
To install a package, use the install.packages() function. Once a package is installed, it must be loaded
into your current R session before being used using library() or require(). Think of this as taking the
book off of the shelf and opening it up to read.
TIP :- Use require function for loading a package as it throws false if package is not found.
Data Types :-
R has a number of basic data types.
1. Numeric :- Also known as Double. The default type when dealing with numbers.
Examples: 1, 1.0, 42.5
2. Integer: - Examples: 1L, 2L, 42L
3. Complex : - Example: 4 + 2i
4. Logical : - Two possible values: TRUE and FALSE, you can also use T and F, but this is not
recommended.
NA is also considered logical.
5. Character :- Examples: "a", "Statistics", "1 plus 2."
19.
20. R Object oriented System
S3
Lacks formal definition
Objects are created by
setting the class attribute
Attributes are accessed
using $
Methods belong to generic
function
Follows copy-on-modify
semantics
S4
Class defined using
setClass()
Objects are created using
new()
Attributes are accessed
using @
Methods belong to generic
function
Follows copy-on-modify
semantics
Reference Classes
Class defined using
setRefClass()
Objects are created using
generator functions
Attributes are accessed
using $
Methods belong to the
class
Does not follow
copy-on-modify semantics
22. We will use inbuilt Cars dataset in R-base
Data gathered during the 1920s about the speed of cars
and the resulting distance it takes for the car to come to a
stop.
Objective :- How far a car travels before stopping, when
traveling at a certain speed?
23. What sort of function should we use for f(X)[Y=f(X) +e) for
the cars data?
- A Horizontal Line?
We see this doesn’t seem to do a very good job. Many of
the data points are very far from the orange line
representing cc . This is an example of underfitting.
- Make f(x) depend on x
- As speed increases, the distance required to come to a
stop increases. There is still some variation about this
line, but it seems to capture the overall trend.
24.
25. Assumptions of Linear Regression
LINE
Linear. The relationship between Y and x is linear, of the form β0+β1x .
Independent. The errors ϵ are independent.
Normal. The errors, ϵ are normally distributed. That is the “error” around the line follows a
normal distribution.
Equal Variance. At each value of x , the variance of Y is the same, σ2 .
We have to find a line that minimize sum of all squared distances from point to line.
26. lm()
stop_dist_model = lm(dist ~ speed, data = cars)
The abline() function is used to add lines of the form a+bx to a plot. (Hence abline.) When we give it
stop_dist_model as an argument, it automatically extracts the regression coefficient estimates ( β̂0 and
β̂1) and uses them as the slope and intercept of the line. Here we also use lwd to modify the width of
the line, as well as col to modify the color of the line.
lm() function returns an object of class lm()
We can access the members using $ operator
> names(stop_dist_model)
> stop_dist_model$residuals
Use summary() to summarize the output for linear regression.The summary() command also returns a
list, and we can again use names() to learn what about the elements of this list.
> names(summary(stop_dist_model))
> summary(stop_dist_model)$r.squared
Use predict function to predict output for certain input values
> predict(stop_dist_model, data.frame(speed = c(8, 21, 50)))
27. Thank You
-Anshik
Lorem ipsum dolor sit amet,
consectetur adipiscing elit, sed do
eiusmod tempor incididunt
The competition:
● Lorem ipsum
● Dolor sit amet
8826274098 (Watsapp)