3. Data Vs Information
Data is a collection of facts, while information
puts those facts into context.
While data is raw and unorganized, information is
organized.
Data points are individual and sometimes
unrelated. Information maps out that data to
provide a big-picture view of how it all fits
together.
4. Database Vs Data warehouse
A database is any collection of data organized for
storage, accessibility, and retrieval.
Example: Employee Database
A data warehouse is a type of database the
integrates copies of transaction data from
disparate source systems and provisions them for
analytical use.
Example: TCS data warehouse that
integrates from their multiple location databases
5. Big Data
It is huge, large, or voluminous data, information,
or the relevant statistics acquired by large
organizations and ventures.
Many software and data storages is created and
prepared as it is difficult to compute the big data
manually. It is used to discover patterns and
trends and make decisions related to human
behavior and interaction technology.
6. Data Science
Data Science: Data Science is a field or domain
which includes and involves working with a huge
amount of data and using it for building
predictive, prescriptive, and prescriptive
analytical models.
It’s about digging, capturing, (building the model)
analyzing (validating the model), and utilizing the
data (deploying the best model). It is an
intersection of Data and computing.
It is a blend of the field of Computer Science,
9. Course Objectives
To train the students in solving computational
problems.
To elucidate solving mathematical problems
using Python programming language.
To understand the fundamentals of Python
programming concepts and its applications.
Practical understanding of building different
types of models and their evaluation.
11. UNIT I - INTRODUCTION TO DATA
SCIENCE
Introduction to Data Science and its
importance - Data Science and Big data-, The
life cycle of Data Science- The Art of Data
Science - Work with data – data Cleaning,
data Managing, data manipulation.
Establishing computational environments for
data scientists using Python with IPython and
Jupyter.
12. UNIT I - INTRODUCTION TO DATA
SCIENCE
Launch the IPython shell and the Jupyter notebook.
Write a python script to control the behaviour of
IPython using magic commands.
Create a file called hello.py
Replace the missing values with the expected, or
mean income of custdata dataset.
Import data in python.
13. UNIT II INTRODUCTION TO
NUMPY
NumPy Basics: Arrays and Vectorized
Computation- The NumPy ndarray- Creating
ndarrays- Data Types for ndarrays- Arithmetic
with NumPy Arrays- Basic Indexing and
Slicing - Boolean Indexing-Transposing Arrays
and Swapping Axes. Universal Functions:
Fast Element-Wise Array Functions-
Mathematical and Statistical Methods-
SortingUnique and Other Set Logic.
14. UNIT II INTRODUCTION TO
NUMPY
Create NumPy arrays from Python Data Structures,
Intrinsic NumPy objects and Random Functions.
Manipulation of NumPy arrays- Indexing, Slicing,
Reshaping, Joining and Splitting.
Computation on NumPy arrays using Universal
Functions and Mathematical methods.
Import a CSV file and perform various Statistical and
Comparison operations on rows/columns.
Load an image file and do crop and flip operation using
NumPy Indexing.
Write a program to compute summary statistics such as
mean, median, mode, standard deviation and variance
of the given different types of data.
15. UNIT III DATA MANIPULATION
WITH PYTHON
Introduction to pandas Data Structures:
Series, DataFrame, Essential Functionality:
Dropping Entries Indexing, Selection, and
Filtering- Function Application and Mapping-
Sorting and Ranking. Summarizing and
Computing Descriptive Statistics- Unique
Values, Value Counts, and Membership.
Reading and Writing Data in Text Format.
16. UNIT III DATA MANIPULATION
WITH PYTHON
a. Create Pandas Series and DataFrame from
various inputs.
b. Import any CSV file to Pandas DataFrame and
perform the following:
Visualize the first and last 10 records
Get the shape, index and column details.
Select/Delete the records(rows)/columns based
on conditions.
Perform ranking and sorting operations.
Do required statistical operations on the given
columns.
Find the count and uniqueness of the given
categorical values.
17. UNIT IV DATA CLEANING,
PREPARATION AND
VISUALIZATION
Data Cleaning and Preparation: Handling
Missing Data - Data Transformation:
Removing Duplicates, Transforming Data
Using a Function or Mapping, Replacing
Values, Detecting and Filtering Outliers- String
Manipulation: Vectorized String Functions in
pandas. Plotting with pandas: Line Plots, Bar
Plots, Histograms and Density Plots, Scatter
or Point Plots.
18. UNIT IV DATA CLEANING,
PREPARATION AND
VISUALIZATION
a. Import any CSV file to Pandas DataFrame
and perform the following:
Handle missing data by detecting and
dropping/ filling missing values.
Transform data using apply() and map()
method.
Detect and filter outliers.
Perform Vectorized String operations on
Pandas Series.
Visualize data using Line Plots, Bar Plots,
Histograms, Density Plots and Scatter Plots.
19. UNIT V MACHINE LEARNING
USING PYTHON
Introduction Machine Learning: Categories of
Machine Learning algorithms, Dimensionality
reduction-Introducing ScikitApplication:
Exploring Hand-written Digits. Feature
EngineeringNaive Bayes Classification -
Linear Regression - kMeans Clustering.
20. UNIT V MACHINE LEARNING
USING PYTHON
Write a program to demonstrate Linear
Regression analysis with residual plots on a given
data set.
Write a program to implement the Naïve Bayesian
classifier for a sample training data set stored as
a .CSV file. Compute the accuracy of the
classifier, considering few test data sets.
Write a program to implement k-Nearest
Neighbour algorithm to classify the iris data set.
Print both correct and wrong predictions using
Python ML library classes.
Write a program to implement k-Means clustering
algorithm to cluster the set of data stored in .CSV
file. Compare the results of various “k” values for
21. Text Book(s)
Wes McKinney, “Python for Data Analysis:
Data Wrangling with Pandas, NumPy, and
IPython”, O’Reilly, 2nd Edition,2018.
Jake VanderPlas, “Python Data Science
Handbook: Essential Tools for Working with
Data”, O’Reilly, 2017.
22. Reference Books
Y. Daniel Liang, “Introduction to Programming
using Python”, Pearson,2012.
Francois Chollet, Deep Learning with Python, 1/e,
Manning Publications Company, 2017.
Peter Wentworth, Jeffrey Elkner, Allen B. Downey
and Chris Meyers, “How to Think Like a
Computer Scientist: Learning with Python 3”, 3rd
edition, Available at
https://www.ict.ru.ac.za/Resources/cspw/thinkcsp
y3/thinkcspy3.pdf
Paul Barry, “Head First Python a Brain Friendly
Guide” 2nd Edition, O’Reilly, 2016 4. Dainel
Y.Chen “Pandas for Everyone Python Data
Analysis” Pearson Education, 2019
24. Outline
Data, Big Data and Challenges
Data Science
Introduction
Why Data Science
Data Scientists
What do they do?
Major/Concentration in Data Science
What courses to take.
25. Data All Around
Lots of data is being collected
and warehoused
Web data, e-commerce
Financial transactions, bank/credit transactions
Online trading and purchasing
Social Network
26. How Much Data Do We have?
Google processes 20 PB a day (2008)
Facebook has 60 TB of daily logs
eBay has 6.5 PB of user data + 50 TB/day
(5/2009)
1000 genomes project: 200 TB
Cost of 1 TB of disk: $35
Time to read 1 TB disk: 3 hrs
(100 MB/s)
27. Big Data
Big Data is any data that is expensive to manage and
hard to extract value from
Volume
The size of the data
Velocity
The latency of data processing relative to the growing demand for
interactivity
Variety and Complexity
the diversity of sources, formats, quality, structures.
29. Types of Data We Have
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can afford to scan the data once
30. What To Do With These Data?
Aggregation and Statistics
Data warehousing and OLAP
Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
Knowledge discovery
Data Mining
Statistical Modeling
31. Big Data and Data Science
“… the sexy job in the next 10 years will be
statisticians,” Hal Varian, Google Chief Economist
The U.S. will need 140,000-190,000 predictive
analysts and 1.5 million managers/analysts by
2018. McKinsey Global Institute’s June 2011
New Data Science institutes being created or
repurposed – NYU, Columbia, Washington,
UCB,...
New degree programs, courses, boot-camps:
e.g., at Berkeley: Stats, I-School, CS, Astronomy…
One proposal (elsewhere) for an MS in “Big Data
Science”
32. What is Data Science?
An area that manages, manipulates, extracts, and
interprets knowledge from tremendous amount of
data
Data science (DS) is a multidisciplinary field of
study with goal to address the challenges in big
data
Data science principles apply to all data – big and
small
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
33. What is Data Science?
Theories and techniques from many fields and
disciplines are used to investigate and
analyze a large amount of data to help
decision makers in many industries such as
science, engineering, economics, politics,
finance, and education
Computer Science
Pattern recognition, visualization, data warehousing, High
performance computing, Databases, AI
Mathematics
Mathematical Modeling
Statistics
Statistical and Stochastic modeling, Probability.
37. Real Life Examples
Companies learn your secrets, shopping patterns,
and preferences
For example, can we know if a woman is pregnant,
even if she doesn’t want us to know? Target case
study
Data Science and election (2008, 2012)
1 million people installed the Obama Facebook app
that gave access to info on “friends”
38. Data Scientists
Data Scientist
The Sexiest Job of the 21st Century
They find stories, extract knowledge. They are not
reporters
39. Data Scientists
Data scientists are the key to realizing the
opportunities presented by big data. They
bring structure to it, find compelling patterns in it,
and advise executives on the implications for
products, processes, and decisions
40. What do Data Scientists
do?
National Security
Cyber Security
Business Analytics
Engineering
Healthcare
And more ….
41. Concentration in Data Science
Mathematics and Applied Mathematics
Applied Statistics/Data Analysis
Solid Programming Skills (R, Python, Julia, SQL)
Data Mining
Data Base Storage and Management
Machine Learning and discovery