Schema on read is obsolete. Welcome metaprogramming..pdf
FDS_dept_ppt.pptx
1. FUNDAMENTALS of DATA SCIENCE
Third Year Computer Science & Engineering (Data Science)
By:
Mr. Ganesh. I. Rathod
H.O.D, Data Science
D Y Patil College of Engineering
Salokhe Nagar, Kolhapur
19-11-2022 Department of Data Science Engineering 1
2. Contents
• Course Objectives.
• Course Outcomes.
• Introduction to Data Science
• Understanding the Syllabus
• Content Beyond the Syllabus
• Online Resources
19-11-2022 Department of Data Science Engineering 2
3. Course Objective
Course Description:
• The aim is to make them up-to-date with common tools used for Data Science
application development. It serves as an introduction to the basics of data science
including programming for data analytics.
Course Objectives:
1. To provide the students with the basic knowledge of Data Science.
2. To make the students develop solutions using Data Science tools.
3. To introduce them to Python packages and their usability.
19-11-2022 Department of Data Science Engineering 3
4. Course Outcomes
1. Study1 basics of data science and its scope.
2. Describe2 basics of data science process and recognize common tools
used for Data Science application development.
3. Explore3 functions of Python libraries & packages.
4. Apply4 data science concepts and methods to find solution to real-world
problems and will communicate these solutions effectively.
19-11-2022 Department of Data Science Engineering 4
5. Program Specific Outcomes
• PSO1: Knowledge of recent technology: Demonstrate the knowledge of
recent technologies like web development, mobile computing, grid
computing, cloud computing, big data analytics, mainframe etc.
• PSO2: Knowledge of programming languages: Demonstrate the knowledge
of programming languages in computer based problem solving.
• PSO3: Software development: Demonstrate the ability to analyse, design
and implement software products.
19-11-2022 Department of Data Science Engineering 5
7. Introduction to Data Science
Data science is about extracting knowledge and insights from data.
The tools and techniques of data science are used to drive business
and process decisions.
19-11-2022 Department of Data Science Engineering 7
10. UNIT NO. UNIT NAME & DETAILS NO. OF
LECTURES
1. Data Science and Its Scope: What Is Data Science, Data Science and Statistics, Role of
Statistics in Data Science, A Brief History, Difference between Data Science and Data
Analytics, Knowledge and Skills for Data Science Professionals, Some Technologies used in
Data Science, Benefits and uses of data science, Facets of data.
6
2.
The data science process: Overview, defining research goals and creating a project charter,
retrieving data, Cleansing, integrating, and transforming data, Exploratory data analysis,
Build the models, presenting findings and building applications on top of them.
7
3.
Data Analysis Tools for Data Science and Analytics: Data Analysis Using Excel: Introduction,
Getting Started with Excel, Format Data as a Table, Filter and Sort, Perform Simple
Calculations, Data Manipulation Sorting and Filtering Data Derived Data, Highlighting Data,
Aggregating Data: Count, Total Sum Basic Calculation using Excel, Analyzing Data using
Pivot Table/Pivot Chart, Descriptive Statistics using Excel, Visualizing Data using Excel
Charts and Graphs, Visualizing Categorical Data: Bar Charts, Pie Charts, Cross Tabulation,
Exploring the Relationship between Two and Three Variables: Scatter Plot Bubble Graph
and Time-Series Plot.
8
19-11-2022 Department of Data Science Engineering 10
11. 4.
Introduction to NumPy: Creating Arrays from Scratch, NumPy Standard Data Types, The
Basics of NumPy Arrays, Array Indexing, slicing, reshaping, Concatenation, splitting,
Computation on NumPy Arrays: Universal Functions, Aggregations: Min, Max, Comparison
operator, Boolean arrays.
7
5.
Data Manipulation with Pandas: Introducing Pandas Objects, Data Indexing and Selection,
Operating on Data in Pandas, Handling Missing Data, Hierarchical Indexing. Combining
Datasets: Concat and Append, Combining Datasets: Merge and Join, Aggregation and
Grouping, Pivot Tables
7
6.
Visualization with Matplotlib: General Matplotlib Tips, Simple Line Plots, Simple Scatter
Plots, Visualizing Errors, Density and Contour Plots, Histograms, Bindings, and Density.
7
19-11-2022 Department of Data Science Engineering 11
12. Text Books
1) Davy Cielen, Arno D. B. Meysman, Mohamed Ali, “Introducing Data Science”,Manning
Publications.[Unit 1 and 2]
2) Jake VanderPlas, “Python Data Science Handbook: Essential Tools for Working with Data”,
O’REILLY Publication.[Unit 3,4,5]
3) DR.AmarSahay, “Essentials of Data Science and Analytics”, O’REILLY Publication.
[Unit 1 and 3]
Reference Books
1. Data Science from Scratch: First Principles with Python, O‟Reilly Media, 2015.
2. Glenn J. Myatt John, Making sense of Data: A practical Guide to Exploratory Data Analysis and Data
Mining, Wiley Publishers, 2000.
19-11-2022 Department of Data Science Engineering 12
13. Content beyond the Syllabus
•R Programming
R is a programming language and free software developed by Ross Ihaka and Robert
Gentleman in 1993. R possesses an extensive catalog of statistical and graphical methods.
It includes machine learning algorithms, linear regression, time series, statistical inference
to name a few.
•Power BI
"Power BI," Microsoft says, "is a business analytics solution that lets you visualize your data and share
insights across your organization, or embed them in your app or website."
19-11-2022 Department of Data Science Engineering 13
14. Online Resources
• https://nptel.ac.in/courses/106/106/106106212/
• https://www.coursera.org/specializations/data-science-fundamentals-python-sql
• https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/
• https://www.youtube.com/watch?v=-ETQ97mXXF0&t=561s (Edureka)
• https://www.youtube.com/watch?v=KxryzSO1Fjs (Simplilearn)
19-11-2022 Department of Data Science Engineering 14
26. Difference between Data Scientist & Data Analyst
19-11-2022 Department of Data Science Engineering 26
Data Science Data Analytics
Data science is a multi-disciplinary blend that involves
algorithm development, data inference, and predictive
modeling to solve analytically complex business
problems.
Data analytics involves a few different branches of
broader statistics and analysis..
Data science focuses more on machine learning and
predictive modeling.
Data analytics focuses more on viewing the historical
data
Data science focuses on discovering new questions that
you might not have realized needed answering to drive
innovation.
Data analysis involves answering questions generated for
better business decision making. It uses existing
information to uncover actionable data. Data analytics
focuses on specific areas with specific goals.
Data science tries to build connections and shapes the
questions to answer them for the future
Data analytics involves checking a hypothesis
If data science is a home for all the methods and tools, data
analytics is a small room in that house.
27. Difference between Data Scientist & Data Analyst
19-11-2022 Department of Data Science Engineering 27
Feature Data Science Data Analytics
Coding
Language
Python is the most
commonly used language for
data science along with the
use of other languages such
as C++, Java, Perl, etc.
The Knowledge of Python
and R Language is essential
for Data Analytics.
Programming
Skills
In- depth knowledge of
programming is required for
data science.
Basic Programming skills is
necessary for data analytics.
Use of
Machine
Learning
Data Science makes use of
machine learning algorithms
to get insights.
Data Analytics doesn’t makes
use of machine learning.
28. Difference between Data Scientist & Data Analyst
19-11-2022 Department of Data Science Engineering 28
Feature Data Science Data Analytics
Scope
The scope of data science
is large.
The Scope of data analysis
is micro i.e., small.
Goals
Data science deals with
explorations and new
innovations.
Data Analysis makes use of
existing resources.
Data Type
Data Science mostly deals
with unstructured data.
Data Analytics deals with
structured data.
29. Difference between Data Scientist & Data Analyst
19-11-2022 Department of Data Science Engineering 29
Feature Data Science Data Analytics
Scope
The scope of data science
is large.
The Scope of data analysis
is micro i.e., small.
Goals
Data science deals with
explorations and new
innovations.
Data Analysis makes use of
existing resources.
Data Type
Data Science mostly deals
with unstructured data.
Data Analytics deals with
structured data.
30. Difference between Data Scientist & Data Analyst
19-11-2022 Department of Data Science Engineering 30
Data Science vs Data Analytics — The Skills
Data Analytics —
• Knowledge of Intermediate Statistics and excellent
problem-solving skills along with expert in Excel and
SQL database.
• Experience working with BI tools like Power BI for
reporting.
• Knowledge of Stats tools like Python and R
31. Difference between Data Scientist & Data Analyst
19-11-2022 Department of Data Science Engineering 31
Data Science vs Data Analytics — The Skills
Data Science —
• Math, Advanced Statistics, Predictive Modelling,
Machine Learning, Programming along with-
Proficiency in using big data tools like Hadoop and
Spark.
• Expertise in SQL and NoSQL databases like
Cassandra and MongoDB.
• Experience with data visualization tools like QlikView,
D3.js, and Tableau.
• Expertise in programming languages like Python, R,
32. Difference between Data Scientist & Data Analyst
19-11-2022 Department of Data Science Engineering 32
Data Science vs Data Analytics — Sample Job
Description
Data Analyst
33. Difference between Data Scientist & Data Analyst
19-11-2022 Department of Data Science Engineering 33
Data Science vs Data Analytics — Sample Job
Description
Data Scientist
34. Knowledge and Skills for Data Science Professionals
19-11-2022 Department of Data Science Engineering 34
35. Knowledge and Skills for Data Science Professionals
19-11-2022 Department of Data Science Engineering 35
• At least one programming language – R/ Python
• Data Extraction, Transformation, and Loading
• Data Wrangling and Data Exploration
• Machine Learning Algorithms
• Advanced Machine Learning (Deep Learning)
• Big Data Processing Frameworks
• Data Visualization
36. Knowledge and Skills for Data Science Professionals
19-11-2022 Department of Data Science Engineering 36
• As a Data Scientist, you’ll be responsible for jobs that span
three domains of skills.
• Statistical/mathematical reasoning,
• Business communication/leadership, and
• Programming
37. Knowledge and Skills for Data Science Professionals
19-11-2022 Department of Data Science Engineering 37
1. Statistics:
Wikipedia defines it as the study of the collection,
interpretation, presentation, and organization of
shouldn’t be a surprise that data scientists need to
For example, data analysis requires descriptive
probability theory, at a minimum. These concepts
better business decisions from data.
38. Knowledge and Skills for Data Science Professionals
19-11-2022 Department of Data Science Engineering 38
2. Programming Language R/ Python:
Python and R are one of the most widely used languages by Data
Scientists. The primary reason is the number of packages available for
computing.
3. Data Extraction, Transformation, and Loading:
Suppose we have multiple data sources like MySQL DB, MongoDB,
have to Extract data from such sources, and then transform it for
structure for the purposes of querying and analysis. Finally, you have
the Data Warehouse, where you will analyze the data. So, for people
Transform and Load) background Data Science can be a good career
option
39. Knowledge and Skills for Data Science Professionals
19-11-2022 Department of Data Science Engineering 39
4. Data Wrangling and Data Exploration:
• Cleaning and unify the messy and complex data sets for easy access and
as Data Wrangling.
• Exploratory Data Analysis (EDA) is the first step in your data analysis
sense of the data you have and then figure out what questions you want
them, as well as how best to manipulate your available data sources to
5. Machine Learning
Machine Learning, as the name suggests, is the process of making
machines intelligent, that have the power to think, analyze and make
Machine Learning models, an organization has a better chance of
– or avoiding unknown risks.
You should have good hands-on knowledge of various Supervised and
40. Knowledge and Skills for Data Science Professionals
19-11-2022 Department of Data Science Engineering 40
6. Big Data Processing Frameworks:
• Nowadays, most of the organizations are using Big Data analytics to gain
insights. It is, therefore, a must-have skill for a Data Scientist.
• Therefore, we require frameworks like Hadoop and Spark to handle Big
41. Benefits and uses of data science and big data
19-11-2022 Department of Data Science Engineering 41
• Commercial companies in almost every industry use data science
and big data to gain insights into their customers, processes, staff,
completion, and products.
• A good example of this is GoogleAdSense, which collects data from internet users so relevant commercial
messages canbe matched to the person browsing the internet.
• Human resource professionals use people analytics and text
mining to screen candidates, monitor the mood of employees, and
study informal networks among coworkers.
• Financial institutions use data science to predict stock markets,
determine the risk of lending money, and learn how to attract new
clients for their services.
42. Some Technologies used in Data Science
19-11-2022 Department of Data Science Engineering 42
43. Benefits and uses of data science and big data
19-11-2022 Department of Data Science Engineering 43
• Governmental organizations are also aware of data’s value. A
data scientist in a governmental organization gets to work on
diverse projects such as detecting fraud and other criminal activity
or optimizing project funding.
• Nongovernmental organizations (NGOs) are also no strangers to
using data. They use it to raise money and defend their causes. The
World Wildlife Fund (WWF), for instance, employs data scientists to
increase the effectiveness of their fundraising efforts.
• Universities use data science in their research but also to enhance
the study experience of their students.
• Ex: MOOC’s- Massive open online courses.
44. Facets of data
19-11-2022 Department of Data Science Engineering 44
• The main categories of data are these:
• ■ Structured
• ■ Semi structured
• ■ Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
45. Structured Data
It concerns all data which can be stored in database
SQL in table with rows and columns.
They have relational key and can be easily mapped into
pre-designed fields.
Today, those data are the most processed in development
and the simplest way to manage information.
But structured data represent only 5 to 10% of all
informatics data.
47. Semi Structured Data
• Semi-structured data is information that doesn’t reside in a
relational database but that does have some organizational
properties that make it easier to analyze.
• With some process you can store them in relation database (it
could be very hard for some kind of semi structured data), but
the semi structure exist to ease space, clarity or compute…
Examples of semi-structured :JSON, CSV , XML documents
are semi structured documents.
But as Structured data, semi structured data represents a few
parts of data (5 to 10%).
48. Unstructured data
• Unstructured data represent around 80% of data.
• It often include text and multimedia content.
• Examples include e-mail messages, word processing documents, videos, photos, audio
files, presentations, webpages and many other kinds of business documents.
• Unstructured data is everywhere.
• In fact, most individuals and organizations conduct their lives around unstructured
data.
• Just as with structured data, unstructured data is either machine generated or human
generated.
49. Unstructured data
Here are some examples of machine-generated unstructured data:
• Satellite images: This includes weather data or the data that the government captures in its satellite
surveillance imagery. Just think about Google Earth, and you get the picture.
• Photographs and video: This includes security, surveillance, and traffic video.
• Radar or sonar data: This includes vehicular, meteorological, and Seismic oceanography.
• The following list shows a few examples of human-generated unstructured data:
• Social media data: This data is generated from the social media platforms such as YouTube, Facebook,
Twitter, LinkedIn, and Flickr.
• Mobile data: This includes data such as text messages and location information.
• website content: This comes from any site delivering unstructured content, like YouTube, Flickr, or
Instagram.
50. Facets of data
• Natural language is a special type of unstructured data; it’s
challenging to process because it requires knowledge of specific
data science techniques and linguistics.
• The natural language processing community has had success in
entity recognition, topic recognition, summarization, and sentiment
analysis, but models trained in one domain don’t generalize well to
other domains.
19-11-2022 Department of Data Science Engineering 50
Natural Language
51. Facets of data
• In graph theory, a graph is a mathematical structure to model pair-
wise relationships between objects.
• Graph or network data is, in short, data that focuses on the
relationship or adjacency of objects.
• The graph structures use nodes, edges, and properties to represent
and store graphical data. Graph-based data is a natural way to
represent social networks.
19-11-2022 Department of Data Science Engineering 51
Graph based or Network Data
53. Facets of data
• Audio, image, and video are data types that pose specific challenges to a
data scientist.
• MLBAM (Major League Baseball Advanced Media) announced in 2014
that they’ll increase video capture to approximately 7 TB per game for the
purpose of live, in-game analytics. High-speed cameras at stadiums will
capture ball and athlete movements to calculate in real time, for example,
the path taken by a defender relative to two baselines.
19-11-2022 Department of Data Science Engineering 53
Audio, Image & Video
Streaming Data
• Streaming data is data that is generated continuously by thousands
of data sources, which typically send in the data records
simultaneously, and in small sizes (order of Kilobytes).
• Examples are the-Log files generated by customers using your mobile or
web applications, online game activity, “What’s trending” on Twitter, live
sporting or music events, and the stock market.