Copyright © SAS Institute Inc. All rights reserved.
1
CSC1202
Fundamentals
of Data Science
Lecture 4: Data Acquisition
These slides are adapted with permission from SAS Introduction to Data Science Course Materials
Copyright © SAS Institute Inc. All rights reserved.
2
An answer
Model the data
Previously…. We considered the Data Science Process
Explore the data
Collect the data
A question
Which data are relevant?
How many sources are involved?
Do you have access to the data?
Do you have privacy issues?
Will the data be available in production?
What is the goal?
Do you need to classify, estimate,
describe?
Do you have the proper data?
What actions are planned?
Are there anomalies or patterns?
How the data look?
Do you have too many or too few variables?
Do you need to impute/transform the data?
Do you need to aggregate/create the data?
Train different models (algorithms and approaches).
Validate all the models.
Test all the models.
Select the best model according to the question/goal.
Score the champion model.
What did you learn?
Can you explain the answer with the
model?
Can you tell a story?
Can you deploy the model in time?
Copyright © SAS Institute Inc. All rights reserved.
3
Next: Collecting the Data
Collect the data
A question
Which data are relevant?
How many sources are involved?
Do you have access to the data?
Do you have privacy issues?
Will the data be available in production?
What is the goal?
Do you need to classify, estimate,
describe?
Do you have the proper data?
What actions are planned?
Sources of Data
Types of Data
Ethical and Privacy Issues
Lecture 4: Data Acquisition
Topic Learning Outcomes
By the end of this topic, students should be able to:
1. Explain sources of business data
2. Discuss the importance of considering ethical and privacy issues in
collecting and using data
Sources of Data
Types of Data
Ethical and Privacy Issues
Lecture 4: Data Acquisition
Data Sources
Data for analytics can come from a variety of sources:
Organizational
Databases
Social Media
Publicly
Available
Datasets
Sensor Data
Surveys and
Market
Research
Organizational Databases
Organizational databases contain data related to a business and its
operations:
Products Customers
Transactions Suppliers Promotions Employees
Authorized users can access these data by querying the database, usually
using SQL (Structured Query Language).
Social Media Data
Channel owners can obtain analytics
data for their own channels using
social media analytics tools
Data obtained from an
organization's own social media
channels on how users share, view
or engage with the content
Publicly available information
shared by social media users on
their own channels
Social listening refers to scraping
social media sites for content
Publicly Available Datasets
There are many publicly available open data sets:
- Government data, such as
- Malaysia Open Data Portal (data.gov.my)
- Malaysian Department of Statistics (dosm.gov.my)
- European Data (data.europa.eu)
- Kaggle (Kaggle.com)
- Google dataset search (https://datasetsearch.research.google.com/)
- Datahub.io
The data is usually downloadable in text format, as comma-separated values
(csv files)
Sensor Data
Sensor data comes from IOT data sources as a data stream.
A streaming data
pipeline has to
be set up for the
data to flow from
the source to the
destination.
Surveys and Market Research
Organizations may also conduct surveys to
collect data for analysis or perform market
research.
Some companies specialize in data collection
and market research; organizations may pay for
their services or pay for data collections.
Sources of Data
Types of Data
Ethical and Privacy Issues
Lecture 4: Data Acquisition
Copyright © SAS Institute Inc. All rights reserved.
14
Types of Data
Structured data is data that have been recorded in a specific format
Structured Data
New and Trending Games on Steam
From store.steampowered.com
Structured data is easy to
display and store in a table:
Price
Platform
Genre
Discount
Game Title
Date Released
Rating
Volume sold
Types of Data
Unstructured data does not have a pre-defined data model
Unstructured Data
Video and
Audio
content
Images
Textual
descriptions
Copyright © SAS Institute Inc. All rights reserved.
16
Describing Data
Three important ways of describing data items:
Data items represent a single attribute, or feature, of an object.
Variable • The name of the data item
Data Type • The description of the type of data item
Value
• The actual contents, or value of the
data item
Describing Data
Data items are usually stored in a table format:
GameID GameTitle ReleaseDate Price Rating
1692392 Feudal Fantasy Incremental 17/02/2023 RM4.27 Teen
1693823 GameDev Life Simulator 18/02/2023 RM49.00E10+
1682912 Going Deep 16/02/2023 RM13.99Teen
3690293 Horror Adventure 18/02/2023 RM26.75Teen
1691032 IBIS AM 17/02/2023 RM5.69 Mature
1702983 Maze (The Amazing Labyrinth) 20/02/2023 RM26.75Everyone
1698732 Mountain Alpaca 19/02/2023 RM5.69 Everyone
1704391 Parasomnia 20/02/2023 RM8.50 Mature
variables
Data types ID Character Datetime Numeric Character
records (or
observations,
cases,
instances)
values
Categories and Measures
Data items can generally be classified into two types:
Categories and Measures
Category data items
Variables with character and datetime data types are
treated as categories.
These variables have distinct values which are used to
group the records, for example by country, or by month.
These values can also be summarized to find the count of
each value, or the mode
- 10 records with "December" Hire Date
- "Sales" is the most frequently occurring Department
Categories and Measures
Measure Data Items
Numeric data items are treated as measures.
These are data items whose values can be used
in calculations.
These values are usually summarized such as
finding the mean, sum, standard deviation.
- Mean Annual Salary of $50,000
- Sum of Total Orders
Sources of Data
Types of Data
Ethical and Privacy Issues
Lecture 4: Data Acquisition
Ethics and Privacy in Data Science
A data science project may have an impact to society
in terms of
• Ethical issues
• Privacy issues
Ethical Issues
Definition
Ethics
1. a set of moral principles : a theory or
system of moral values
2. the principles of conduct governing an
individual or a group
Source: “Ethic.” Merriam-Webster.com Dictionary, Merriam-Webster,
https://www.merriam-webster.com/dictionary/ethic. Accessed 19 Feb. 2023.
Ethical Issues in Data Science
Denying access to
social services to
individuals who
criticize government
policies
Classifying
neighborhoods as
"safe" or "unsafe"
may affect the
livelihood of business
owners
Determining whether
to hire a new
employee by
analysing their social
media content
Using customer data
from an online
contest to market
new products
Ethics
Informal:
1. Is there someone who would like it to be kept quiet?
2. Would you tell your mother?
3. Would you talk about it on youtube?
4. If you advertised it, would people admire or criticize your organization?
5. What does your instinct tell you?
Recognizing an Ethical Issue
Ethics
Formal Methods:
1. Does the act violate corporate policy?
2. Does it violate corporate or professional code of conduct or ethics?
3. Does it violate the “Golden Rule”?
⚫ treat others the way you wish them to treat you.
Recognizing an Ethical Issue
Professional Code of Ethics
Data scientists may adhere to professional code of conduct based on their
membership in professional associations:
- Malaysia Board of Technologists have a MBOT Code of Ethics for
Technologists and Technicians which cover general professional practice
for technical professionals
- The Data Science Association has a more specialized Data Science Code
of Professional Conduct geared towards data science professionals
Copyright © SAS Institute Inc. All rights reserved.
28
Responsible Data Science
Bruce and Fleming (2021) highlight in their book "Responsible
Data Science" that "data science projects can go awry, when
the predictions made by statistical and machine learning
algorithms turn out to be not just wrong, but biased and unfair
in ways that cause harm".
- Bias : an algorithm that makes predictions for people of one group
systematically differently than for others
- Unfairness: an algorithm that makes predictions that are
disadvantageous to certain people
Copyright © SAS Institute Inc. All rights reserved.
29
Ethics in the Data Science Process
Ethical issues should be considered through every stage of the data science
process:
A question
Objective of the project: consider issues of bias and unfairness; will it benefit all members
of society equally? Is the intention to exploit others' weaknesses for profit?
Copyright © SAS Institute Inc. All rights reserved.
30
Ethics in the Data Science Process
Ethical issues should be considered through every stage of the data science
process:
Collect the data
A question
Data collection - consider issues of privacy and representation of all groups
Copyright © SAS Institute Inc. All rights reserved.
31
Ethics in the Data Science Process
Ethical issues should be considered through every stage of the data science
process:
Explore the data
Collect the data
A question
Processing and exploring data: protection of data confidentiality and accuracy
Copyright © SAS Institute Inc. All rights reserved.
32
Ethics in the Data Science Process
Ethical issues should be considered through every stage of the data science
process:
Model the data
Explore the data
Collect the data
A question
Modelling: ensuring that the model that is developed and trained is not biased against
certain groups of people
Copyright © SAS Institute Inc. All rights reserved.
33
Ethics in the Data Science Process
Ethical issues should be considered through every stage of the data science
process:
An answer
Model the data
Explore the data
Collect the data
A question
Result: Will the use of the model result in an unfair advantage or disadvantage for
certain groups?
Data Protection
In order to protect individual data privacy, governments have
implemented data protection laws:
- The European GDPR (General Data Protection Regulation) is
applicable to all European Union member countries
- An interactive map by DLA Piper shows other countries with data
protection laws
Malaysia’s PDPA
Malaysia implemented the Personal Data Protection Act 2010 to ensure
information security by all organizations who perform data processing
relating to commercial transactions in Malaysia.
35
To ensure that
organisations
- Explain the purpose of
data collection
- Seek consent for data
collection
- Establish data
protection policies
Malaysia’s PDPA
36
3 roles defined in the PDPA:
• A licensed organization
or individual who is
processes, has control
over or authorizes
processing of personal
data
Data User
• An individual who is the
subject of personal data
Data
Subject
• Any person other than
an employee of the
Data User who
processes the personal
data on behalf of the
Data User
Data
Processor
Personal Data
Personal data in the PDPA means information
- in respect of commercial transactions
- directly or indirectly related to a data subject
- who is identified or identifiable from that information and
- other information in the possession of the data user.
Sensitive personal data means any information related to:
- Physical or mental health or condition
- Political opinions
- Religious or other similar beliefs
- Offence records
Personal Data Protection Principles
General Principle
• Prohibits a data user from processing a data subject's personal
data without his/her consent
Notice and Choice
Principle
• Requires a data user to inform a data subject on how the personal
data is being used and provide a means of providing consent
Disclosure Principle • Prohibits the disclosure of personal data without consent
Security Principle • Obligation of the data user to protect the personal data
Retention Principle • The personal data is not to be retained longer than necessary
Data Integrity
• Responsibility of the data user to take reasonable steps to ensure
the personal data is accurate and complete
Access Principle
• The data subject has the right to access and correct his/her own
data
The Rights of Data Subjects
39
Right of Access to Personal Data
•A data subject can request for information on the personal data that is being
processed
Right to Correct Personal Data
•A data subject can request for personal data to be corrected if it is misleading,
inaccurate or outdated
Right to Withdraw Consent
•A data subject may request to withdraw consent for processing of personal data
Right to Prevent Processing
•A data subject may request the data user not to begin or cease processing of personal
data:
•that is causing or likely to cause damage or distress
•for the purpose of direct marketing
[Source: pdp.gov.my]
Ethics in Data Science
Case Study:
You are working on a data science project for non-governmental
organization that collects donations. You would like to collect information
about donors and the amount that they have donated to various causes. You
hope that with the information about how much they have donated and
how often, you will be able to encourage run targeted marketing campaigns
to identify potential donors who will make more donations in the future.
Ethical and Privacy Issues
An answer
Model the data
Explore the data
Collect the data
A question
Discuss what will need to be done for the data science project described.
What are some ethical and privacy issues that need to be
considered? How would you address them?
Copyright © SAS Institute Inc. All rights reserved.
42
Summary
Identifying sources of data and performing data collection is critical to the
data science project.
Data scientists have to take into consideration
- The sources of data they are using
- Whether they have access to the data
- Whether the data has been collected ethically and with regard to
personal data protection laws.
References
Bruce, P.C. and Fleming, G. (2021). Responsible Data Science. Wiley.
Personal Data Protection Commissioner Malaysia (n.d.) What you
need to know? Personal Data Protection Act 2010. Department of
Personal Data Protection, https://www.pdp.gov.my/jpdpv2/
Pierson, L. (2021). Data Science For Dummies. For Dummies.
Van Der Velden, J. (2021). Introduction to Data Science Course
Notes. SAS Institute.

CSC1202 Lecture 4 Data Acquisition (1).pdf

  • 1.
    Copyright © SASInstitute Inc. All rights reserved. 1 CSC1202 Fundamentals of Data Science Lecture 4: Data Acquisition These slides are adapted with permission from SAS Introduction to Data Science Course Materials
  • 2.
    Copyright © SASInstitute Inc. All rights reserved. 2 An answer Model the data Previously…. We considered the Data Science Process Explore the data Collect the data A question Which data are relevant? How many sources are involved? Do you have access to the data? Do you have privacy issues? Will the data be available in production? What is the goal? Do you need to classify, estimate, describe? Do you have the proper data? What actions are planned? Are there anomalies or patterns? How the data look? Do you have too many or too few variables? Do you need to impute/transform the data? Do you need to aggregate/create the data? Train different models (algorithms and approaches). Validate all the models. Test all the models. Select the best model according to the question/goal. Score the champion model. What did you learn? Can you explain the answer with the model? Can you tell a story? Can you deploy the model in time?
  • 3.
    Copyright © SASInstitute Inc. All rights reserved. 3 Next: Collecting the Data Collect the data A question Which data are relevant? How many sources are involved? Do you have access to the data? Do you have privacy issues? Will the data be available in production? What is the goal? Do you need to classify, estimate, describe? Do you have the proper data? What actions are planned?
  • 4.
    Sources of Data Typesof Data Ethical and Privacy Issues Lecture 4: Data Acquisition
  • 5.
    Topic Learning Outcomes Bythe end of this topic, students should be able to: 1. Explain sources of business data 2. Discuss the importance of considering ethical and privacy issues in collecting and using data
  • 6.
    Sources of Data Typesof Data Ethical and Privacy Issues Lecture 4: Data Acquisition
  • 7.
    Data Sources Data foranalytics can come from a variety of sources: Organizational Databases Social Media Publicly Available Datasets Sensor Data Surveys and Market Research
  • 8.
    Organizational Databases Organizational databasescontain data related to a business and its operations: Products Customers Transactions Suppliers Promotions Employees Authorized users can access these data by querying the database, usually using SQL (Structured Query Language).
  • 9.
    Social Media Data Channelowners can obtain analytics data for their own channels using social media analytics tools Data obtained from an organization's own social media channels on how users share, view or engage with the content Publicly available information shared by social media users on their own channels Social listening refers to scraping social media sites for content
  • 10.
    Publicly Available Datasets Thereare many publicly available open data sets: - Government data, such as - Malaysia Open Data Portal (data.gov.my) - Malaysian Department of Statistics (dosm.gov.my) - European Data (data.europa.eu) - Kaggle (Kaggle.com) - Google dataset search (https://datasetsearch.research.google.com/) - Datahub.io The data is usually downloadable in text format, as comma-separated values (csv files)
  • 11.
    Sensor Data Sensor datacomes from IOT data sources as a data stream. A streaming data pipeline has to be set up for the data to flow from the source to the destination.
  • 12.
    Surveys and MarketResearch Organizations may also conduct surveys to collect data for analysis or perform market research. Some companies specialize in data collection and market research; organizations may pay for their services or pay for data collections.
  • 13.
    Sources of Data Typesof Data Ethical and Privacy Issues Lecture 4: Data Acquisition
  • 14.
    Copyright © SASInstitute Inc. All rights reserved. 14 Types of Data Structured data is data that have been recorded in a specific format Structured Data New and Trending Games on Steam From store.steampowered.com Structured data is easy to display and store in a table: Price Platform Genre Discount Game Title Date Released Rating Volume sold
  • 15.
    Types of Data Unstructureddata does not have a pre-defined data model Unstructured Data Video and Audio content Images Textual descriptions
  • 16.
    Copyright © SASInstitute Inc. All rights reserved. 16 Describing Data Three important ways of describing data items: Data items represent a single attribute, or feature, of an object. Variable • The name of the data item Data Type • The description of the type of data item Value • The actual contents, or value of the data item
  • 17.
    Describing Data Data itemsare usually stored in a table format: GameID GameTitle ReleaseDate Price Rating 1692392 Feudal Fantasy Incremental 17/02/2023 RM4.27 Teen 1693823 GameDev Life Simulator 18/02/2023 RM49.00E10+ 1682912 Going Deep 16/02/2023 RM13.99Teen 3690293 Horror Adventure 18/02/2023 RM26.75Teen 1691032 IBIS AM 17/02/2023 RM5.69 Mature 1702983 Maze (The Amazing Labyrinth) 20/02/2023 RM26.75Everyone 1698732 Mountain Alpaca 19/02/2023 RM5.69 Everyone 1704391 Parasomnia 20/02/2023 RM8.50 Mature variables Data types ID Character Datetime Numeric Character records (or observations, cases, instances) values
  • 18.
    Categories and Measures Dataitems can generally be classified into two types:
  • 19.
    Categories and Measures Categorydata items Variables with character and datetime data types are treated as categories. These variables have distinct values which are used to group the records, for example by country, or by month. These values can also be summarized to find the count of each value, or the mode - 10 records with "December" Hire Date - "Sales" is the most frequently occurring Department
  • 20.
    Categories and Measures MeasureData Items Numeric data items are treated as measures. These are data items whose values can be used in calculations. These values are usually summarized such as finding the mean, sum, standard deviation. - Mean Annual Salary of $50,000 - Sum of Total Orders
  • 21.
    Sources of Data Typesof Data Ethical and Privacy Issues Lecture 4: Data Acquisition
  • 22.
    Ethics and Privacyin Data Science A data science project may have an impact to society in terms of • Ethical issues • Privacy issues
  • 23.
    Ethical Issues Definition Ethics 1. aset of moral principles : a theory or system of moral values 2. the principles of conduct governing an individual or a group Source: “Ethic.” Merriam-Webster.com Dictionary, Merriam-Webster, https://www.merriam-webster.com/dictionary/ethic. Accessed 19 Feb. 2023.
  • 24.
    Ethical Issues inData Science Denying access to social services to individuals who criticize government policies Classifying neighborhoods as "safe" or "unsafe" may affect the livelihood of business owners Determining whether to hire a new employee by analysing their social media content Using customer data from an online contest to market new products
  • 25.
    Ethics Informal: 1. Is theresomeone who would like it to be kept quiet? 2. Would you tell your mother? 3. Would you talk about it on youtube? 4. If you advertised it, would people admire or criticize your organization? 5. What does your instinct tell you? Recognizing an Ethical Issue
  • 26.
    Ethics Formal Methods: 1. Doesthe act violate corporate policy? 2. Does it violate corporate or professional code of conduct or ethics? 3. Does it violate the “Golden Rule”? ⚫ treat others the way you wish them to treat you. Recognizing an Ethical Issue
  • 27.
    Professional Code ofEthics Data scientists may adhere to professional code of conduct based on their membership in professional associations: - Malaysia Board of Technologists have a MBOT Code of Ethics for Technologists and Technicians which cover general professional practice for technical professionals - The Data Science Association has a more specialized Data Science Code of Professional Conduct geared towards data science professionals
  • 28.
    Copyright © SASInstitute Inc. All rights reserved. 28 Responsible Data Science Bruce and Fleming (2021) highlight in their book "Responsible Data Science" that "data science projects can go awry, when the predictions made by statistical and machine learning algorithms turn out to be not just wrong, but biased and unfair in ways that cause harm". - Bias : an algorithm that makes predictions for people of one group systematically differently than for others - Unfairness: an algorithm that makes predictions that are disadvantageous to certain people
  • 29.
    Copyright © SASInstitute Inc. All rights reserved. 29 Ethics in the Data Science Process Ethical issues should be considered through every stage of the data science process: A question Objective of the project: consider issues of bias and unfairness; will it benefit all members of society equally? Is the intention to exploit others' weaknesses for profit?
  • 30.
    Copyright © SASInstitute Inc. All rights reserved. 30 Ethics in the Data Science Process Ethical issues should be considered through every stage of the data science process: Collect the data A question Data collection - consider issues of privacy and representation of all groups
  • 31.
    Copyright © SASInstitute Inc. All rights reserved. 31 Ethics in the Data Science Process Ethical issues should be considered through every stage of the data science process: Explore the data Collect the data A question Processing and exploring data: protection of data confidentiality and accuracy
  • 32.
    Copyright © SASInstitute Inc. All rights reserved. 32 Ethics in the Data Science Process Ethical issues should be considered through every stage of the data science process: Model the data Explore the data Collect the data A question Modelling: ensuring that the model that is developed and trained is not biased against certain groups of people
  • 33.
    Copyright © SASInstitute Inc. All rights reserved. 33 Ethics in the Data Science Process Ethical issues should be considered through every stage of the data science process: An answer Model the data Explore the data Collect the data A question Result: Will the use of the model result in an unfair advantage or disadvantage for certain groups?
  • 34.
    Data Protection In orderto protect individual data privacy, governments have implemented data protection laws: - The European GDPR (General Data Protection Regulation) is applicable to all European Union member countries - An interactive map by DLA Piper shows other countries with data protection laws
  • 35.
    Malaysia’s PDPA Malaysia implementedthe Personal Data Protection Act 2010 to ensure information security by all organizations who perform data processing relating to commercial transactions in Malaysia. 35 To ensure that organisations - Explain the purpose of data collection - Seek consent for data collection - Establish data protection policies
  • 36.
    Malaysia’s PDPA 36 3 rolesdefined in the PDPA: • A licensed organization or individual who is processes, has control over or authorizes processing of personal data Data User • An individual who is the subject of personal data Data Subject • Any person other than an employee of the Data User who processes the personal data on behalf of the Data User Data Processor
  • 37.
    Personal Data Personal datain the PDPA means information - in respect of commercial transactions - directly or indirectly related to a data subject - who is identified or identifiable from that information and - other information in the possession of the data user. Sensitive personal data means any information related to: - Physical or mental health or condition - Political opinions - Religious or other similar beliefs - Offence records
  • 38.
    Personal Data ProtectionPrinciples General Principle • Prohibits a data user from processing a data subject's personal data without his/her consent Notice and Choice Principle • Requires a data user to inform a data subject on how the personal data is being used and provide a means of providing consent Disclosure Principle • Prohibits the disclosure of personal data without consent Security Principle • Obligation of the data user to protect the personal data Retention Principle • The personal data is not to be retained longer than necessary Data Integrity • Responsibility of the data user to take reasonable steps to ensure the personal data is accurate and complete Access Principle • The data subject has the right to access and correct his/her own data
  • 39.
    The Rights ofData Subjects 39 Right of Access to Personal Data •A data subject can request for information on the personal data that is being processed Right to Correct Personal Data •A data subject can request for personal data to be corrected if it is misleading, inaccurate or outdated Right to Withdraw Consent •A data subject may request to withdraw consent for processing of personal data Right to Prevent Processing •A data subject may request the data user not to begin or cease processing of personal data: •that is causing or likely to cause damage or distress •for the purpose of direct marketing [Source: pdp.gov.my]
  • 40.
    Ethics in DataScience Case Study: You are working on a data science project for non-governmental organization that collects donations. You would like to collect information about donors and the amount that they have donated to various causes. You hope that with the information about how much they have donated and how often, you will be able to encourage run targeted marketing campaigns to identify potential donors who will make more donations in the future.
  • 41.
    Ethical and PrivacyIssues An answer Model the data Explore the data Collect the data A question Discuss what will need to be done for the data science project described. What are some ethical and privacy issues that need to be considered? How would you address them?
  • 42.
    Copyright © SASInstitute Inc. All rights reserved. 42 Summary Identifying sources of data and performing data collection is critical to the data science project. Data scientists have to take into consideration - The sources of data they are using - Whether they have access to the data - Whether the data has been collected ethically and with regard to personal data protection laws.
  • 43.
    References Bruce, P.C. andFleming, G. (2021). Responsible Data Science. Wiley. Personal Data Protection Commissioner Malaysia (n.d.) What you need to know? Personal Data Protection Act 2010. Department of Personal Data Protection, https://www.pdp.gov.my/jpdpv2/ Pierson, L. (2021). Data Science For Dummies. For Dummies. Van Der Velden, J. (2021). Introduction to Data Science Course Notes. SAS Institute.