Introduction to
Data Science
Chapter Two
1
Topics Covered
An Overview of Data Science
Data and Information
Data Types and Representation
Data Processing Cycle
Data Value Chain (Acquisition, Analysis,
Curating, Storage, Usage)
Basic Concepts of Big Data
2
2.1 Overview of Data Science
What is Data science?
A multi-disciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge and insights from
structured and unstructured data.
3
Or it is the field of study that combines programming
skills, knowledge of mathematics and statistics to extract
meaningful insights from data.
Cont.
Data science is much more than simply analyzing data.
It offers a range of role of and requires a range of skills.
It is a science of different professionals. i.e. by
combining mathematics, statics, computer programing
etc
Examples of data
Your notebook
Prices of items in supermarket
Files in computer etc
4
Cont. . .
Data science continues to evolve as one of the most
promising and in-demand career paths for skilled
professionals
To be a successful data professional in today’s market
requires to advance past traditional skills of analyzing large
amounts of data by data mining and programming skills.
5
Data Science Experts/Scientist?
Data scientists are analytical experts who utilize their
skills in both technology and social science to find trends
and manage data.
They use industry knowledge, contextual understanding,
uncertainty of existing assumptions to uncover solutions to
business challenges.
Skill needed for a data scientist are statistics and linear
algebra as well as programming knowledge.
Must master the full spectrum of the data science life
cycle and possess a level of flexibility and understanding
to maximize returns.
6
2.2 Data and Information
Data?
A representation of raw or unprocessed facts, figures,
concepts, or instructions in a formalized manner, which
should be suitable for communication, interpretation, or
processing by human or electronic machine.
It is not used for decision making
The data doesn’t have pattern
Data can be represented with the help of:
Alphabet (A-Z, a-z)
Digit (0-9)
Special Characters (+,-, *, /, >,<, = etc. )
7
Information?
Interpreted data, created from organized, structured and
processed data, which has some meaningful values for the
receiver.
It is organized, processed, structured and analyzed data
 It is used for decisions making purposes.
Principle of information - processed data must qualify
for the following
 Timely-Information should be available when required.
 Accuracy − Information should be accurate.
 Completeness − Information should be complete.
8
Data Information
Described as unprocessed or raw
facts and figures
Described as processed data
Cannot help in decision making Can help in decision making
Raw material that can be organized,
structured, and interpreted to create
useful information systems.
Interpreted data; created from
organized, structured, and
processed data in a particular
context.
An example of data is a students
test score.
The average score of a class is
the information driven from the
given data.
Summery: Data Vs. Information
9
2.3 Data Processing Cycle
Data Processing Cycle
Input (prepared
in some
convenient form
for processing)
e.g. Electronic
computers
Output: is collecting
the result of processing
Processing :
(Changing data in
to useful form)
e.g. calculating
CGPA
Produced information need
to be stored for future usage
10
 is re-structuring or re-ordering of data by people or
machine to increase their usefulness
 The set of operations used to transform data into useful
information.
Cont. . .
11
2.4 Data Types and their Perspective
Common data types include:
Integers (int) - used to store whole numbers
mathematically known as Integers
Booleans (bool) – store one of two values: true or false
or (High or Low)
Characters (char) used to store a single character
(numeric, Alphabetic, symbol)
Floating – point numbers (float)-is used to store real
numbers
Alphanumeric strings (string) --- used to stores a
combination of characters and numbers.
12
Data Types and data Analytics Representation
 Structured Data:
It has a pre-defined data model and straightforward to
analyze
take a tabular format (table format) with a relationship
between different rows and columns
E.g. Excel files or SQL databases
 Semi-structured Data:
does not conform with the formal structure of data model.
But, contains tags or other markers for separation semantic
elements enforce hierarchies of records and fields within the
data
Known as self describing structure.
Fore example: JSON and XML 13
Cont. . .
 Unstructured Data
does not have a predefined data model or is not
organized in a pre-defined manner.
Examples: audio, video files or No-SQL databases.
Metadata - data about data that provides additional
information about a specific set of data.
It is one of the most important elements for big data
analysis and big data solution.
 E.g. photographs metadata - describe when and
where the photos were taken.
14
2.5 Data Value Chain
Is the information flow within a big data system as a
series of steps needed to generate useful insights
from data.
Data value Chain includes:
1. Data Acquisition: is the process of Gathering,
Filtering and Cleaning data before any data analysis
can be carried out.
2. Data analysis: (making raw data amenable to use in
decision making)
Data analysis involves exploring, transforming and
modeling data and extracting useful information 15
Cont. . .
3. Data Curation : Active management of data over its
life cycle to ensure it meets the necessary data quality
requirements for its effective usage.
Include creation of content, selection, classification,
transformation, validation and preservation.
Data Curation is performed by expert Curators that are
responsible for improving the accessibility and quality
of data.
4.Data storage: (storing the processed data)
5. Data usage: (using the processed data to make
decision)
16
Use case of Data Science
17
Application domain of Data Science
18
2.6 Basic Concepts of Big data
 is a term for a collection of data sets so large and
complex that it becomes difficult to process using on-
hand database management tools or traditional data
processing applications.
19
Characteristics of Big data
Big data can be characterized by :
1. Volume: large amount of data, massive datasets
2. Velocity: data is live streaming or in motion (Rapidity)
The speed that data moves through the system.
3.Variety: data comes in many different forms from
diverse sources. (structured, unstructured, text)
4. Veracity: can we trust the data? How accurate is it? etc.
Uncertainty due to data inconsistency and
incompleteness etc.
20
• The speed at
which data are
generated
• Data is live
streaming or
in motion
• Realtime
• Data trustworthiness
(the degree to
which big data
can be trusted)
• Data accuracy
How accurate is
it?
Characteristics of Big data
• The amount of
data from
myriad source
• large amounts
of data Zeta
bytes
(Massive
datasets)
• The types of
data
• Data comes in
many
different
forms from
diverse
sources
• The way in
which the big
data can be
used and
formatted
• To whom the
data are
accessible?
• Business
value of the
data
collected
• Uses and
purpose of
data
21
The 4 Vs of Big Data
22
Five major use cases of Big Data
Big data exploration or investigation
Enhanced customer view
Security / intelligence extension
Operations analysis
Data warehouse augmentation
23
Clustered Computing
Individual computers are often inadequate for
handling big data at most stages.
Clustered Computing is a group of computers
connected through LAN (local area network) that
work together and they behave like a single
system.
Computer made up of computer
Is used to better address high storage and
computational needs of big data.
24
Cont. . .
25
 The four nodes are connected through software to share loads
and they perform like a single unit.
 It is important to maximize the processor that improves the
speed of computers when analyzing big data.
 We can search, extract or allocate data from all nodes by
accessing only one node b/c each node have relationships.
 Each node have backups, duplications
Benefits of Clustered Computing
The benefits of combining the resources of many
smaller machines are to get:
1.Resource pooling: combining available storage
space, CPU or memory to get high speed operation or
high speed transaction.
2.High availability: it provides varying levels of fault
tolerance and availability guarantees.
If one machine falls we can get the data from another
machines. Thus, no data lost or west.
3. Easy scalability (scalable by adding additional
machine)
26
Examples of Scaling Clustered Computing
27
2.7 Hadoop and its Ecosystem
 An open-source framework intended to make interaction with big
data easier.
 It allows clustering multiple computers to analyze massive data sets in
parallel more quickly.
The four key characteristics of Hadoop
Economical: ordinary computers can be
used for data processing.
Reliable: it stores copies of the data on
different machines and is resistant to
hardware failure.
 Scalable: It is easily scalable both,
horizontally and vertically.
Flexible: It is flexible and you can store as much structured
and unstructured data as you need and to use them later.
28
 The 4 core
components of
Hadoop includes
Data
Management,
Data Access,
Data Processing
Data Storage.
The 4 core components of Hadoop and its Ecosystem
The Hadoop Ecosystem
29
Cont.…..
It comprises the following components
HDFS: Hadoop distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark : in Memory data processing
PIG, HIVE: Query- based processing of data
service
HBase: NoSQL Database etc
30
The Big data life cycle with Hadoop
Stage 1- Ingesting data into the Hadoop.
The data is ingested or transferred to Hadoop from
various sources such as relational databases, systems,
or local files.
Stage 2-Processing: in this stage the data is stored and
processed
Stage 3- Computing and analyzing data
The data is analyzing and processing by using
opensource frameworks such as Pig, Hive, and
Impala.
Stage 4- Visualizing the results
The analyzed data can be accessed by users. 31
Review Questions
1. Discuss the difference between Big data and Data Science.
2. Briefly discuss the Big data life cycle.
3. List and explain Big data application domains with
example.
4. What is Clustered Computing? Explain its advantages.
32
Thank you!
33

2016 Chapter 2 - Intro. to Data Sciences.pptx

  • 1.
  • 2.
    Topics Covered An Overviewof Data Science Data and Information Data Types and Representation Data Processing Cycle Data Value Chain (Acquisition, Analysis, Curating, Storage, Usage) Basic Concepts of Big Data 2
  • 3.
    2.1 Overview ofData Science What is Data science? A multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. 3 Or it is the field of study that combines programming skills, knowledge of mathematics and statistics to extract meaningful insights from data.
  • 4.
    Cont. Data science ismuch more than simply analyzing data. It offers a range of role of and requires a range of skills. It is a science of different professionals. i.e. by combining mathematics, statics, computer programing etc Examples of data Your notebook Prices of items in supermarket Files in computer etc 4
  • 5.
    Cont. . . Datascience continues to evolve as one of the most promising and in-demand career paths for skilled professionals To be a successful data professional in today’s market requires to advance past traditional skills of analyzing large amounts of data by data mining and programming skills. 5
  • 6.
    Data Science Experts/Scientist? Datascientists are analytical experts who utilize their skills in both technology and social science to find trends and manage data. They use industry knowledge, contextual understanding, uncertainty of existing assumptions to uncover solutions to business challenges. Skill needed for a data scientist are statistics and linear algebra as well as programming knowledge. Must master the full spectrum of the data science life cycle and possess a level of flexibility and understanding to maximize returns. 6
  • 7.
    2.2 Data andInformation Data? A representation of raw or unprocessed facts, figures, concepts, or instructions in a formalized manner, which should be suitable for communication, interpretation, or processing by human or electronic machine. It is not used for decision making The data doesn’t have pattern Data can be represented with the help of: Alphabet (A-Z, a-z) Digit (0-9) Special Characters (+,-, *, /, >,<, = etc. ) 7
  • 8.
    Information? Interpreted data, createdfrom organized, structured and processed data, which has some meaningful values for the receiver. It is organized, processed, structured and analyzed data  It is used for decisions making purposes. Principle of information - processed data must qualify for the following  Timely-Information should be available when required.  Accuracy − Information should be accurate.  Completeness − Information should be complete. 8
  • 9.
    Data Information Described asunprocessed or raw facts and figures Described as processed data Cannot help in decision making Can help in decision making Raw material that can be organized, structured, and interpreted to create useful information systems. Interpreted data; created from organized, structured, and processed data in a particular context. An example of data is a students test score. The average score of a class is the information driven from the given data. Summery: Data Vs. Information 9
  • 10.
    2.3 Data ProcessingCycle Data Processing Cycle Input (prepared in some convenient form for processing) e.g. Electronic computers Output: is collecting the result of processing Processing : (Changing data in to useful form) e.g. calculating CGPA Produced information need to be stored for future usage 10  is re-structuring or re-ordering of data by people or machine to increase their usefulness  The set of operations used to transform data into useful information.
  • 11.
  • 12.
    2.4 Data Typesand their Perspective Common data types include: Integers (int) - used to store whole numbers mathematically known as Integers Booleans (bool) – store one of two values: true or false or (High or Low) Characters (char) used to store a single character (numeric, Alphabetic, symbol) Floating – point numbers (float)-is used to store real numbers Alphanumeric strings (string) --- used to stores a combination of characters and numbers. 12
  • 13.
    Data Types anddata Analytics Representation  Structured Data: It has a pre-defined data model and straightforward to analyze take a tabular format (table format) with a relationship between different rows and columns E.g. Excel files or SQL databases  Semi-structured Data: does not conform with the formal structure of data model. But, contains tags or other markers for separation semantic elements enforce hierarchies of records and fields within the data Known as self describing structure. Fore example: JSON and XML 13
  • 14.
    Cont. . . Unstructured Data does not have a predefined data model or is not organized in a pre-defined manner. Examples: audio, video files or No-SQL databases. Metadata - data about data that provides additional information about a specific set of data. It is one of the most important elements for big data analysis and big data solution.  E.g. photographs metadata - describe when and where the photos were taken. 14
  • 15.
    2.5 Data ValueChain Is the information flow within a big data system as a series of steps needed to generate useful insights from data. Data value Chain includes: 1. Data Acquisition: is the process of Gathering, Filtering and Cleaning data before any data analysis can be carried out. 2. Data analysis: (making raw data amenable to use in decision making) Data analysis involves exploring, transforming and modeling data and extracting useful information 15
  • 16.
    Cont. . . 3.Data Curation : Active management of data over its life cycle to ensure it meets the necessary data quality requirements for its effective usage. Include creation of content, selection, classification, transformation, validation and preservation. Data Curation is performed by expert Curators that are responsible for improving the accessibility and quality of data. 4.Data storage: (storing the processed data) 5. Data usage: (using the processed data to make decision) 16
  • 17.
    Use case ofData Science 17
  • 18.
    Application domain ofData Science 18
  • 19.
    2.6 Basic Conceptsof Big data  is a term for a collection of data sets so large and complex that it becomes difficult to process using on- hand database management tools or traditional data processing applications. 19
  • 20.
    Characteristics of Bigdata Big data can be characterized by : 1. Volume: large amount of data, massive datasets 2. Velocity: data is live streaming or in motion (Rapidity) The speed that data moves through the system. 3.Variety: data comes in many different forms from diverse sources. (structured, unstructured, text) 4. Veracity: can we trust the data? How accurate is it? etc. Uncertainty due to data inconsistency and incompleteness etc. 20
  • 21.
    • The speedat which data are generated • Data is live streaming or in motion • Realtime • Data trustworthiness (the degree to which big data can be trusted) • Data accuracy How accurate is it? Characteristics of Big data • The amount of data from myriad source • large amounts of data Zeta bytes (Massive datasets) • The types of data • Data comes in many different forms from diverse sources • The way in which the big data can be used and formatted • To whom the data are accessible? • Business value of the data collected • Uses and purpose of data 21
  • 22.
    The 4 Vsof Big Data 22
  • 23.
    Five major usecases of Big Data Big data exploration or investigation Enhanced customer view Security / intelligence extension Operations analysis Data warehouse augmentation 23
  • 24.
    Clustered Computing Individual computersare often inadequate for handling big data at most stages. Clustered Computing is a group of computers connected through LAN (local area network) that work together and they behave like a single system. Computer made up of computer Is used to better address high storage and computational needs of big data. 24
  • 25.
    Cont. . . 25 The four nodes are connected through software to share loads and they perform like a single unit.  It is important to maximize the processor that improves the speed of computers when analyzing big data.  We can search, extract or allocate data from all nodes by accessing only one node b/c each node have relationships.  Each node have backups, duplications
  • 26.
    Benefits of ClusteredComputing The benefits of combining the resources of many smaller machines are to get: 1.Resource pooling: combining available storage space, CPU or memory to get high speed operation or high speed transaction. 2.High availability: it provides varying levels of fault tolerance and availability guarantees. If one machine falls we can get the data from another machines. Thus, no data lost or west. 3. Easy scalability (scalable by adding additional machine) 26
  • 27.
    Examples of ScalingClustered Computing 27
  • 28.
    2.7 Hadoop andits Ecosystem  An open-source framework intended to make interaction with big data easier.  It allows clustering multiple computers to analyze massive data sets in parallel more quickly. The four key characteristics of Hadoop Economical: ordinary computers can be used for data processing. Reliable: it stores copies of the data on different machines and is resistant to hardware failure.  Scalable: It is easily scalable both, horizontally and vertically. Flexible: It is flexible and you can store as much structured and unstructured data as you need and to use them later. 28
  • 29.
     The 4core components of Hadoop includes Data Management, Data Access, Data Processing Data Storage. The 4 core components of Hadoop and its Ecosystem The Hadoop Ecosystem 29
  • 30.
    Cont.….. It comprises thefollowing components HDFS: Hadoop distributed File System YARN: Yet Another Resource Negotiator MapReduce: Programming based Data Processing Spark : in Memory data processing PIG, HIVE: Query- based processing of data service HBase: NoSQL Database etc 30
  • 31.
    The Big datalife cycle with Hadoop Stage 1- Ingesting data into the Hadoop. The data is ingested or transferred to Hadoop from various sources such as relational databases, systems, or local files. Stage 2-Processing: in this stage the data is stored and processed Stage 3- Computing and analyzing data The data is analyzing and processing by using opensource frameworks such as Pig, Hive, and Impala. Stage 4- Visualizing the results The analyzed data can be accessed by users. 31
  • 32.
    Review Questions 1. Discussthe difference between Big data and Data Science. 2. Briefly discuss the Big data life cycle. 3. List and explain Big data application domains with example. 4. What is Clustered Computing? Explain its advantages. 32
  • 33.