11/09/2024 1
Chapter Two
Data science
11/09/2024 2
Chapter Contents
 Describe what data science is and the role of data scientists.
 Differentiate data and information.
 Describe data processing life cycle
 Understand different data types from diverse perspectives
 Describe data value chain in emerging era of big data.
 Understand the basics of Big Data.
 Describe the purpose of the Hadoop ecosystem components.
11/09/2024 3
11/09/2024 4
Data science
• Data science is now one of the most influential topics all
around.
• Companies and enterprises are focusing a lot on
gathering data science talent further creating more
viable roles in the data science industry.
• Data science is a multi-disciplinary field that uses
scientific methods, processes, algorithms, and systems to
extract knowledge and insights from structured, semi-
structured and unstructured data.
• Example: The data involved in buying a box of cereal
from the store or supermarket
11/09/2024 5
Data science Vs data scientist
• Data Science defined as the extraction of
actionable knowledge directly from the data
through the process of discovery, hypothesis, and
analytical hypotheses analysis.
• It is a process of effectively producing or helping
to produce some tool, method, or other product
that derives intelligence from datasets too large.
11/09/2024 6
Cont’d…
• A data scientist (is a job title) is a person engaging
in a systematic activity to acquire knowledge from
data.
• In a more restricted sense, a data scientist may
refer to an individual who uses the scientific
method on existing data.
• Data Scientists perform research toward a more
comprehensive understanding of products,
systems, or nature, including physical,
mathematical and social realms.
11/09/2024 7
Role of data scientist
• Advance the skills of analyzing large amounts
of data, data mining, and programming skills.
• The processed and filtered data are handed to
them which are then fed to various analytics
programs and machine learning with statistical
methods to generate data which will soon be
used in predictive analysis and other fields
• Explore for more cryptic patterns to procure
proper insights.
11/09/2024 8
Algorithms
• An algorithm is a set of instructions designed
to perform a specific task.
• This can be a simple process, such as
multiplying two numbers, or a complex
operation, such as playing a compressed
video file.
• Search engines use proprietary algorithms to
display the most relevant results from their
search index for specific queries.
11/09/2024 9
Data Vs Information
• Data
– Can be defined as a representation of facts, concepts, or
instructions in a formalized manner, which should be
suitable for communication, interpretation, or processing, by
human or electronic machines.
– It can be described as unprocessed facts and figures
– It is represented with the help of characters such as alphabets
(A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>,
=, etc.
• Information
– The processed data on which decisions and actions are based
– Information is interpreted data; created from organized,
structured, and processed data in a particular context
11/09/2024 10
Data processing cycle
• Data processing is the conversion of raw data to
meaningful information through a process.
• Data is manipulated to produce results that lead
to a resolution of a problem or improvement of
an existing situation.
• The process includes activities like data
entry/input, calculation/process, output and
storage
11/09/2024 11
Cont’d…
• Input is the task where verified data is coded or
converted into machine readable form so that it
can be processed through a computer. Data entry
is done through the use of a keyboard, digitizer,
scanner, or data entry from an existing source.
input Processing Output
11/09/2024 12
Cont’d…
• Processing is when the data is subjected to various means and
methods of manipulation, the point where a computer program
is being executed, and it contains the program code and its
current activity.
• Output and interpretation is the stage where processed
information is now transmitted to the user. Output is presented
to users in various report formats like printed report, audio,
video, or on monitor.
• Storage is the last stage in the data processing cycle, where data,
instruction and information are held for future use. The
importance of this cycle is that it allows quick access and
retrieval of the processed information, allowing it to be passed
on to the next stage directly, when needed.
11/09/2024 13
Data types
• A data type is way to tell compiler as to which data (integer, character,
float, etc.) is supposed to be stored and what amount of memory
consequently to allocate.
• A data type is way to tell the compiler that at a cell x in a memory space, a
bit value of some range y is only supposed to be stored. It restricts the
compiler to store anything else other than that value range
• Common data types include
– Integers(int)- is used to store whole numbers, mathematically known
as integers
– Booleans(bool)- is used to represent restricted to one of two values:
true or false
– Characters(char)- is used to store a single character
– Floating-point numbers(float)- is used to store real numbers
– Alphanumeric strings(string)- used to store a combination of
characters and numbers
11/09/2024 14
Data representation
• Types are an abstraction letting us model things
in categories and it is largely a mental construct.
• All computer represent data nothing more than a
string of ones and zeroes.
• In order for said ones and zeroes to convey any
meaning, they need to be contextualized.
• Data types provide that context.
– E.g. 01100001
11/09/2024 15
Data types from data analytics perspective
• Data analytics (DA) is that the method of
examining knowledge sets to conclude the data they
contain, progressively with the help of specialized
systems and software package
• From a data analytics point of view, it is important
to understand that there are three common types of
data types or structures:
– Structured,
– Semi-structured, and
– Unstructured data types
11/09/2024 16
Structured data
• Structured data is data that adheres to a pre-defined
data model and is therefore straightforward to
analyze.
• Structured data concerns all data which can be stored
in database SQL in table with rows and columns. They
have relational key and can be easily mapped into
pre-designed fields.
• Structured data is highly organized information that
uploads neatly into a relational database
• Structured data is relatively simple to enter, store,
query, and analyze, but it must be strictly defined in
terms of field name and type.
11/09/2024 17
Semi structured data
• Semi-structured data is a form of structured data that
does not conform with the formal structure of data
models associated with relational databases or other
forms of data tables, but nonetheless, contains tags or
other markers to separate semantic elements and
enforce hierarchies of records and fields within the
data.
• Semi-structured data is information that doesn’t reside
in a relational database but that does have some
organizational properties that make it easier to analyze.
• Examples of semi-structured : CSV but XML and JSON
documents are semi structured documents, NoSQL
databases are considered as semi structured.
11/09/2024 18
Unstructured data
• Unstructured data is information that either does not have
a predefined data model or is not organized in a pre-
defined manner.
• Unstructured data may have its own internal structure, but
does not conform neatly into a spreadsheet or database.
• Most business interactions, in fact, are unstructured in
nature.
• Today more than 80% of the data generated is
unstructured.
• The fundamental challenge of unstructured data sources is
that they are difficult for nontechnical business users and
data analysts alike to unbox, understand, and prepare for
analytic use.
11/09/2024 19
Metadata – Data about data
• Metadata is data about data. Data that describes other data.
• It provides additional information about a specific set of
data.
• Metadata summarizes basic information about data, which
can make finding and working with particular instances of
data easier.
• For example, author, date created and date modified and
file size are examples of very basic document metadata.
• Having the ability to filter through that metadata makes it
much easier for someone to locate a specific document.
• In context of databases, metadata would be info on tables,
views, columns, arguments etc.
11/09/2024 20
Data value chain
• The Data Value Chain is introduced to describe the
information flow within a big data system as a series
of steps needed to generate value and useful insights
from data.
– Data acquisition, data analysis, data curation, data
storage, data usage
• Data acquisition is the process of digitizing data from
the world around us so it can be displayed, analyzed,
and stored in a computer. It is the processes for
bringing data that has been created by a source
outside the organization, into the organization, for
production use.
11/09/2024 21
Cont’d…
• Data analysis is the process of evaluating data
using analytical and logical reasoning to examine
each component of the data provided. Data from
various sources is gathered, reviewed, and then
analyzed to form some sort of finding or
conclusion.
• Data analytics is process of finding information
from data to make a decision and subsequently
act on it.
11/09/2024 22
Cont’d…
• The Big Data Value Chain identifies the following key high-
level activities:
11/09/2024 23
Big data
• No single standard definition…
“Big Data” is data whose scale, diversity, and
complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract
value and hidden knowledge from it.
11/09/2024 24
Cont’d…
• Big data is the term for a collection of data sets so
large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications.
• In other words, data that is the range of 100s of
TBs or PB comes into Big Data.
• But it doesn't mean the amount of data, the thing
matters is what organization do with data.
• Big Data is analyzed for insights that lead to
better decisions.
11/09/2024 25
Cont’d…
• Big Data is associated with the concept of 3 V
that is volume, velocity, and variety. Big data is
characterized by 3V and more:
– Volume: large amounts of data Zeta
bytes/Massive datasets
– Velocity: Data is live streaming or in motion
– Variety: data comes in many different forms
from diverse sources
– Veracity: can we trust the data? How
accurate is it? etc.
11/09/2024 26
Cluster computing
• Cluster Computing addresses the latest results in
these fields that support High Performance
Distributed Computing .
• The Clustering methods have identified as- HPC
IAAS, HPC PAAS, that are more expensive and
difficult to setup and maintain than a single
computer.
• In HPDC environments, parallel and/or
distributed computing techniques are applied to
the solution of computationally intensive
applications across networks of computers.
11/09/2024 27
Cont’d…
• “Computer cluster” basically refers to a set of
connected computer working together.
• The cluster represents one system and the
objective is to improve performance.
• The computers are generally connected in a LAN
(Local Area Network).
• So, when this cluster of computers works to
perform some tasks and gives an impression of
only a single entity, it is called “cluster
computing”.
11/09/2024 28
Cont’d…
• Big data clustering software combines the resources of
many smaller machines, seeking to provide a number of
benefits:
– Resource Pooling:
• Combining the available storage space to hold data is a
clear benefit, but CPU and memory pooling are also
extremely important. Processing large datasets requires
large amounts of all three of these resources.
• Object Pooling is a way which enable storing of group of
object(called pool storage) in memory.
• Whenever new object is needs to be created, it is first
checked in pool storage and if available it is reused and
like this it provide reusability of object and system
resources, improves the scalability of program.
11/09/2024 29
Cont’d…
• High Availability: In computing, the term availability is used to
describe the period of time when a service is available, as well as
the time required by a system to respond to a request made by a
user. High availability is a quality of a system or component that
assures a high level of operational performance for a given period
of time.
• Clusters can provide varying levels of fault tolerance and
availability guarantees to prevent hardware or software failures
from affecting access to data and processing. This becomes
increasingly important as we continue to emphasize the
importance of real-time analytics.
• Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group. This means the system
can react to changes in resource requirements without expanding
the physical resources on a machine.
11/09/2024 30
Hadoop and its ecosystem
• Hadoop is an open-source framework intended to make
interaction with big data easier. It is a framework that allows for
the distributed processing of large datasets across clusters of
computers using simple programming models.
• The four key characteristics of Hadoop are:
– Economical: Its systems are highly economical as ordinary
computers can be used for data processing.
– Reliable: It is reliable as it stores copies of the data on
different machines and is resistant to hardware failure.
– Scalable: It is easily scalable both, horizontally and vertically.
A few extra nodes help in scaling up the framework.
– Flexible: It is flexible and you can store as much structured
and unstructured data as you need to and decide to use them
later.
11/09/2024 31
Cont’d…
• Hadoop has an ecosystem that has evolved from its four core
components: data management, access, processing, and storage.
• It is continuously growing to meet the needs of Big Data.
• It comprises the following components and many others:
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query-based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm
libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
11/09/2024 32
Big data life cycle
11/09/2024 33
Thank you

Chapter 2.pptx emerging technology data science

  • 1.
  • 2.
    11/09/2024 2 Chapter Contents Describe what data science is and the role of data scientists.  Differentiate data and information.  Describe data processing life cycle  Understand different data types from diverse perspectives  Describe data value chain in emerging era of big data.  Understand the basics of Big Data.  Describe the purpose of the Hadoop ecosystem components.
  • 3.
  • 4.
    11/09/2024 4 Data science •Data science is now one of the most influential topics all around. • Companies and enterprises are focusing a lot on gathering data science talent further creating more viable roles in the data science industry. • Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured, semi- structured and unstructured data. • Example: The data involved in buying a box of cereal from the store or supermarket
  • 5.
    11/09/2024 5 Data scienceVs data scientist • Data Science defined as the extraction of actionable knowledge directly from the data through the process of discovery, hypothesis, and analytical hypotheses analysis. • It is a process of effectively producing or helping to produce some tool, method, or other product that derives intelligence from datasets too large.
  • 6.
    11/09/2024 6 Cont’d… • Adata scientist (is a job title) is a person engaging in a systematic activity to acquire knowledge from data. • In a more restricted sense, a data scientist may refer to an individual who uses the scientific method on existing data. • Data Scientists perform research toward a more comprehensive understanding of products, systems, or nature, including physical, mathematical and social realms.
  • 7.
    11/09/2024 7 Role ofdata scientist • Advance the skills of analyzing large amounts of data, data mining, and programming skills. • The processed and filtered data are handed to them which are then fed to various analytics programs and machine learning with statistical methods to generate data which will soon be used in predictive analysis and other fields • Explore for more cryptic patterns to procure proper insights.
  • 8.
    11/09/2024 8 Algorithms • Analgorithm is a set of instructions designed to perform a specific task. • This can be a simple process, such as multiplying two numbers, or a complex operation, such as playing a compressed video file. • Search engines use proprietary algorithms to display the most relevant results from their search index for specific queries.
  • 9.
    11/09/2024 9 Data VsInformation • Data – Can be defined as a representation of facts, concepts, or instructions in a formalized manner, which should be suitable for communication, interpretation, or processing, by human or electronic machines. – It can be described as unprocessed facts and figures – It is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc. • Information – The processed data on which decisions and actions are based – Information is interpreted data; created from organized, structured, and processed data in a particular context
  • 10.
    11/09/2024 10 Data processingcycle • Data processing is the conversion of raw data to meaningful information through a process. • Data is manipulated to produce results that lead to a resolution of a problem or improvement of an existing situation. • The process includes activities like data entry/input, calculation/process, output and storage
  • 11.
    11/09/2024 11 Cont’d… • Inputis the task where verified data is coded or converted into machine readable form so that it can be processed through a computer. Data entry is done through the use of a keyboard, digitizer, scanner, or data entry from an existing source. input Processing Output
  • 12.
    11/09/2024 12 Cont’d… • Processingis when the data is subjected to various means and methods of manipulation, the point where a computer program is being executed, and it contains the program code and its current activity. • Output and interpretation is the stage where processed information is now transmitted to the user. Output is presented to users in various report formats like printed report, audio, video, or on monitor. • Storage is the last stage in the data processing cycle, where data, instruction and information are held for future use. The importance of this cycle is that it allows quick access and retrieval of the processed information, allowing it to be passed on to the next stage directly, when needed.
  • 13.
    11/09/2024 13 Data types •A data type is way to tell compiler as to which data (integer, character, float, etc.) is supposed to be stored and what amount of memory consequently to allocate. • A data type is way to tell the compiler that at a cell x in a memory space, a bit value of some range y is only supposed to be stored. It restricts the compiler to store anything else other than that value range • Common data types include – Integers(int)- is used to store whole numbers, mathematically known as integers – Booleans(bool)- is used to represent restricted to one of two values: true or false – Characters(char)- is used to store a single character – Floating-point numbers(float)- is used to store real numbers – Alphanumeric strings(string)- used to store a combination of characters and numbers
  • 14.
    11/09/2024 14 Data representation •Types are an abstraction letting us model things in categories and it is largely a mental construct. • All computer represent data nothing more than a string of ones and zeroes. • In order for said ones and zeroes to convey any meaning, they need to be contextualized. • Data types provide that context. – E.g. 01100001
  • 15.
    11/09/2024 15 Data typesfrom data analytics perspective • Data analytics (DA) is that the method of examining knowledge sets to conclude the data they contain, progressively with the help of specialized systems and software package • From a data analytics point of view, it is important to understand that there are three common types of data types or structures: – Structured, – Semi-structured, and – Unstructured data types
  • 16.
    11/09/2024 16 Structured data •Structured data is data that adheres to a pre-defined data model and is therefore straightforward to analyze. • Structured data concerns all data which can be stored in database SQL in table with rows and columns. They have relational key and can be easily mapped into pre-designed fields. • Structured data is highly organized information that uploads neatly into a relational database • Structured data is relatively simple to enter, store, query, and analyze, but it must be strictly defined in terms of field name and type.
  • 17.
    11/09/2024 17 Semi structureddata • Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless, contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. • Semi-structured data is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. • Examples of semi-structured : CSV but XML and JSON documents are semi structured documents, NoSQL databases are considered as semi structured.
  • 18.
    11/09/2024 18 Unstructured data •Unstructured data is information that either does not have a predefined data model or is not organized in a pre- defined manner. • Unstructured data may have its own internal structure, but does not conform neatly into a spreadsheet or database. • Most business interactions, in fact, are unstructured in nature. • Today more than 80% of the data generated is unstructured. • The fundamental challenge of unstructured data sources is that they are difficult for nontechnical business users and data analysts alike to unbox, understand, and prepare for analytic use.
  • 19.
    11/09/2024 19 Metadata –Data about data • Metadata is data about data. Data that describes other data. • It provides additional information about a specific set of data. • Metadata summarizes basic information about data, which can make finding and working with particular instances of data easier. • For example, author, date created and date modified and file size are examples of very basic document metadata. • Having the ability to filter through that metadata makes it much easier for someone to locate a specific document. • In context of databases, metadata would be info on tables, views, columns, arguments etc.
  • 20.
    11/09/2024 20 Data valuechain • The Data Value Chain is introduced to describe the information flow within a big data system as a series of steps needed to generate value and useful insights from data. – Data acquisition, data analysis, data curation, data storage, data usage • Data acquisition is the process of digitizing data from the world around us so it can be displayed, analyzed, and stored in a computer. It is the processes for bringing data that has been created by a source outside the organization, into the organization, for production use.
  • 21.
    11/09/2024 21 Cont’d… • Dataanalysis is the process of evaluating data using analytical and logical reasoning to examine each component of the data provided. Data from various sources is gathered, reviewed, and then analyzed to form some sort of finding or conclusion. • Data analytics is process of finding information from data to make a decision and subsequently act on it.
  • 22.
    11/09/2024 22 Cont’d… • TheBig Data Value Chain identifies the following key high- level activities:
  • 23.
    11/09/2024 23 Big data •No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it.
  • 24.
    11/09/2024 24 Cont’d… • Bigdata is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. • In other words, data that is the range of 100s of TBs or PB comes into Big Data. • But it doesn't mean the amount of data, the thing matters is what organization do with data. • Big Data is analyzed for insights that lead to better decisions.
  • 25.
    11/09/2024 25 Cont’d… • BigData is associated with the concept of 3 V that is volume, velocity, and variety. Big data is characterized by 3V and more: – Volume: large amounts of data Zeta bytes/Massive datasets – Velocity: Data is live streaming or in motion – Variety: data comes in many different forms from diverse sources – Veracity: can we trust the data? How accurate is it? etc.
  • 26.
    11/09/2024 26 Cluster computing •Cluster Computing addresses the latest results in these fields that support High Performance Distributed Computing . • The Clustering methods have identified as- HPC IAAS, HPC PAAS, that are more expensive and difficult to setup and maintain than a single computer. • In HPDC environments, parallel and/or distributed computing techniques are applied to the solution of computationally intensive applications across networks of computers.
  • 27.
    11/09/2024 27 Cont’d… • “Computercluster” basically refers to a set of connected computer working together. • The cluster represents one system and the objective is to improve performance. • The computers are generally connected in a LAN (Local Area Network). • So, when this cluster of computers works to perform some tasks and gives an impression of only a single entity, it is called “cluster computing”.
  • 28.
    11/09/2024 28 Cont’d… • Bigdata clustering software combines the resources of many smaller machines, seeking to provide a number of benefits: – Resource Pooling: • Combining the available storage space to hold data is a clear benefit, but CPU and memory pooling are also extremely important. Processing large datasets requires large amounts of all three of these resources. • Object Pooling is a way which enable storing of group of object(called pool storage) in memory. • Whenever new object is needs to be created, it is first checked in pool storage and if available it is reused and like this it provide reusability of object and system resources, improves the scalability of program.
  • 29.
    11/09/2024 29 Cont’d… • HighAvailability: In computing, the term availability is used to describe the period of time when a service is available, as well as the time required by a system to respond to a request made by a user. High availability is a quality of a system or component that assures a high level of operational performance for a given period of time. • Clusters can provide varying levels of fault tolerance and availability guarantees to prevent hardware or software failures from affecting access to data and processing. This becomes increasingly important as we continue to emphasize the importance of real-time analytics. • Easy Scalability: Clusters make it easy to scale horizontally by adding additional machines to the group. This means the system can react to changes in resource requirements without expanding the physical resources on a machine.
  • 30.
    11/09/2024 30 Hadoop andits ecosystem • Hadoop is an open-source framework intended to make interaction with big data easier. It is a framework that allows for the distributed processing of large datasets across clusters of computers using simple programming models. • The four key characteristics of Hadoop are: – Economical: Its systems are highly economical as ordinary computers can be used for data processing. – Reliable: It is reliable as it stores copies of the data on different machines and is resistant to hardware failure. – Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in scaling up the framework. – Flexible: It is flexible and you can store as much structured and unstructured data as you need to and decide to use them later.
  • 31.
    11/09/2024 31 Cont’d… • Hadoophas an ecosystem that has evolved from its four core components: data management, access, processing, and storage. • It is continuously growing to meet the needs of Big Data. • It comprises the following components and many others:  HDFS: Hadoop Distributed File System  YARN: Yet Another Resource Negotiator  MapReduce: Programming based Data Processing  Spark: In-Memory data processing  PIG, HIVE: Query-based processing of data services  HBase: NoSQL Database  Mahout, Spark MLLib: Machine Learning algorithm libraries  Solar, Lucene: Searching and Indexing  Zookeeper: Managing cluster  Oozie: Job Scheduling
  • 32.
  • 33.