Chapter 2.pptx emerging technology data science

11/09/2024 1
Chapter Two
Data science

11/09/2024 2
Chapter Contents
 Describe what data science is and the role of data scientists.
 Differentiate data and information.
 Describe data processing life cycle
 Understand different data types from diverse perspectives
 Describe data value chain in emerging era of big data.
 Understand the basics of Big Data.
 Describe the purpose of the Hadoop ecosystem components.

11/09/2024 4
Data science
• Data science is now one of the most influential topics all
around.
• Companies and enterprises are focusing a lot on
gathering data science talent further creating more
viable roles in the data science industry.
• Data science is a multi-disciplinary field that uses
scientific methods, processes, algorithms, and systems to
extract knowledge and insights from structured, semi-
structured and unstructured data.
• Example: The data involved in buying a box of cereal
from the store or supermarket

11/09/2024 5
Data science Vs data scientist
• Data Science defined as the extraction of
actionable knowledge directly from the data
through the process of discovery, hypothesis, and
analytical hypotheses analysis.
• It is a process of effectively producing or helping
to produce some tool, method, or other product
that derives intelligence from datasets too large.

11/09/2024 6
Cont’d…
• A data scientist (is a job title) is a person engaging
in a systematic activity to acquire knowledge from
data.
• In a more restricted sense, a data scientist may
refer to an individual who uses the scientific
method on existing data.
• Data Scientists perform research toward a more
comprehensive understanding of products,
systems, or nature, including physical,
mathematical and social realms.

11/09/2024 7
Role of data scientist
• Advance the skills of analyzing large amounts
of data, data mining, and programming skills.
• The processed and filtered data are handed to
them which are then fed to various analytics
programs and machine learning with statistical
methods to generate data which will soon be
used in predictive analysis and other fields
• Explore for more cryptic patterns to procure
proper insights.

11/09/2024 8
Algorithms
• An algorithm is a set of instructions designed
to perform a specific task.
• This can be a simple process, such as
multiplying two numbers, or a complex
operation, such as playing a compressed
video file.
• Search engines use proprietary algorithms to
display the most relevant results from their
search index for specific queries.

11/09/2024 9
Data Vs Information
• Data
– Can be defined as a representation of facts, concepts, or
instructions in a formalized manner, which should be
suitable for communication, interpretation, or processing, by
human or electronic machines.
– It can be described as unprocessed facts and figures
– It is represented with the help of characters such as alphabets
(A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>,
=, etc.
• Information
– The processed data on which decisions and actions are based
– Information is interpreted data; created from organized,
structured, and processed data in a particular context

11/09/2024 10
Data processing cycle
• Data processing is the conversion of raw data to
meaningful information through a process.
• Data is manipulated to produce results that lead
to a resolution of a problem or improvement of
an existing situation.
• The process includes activities like data
entry/input, calculation/process, output and
storage

11/09/2024 11
Cont’d…
• Input is the task where verified data is coded or
converted into machine readable form so that it
can be processed through a computer. Data entry
is done through the use of a keyboard, digitizer,
scanner, or data entry from an existing source.
input Processing Output

11/09/2024 12
Cont’d…
• Processing is when the data is subjected to various means and
methods of manipulation, the point where a computer program
is being executed, and it contains the program code and its
current activity.
• Output and interpretation is the stage where processed
information is now transmitted to the user. Output is presented
to users in various report formats like printed report, audio,
video, or on monitor.
• Storage is the last stage in the data processing cycle, where data,
instruction and information are held for future use. The
importance of this cycle is that it allows quick access and
retrieval of the processed information, allowing it to be passed
on to the next stage directly, when needed.

11/09/2024 13
Data types
• A data type is way to tell compiler as to which data (integer, character,
float, etc.) is supposed to be stored and what amount of memory
consequently to allocate.
• A data type is way to tell the compiler that at a cell x in a memory space, a
bit value of some range y is only supposed to be stored. It restricts the
compiler to store anything else other than that value range
• Common data types include
– Integers(int)- is used to store whole numbers, mathematically known
as integers
– Booleans(bool)- is used to represent restricted to one of two values:
true or false
– Characters(char)- is used to store a single character
– Floating-point numbers(float)- is used to store real numbers
– Alphanumeric strings(string)- used to store a combination of
characters and numbers

11/09/2024 14
Data representation
• Types are an abstraction letting us model things
in categories and it is largely a mental construct.
• All computer represent data nothing more than a
string of ones and zeroes.
• In order for said ones and zeroes to convey any
meaning, they need to be contextualized.
• Data types provide that context.
– E.g. 01100001

11/09/2024 15
Data types from data analytics perspective
• Data analytics (DA) is that the method of
examining knowledge sets to conclude the data they
contain, progressively with the help of specialized
systems and software package
• From a data analytics point of view, it is important
to understand that there are three common types of
data types or structures:
– Structured,
– Semi-structured, and
– Unstructured data types

11/09/2024 16
Structured data
• Structured data is data that adheres to a pre-defined
data model and is therefore straightforward to
analyze.
• Structured data concerns all data which can be stored
in database SQL in table with rows and columns. They
have relational key and can be easily mapped into
pre-designed fields.
• Structured data is highly organized information that
uploads neatly into a relational database
• Structured data is relatively simple to enter, store,
query, and analyze, but it must be strictly defined in
terms of field name and type.

11/09/2024 17
Semi structured data
• Semi-structured data is a form of structured data that
does not conform with the formal structure of data
models associated with relational databases or other
forms of data tables, but nonetheless, contains tags or
other markers to separate semantic elements and
enforce hierarchies of records and fields within the
data.
• Semi-structured data is information that doesn’t reside
in a relational database but that does have some
organizational properties that make it easier to analyze.
• Examples of semi-structured : CSV but XML and JSON
documents are semi structured documents, NoSQL
databases are considered as semi structured.

11/09/2024 18
Unstructured data
• Unstructured data is information that either does not have
a predefined data model or is not organized in a pre-
defined manner.
• Unstructured data may have its own internal structure, but
does not conform neatly into a spreadsheet or database.
• Most business interactions, in fact, are unstructured in
nature.
• Today more than 80% of the data generated is
unstructured.
• The fundamental challenge of unstructured data sources is
that they are difficult for nontechnical business users and
data analysts alike to unbox, understand, and prepare for
analytic use.

11/09/2024 19
Metadata – Data about data
• Metadata is data about data. Data that describes other data.
• It provides additional information about a specific set of
data.
• Metadata summarizes basic information about data, which
can make finding and working with particular instances of
data easier.
• For example, author, date created and date modified and
file size are examples of very basic document metadata.
• Having the ability to filter through that metadata makes it
much easier for someone to locate a specific document.
• In context of databases, metadata would be info on tables,
views, columns, arguments etc.

11/09/2024 20
Data value chain
• The Data Value Chain is introduced to describe the
information flow within a big data system as a series
of steps needed to generate value and useful insights
from data.
– Data acquisition, data analysis, data curation, data
storage, data usage
• Data acquisition is the process of digitizing data from
the world around us so it can be displayed, analyzed,
and stored in a computer. It is the processes for
bringing data that has been created by a source
outside the organization, into the organization, for
production use.

11/09/2024 21
Cont’d…
• Data analysis is the process of evaluating data
using analytical and logical reasoning to examine
each component of the data provided. Data from
various sources is gathered, reviewed, and then
analyzed to form some sort of finding or
conclusion.
• Data analytics is process of finding information
from data to make a decision and subsequently
act on it.

11/09/2024 22
Cont’d…
• The Big Data Value Chain identifies the following key high-
level activities:

11/09/2024 23
Big data
• No single standard definition…
“Big Data” is data whose scale, diversity, and
complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract
value and hidden knowledge from it.

11/09/2024 24
Cont’d…
• Big data is the term for a collection of data sets so
large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications.
• In other words, data that is the range of 100s of
TBs or PB comes into Big Data.
• But it doesn't mean the amount of data, the thing
matters is what organization do with data.
• Big Data is analyzed for insights that lead to
better decisions.

11/09/2024 25
Cont’d…
• Big Data is associated with the concept of 3 V
that is volume, velocity, and variety. Big data is
characterized by 3V and more:
– Volume: large amounts of data Zeta
bytes/Massive datasets
– Velocity: Data is live streaming or in motion
– Variety: data comes in many different forms
from diverse sources
– Veracity: can we trust the data? How
accurate is it? etc.

11/09/2024 26
Cluster computing
• Cluster Computing addresses the latest results in
these fields that support High Performance
Distributed Computing .
• The Clustering methods have identified as- HPC
IAAS, HPC PAAS, that are more expensive and
difficult to setup and maintain than a single
computer.
• In HPDC environments, parallel and/or
distributed computing techniques are applied to
the solution of computationally intensive
applications across networks of computers.

11/09/2024 27
Cont’d…
• “Computer cluster” basically refers to a set of
connected computer working together.
• The cluster represents one system and the
objective is to improve performance.
• The computers are generally connected in a LAN
(Local Area Network).
• So, when this cluster of computers works to
perform some tasks and gives an impression of
only a single entity, it is called “cluster
computing”.

11/09/2024 28
Cont’d…
• Big data clustering software combines the resources of
many smaller machines, seeking to provide a number of
benefits:
– Resource Pooling:
• Combining the available storage space to hold data is a
clear benefit, but CPU and memory pooling are also
extremely important. Processing large datasets requires
large amounts of all three of these resources.
• Object Pooling is a way which enable storing of group of
object(called pool storage) in memory.
• Whenever new object is needs to be created, it is first
checked in pool storage and if available it is reused and
like this it provide reusability of object and system
resources, improves the scalability of program.

11/09/2024 29
Cont’d…
• High Availability: In computing, the term availability is used to
describe the period of time when a service is available, as well as
the time required by a system to respond to a request made by a
user. High availability is a quality of a system or component that
assures a high level of operational performance for a given period
of time.
• Clusters can provide varying levels of fault tolerance and
availability guarantees to prevent hardware or software failures
from affecting access to data and processing. This becomes
increasingly important as we continue to emphasize the
importance of real-time analytics.
• Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group. This means the system
can react to changes in resource requirements without expanding
the physical resources on a machine.

11/09/2024 30
Hadoop and its ecosystem
• Hadoop is an open-source framework intended to make
interaction with big data easier. It is a framework that allows for
the distributed processing of large datasets across clusters of
computers using simple programming models.
• The four key characteristics of Hadoop are:
– Economical: Its systems are highly economical as ordinary
computers can be used for data processing.
– Reliable: It is reliable as it stores copies of the data on
different machines and is resistant to hardware failure.
– Scalable: It is easily scalable both, horizontally and vertically.
A few extra nodes help in scaling up the framework.
– Flexible: It is flexible and you can store as much structured
and unstructured data as you need to and decide to use them
later.

11/09/2024 31
Cont’d…
• Hadoop has an ecosystem that has evolved from its four core
components: data management, access, processing, and storage.
• It is continuously growing to meet the needs of Big Data.
• It comprises the following components and many others:
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query-based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm
libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

11/09/2024 32
Big data life cycle

Chapter 2.pptx emerging technology data science

More Related Content

Similar to Chapter 2.pptx emerging technology data science

More from Tekle12

Recently uploaded

Chapter 2.pptx emerging technology data science