Chapter Two:
Data science
Abey B.(MSc)
Unity University
Faculty of Computing
Department Of Computer Science
November 2, 2024
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 1 / 50
What is Data science
Data science is now one of the most important topics all around.
Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms, and systems to extract
knowledge and insights from
▶ Structured Data
▶ Semi-structured Data
▶ Unstructured Data
Data Science is the art of gaining knowledge from data, or of
getting meaning into large databases.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 2 / 50
What is Data science
Data Science defined as the extraction of actionable knowledge
directly from the data through the process using different
method.
A data scientist (is a job title) is a person in a systematic activity to
acquire knowledge from data.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 3 / 50
What is Data science
What is Algorithms?
An algorithm is a set of instructions designed to perform a
specific task
Algorithm refers to a sequence of finite steps to solve a particular
problem.
An algorithm is a set of commands that must be followed for a
computer to perform calculations or other problem-solving operations.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 4 / 50
What is Data science
What is Algorithms?
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 5 / 50
Data vs. Information
Data
Data is raw, unorganized facts that need to be processed.
Each piece of data is a little fact that doesn’t mean much on its own.
It can be described as unprocessed facts and figures
It is represented with the help of characters such as alphabets (A-Z,
a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.)
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 6 / 50
Data vs. Information
There are two main types of data:
1 Quantitative data
▶ It is provided in numerical form, like the weight, volume, or cost of an
item.
2 Qualitative data
▶ It is descriptive, but non-numerical, like the name, sex, or eye color of
a person.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 7 / 50
Data vs. Information
Information
Information is interpreted data; created from organized, structured,
and processed data in a particular context
Information is data that is processed, organized, and structured
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 8 / 50
Difference Between Data and Information
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 9 / 50
Data Processing Cycle
The data processing cycle is the set of operations used to
transform data into useful information
Data processing cycle as the term suggests a sequence of steps or
operations for processing data, i.e., processing raw data to the
usable form.
Data processing is the re-structuring or re-ordering of data by
people or machines to increase their usefulness and add values for
a particular purpose.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 10 / 50
Data Processing Cycle
Stages of data processing:
Input
▶ The raw data after collection needs to be fed in the cycle for
processing.
▶ This is considered the first step and called input.
Processing
▶ Once the input is provided the raw data is processed by a suitable or
selected processing method.
▶ This is the most important step as it provides the processed data
in the form of output which will be used further.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 11 / 50
Data Processing Cycle
Stages of data processing:
Output
▶ This is the outcome and the raw data provided in the first stage is
now “processed” and the data is useful and provides information
and no longer called data.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 12 / 50
Data types and their representation
Data types can be described from diverse perspectives.
In computer science and computer programming, for instance, a
data type
▶ Data Type is simply an attribute of data that tells the compiler or
interpreter how the programmer intends to use the data.
Data types can be classified from two perspectives.
1 Data types from Data Analytics perspective
2 Data types from Computer programming perspective
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 13 / 50
Data types and their representation
Data types from Data Analytics perspective
From a data analytics point of view, it is important to understand
that there are three common types of data types
1 Structured data types
2 Semi-structured data types
3 Unstructured data types
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 14 / 50
Data types from Data Analytics perspective
Structured Data
Structured data refers to data that is organized and formatted in
a specific way to make it easily readable and understandable by
both humans and machines.
Structured data store in a table format with a relationship
between the different rows and columns
Structured data is highly valuable because it can be easily
searched, queried, and analyzed using various tools and
techniques.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 15 / 50
Data types from Data Analytics perspective
Structured Data
Common examples of structured data are
▶ Excel files
▶ SQL databases
Each of these has structured rows and columns that can be sorted
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 16 / 50
Data types from Data Analytics perspective
Excel files
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 17 / 50
Data types from Data Analytics perspective
Unstructured Data
Unstructured data is information that either does not have a
predefined data model or is not organized in a pre-defined
manner.
Unstructured data may have its own internal structure, but does
not follow neatly into a spreadsheet or database.
From 80% to 90% of data generated and collected by organizations
is unstructured,
▶ Its volumes are growing rapidly — many times faster than the rate
of growth for structured databases.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 18 / 50
Data types from Data Analytics perspective
Unstructured Data
Examples of unstructured data include
▶ Audio
▶ video
▶ files
▶ Entertainment data
▶ Sensor data, etc..
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 19 / 50
Data types from Data Analytics perspective
Semi-structured Data
Semi-structured data is a type of data that is not purely
structured, but also not completely unstructured.
It contains some level of organization or structure, but does not
conform to a rigid schema or data model
Semi-structured contains tags or other markers to separate
semantic elements
Semi-structured data is information that doesn’t exist in a
relational database but that have some organizational properties
that make it easier to analyze.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 20 / 50
Data types from Data Analytics perspective
Unstructured Data
Examples of semi-structured data include
▶ JSON
▶ XML are forms of semistructured data.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 21 / 50
Data types from Data Analytics perspective
Unstructured Data
JSON
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 22 / 50
Data types from Data Analytics perspective
Unstructured Data
XML
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 23 / 50
Data types from Data Analytics perspective
Structured Data,Unstructured Data and Semi-structured Data
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 24 / 50
Data types from Computer programming perspective
from Computer programming perspective common data types
include
1 Integers(int)
2 Booleans(bool)
3 Characters(char)
4 Floating-point numbers(float)
5 Alphanumeric strings
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 25 / 50
Data types from Computer programming perspective
Integers(int)
▶ Integers(int)- is used to store whole numbers
▶ Mathematically known as integers
▶ Examples of integers are 0, 1, 2, 3 and 4.
Booleans(bool)
▶ Booleans(bool)- is used to represent restricted to one of two values:
true or false
Characters(char)
▶ Characters(char)- is used to store a single character
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 26 / 50
Data types from Computer programming perspective
Floating-point numbers(float)
▶ Floating-point numbers(float)- is used to store real numbers
▶ Floating point number is a positive or negative whole number with
a decimal point.
Alphanumeric strings(string)
▶ Alphanumeric strings(string)- used to store a combination of
characters and numbers
▶ A string is a sequence of characters enclosed between the double
quotes ”...”
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 27 / 50
What is Metadata?
Metadata is data about data.
It provides additional information about a specific set of data.
In a set of photographs, for example, metadata could describe
when and where the photos were taken.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 28 / 50
What is Metadata?
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 29 / 50
What is Data value Chain?
The Data Value Chain is the information flow within a big data
system as a series of steps needed to generate value and useful
insights from data.
The Big Data Value Chain identifies the following key high-level
activities:
1 Data Acquisition
2 Data Analysis
3 Data Curation
4 Data Storage
5 Data Usage
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 30 / 50
What is Data value Chain?
Data Acquisition
It is the process of gathering, filtering, and cleaning data before
it is put in a data warehouse or any other storage solution on which
data analysis can be carried out.
Data acquisition is one of the major big data challenges in terms
of infrastructure requirements.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 31 / 50
What is Data value Chain?
Data analysis
Data analysis is the process of evaluating data using analytical
and logical reasoning to examine each component of the data
provided.
Data from various sources is gathered, reviewed, and then
analyzed to form some sort of finding or conclusion
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 32 / 50
What is Data value Chain?
Data Curation
It is the active management of data over its life cycle to ensure it
meets the necessary data quality requirements for its effective
usage.
Data curation is performed by expert curators(scientific curators,
data annotators ) that are responsible for improving the
accessibility and quality of data.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 33 / 50
What is Data value Chain?
Data storage
It is also defined as a way of keeping information in the memory
storage for use by a computer.
It is the persistence and management of data in a scalable way
that satisfies the needs of applications that require fast access to
the data.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 34 / 50
What is Data value Chain?
Data usage
It covers the data-driven business activities that need access to data,
its analysis, and the tools needed to integrate the data analysis within
the business activity.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 35 / 50
What is big Data?
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 36 / 50
What is big Data?
Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
Big data refers to the large, diverse sets of information that grow
at ever-increasing rates.
It doesn’t mean the amount of data, the thing matters is what
organization do with data.
Big Data is analyzed for insights that lead to better decisions.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 37 / 50
What is big Data?
Big data is characterized by 3V
1 Volume
2 Velocity
3 Variety
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 38 / 50
Big data is characterized by 3V
Volume
Big data is huge.
While traditional data is measured in familiar sizes like
megabytes, gigabytes and terabytes
Big data is stored in petabytes and zettabytes.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 39 / 50
Big data is characterized by 3V
Velocity
Velocity is refers to the speed of generation of data.
The data is increasing at a very fast rate.
Sensors, social media platforms are all continuously generate
enormous volumes of data.
It is estimated that the volume of data will double in every 2
years.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 40 / 50
Big data is characterized by 3V
Variety
Variety is data comes in many different forms from diverse sources.
Variety refers to heterogeneous sources and the nature of
data,structured,semi-structure unstructured.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 41 / 50
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 42 / 50
What is Clustered Computing?
Cluster computing defines several computers linked on a
network and implemented like an individual entity.
Each computer that is linked to the network is known as a node.
Cluster computing provides solutions to solve difficult problems
by providing faster speed, and enhanced data integrity.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 43 / 50
What is Clustered Computing?
The connected computers perform operations together
The cluster represents one system and the objective is to
improve performance.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 44 / 50
What is Clustered Computing?
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 45 / 50
What is Clustered Computing?
Big data clustering software combines the resources of many smaller
machines, seeking to provide a number of benefits:
▶ Resource Pooling
▶ High Availability
▶ Easy Scalability
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 46 / 50
Hadoop and its Ecosystem
Cluster membership and resource allocation can be handled by
software like Hadoop’s
Hadoop is an open-source framework intended to make
interaction with big data easier
Hadoop is a framework that allows for the distributed processing
of large datasets across clusters of computers using simple
programming models.
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 47 / 50
Hadoop and its Ecosystem
The four key characteristics of Hadoop are:
1 Economical
▶ Its systems are highly economical as ordinary computers can be
used for data processing
2 Reliable
▶ It is reliable as it stores copies of the data on different machines
and is resistant to hardware failure
3 Scalable
▶ It is easily scalable both, horizontally and vertically.
4 Flexible
▶ It is flexible and you can store as much structured and
unstructured data
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 48 / 50
Big Data Life Cycle with Hadoop
1 Ingesting data into the system
▶ The first stage of Big Data processing is Ingest
▶ The data is ingested or transferred to Hadoop from various
sources such as relational databases, systems, or local files.
2 Processing the data in storage
▶ In this stage, the data is stored and processed.
3 Computing and analyzing data
▶ The data is analyzed by processing frameworks
4 Visualizing the results
▶ In this stage, the analyzed data can be accessed by users
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 49 / 50
End of Chapter Two
Thank You!!!!
Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 50 / 50

Emerging technology Chapter Two.pdf Unity university

  • 1.
    Chapter Two: Data science AbeyB.(MSc) Unity University Faculty of Computing Department Of Computer Science November 2, 2024 Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 1 / 50
  • 2.
    What is Datascience Data science is now one of the most important topics all around. Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from ▶ Structured Data ▶ Semi-structured Data ▶ Unstructured Data Data Science is the art of gaining knowledge from data, or of getting meaning into large databases. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 2 / 50
  • 3.
    What is Datascience Data Science defined as the extraction of actionable knowledge directly from the data through the process using different method. A data scientist (is a job title) is a person in a systematic activity to acquire knowledge from data. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 3 / 50
  • 4.
    What is Datascience What is Algorithms? An algorithm is a set of instructions designed to perform a specific task Algorithm refers to a sequence of finite steps to solve a particular problem. An algorithm is a set of commands that must be followed for a computer to perform calculations or other problem-solving operations. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 4 / 50
  • 5.
    What is Datascience What is Algorithms? Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 5 / 50
  • 6.
    Data vs. Information Data Datais raw, unorganized facts that need to be processed. Each piece of data is a little fact that doesn’t mean much on its own. It can be described as unprocessed facts and figures It is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.) Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 6 / 50
  • 7.
    Data vs. Information Thereare two main types of data: 1 Quantitative data ▶ It is provided in numerical form, like the weight, volume, or cost of an item. 2 Qualitative data ▶ It is descriptive, but non-numerical, like the name, sex, or eye color of a person. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 7 / 50
  • 8.
    Data vs. Information Information Informationis interpreted data; created from organized, structured, and processed data in a particular context Information is data that is processed, organized, and structured Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 8 / 50
  • 9.
    Difference Between Dataand Information Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 9 / 50
  • 10.
    Data Processing Cycle Thedata processing cycle is the set of operations used to transform data into useful information Data processing cycle as the term suggests a sequence of steps or operations for processing data, i.e., processing raw data to the usable form. Data processing is the re-structuring or re-ordering of data by people or machines to increase their usefulness and add values for a particular purpose. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 10 / 50
  • 11.
    Data Processing Cycle Stagesof data processing: Input ▶ The raw data after collection needs to be fed in the cycle for processing. ▶ This is considered the first step and called input. Processing ▶ Once the input is provided the raw data is processed by a suitable or selected processing method. ▶ This is the most important step as it provides the processed data in the form of output which will be used further. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 11 / 50
  • 12.
    Data Processing Cycle Stagesof data processing: Output ▶ This is the outcome and the raw data provided in the first stage is now “processed” and the data is useful and provides information and no longer called data. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 12 / 50
  • 13.
    Data types andtheir representation Data types can be described from diverse perspectives. In computer science and computer programming, for instance, a data type ▶ Data Type is simply an attribute of data that tells the compiler or interpreter how the programmer intends to use the data. Data types can be classified from two perspectives. 1 Data types from Data Analytics perspective 2 Data types from Computer programming perspective Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 13 / 50
  • 14.
    Data types andtheir representation Data types from Data Analytics perspective From a data analytics point of view, it is important to understand that there are three common types of data types 1 Structured data types 2 Semi-structured data types 3 Unstructured data types Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 14 / 50
  • 15.
    Data types fromData Analytics perspective Structured Data Structured data refers to data that is organized and formatted in a specific way to make it easily readable and understandable by both humans and machines. Structured data store in a table format with a relationship between the different rows and columns Structured data is highly valuable because it can be easily searched, queried, and analyzed using various tools and techniques. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 15 / 50
  • 16.
    Data types fromData Analytics perspective Structured Data Common examples of structured data are ▶ Excel files ▶ SQL databases Each of these has structured rows and columns that can be sorted Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 16 / 50
  • 17.
    Data types fromData Analytics perspective Excel files Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 17 / 50
  • 18.
    Data types fromData Analytics perspective Unstructured Data Unstructured data is information that either does not have a predefined data model or is not organized in a pre-defined manner. Unstructured data may have its own internal structure, but does not follow neatly into a spreadsheet or database. From 80% to 90% of data generated and collected by organizations is unstructured, ▶ Its volumes are growing rapidly — many times faster than the rate of growth for structured databases. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 18 / 50
  • 19.
    Data types fromData Analytics perspective Unstructured Data Examples of unstructured data include ▶ Audio ▶ video ▶ files ▶ Entertainment data ▶ Sensor data, etc.. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 19 / 50
  • 20.
    Data types fromData Analytics perspective Semi-structured Data Semi-structured data is a type of data that is not purely structured, but also not completely unstructured. It contains some level of organization or structure, but does not conform to a rigid schema or data model Semi-structured contains tags or other markers to separate semantic elements Semi-structured data is information that doesn’t exist in a relational database but that have some organizational properties that make it easier to analyze. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 20 / 50
  • 21.
    Data types fromData Analytics perspective Unstructured Data Examples of semi-structured data include ▶ JSON ▶ XML are forms of semistructured data. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 21 / 50
  • 22.
    Data types fromData Analytics perspective Unstructured Data JSON Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 22 / 50
  • 23.
    Data types fromData Analytics perspective Unstructured Data XML Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 23 / 50
  • 24.
    Data types fromData Analytics perspective Structured Data,Unstructured Data and Semi-structured Data Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 24 / 50
  • 25.
    Data types fromComputer programming perspective from Computer programming perspective common data types include 1 Integers(int) 2 Booleans(bool) 3 Characters(char) 4 Floating-point numbers(float) 5 Alphanumeric strings Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 25 / 50
  • 26.
    Data types fromComputer programming perspective Integers(int) ▶ Integers(int)- is used to store whole numbers ▶ Mathematically known as integers ▶ Examples of integers are 0, 1, 2, 3 and 4. Booleans(bool) ▶ Booleans(bool)- is used to represent restricted to one of two values: true or false Characters(char) ▶ Characters(char)- is used to store a single character Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 26 / 50
  • 27.
    Data types fromComputer programming perspective Floating-point numbers(float) ▶ Floating-point numbers(float)- is used to store real numbers ▶ Floating point number is a positive or negative whole number with a decimal point. Alphanumeric strings(string) ▶ Alphanumeric strings(string)- used to store a combination of characters and numbers ▶ A string is a sequence of characters enclosed between the double quotes ”...” Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 27 / 50
  • 28.
    What is Metadata? Metadatais data about data. It provides additional information about a specific set of data. In a set of photographs, for example, metadata could describe when and where the photos were taken. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 28 / 50
  • 29.
    What is Metadata? AbeyB. (UU) Introduction to Emerging Technologies November 2, 2024 29 / 50
  • 30.
    What is Datavalue Chain? The Data Value Chain is the information flow within a big data system as a series of steps needed to generate value and useful insights from data. The Big Data Value Chain identifies the following key high-level activities: 1 Data Acquisition 2 Data Analysis 3 Data Curation 4 Data Storage 5 Data Usage Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 30 / 50
  • 31.
    What is Datavalue Chain? Data Acquisition It is the process of gathering, filtering, and cleaning data before it is put in a data warehouse or any other storage solution on which data analysis can be carried out. Data acquisition is one of the major big data challenges in terms of infrastructure requirements. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 31 / 50
  • 32.
    What is Datavalue Chain? Data analysis Data analysis is the process of evaluating data using analytical and logical reasoning to examine each component of the data provided. Data from various sources is gathered, reviewed, and then analyzed to form some sort of finding or conclusion Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 32 / 50
  • 33.
    What is Datavalue Chain? Data Curation It is the active management of data over its life cycle to ensure it meets the necessary data quality requirements for its effective usage. Data curation is performed by expert curators(scientific curators, data annotators ) that are responsible for improving the accessibility and quality of data. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 33 / 50
  • 34.
    What is Datavalue Chain? Data storage It is also defined as a way of keeping information in the memory storage for use by a computer. It is the persistence and management of data in a scalable way that satisfies the needs of applications that require fast access to the data. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 34 / 50
  • 35.
    What is Datavalue Chain? Data usage It covers the data-driven business activities that need access to data, its analysis, and the tools needed to integrate the data analysis within the business activity. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 35 / 50
  • 36.
    What is bigData? Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 36 / 50
  • 37.
    What is bigData? Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Big data refers to the large, diverse sets of information that grow at ever-increasing rates. It doesn’t mean the amount of data, the thing matters is what organization do with data. Big Data is analyzed for insights that lead to better decisions. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 37 / 50
  • 38.
    What is bigData? Big data is characterized by 3V 1 Volume 2 Velocity 3 Variety Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 38 / 50
  • 39.
    Big data ischaracterized by 3V Volume Big data is huge. While traditional data is measured in familiar sizes like megabytes, gigabytes and terabytes Big data is stored in petabytes and zettabytes. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 39 / 50
  • 40.
    Big data ischaracterized by 3V Velocity Velocity is refers to the speed of generation of data. The data is increasing at a very fast rate. Sensors, social media platforms are all continuously generate enormous volumes of data. It is estimated that the volume of data will double in every 2 years. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 40 / 50
  • 41.
    Big data ischaracterized by 3V Variety Variety is data comes in many different forms from diverse sources. Variety refers to heterogeneous sources and the nature of data,structured,semi-structure unstructured. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 41 / 50
  • 42.
    Abey B. (UU)Introduction to Emerging Technologies November 2, 2024 42 / 50
  • 43.
    What is ClusteredComputing? Cluster computing defines several computers linked on a network and implemented like an individual entity. Each computer that is linked to the network is known as a node. Cluster computing provides solutions to solve difficult problems by providing faster speed, and enhanced data integrity. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 43 / 50
  • 44.
    What is ClusteredComputing? The connected computers perform operations together The cluster represents one system and the objective is to improve performance. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 44 / 50
  • 45.
    What is ClusteredComputing? Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 45 / 50
  • 46.
    What is ClusteredComputing? Big data clustering software combines the resources of many smaller machines, seeking to provide a number of benefits: ▶ Resource Pooling ▶ High Availability ▶ Easy Scalability Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 46 / 50
  • 47.
    Hadoop and itsEcosystem Cluster membership and resource allocation can be handled by software like Hadoop’s Hadoop is an open-source framework intended to make interaction with big data easier Hadoop is a framework that allows for the distributed processing of large datasets across clusters of computers using simple programming models. Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 47 / 50
  • 48.
    Hadoop and itsEcosystem The four key characteristics of Hadoop are: 1 Economical ▶ Its systems are highly economical as ordinary computers can be used for data processing 2 Reliable ▶ It is reliable as it stores copies of the data on different machines and is resistant to hardware failure 3 Scalable ▶ It is easily scalable both, horizontally and vertically. 4 Flexible ▶ It is flexible and you can store as much structured and unstructured data Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 48 / 50
  • 49.
    Big Data LifeCycle with Hadoop 1 Ingesting data into the system ▶ The first stage of Big Data processing is Ingest ▶ The data is ingested or transferred to Hadoop from various sources such as relational databases, systems, or local files. 2 Processing the data in storage ▶ In this stage, the data is stored and processed. 3 Computing and analyzing data ▶ The data is analyzed by processing frameworks 4 Visualizing the results ▶ In this stage, the analyzed data can be accessed by users Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 49 / 50
  • 50.
    End of ChapterTwo Thank You!!!! Abey B. (UU) Introduction to Emerging Technologies November 2, 2024 50 / 50