SlideShare a Scribd company logo
1 of 23
WOLAITA SODO UNIVERSITY
Department of IT
Introduction to Emerging Technologies
(EMTE1011/1012)
By: Mesele G.
Feb 2, 2021
By: Mesele 1/ 1
CH-2: Overview of Data Science
What are data andinformation?
• Data
• can be defined as a representation of facts, concepts, or instructions in a
formalized manner, which should be suitable for communication,
interpretation, or processing, by human or electronic machines.
• It can be described as unprocessed facts and figures.
• Information
• it is the processeddata on which decisions and actions are based.
• It is data that has beenprocessedinto a form that is meaningful to the
recipient and is of real or perceived value in the current or the
prospective action or decision of recipient.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 36/ 1
Overview of Data Science
What isData science?
• It is a multi-disciplinary field that usesscientific methods,
processes, algorithms, and systems to extract knowledge and
insights from (structured, semi and unstructured data).
• It is much more than simply analyzing data.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 37/ 1
Overview of Data Science
Data processing Cycle
• Data processingis the restructuring/reordering of data by people or
machines to increasetheir usefulnessand add values for a particular purpose.
• It consistsof the following basic steps:
• input
• processing and
• output
Input: in this step,the input data is preparedin someconvenient formfor
processing. The formwill depend on the processing machine.For example, when
electronic computersareused, the input data can be recordedon any oneof the
severaltypes of storagemedium, such ashard disk, CD, flash disk and so on.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 38/ 1
Overview of Data Science
Data processing Cycle
• Processing: in this step, the input data is changed to produce
data in a more useful form. For example, interest can be
calculated on deposit to a bank, or a summary of sales for the
month can be calculated from the sales orders.
• Output: at this stage, the result of the proceeding processing
step is collected. The particular form of the output data
depends on the use of the data. For example, output data may
be payroll for employees.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 39/ 1
Data types and their representation
Data types from Computer programming perspective
• Almost all programming languagesexplicitly include the notion of data
type,different languagesmay usedifferent terminology.
• Common data types include:
• Integers(int) - is used to store whole numbers, mathematically known as integers
• Booleans(bool) - is used to represent restricted to one of two values: true or false
• Characters(char) - is used to store a single character
• Floating-point numbers(float) - is used to store real numbers
• Alphanumeric strings(string) - used to store a combination of
characters and numbers.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 40/ 1
Data types and their representation
Data types from Data Analytics perspective
• There arethree common types of data types or structures.
1. Structured Data
• A pre-defined data model and is therefore straightforward to analyze.
conforms to a tabular format with a relationship between the different rows
and columns(rows and columns i.e Excel files or SQL databases.)
2. Semi-structured Data
• SD that does not conform with the formal structure of data models
associated with relational databases. known asa self-describing structure.
(Examples:JSON and XML)
3. Unstructured data
• It does not have a predefined data model or is not organized in a
predefined manner.
• Typically text -heavy but may contain data such asdates, numbers, and
facts as well.
• Common examples of unstructured data include audio, video files or
NoSQL databases.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 41/ 1
Data types and their representation
Metadata
• last category of data type is metadata.
• Metadata is Data about Data
• It is one of the most important elements
for Big Data analysis and big data
solutions.
• It provides additional information about
a specific set of data.
• Example, In a set of photographs
metadata could describe when and where
the photos were taken.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 42/ 1
Data Value Chain
• To describe the information flow within a big data system as a series of
steps neededto generate value and useful insights from data.
• The Big Data Value Chain identifies the following key high-level
activities:
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 43/ 1
Data Value Chain
• Data Acquisition
• It is the process of gathering, filtering, and cleaning data before it
is put in a data warehouse or any other storage solution on which
data analysis can be carried out.
• Data acquisition is one of the major big data challenges in terms
of infrastructure requirements.
• Data Analysis
• It is concerned with making the raw data acquired agreeable to
use in decision-making as well as domain-specific usage.
• Data analysis involves exploring, transforming, and modeling
data.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 44/ 1
Data Value Chain
• Data Curation
• It is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements
for its effective usage.
• Data curation processes can be categorized into
different activities such as content creation, selection,
classification, transformation, validation, and
preservation.
• Data curation is performed by expert curators that are
responsible for improving the accessibility and
quality of data.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 45/ 1
Data Value Chain
Data Storage
• It is the management of data in a scalable way that satisfies the
needs of applications that require fast access to the data.
• Relational Database Management Systems (RDBMS) have
been the main, and almost unique, a solution to the storage
standard for nearly 40 years.
• However, when data volumes and complexity grow, making
them unsuitable for big data scenarios.
• NoSQL technologies have been designed with the scalability goal
in mind and present a wide range of solutions based on alternative
data models.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 46/ 1
Data Value Chain
• Data Usage
• It covers the data-driven business activities that need
access to data, its analysis, and the tools needed to
integrate the data analysis within the business activity.
• Data usage in business decision- making can enhance
competitiveness through the reduction of costs,
increased added value, or any other parameter that can
be measured against existing performance criteria.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 47/ 1
What is Big Data?
• Big data is a complete term for the non-traditional strategies and
technologies neededto gather, organize, process, and gather insights from
large datasets.
• While the problem of working with data that exceedsthe computing
poweror storage of a single computer is not new.
• Big data is the term for a collection of data sets so large and complexthat
it becomes difficult to process using on-hand database managementtools
or traditional data processing applications.
• Inthis context, a large dataset meansa dataset too large to reasonably
process or store with traditional tooling or on a single computer.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 48/ 1
What is Big Data?
• The common scale of big datasetsis constantly shifting and may vary
significantly from organization to organization. Big data is characterized by3V
and more:
1. Volume: large amounts of data Zeta bytes/Massive datasets
2. Velocity: Data is live streaming or in motion
3. Variety: data comes in many different forms from diverse sources
4. Veracity: can wetrust the data? How accurate is it? etc.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 49/ 1
Clustered Computing and Hadoop Ecosystem
Clustered Computing
• Becauseof the qualities of big data, individual computersareoften
inadequatefor handling the data at most stages.
• To better address the high storageand computational needsof big data,
computer clustersarea better fit.
• Big data clustering software combines the resourcesof many smaller
machines, seekingto provide a number of benefits:
• Resource Pooling:Combining the available storage space to hold
data is a clear benefit, but CPU and memory pooling are also
extremely important.
• High Availability: Clusters can provide varying levels of fault
tolerance and availability guarantees to prevent hardwareor software
failures from affecting access to data and processing.
• EasyScalability:Clusters make it easy to scale horizontally by
adding additional machines to the group.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 50/ 1
Clustered Computing and Hadoop Ecosystem
Clustered Computing
• Using clusters requires a solution for managing
cluster membership, coordinating resource
sharing, and scheduling actual work on individual
nodes.
• Cluster membership and resource allocation
can be handled by software like Hadoops
YARN.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 51/ 1
Clustered Computing and Hadoop Ecosystem
Hadoop anditsEcosystem
• An open-source frameworkthat allowsfor the distributed
processingof large datasetsacross clusters of computers.
• The four key characteristics of Hadoop are:
1. Economical:Its systems are highly economical as ordinary
computers can be used for data processing.
2. Reliable: It is reliable as it stores copies of the data on
different machines and is resistant to hardware failure.
3. Scalable: It is easily scalable both, horizontally and
vertically. A few extra nodes help in scaling up the
framework.
4. flexible:It is flexible and you can store as much structured
and unstructured data as you needto and decide to usethem
later.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 52/ 1
Clustered Computing and Hadoop Ecosystem
Hadoop anditsEcosystem
• Hadoop hasan ecosystemthat has evolvedfrom its four core components: data
management,access, processing, and storage. It is continuously growing to
meetthe needsof Big Data.
• It comprises the following componentsand many others:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query-based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm
libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 53/ 1
Clustered Computing and Hadoop Ecosystem
Hadoop and itsEcosystem
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 54/ 1
Big Data Life Cycle with Hadoop
1
2
1. Ingestingdata into the system:
The first stage is Ingest.
The data is ingested or transferred to Hadoop from various
sources such as relational databases, systems, or local files.
Sqoop transfers data from RDBMS to HDFS, whereas
Flume transfers event data.
2.Processing the data in storage:
Second stage is Processing
The data is stored and processed.
The data is stored in the distributed file system, HDFS, and
the NoSQL distributed data, HBase. Spark and
MapReduce perform data processing.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 55/ 1
Big Data Life Cycle with Hadoop
3. Computing and analyzingdata:
• The data is analyzed by processing frameworks such as Pig,Hive,
and Impala.
• Pig converts the data using a map and reduce and then analyzes
it.
• Hive is also based on the map and reduce programming and is
most suitable for structured data.
4. Visualizing the results:
• The fourth stage is Access, which is performed by tools such as
Hue and Cloudera Search.
• In this stage, the analyzed data can be accessed by users.
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 56/ 1
By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 143/ 1

More Related Content

What's hot

History of Ethiopia & the Horn Unit 1 (1).pptx
History of Ethiopia & the Horn Unit 1 (1).pptxHistory of Ethiopia & the Horn Unit 1 (1).pptx
History of Ethiopia & the Horn Unit 1 (1).pptxTeamireabDesta
 
Introduction to emerging technology
Introduction to emerging technologyIntroduction to emerging technology
Introduction to emerging technologyBiniam Behailu
 
4_5825656793769971825.pptx
4_5825656793769971825.pptx4_5825656793769971825.pptx
4_5825656793769971825.pptxWakjiraTesfaye
 
Emerging Technologies Module.pdf
Emerging Technologies Module.pdfEmerging Technologies Module.pdf
Emerging Technologies Module.pdfwishutubeMelhik
 
Anthropology Chapter 2 PPT.pdf
Anthropology Chapter 2 PPT.pdfAnthropology Chapter 2 PPT.pdf
Anthropology Chapter 2 PPT.pdfShegawAtirsaw
 
Civic Education All Slides.pptx
Civic Education All Slides.pptxCivic Education All Slides.pptx
Civic Education All Slides.pptxmahamedYusuf5
 
moral & civic education.pdf
moral & civic education.pdfmoral & civic education.pdf
moral & civic education.pdfShwaBeli
 
Management Information System
Management Information SystemManagement Information System
Management Information SystemZeinul Haleem
 
Accounting Information System ch 01 ppt1
Accounting Information System ch 01 ppt1Accounting Information System ch 01 ppt1
Accounting Information System ch 01 ppt1Eloise Mae Hersane
 
The impact of information in society
The impact of information in society The impact of information in society
The impact of information in society Abrar Almjaly
 
Lesson 1 - Introduction to Emerging Technologies.pptx
Lesson 1 -  Introduction to Emerging Technologies.pptxLesson 1 -  Introduction to Emerging Technologies.pptx
Lesson 1 - Introduction to Emerging Technologies.pptxRizaJeanMAcanto
 
01 introduction to information technology
01 introduction to information technology01 introduction to information technology
01 introduction to information technologyDinesh Gunathilaka
 
Analog and Digital Computers
Analog and Digital ComputersAnalog and Digital Computers
Analog and Digital Computersiampencilbox
 
Computer fundamental
Computer fundamentalComputer fundamental
Computer fundamentalactacademy
 
Social Anthropology course material - Copy.pptx
Social Anthropology course material - Copy.pptxSocial Anthropology course material - Copy.pptx
Social Anthropology course material - Copy.pptxSebehadinKedir
 

What's hot (20)

History of Ethiopia & the Horn Unit 1 (1).pptx
History of Ethiopia & the Horn Unit 1 (1).pptxHistory of Ethiopia & the Horn Unit 1 (1).pptx
History of Ethiopia & the Horn Unit 1 (1).pptx
 
Introduction to emerging technology
Introduction to emerging technologyIntroduction to emerging technology
Introduction to emerging technology
 
4_5825656793769971825.pptx
4_5825656793769971825.pptx4_5825656793769971825.pptx
4_5825656793769971825.pptx
 
Emerging Technologies Module.pdf
Emerging Technologies Module.pdfEmerging Technologies Module.pdf
Emerging Technologies Module.pdf
 
Anthropology Chapter 2 PPT.pdf
Anthropology Chapter 2 PPT.pdfAnthropology Chapter 2 PPT.pdf
Anthropology Chapter 2 PPT.pdf
 
Civic Education All Slides.pptx
Civic Education All Slides.pptxCivic Education All Slides.pptx
Civic Education All Slides.pptx
 
Information society
Information societyInformation society
Information society
 
moral & civic education.pdf
moral & civic education.pdfmoral & civic education.pdf
moral & civic education.pdf
 
Management Information System
Management Information SystemManagement Information System
Management Information System
 
ArqueologíA Pdf
ArqueologíA PdfArqueologíA Pdf
ArqueologíA Pdf
 
Accounting Information System ch 01 ppt1
Accounting Information System ch 01 ppt1Accounting Information System ch 01 ppt1
Accounting Information System ch 01 ppt1
 
The impact of information in society
The impact of information in society The impact of information in society
The impact of information in society
 
Lesson 1 - Introduction to Emerging Technologies.pptx
Lesson 1 -  Introduction to Emerging Technologies.pptxLesson 1 -  Introduction to Emerging Technologies.pptx
Lesson 1 - Introduction to Emerging Technologies.pptx
 
01 introduction to information technology
01 introduction to information technology01 introduction to information technology
01 introduction to information technology
 
Electronic Resources
Electronic ResourcesElectronic Resources
Electronic Resources
 
Analog and Digital Computers
Analog and Digital ComputersAnalog and Digital Computers
Analog and Digital Computers
 
Information system
Information systemInformation system
Information system
 
Computer fundamental
Computer fundamentalComputer fundamental
Computer fundamental
 
Data, knowledge and information
Data, knowledge and informationData, knowledge and information
Data, knowledge and information
 
Social Anthropology course material - Copy.pptx
Social Anthropology course material - Copy.pptxSocial Anthropology course material - Copy.pptx
Social Anthropology course material - Copy.pptx
 

Similar to ET Ch - 2.pptx

Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdfACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdfJerichoGerance
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)Krishan Pareek
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
 
BI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business businessBI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business businessJawaherAlbaddawi
 
What are the benefits of learning ETL Development and where to start learning...
What are the benefits of learning ETL Development and where to start learning...What are the benefits of learning ETL Development and where to start learning...
What are the benefits of learning ETL Development and where to start learning...kzayra69
 
Characterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesCharacterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesIJTET Journal
 

Similar to ET Ch - 2.pptx (20)

Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdfACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
 
AtomicDBCoreTech_White Papaer
AtomicDBCoreTech_White PapaerAtomicDBCoreTech_White Papaer
AtomicDBCoreTech_White Papaer
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)
 
ch2 DS.pptx
ch2 DS.pptxch2 DS.pptx
ch2 DS.pptx
 
Ch_2.pdf
Ch_2.pdfCh_2.pdf
Ch_2.pdf
 
Data mining notes
Data mining notesData mining notes
Data mining notes
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
ijcatr04081001
ijcatr04081001ijcatr04081001
ijcatr04081001
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
BI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business businessBI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business business
 
What are the benefits of learning ETL Development and where to start learning...
What are the benefits of learning ETL Development and where to start learning...What are the benefits of learning ETL Development and where to start learning...
What are the benefits of learning ETL Development and where to start learning...
 
many-task computing
many-task computingmany-task computing
many-task computing
 
Characterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesCharacterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining Techniques
 

Recently uploaded

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

ET Ch - 2.pptx

  • 1. WOLAITA SODO UNIVERSITY Department of IT Introduction to Emerging Technologies (EMTE1011/1012) By: Mesele G. Feb 2, 2021 By: Mesele 1/ 1
  • 2. CH-2: Overview of Data Science What are data andinformation? • Data • can be defined as a representation of facts, concepts, or instructions in a formalized manner, which should be suitable for communication, interpretation, or processing, by human or electronic machines. • It can be described as unprocessed facts and figures. • Information • it is the processeddata on which decisions and actions are based. • It is data that has beenprocessedinto a form that is meaningful to the recipient and is of real or perceived value in the current or the prospective action or decision of recipient. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 36/ 1
  • 3. Overview of Data Science What isData science? • It is a multi-disciplinary field that usesscientific methods, processes, algorithms, and systems to extract knowledge and insights from (structured, semi and unstructured data). • It is much more than simply analyzing data. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 37/ 1
  • 4. Overview of Data Science Data processing Cycle • Data processingis the restructuring/reordering of data by people or machines to increasetheir usefulnessand add values for a particular purpose. • It consistsof the following basic steps: • input • processing and • output Input: in this step,the input data is preparedin someconvenient formfor processing. The formwill depend on the processing machine.For example, when electronic computersareused, the input data can be recordedon any oneof the severaltypes of storagemedium, such ashard disk, CD, flash disk and so on. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 38/ 1
  • 5. Overview of Data Science Data processing Cycle • Processing: in this step, the input data is changed to produce data in a more useful form. For example, interest can be calculated on deposit to a bank, or a summary of sales for the month can be calculated from the sales orders. • Output: at this stage, the result of the proceeding processing step is collected. The particular form of the output data depends on the use of the data. For example, output data may be payroll for employees. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 39/ 1
  • 6. Data types and their representation Data types from Computer programming perspective • Almost all programming languagesexplicitly include the notion of data type,different languagesmay usedifferent terminology. • Common data types include: • Integers(int) - is used to store whole numbers, mathematically known as integers • Booleans(bool) - is used to represent restricted to one of two values: true or false • Characters(char) - is used to store a single character • Floating-point numbers(float) - is used to store real numbers • Alphanumeric strings(string) - used to store a combination of characters and numbers. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 40/ 1
  • 7. Data types and their representation Data types from Data Analytics perspective • There arethree common types of data types or structures. 1. Structured Data • A pre-defined data model and is therefore straightforward to analyze. conforms to a tabular format with a relationship between the different rows and columns(rows and columns i.e Excel files or SQL databases.) 2. Semi-structured Data • SD that does not conform with the formal structure of data models associated with relational databases. known asa self-describing structure. (Examples:JSON and XML) 3. Unstructured data • It does not have a predefined data model or is not organized in a predefined manner. • Typically text -heavy but may contain data such asdates, numbers, and facts as well. • Common examples of unstructured data include audio, video files or NoSQL databases. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 41/ 1
  • 8. Data types and their representation Metadata • last category of data type is metadata. • Metadata is Data about Data • It is one of the most important elements for Big Data analysis and big data solutions. • It provides additional information about a specific set of data. • Example, In a set of photographs metadata could describe when and where the photos were taken. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 42/ 1
  • 9. Data Value Chain • To describe the information flow within a big data system as a series of steps neededto generate value and useful insights from data. • The Big Data Value Chain identifies the following key high-level activities: By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 43/ 1
  • 10. Data Value Chain • Data Acquisition • It is the process of gathering, filtering, and cleaning data before it is put in a data warehouse or any other storage solution on which data analysis can be carried out. • Data acquisition is one of the major big data challenges in terms of infrastructure requirements. • Data Analysis • It is concerned with making the raw data acquired agreeable to use in decision-making as well as domain-specific usage. • Data analysis involves exploring, transforming, and modeling data. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 44/ 1
  • 11. Data Value Chain • Data Curation • It is the active management of data over its life cycle to ensure it meets the necessary data quality requirements for its effective usage. • Data curation processes can be categorized into different activities such as content creation, selection, classification, transformation, validation, and preservation. • Data curation is performed by expert curators that are responsible for improving the accessibility and quality of data. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 45/ 1
  • 12. Data Value Chain Data Storage • It is the management of data in a scalable way that satisfies the needs of applications that require fast access to the data. • Relational Database Management Systems (RDBMS) have been the main, and almost unique, a solution to the storage standard for nearly 40 years. • However, when data volumes and complexity grow, making them unsuitable for big data scenarios. • NoSQL technologies have been designed with the scalability goal in mind and present a wide range of solutions based on alternative data models. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 46/ 1
  • 13. Data Value Chain • Data Usage • It covers the data-driven business activities that need access to data, its analysis, and the tools needed to integrate the data analysis within the business activity. • Data usage in business decision- making can enhance competitiveness through the reduction of costs, increased added value, or any other parameter that can be measured against existing performance criteria. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 47/ 1
  • 14. What is Big Data? • Big data is a complete term for the non-traditional strategies and technologies neededto gather, organize, process, and gather insights from large datasets. • While the problem of working with data that exceedsthe computing poweror storage of a single computer is not new. • Big data is the term for a collection of data sets so large and complexthat it becomes difficult to process using on-hand database managementtools or traditional data processing applications. • Inthis context, a large dataset meansa dataset too large to reasonably process or store with traditional tooling or on a single computer. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 48/ 1
  • 15. What is Big Data? • The common scale of big datasetsis constantly shifting and may vary significantly from organization to organization. Big data is characterized by3V and more: 1. Volume: large amounts of data Zeta bytes/Massive datasets 2. Velocity: Data is live streaming or in motion 3. Variety: data comes in many different forms from diverse sources 4. Veracity: can wetrust the data? How accurate is it? etc. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 49/ 1
  • 16. Clustered Computing and Hadoop Ecosystem Clustered Computing • Becauseof the qualities of big data, individual computersareoften inadequatefor handling the data at most stages. • To better address the high storageand computational needsof big data, computer clustersarea better fit. • Big data clustering software combines the resourcesof many smaller machines, seekingto provide a number of benefits: • Resource Pooling:Combining the available storage space to hold data is a clear benefit, but CPU and memory pooling are also extremely important. • High Availability: Clusters can provide varying levels of fault tolerance and availability guarantees to prevent hardwareor software failures from affecting access to data and processing. • EasyScalability:Clusters make it easy to scale horizontally by adding additional machines to the group. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 50/ 1
  • 17. Clustered Computing and Hadoop Ecosystem Clustered Computing • Using clusters requires a solution for managing cluster membership, coordinating resource sharing, and scheduling actual work on individual nodes. • Cluster membership and resource allocation can be handled by software like Hadoops YARN. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 51/ 1
  • 18. Clustered Computing and Hadoop Ecosystem Hadoop anditsEcosystem • An open-source frameworkthat allowsfor the distributed processingof large datasetsacross clusters of computers. • The four key characteristics of Hadoop are: 1. Economical:Its systems are highly economical as ordinary computers can be used for data processing. 2. Reliable: It is reliable as it stores copies of the data on different machines and is resistant to hardware failure. 3. Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in scaling up the framework. 4. flexible:It is flexible and you can store as much structured and unstructured data as you needto and decide to usethem later. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 52/ 1
  • 19. Clustered Computing and Hadoop Ecosystem Hadoop anditsEcosystem • Hadoop hasan ecosystemthat has evolvedfrom its four core components: data management,access, processing, and storage. It is continuously growing to meetthe needsof Big Data. • It comprises the following componentsand many others: • HDFS: Hadoop Distributed File System • YARN: Yet Another Resource Negotiator • MapReduce: Programming based Data Processing • Spark: In-Memory data processing • PIG, HIVE: Query-based processing of data services • HBase: NoSQL Database • Mahout, Spark MLLib: Machine Learning algorithm libraries • Solar, Lucene: Searching and Indexing • Zookeeper: Managing cluster • Oozie: Job Scheduling By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 53/ 1
  • 20. Clustered Computing and Hadoop Ecosystem Hadoop and itsEcosystem By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 54/ 1
  • 21. Big Data Life Cycle with Hadoop 1 2 1. Ingestingdata into the system: The first stage is Ingest. The data is ingested or transferred to Hadoop from various sources such as relational databases, systems, or local files. Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event data. 2.Processing the data in storage: Second stage is Processing The data is stored and processed. The data is stored in the distributed file system, HDFS, and the NoSQL distributed data, HBase. Spark and MapReduce perform data processing. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 55/ 1
  • 22. Big Data Life Cycle with Hadoop 3. Computing and analyzingdata: • The data is analyzed by processing frameworks such as Pig,Hive, and Impala. • Pig converts the data using a map and reduce and then analyzes it. • Hive is also based on the map and reduce programming and is most suitable for structured data. 4. Visualizing the results: • The fourth stage is Access, which is performed by tools such as Hue and Cloudera Search. • In this stage, the analyzed data can be accessed by users. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 56/ 1
  • 23. By: Mesele Introduction to Emerging Technologies (EMTE1011/1012) 143/ 1