Chapter 2 - EMTE.pptx

INTRODUCTION TO EMERGING TECHNOLOGIES
(EMTE1012)
CHAPTER – 2
INTRODUCTION TO DATA SCIENCE
1

2
 Describe what data science is and the role of data
scientists.
➢ Differentiate data and information.
➢ Describe data processing life cycle
➢ Understand different data types from diverse
perspectives
 Describe data value chain in emerging era of big
data.
➢ Understand the basics of Big Data.
➢ Describe the purpose of the Hadoop ecosystem
components.
Data
Science

4
 Data science is much more than simply analyzing data.
 Data science is a multi disciplinary field that uses
scientific methods, processes, algorithms and
systems to extract knowledge and insights from
structured, semi-structured and unstructured data

5
 Data science is a "concept to unify statistics, data analysis, machine
learning and their related methods" in order to "understand and
analyze actual phenomena" with data.
 It employs techniques and theories drawn from many fields within the
context of mathematics, statistics, computer science, and information
science.

6
 They possess a strong quantitative background in:
 statistics and linear algebra
 programming knowledge
 data warehousing, mining, and modeling to build and analyze
algorithms.

Data
8
 Data can be described as unprocessed facts, and figures.
 It can exist in any form.
 It is a representation of facts, concepts, or instructions in a formalized
manner, which should be suitable for communication, interpretation,
or processing, by human or electronic machines.
 It is represented with the help of characters such as alphabets (A-Z,
a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.)

9
 Information is data that has been given meaning by way of relational
connection.
 It is the processed data on which decisions and actions are based.
 Information is interpreted data; created from organized, structured,
and processed data in a particular context.
Cont.…

10
 Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a particular
purpose.
 Data processing consists of the following basic steps - input,
processing, and output.
Data

11
 Data can take many material forms including numbers, text, symbols,
images, sound, electromagnetic waves etc. ...
 These are typically divided into two broad categories.
• Qualitative and
• Quantitative

12
 Quantitative data consist of numeric records.
 Generally, such data are extensive and relate to the
• Physical properties of phenomena (such as length, height, distance,
weight, area, volume),
• Non-physical characteristics of phenomena (such as social class,
educational attainment, quality of life rankings).

13
 Qualitative data deals with descriptions.
 Such data can be analyzed using visualizations, a variety of descriptive
and inferential statistics, and be used as the inputs to predictive and
simulation models.

15
 In computer science and computer programming, for instance, a data
type is simply an attribute of data that tells the compiler or interpreter
how the programmer intends to use the data.
 Almost all programming languages explicitly include the notion of
data type, though different languages may use different terminology.
Data

16
 Integers(int) - whole numbers, mathematically known as integers
 Booleans(bool) - true or false
 Characters(char) - is used to store a single character
 Floating-point numbers(float) - is used to store real numbers
 Alphanumeric strings(string) - used to store a combination of
characters and numbers
Data

17
 From a data analytics point of view, it is important to understand that
there are three common types of data types or structures: Structured,
Semi-structured, and Unstructured data types.
Data

Structured
18
 Can be easily organized, stored and transferred in a defined data
model.
 Can be processed, searched, queried, combined, and analyzed
 Managed using Structured Query Language (SQL).

Semi-Structured
19
 Mix of unstructured and structured data
 loosely structured data that have no predefined data model/schema
and thus cannot be held in a relational database.
 Contains tags or other markers to separate semantic elements and
enforce hierarchies of records and fields within the data.
 Examples of semi-structured data include JSON and XML are forms
of semi-structured data.

Unstructured Data
20
 Information that either does not have a predefined data model or
is not organized in a pre-defined manner.
 A much bigger percentage of all the data in our world is unstructured
 Cannot be contained in a row-column database and doesn’t have an
associated data model.
 Common examples of unstructured data include audio, video files.
 Usually stored in data lakes, NoSQL databases, applications and data
warehouses.

22
 Meta data is data about data.
 Most important elements for Big Data analysis and big data solutions.
 It provides additional information about a specific set of data.
 In a set of photographs, for example, metadata could describe when
and where the photos were taken.
 The metadata then provides fields for dates and locations which, by
themselves, can be considered as structured data.

23
 Describes the process of data creation and reuse.
 Made up of a series of subsystem each with inputs, transformation
processes, and outputs.
 In a Data Value Chain, information flow is described as a series of
steps needed to generate value and useful insights from data.

24
 Data Acquisition - process of gathering, filtering, and cleaning data
 Data acquisition is one of the major big data challenges in terms of
infrastructure requirements.

25
 Data analysis is the process of evaluating data using analytical and
statistical tools to discover useful information
 Data analysis involves:
 Exploring
 Transforming, and
 modeling data

26
 Data Curation is the active management of data over its life cycle to
ensure that it meets the necessary data quality requirements for its
effective usage.
 Data curation is the organization and integration of data collected from
various sources.
 Data curation processes can be categorized into different activities
such as content creation, selection, classification, transformation,
validation, and preservation.
 It involves annotation, publication and presentation of the data such
that the value of the data is maintained over time, and the data remains
available for reuse and preservation.

27
 Data Storage is the persistence and management of data
 Relational Database Management Systems (RDBMS) have been the
main, and almost unique, solution to the storage paradigm for nearly
40 years.
 NoSQL technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on alternative data
models.

29
 The process of exploration of data (browsing and lookup), and
exploratory search (finding correlations, comparisons, what-if
scenarios, etc.).

30
 Big Data is not simply a large amount of data
 Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.

31
 Leading IT industry research group Gartner defines Big Data as:
 Big Data definition is based on the three Vs:
 Volume: Size of data (how big it is)
 Velocity: How fast data is being generated
 Variety: Variation of data types to include source, format, and
structure(data can be unstructured, semi-structured, or structured).
“Big Data are high-volume, high-velocity, and/or high-
variety information assets that require new forms of
processing to enable enhanced decision making, insight
discovery and process optimization.”

33
 Reasons for the data explosion are due to new technologies generating
and collecting vast amounts of data.
 These sources include
• Scientific sensors such as global mapping, meteorological tracking,
medical imaging, and DNA research
• Point of Sale (POS) tracking and inventory control systems
• Social media such as Facebook posts and Twitter Tweets
• Internet and intranet websites across the world

34
Clustered Computing  Resource Pooling: Combining the available
storage space to hold data is a clear benefit,
but CPU and memory pooling are also
extremely important.
 High Availability: Clusters can provide
varying levels of fault tolerance and
availability guarantees to prevent hardware
or software failures from affecting access to
data and processing.
 Easy Scalability: Clusters make it easy to
scale horizontally by adding additional
machines to the group.

35
 Using clusters requires a solution for managing cluster membership,
coordinating resource sharing, and scheduling actual work on
individual nodes.
 Cluster membership and resource allocation can be handled by
software like Hadoop’s YARN (which stands for Yet Another
Resource Negotiator).
 YARN allows the data stored in HDFS (Hadoop Distributed File
System) to be processed and run by various data processing engines

36
Hadoop
 Open-source software from Apache
Software Foundation to store and
process large non-relational data
 It is a scalable and fault-tolerant system
for processing large datasets across a
cluster of commodity servers.

37
Four characteristics of Hadoop
 Economical: Its systems are highly economical as ordinary computers
can be used for data processing.
 Reliable: It is reliable as it stores copies of the data on different
machines and is resistant to hardware failure.
 Scalable: It is easily scalable both, horizontally and vertically.
 Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.

Chapter 2 - EMTE.pptx

More Related Content

Similar to Chapter 2 - EMTE.pptx

More from Eyersu Selemon

Recently uploaded

Chapter 2 - EMTE.pptx

Editor's Notes