INTRODUCTION TO EMERGING TECHNOLOGIES
(EMTE1012)
CHAPTER – 2
INTRODUCTION TO DATA SCIENCE
1
2
 Describe what data science is and the role of data
scientists.
➢ Differentiate data and information.
➢ Describe data processing life cycle
➢ Understand different data types from diverse
perspectives
 Describe data value chain in emerging era of big
data.
➢ Understand the basics of Big Data.
➢ Describe the purpose of the Hadoop ecosystem
components.
Data
Science
What is Data Science?
3
4
 Data science is much more than simply analyzing data.
 Data science is a multi disciplinary field that uses
scientific methods, processes, algorithms and
systems to extract knowledge and insights from
structured, semi-structured and unstructured data
5
 Data science is a "concept to unify statistics, data analysis, machine
learning and their related methods" in order to "understand and
analyze actual phenomena" with data.
 It employs techniques and theories drawn from many fields within the
context of mathematics, statistics, computer science, and information
science.
6
 They possess a strong quantitative background in:
 statistics and linear algebra
 programming knowledge
 data warehousing, mining, and modeling to build and analyze
algorithms.
7
Data
8
 Data can be described as unprocessed facts, and figures.
 It can exist in any form.
 It is a representation of facts, concepts, or instructions in a formalized
manner, which should be suitable for communication, interpretation,
or processing, by human or electronic machines.
 It is represented with the help of characters such as alphabets (A-Z,
a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.)
9
 Information is data that has been given meaning by way of relational
connection.
 It is the processed data on which decisions and actions are based.
 Information is interpreted data; created from organized, structured,
and processed data in a particular context.
Cont.…
10
 Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a particular
purpose.
 Data processing consists of the following basic steps - input,
processing, and output.
Data
11
 Data can take many material forms including numbers, text, symbols,
images, sound, electromagnetic waves etc. ...
 These are typically divided into two broad categories.
• Qualitative and
• Quantitative
12
 Quantitative data consist of numeric records.
 Generally, such data are extensive and relate to the
• Physical properties of phenomena (such as length, height, distance,
weight, area, volume),
• Non-physical characteristics of phenomena (such as social class,
educational attainment, quality of life rankings).
13
 Qualitative data deals with descriptions.
 Such data can be analyzed using visualizations, a variety of descriptive
and inferential statistics, and be used as the inputs to predictive and
simulation models.
14
15
 In computer science and computer programming, for instance, a data
type is simply an attribute of data that tells the compiler or interpreter
how the programmer intends to use the data.
 Almost all programming languages explicitly include the notion of
data type, though different languages may use different terminology.
Data
16
 Integers(int) - whole numbers, mathematically known as integers
 Booleans(bool) - true or false
 Characters(char) - is used to store a single character
 Floating-point numbers(float) - is used to store real numbers
 Alphanumeric strings(string) - used to store a combination of
characters and numbers
Data
17
 From a data analytics point of view, it is important to understand that
there are three common types of data types or structures: Structured,
Semi-structured, and Unstructured data types.
Data
Structured
18
 Can be easily organized, stored and transferred in a defined data
model.
 Can be processed, searched, queried, combined, and analyzed
 Managed using Structured Query Language (SQL).
Semi-Structured
19
 Mix of unstructured and structured data
 loosely structured data that have no predefined data model/schema
and thus cannot be held in a relational database.
 Contains tags or other markers to separate semantic elements and
enforce hierarchies of records and fields within the data.
 Examples of semi-structured data include JSON and XML are forms
of semi-structured data.
Unstructured Data
20
 Information that either does not have a predefined data model or
is not organized in a pre-defined manner.
 A much bigger percentage of all the data in our world is unstructured
 Cannot be contained in a row-column database and doesn’t have an
associated data model.
 Common examples of unstructured data include audio, video files.
 Usually stored in data lakes, NoSQL databases, applications and data
warehouses.
21
Unstructured Data
22
 Meta data is data about data.
 Most important elements for Big Data analysis and big data solutions.
 It provides additional information about a specific set of data.
 In a set of photographs, for example, metadata could describe when
and where the photos were taken.
 The metadata then provides fields for dates and locations which, by
themselves, can be considered as structured data.
23
 Describes the process of data creation and reuse.
 Made up of a series of subsystem each with inputs, transformation
processes, and outputs.
 In a Data Value Chain, information flow is described as a series of
steps needed to generate value and useful insights from data.
24
 Data Acquisition - process of gathering, filtering, and cleaning data
 Data acquisition is one of the major big data challenges in terms of
infrastructure requirements.
25
 Data analysis is the process of evaluating data using analytical and
statistical tools to discover useful information
 Data analysis involves:
 Exploring
 Transforming, and
 modeling data
26
 Data Curation is the active management of data over its life cycle to
ensure that it meets the necessary data quality requirements for its
effective usage.
 Data curation is the organization and integration of data collected from
various sources.
 Data curation processes can be categorized into different activities
such as content creation, selection, classification, transformation,
validation, and preservation.
 It involves annotation, publication and presentation of the data such
that the value of the data is maintained over time, and the data remains
available for reuse and preservation.
27
 Data Storage is the persistence and management of data
 Relational Database Management Systems (RDBMS) have been the
main, and almost unique, solution to the storage paradigm for nearly
40 years.
 NoSQL technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on alternative data
models.
28
29
 The process of exploration of data (browsing and lookup), and
exploratory search (finding correlations, comparisons, what-if
scenarios, etc.).
30
 Big Data is not simply a large amount of data
 Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
31
 Leading IT industry research group Gartner defines Big Data as:
 Big Data definition is based on the three Vs:
 Volume: Size of data (how big it is)
 Velocity: How fast data is being generated
 Variety: Variation of data types to include source, format, and
structure(data can be unstructured, semi-structured, or structured).
“Big Data are high-volume, high-velocity, and/or high-
variety information assets that require new forms of
processing to enable enhanced decision making, insight
discovery and process optimization.”
32
33
 Reasons for the data explosion are due to new technologies generating
and collecting vast amounts of data.
 These sources include
• Scientific sensors such as global mapping, meteorological tracking,
medical imaging, and DNA research
• Point of Sale (POS) tracking and inventory control systems
• Social media such as Facebook posts and Twitter Tweets
• Internet and intranet websites across the world
34
Clustered Computing  Resource Pooling: Combining the available
storage space to hold data is a clear benefit,
but CPU and memory pooling are also
extremely important.
 High Availability: Clusters can provide
varying levels of fault tolerance and
availability guarantees to prevent hardware
or software failures from affecting access to
data and processing.
 Easy Scalability: Clusters make it easy to
scale horizontally by adding additional
machines to the group.
35
 Using clusters requires a solution for managing cluster membership,
coordinating resource sharing, and scheduling actual work on
individual nodes.
 Cluster membership and resource allocation can be handled by
software like Hadoop’s YARN (which stands for Yet Another
Resource Negotiator).
 YARN allows the data stored in HDFS (Hadoop Distributed File
System) to be processed and run by various data processing engines
36
Hadoop
 Open-source software from Apache
Software Foundation to store and
process large non-relational data
 It is a scalable and fault-tolerant system
for processing large datasets across a
cluster of commodity servers.
37
Four characteristics of Hadoop
 Economical: Its systems are highly economical as ordinary computers
can be used for data processing.
 Reliable: It is reliable as it stores copies of the data on different
machines and is resistant to hardware failure.
 Scalable: It is easily scalable both, horizontally and vertically.
 Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.
38
THANK YOU
?
39

Chapter 2 - EMTE.pptx

  • 1.
    INTRODUCTION TO EMERGINGTECHNOLOGIES (EMTE1012) CHAPTER – 2 INTRODUCTION TO DATA SCIENCE 1
  • 2.
    2  Describe whatdata science is and the role of data scientists. ➢ Differentiate data and information. ➢ Describe data processing life cycle ➢ Understand different data types from diverse perspectives  Describe data value chain in emerging era of big data. ➢ Understand the basics of Big Data. ➢ Describe the purpose of the Hadoop ecosystem components. Data Science
  • 3.
    What is DataScience? 3
  • 4.
    4  Data scienceis much more than simply analyzing data.  Data science is a multi disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured, semi-structured and unstructured data
  • 5.
    5  Data scienceis a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data.  It employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science.
  • 6.
    6  They possessa strong quantitative background in:  statistics and linear algebra  programming knowledge  data warehousing, mining, and modeling to build and analyze algorithms.
  • 7.
  • 8.
    Data 8  Data canbe described as unprocessed facts, and figures.  It can exist in any form.  It is a representation of facts, concepts, or instructions in a formalized manner, which should be suitable for communication, interpretation, or processing, by human or electronic machines.  It is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.)
  • 9.
    9  Information isdata that has been given meaning by way of relational connection.  It is the processed data on which decisions and actions are based.  Information is interpreted data; created from organized, structured, and processed data in a particular context. Cont.…
  • 10.
    10  Data processingis the re-structuring or re-ordering of data by people or machines to increase their usefulness and add values for a particular purpose.  Data processing consists of the following basic steps - input, processing, and output. Data
  • 11.
    11  Data cantake many material forms including numbers, text, symbols, images, sound, electromagnetic waves etc. ...  These are typically divided into two broad categories. • Qualitative and • Quantitative
  • 12.
    12  Quantitative dataconsist of numeric records.  Generally, such data are extensive and relate to the • Physical properties of phenomena (such as length, height, distance, weight, area, volume), • Non-physical characteristics of phenomena (such as social class, educational attainment, quality of life rankings).
  • 13.
    13  Qualitative datadeals with descriptions.  Such data can be analyzed using visualizations, a variety of descriptive and inferential statistics, and be used as the inputs to predictive and simulation models.
  • 14.
  • 15.
    15  In computerscience and computer programming, for instance, a data type is simply an attribute of data that tells the compiler or interpreter how the programmer intends to use the data.  Almost all programming languages explicitly include the notion of data type, though different languages may use different terminology. Data
  • 16.
    16  Integers(int) -whole numbers, mathematically known as integers  Booleans(bool) - true or false  Characters(char) - is used to store a single character  Floating-point numbers(float) - is used to store real numbers  Alphanumeric strings(string) - used to store a combination of characters and numbers Data
  • 17.
    17  From adata analytics point of view, it is important to understand that there are three common types of data types or structures: Structured, Semi-structured, and Unstructured data types. Data
  • 18.
    Structured 18  Can beeasily organized, stored and transferred in a defined data model.  Can be processed, searched, queried, combined, and analyzed  Managed using Structured Query Language (SQL).
  • 19.
    Semi-Structured 19  Mix ofunstructured and structured data  loosely structured data that have no predefined data model/schema and thus cannot be held in a relational database.  Contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data.  Examples of semi-structured data include JSON and XML are forms of semi-structured data.
  • 20.
    Unstructured Data 20  Informationthat either does not have a predefined data model or is not organized in a pre-defined manner.  A much bigger percentage of all the data in our world is unstructured  Cannot be contained in a row-column database and doesn’t have an associated data model.  Common examples of unstructured data include audio, video files.  Usually stored in data lakes, NoSQL databases, applications and data warehouses.
  • 21.
  • 22.
    22  Meta datais data about data.  Most important elements for Big Data analysis and big data solutions.  It provides additional information about a specific set of data.  In a set of photographs, for example, metadata could describe when and where the photos were taken.  The metadata then provides fields for dates and locations which, by themselves, can be considered as structured data.
  • 23.
    23  Describes theprocess of data creation and reuse.  Made up of a series of subsystem each with inputs, transformation processes, and outputs.  In a Data Value Chain, information flow is described as a series of steps needed to generate value and useful insights from data.
  • 24.
    24  Data Acquisition- process of gathering, filtering, and cleaning data  Data acquisition is one of the major big data challenges in terms of infrastructure requirements.
  • 25.
    25  Data analysisis the process of evaluating data using analytical and statistical tools to discover useful information  Data analysis involves:  Exploring  Transforming, and  modeling data
  • 26.
    26  Data Curationis the active management of data over its life cycle to ensure that it meets the necessary data quality requirements for its effective usage.  Data curation is the organization and integration of data collected from various sources.  Data curation processes can be categorized into different activities such as content creation, selection, classification, transformation, validation, and preservation.  It involves annotation, publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation.
  • 27.
    27  Data Storageis the persistence and management of data  Relational Database Management Systems (RDBMS) have been the main, and almost unique, solution to the storage paradigm for nearly 40 years.  NoSQL technologies have been designed with the scalability goal in mind and present a wide range of solutions based on alternative data models.
  • 28.
  • 29.
    29  The processof exploration of data (browsing and lookup), and exploratory search (finding correlations, comparisons, what-if scenarios, etc.).
  • 30.
    30  Big Datais not simply a large amount of data  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
  • 31.
    31  Leading ITindustry research group Gartner defines Big Data as:  Big Data definition is based on the three Vs:  Volume: Size of data (how big it is)  Velocity: How fast data is being generated  Variety: Variation of data types to include source, format, and structure(data can be unstructured, semi-structured, or structured). “Big Data are high-volume, high-velocity, and/or high- variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”
  • 32.
  • 33.
    33  Reasons forthe data explosion are due to new technologies generating and collecting vast amounts of data.  These sources include • Scientific sensors such as global mapping, meteorological tracking, medical imaging, and DNA research • Point of Sale (POS) tracking and inventory control systems • Social media such as Facebook posts and Twitter Tweets • Internet and intranet websites across the world
  • 34.
    34 Clustered Computing Resource Pooling: Combining the available storage space to hold data is a clear benefit, but CPU and memory pooling are also extremely important.  High Availability: Clusters can provide varying levels of fault tolerance and availability guarantees to prevent hardware or software failures from affecting access to data and processing.  Easy Scalability: Clusters make it easy to scale horizontally by adding additional machines to the group.
  • 35.
    35  Using clustersrequires a solution for managing cluster membership, coordinating resource sharing, and scheduling actual work on individual nodes.  Cluster membership and resource allocation can be handled by software like Hadoop’s YARN (which stands for Yet Another Resource Negotiator).  YARN allows the data stored in HDFS (Hadoop Distributed File System) to be processed and run by various data processing engines
  • 36.
    36 Hadoop  Open-source softwarefrom Apache Software Foundation to store and process large non-relational data  It is a scalable and fault-tolerant system for processing large datasets across a cluster of commodity servers.
  • 37.
    37 Four characteristics ofHadoop  Economical: Its systems are highly economical as ordinary computers can be used for data processing.  Reliable: It is reliable as it stores copies of the data on different machines and is resistant to hardware failure.  Scalable: It is easily scalable both, horizontally and vertically.  Flexible: It is flexible and you can store as much structured and unstructured data as you need to and decide to use them later.
  • 38.
  • 39.

Editor's Notes

  • #6 Cambridge analytica Social profiling 50 million
  • #21 A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. 
  • #24  Data comes in many forms and one dimension to consider and compare differing data formats is the amount of structure contained therein.
  • #26 The more structure a dataset has the more amenable it will be to machine processing. At the extreme, semantic representations will enable machine reasoning.
  • #27 Data curation provides the methodological and technological data management support to address data quality issues maximizing the usability of the data.
  • #28 performance and fault tolerance when data volumes and complexity grow making them unsuitable for big data scenarios. Key-value stores Graph database Doc. Database
  • #29 performance and fault tolerance when data volumes and complexity grow making them unsuitable for big data scenarios. Key-value stores Graph database Doc. Database
  • #30 Where Business domain
  • #31 Where Business domain
  • #32 Where Business domain
  • #33 Where Business domain
  • #37 Where Business domain
  • #38 Where Business domain