Chapter 2- Data Science and big data.pptx

 Data science is now one of the most influential
topics all around.
 Companies and enterprises are focusing a lot on
gathering data science talent further creating more
viable roles in the data science industry.
 Data science is a multi-disciplinary field that uses
scientific methods, processes, algorithms, and
systems to extract knowledge and insights from
structured, semi-structured and unstructured data.
 Example: The data involved in buying a box of cereal
from the store or supermarket
Introduction
CONT.

Data Science vs Data scientist
 Data Science is defined as the extraction of
actionable knowledge directly from the data
through the process of discovery, hypothesis,
and analytical hypotheses analysis.
 It is a process of effectively producing or
helping to produce some tool, method, or other
product that derives intelligence from datasets
too large.

Data Science vs Data scientist
 A data scientist (is a job title) is a person engaging in a
systematic activity to acquire knowledge from data.
 In a more restricted sense, a data scientist may refer to an
individual who uses the scientific method on existing data.
 Data Scientists perform research toward a more
comprehensive understanding of products, systems, or
nature, including physical, mathematical and social
realms( environment or sphere).

Role of a Data Scientist
 Advance the skills of analyzing large amounts of data, data
mining, and programming skills.
 The processed and filtered data are handed to them which
are then fed to various analytics programs and machine
learning with statistical methods to generate data which will
soon be used in predictive analysis and other fields
 Explore for more cryptic(hidden) patterns to procure(obtain)
proper insights.

Data Science
 Scientific method requires data to begin iterating towards
a more convincing hypothesis.
 Science doesn’t exist without data.
 Data scientist
 possess a strong
Quantitative background in statistics
Linear algebra
Programming knowledge with focuses on data
warehousing, mining, and modeling to build and
analyze algorithms

Data vs. Information
 Data
 Can be defined as a representation of facts, concepts, or instructions
in a formalized manner, which should be suitable for communication,
interpretation, or processing, by human or electronic machines.
 It can be described as unprocessed facts and figures
 It is represented with the help of characters such as alphabets (A-Z,
a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.
 Information
 The processed data on which decisions and actions are based
 Information is interpreted data; created from organized, structured,
and processed data in a particular context

Data Processing Cycle
 Data processing is the conversion of raw data to meaningful
information through a process.
 Data is manipulated to produce results that lead to a resolution
of a problem or improvement of an existing situation.
 The process includes activities like data entry/input,
calculation/process, output and storage
 Input is the task where verified data is coded or converted into
machine readable form so that it can be processed through a
computer. Data entry is done through the use of a keyboard,
digitizer, scanner, or data entry from an existing source.

Data Processing Cycle
 Processing is when the data is subjected to various means and
methods of manipulation, the point where a computer program
is being executed, and it contains the program code and its
current activity.
 Output and interpretation is the stage where processed
information is now transmitted to the user. Output is presented
to users in various report formats like printed report, audio,
video, or on monitor.
 Storage is the last stage in the data processing cycle, where
data, instruction and information are held for future use. The
importance of this cycle is that it allows quick access and
retrieval of the processed information, allowing it to be passed
on to the next stage directly, when needed.

Data types
 A data type is way to tell compiler as to which data (integer, character, float, etc.) is
supposed to be stored and what amount of memory consequently to allocate.
 A data type is way to tell the compiler that at a cell x in a memory space, a bit value
of some range y is only supposed to be stored. It restricts the compiler to store
anything else other than that value range
 Common data types include
 Integers(int)- is used to store whole numbers, mathematically known as integers
 Booleans(bool)- is used to represent restricted to one of two values: true or false
 Characters(char)- is used to store a single character
 Floating-point numbers(float)- is used to store real numbers
 Alphanumeric strings(string)- used to store a combination of characters and
numbers
 AA
 Formal language
 Logic in computer science

Data representation
 Types are an abstraction letting us model things
in categories and it is largely a mental construct.
 All computer represent data nothing more than
a string of ones and zeroes.
 In order for said ones and zeroes to convey any
meaning, they need to be contextualized.
 Data types provide that context.
 E.g. 01100001

Data types from Data Analytics perspective
 Data analytics (DA) is that the method of examining
knowledge sets to conclude the data they contain,
progressively with the help of specialized systems
and software package
 From a data analytics point of view, it is important to
understand that there are three common types of data
types or structures:
 Structured(clearly defined and represented)
 Unstructured data types (often qualitative often difficult to organize and input in
data base)
 Semi-structured(includes both)

Structured Data
 Structured data is data that adheres to a pre-defined
data model and is therefore straightforward to
analyze.
 Structured data concerns all data which can be stored
in database SQL in table with rows and columns.
They have relational key and can be easily mapped
into pre-designed fields.
 Structured data is highly organized information that
uploads neatly into a relational database
 Structured data is relatively simple to enter, store,
query, and analyze, but it must be strictly defined in
terms of field name and type.

Unstructured Data
 Unstructured data is information that either does not have a
predefined data model or is not organized in a pre-defined
manner.
 Unstructured data may have its own internal structure, but
does not conform neatly into a spreadsheet or database.
 Most business interactions, in fact, are unstructured in nature.
 Today more than 80% of the data generated is unstructured.
 The fundamental challenge of unstructured data sources is that
they are difficult for nontechnical business users and data
analysts alike to unbox, understand, and prepare for analytic
use.

Semi structured Data
 Semi-structured data is a form of structured data that does not
conform with the formal structure of data models associated
with relational databases or other forms of data tables, but
nonetheless, contains tags or other markers to separate
semantic elements and enforce hierarchies of records and
fields within the data.
 Semi-structured data is information that doesn’t reside in a
relational database but that does have some organizational
properties that make it easier to analyze.
 Examples of semi-structured : CSV (comma separated
value)XML(extensible mark up language) and JSON (java
script object notation) documents are semi structured
documents, NoSQL databases are considered as semi
structured.

Metadata – Data about Data
 Metadata is data about data. Data that describes other data.
 It provides additional information about a specific set of data.
 Metadata summarizes basic information about data, which can
make finding and working with particular instances of data
easier.
 For example, author, date created and date modified and file
size are examples of very basic document metadata.
 Having the ability to filter through that metadata makes it much
easier for someone to locate a specific document.
 In context of databases, metadata would be info on tables,
views, columns, arguments etc.

Data value chain
 The Data Value Chain is introduced to describe the information
flow within a big data system as a series of steps needed to
generate value and useful insights from data.
 Data acquisition, data analysis(converting data with useful
statistics e.g. charts, maps etc.), data curation or
grouping(organization and integration of data collected from
various sources), data storage, data usage.
 Data acquisition is the process of digitizing data from
the world around us so it can be displayed, analyzed,
and stored in a computer. It is the processes for bringing
data that has been created by a source outside the
organization, into the organization, for production use.

Data value chain
 Data analysis is the process of evaluating
data using analytical and logical reasoning
to examine each component of the data
provided. Data from various sources is
gathered, reviewed, and then analyzed to
form some sort of finding or conclusion.
 Data analytics is process of finding
information from data to make a decision
and subsequently act on it.

Data value chain
 Data curation is about managing data throughout
its lifecycle. Collecting, organizing, cleaning and
much more are included in data curation. Data
curators manage the data through various stages and
make the data usable for data analysts and scientists.
 Data storage is defined as a way of keeping
information in the memory storage for use by a
computer. An example of data storage is a folder
for storing Microsoft Word documents.
 Data usage is the amount of data (things like
images, movies, photos, videos, and other files) that
you send, receive, download and/or upload.

Big Data Definition
 No single standard definition…
“Big Data” is data whose scale, diversity, and
complexity require new architecture,
techniques, algorithms, and analytics to
manage it and extract value and hidden
knowledge from it…

Big data
 Big data is the term for a collection of data sets so large
and complex that it becomes difficult to process using on-
hand database management tools or traditional data
processing applications.
 In other words, data that is the range of 100s of
TBs(Terabyte) or PB( Peta byte) comes into Big Data. 1
TB=1000 GB and 1PB=1000TB
 But it doesn't mean the amount of data, the thing matters
is what organization do with data.
 Big Data is analyzed for insights that lead to better
decisions.

Big data
 Big Data is associated with the concept of 3 V
that is volume, velocity, and variety. Big data is
characterized by 3V and more:
 Volume: large amounts of data Zeta bytes/Massive
datasets.
1 ZB=1Triloin GB(10 the power of 21 that is 21 0’s)
 Velocity: Data is live streaming or in motion.
 Variety: data comes in many different forms from
diverse sources.
 Veracity: can we trust the data? How accurate is it?
etc.(source)

Clustered Computing
 Cluster Computing addresses the latest results in these fields
that support High Performance Distributed Computing .
(Working by networking in group as a single entity to achieve a
common goal ).
 The Clustering methods have identified as- HPC(high
performance computing) IAAS(infrastructure as a service),
PAAS(platform as a service), that are more expensive and
difficult to setup and maintain than a single computer.
 In HPDC(high pressure die casting) environments, parallel
and/or distributed computing techniques are applied to the
solution of computationally intensive applications across
networks of computers.(technology for production of huge
machineries like cars.)

Clustered Computing
 “Computer cluster” basically refers to a set of
connected computer working together.
 The cluster represents one system and the
objective is to improve performance.
 The computers are generally connected in a LAN
(Local Area Network).
 So, when this cluster of computers works to
perform some tasks and gives an impression of
only a single entity, it is called “cluster
computing”.

Clustered Computing
 Big data clustering software combines the resources of many smaller machines,
seeking to provide a number of benefits:
 Resource Pooling: (grouping resources together)
 Combining the available storage space to hold data is a clear benefit, but
CPU and memory pooling are also extremely important. Processing large
datasets requires large amounts of all three of these resources.
 Object Pooling is a way which enable storing of group of object(called
pool storage) in memory. (grouping objects in different categories)
 Whenever new object is needs to be created, it is first checked in pool
storage and if available it is reused and like this it provide reusability of
object and system resources, improves the scalability of program.
 It seems like cloud resource polling like resources of storage, memory,
processing and bandwidth tec.)

Clustered Computing
 High Availability: In computing, the term availability is used to
describe the period of time when a service is available, as well as the
time required by a system to respond to a request made by a user.
High availability is a quality of a system or component that assures a
high level of operational performance for a given period of time.
 Clusters can provide varying levels of fault tolerance and availability
guarantees to prevent hardware or software failures from affecting
access to data and processing. This becomes increasingly important as
we continue to emphasize the importance of real-time analytics.
 Easy Scalability(expanding by adding resources to the system):
Clusters make it easy to scale horizontally by adding additional
machines to the group. This means the system can react to changes in
resource requirements without expanding the physical resources on a
machine.

Hadoop and its Ecosystem
 Hadoop is an open-source framework intended to make interaction with
big data easier. It is a framework that allows for the distributed processing
of large datasets across clusters of computers using simple programming
models.(it stores and process a large data sets ranging from gigabytes to
petabytes of data and its eco system includes multiple components that
support big data processing)
 The four key characteristics of Hadoop (software library) are:
 Economical: Its systems are highly economical as ordinary computers
can be used for data processing.
 Reliable: It is reliable as it stores copies of the data on different
machines and is resistant to hardware failure.
 Scalable: It is easily scalable both, horizontally and vertically. A few
extra nodes help in scaling up the framework.
 Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.

 Hadoop has an ecosystem that has
evolved from its four core
components: data management,
access, processing, and storage.
 It is continuously growing to meet the
needs of Big Data.
 It comprises the following components
and many others:
 HDFS: Hadoop Distributed File
System
 YARN: Yet Another Resource
Negotiator
 MapReduce: Programming
based Data Processing
 Spark: In-Memory data
processing
Hadoop and its Ecosystem
○ PIG, HIVE: Query-
based processing of data
services
○ HBase: NoSQL
Database(not relational)
○ Mahout, Spark MLLib:
Machine Learning
algorithm libraries.
○ Solar, Lucene:
Searching and Indexing
○ Zookeeper: Managing
cluster.
○ Oozie: Job Scheduling.

Ingesting data into the system
■The first stage of Big Data processing is Ingest. The data is ingested
or transferred to Hadoop from various sources such as relational
databases, systems, or local files. Sqoop (a tool for transfer) transfers
data from RDBMS to HDFS(Hadoop Distributed File System) ,
whereas Flume(tool) transfers event data.(information about change
that occurs on data on specific time)
Processing the data in storage
■The second stage is Processing. In this stage, the data is stored and
processed. The data is stored in the distributed file system, HDFS,
and the NoSQL distributed data, HBase. Spark and MapReduce
perform data processing.
Big Data Life Cycle with Hadoop

Computing and analyzing data :
• The third stage is to Analyze. Here, the data is analyzed by
processing frameworks such as Pig, Hive, and Impala. Pig
converts the data using a map and reduce and then analyzes
it. Hive is also based on the map and reduce programming
and is most suitable for structured data.
Big Data Life Cycle with Hadoop

Big Data Life Cycle
Determines
who has
authority
and control
over data
and how
these data
assets may
be used.

Chapter 2- Data Science and big data.pptx

More Related Content

Similar to Chapter 2- Data Science and big data.pptx

More from HailieeyesusKindie

Recently uploaded

Chapter 2- Data Science and big data.pptx