BDA_Module1.pptx

2
SHRINIVASA, Dept. of CSE, SMVITM Bantakal

3

4

5

Course Outcomes:
 CO1: Describe the fundamentals of Big Data analytics.
 CO2: Extrapolate Hadoop framework and Hadoop Distributed File system.
 CO3: Explore the concepts of NoSQL using MongoDB and Cassandra for Big
Data.
 CO4: Demonstrate the MapReduce programming model for the big data
processing along with Hadoop tools.
 CO5: Explore Machine Learning algorithms for real world big data along with
analytics of web contents and Social Networks with relevant visualization tools.
6

xcd5mzi
7
xcd5mzi
xcd5mzi
Display

In today’s discussion…
 Introduction to data
 Types of data
 Current trend
 Data and Big data
 Big data vs. small data
 Tools and techniques
8

Introduction to data
 Example:
10, 25, …, Bantakal, 18CS72, sode-edu.in
Anything else?
 Data vs. Information
100.0, 0.0, 250.0, 150.0, 220.0, 300.0, 110.0
Is there any information?
9

Types of Data
 Structured
 Unstructured
 Semi-structured
10

Structured data
 Any data that can be stored, accessed and processed in the form of fixed format is termed
as a 'structured' data.
11

Unstructured
 Any data with unknown form or the structure is classified as unstructured data.
 Example:
12

Semi-structured
 Semi-structured data can contain both the forms of data.
 Examples Of Semi-structured Data
 Personal data stored in an XML file-
<rec><name>Prashant Rao</name><gender>Male</gender><age>35</age></rec>
<rec><name>Seema R.</name><gender>Female</gender><age>41</age></rec>
<rec><name>Satish Mane</name><gender>Male</gender><age>29</age></rec>
<rec><name>Subrato Roy</name><gender>Male</gender><age>26</age></rec>
13

14

How large your data is?
 What is the maximum file size you have dealt so far?
 Movies/files/streaming video that you have used?
 What is the maximum download speed you get?
 To retrieve data stored in distant locations?
 How fast your computation is?
 How much time to just transfer from you, process and get
result?
15

Growth of data
16

Sources of data
 “Every day, we create 2.5 quintillion bytes of data
 So much that 90% of the data in the world today has been created in the last two years alone. This
explosion of information is known as “Big Data,”
 The data come from several sources :
etc. …… to name a few!
17

Social Media:
SHRINIVASA, Dept. of CSE, SMVITM Bantakal 18

Now data is Big data!
 No single standard definition!
 ‘Big-data’ is similar to ‘Small-data’, but bigger
…but having data bigger consequently requires different approaches
 techniques, tools and architectures
…to solve: new problems
…and, of course, in a better way
Big data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms,
and analytics to manage it and extract value and hidden knowledge from it…
19

Characteristics of Big data: 5V’s
 5 V's of Big Data:
 Volume
 Velocity
 Variety
 Veracity
 Value

Volume(The scale):
 Volume in Big data represents the amount of data. In today’s world data is
being processed in various formats like word, excel, pdf format, and
sometimes in audio and video. These data can be structured ,
unstructured or semi-structured format. The recent social media
platforms produce a tremendous amount of data which is difficult to
handle by the organization. To handle this huge amount of data
organizations should implement modern business intelligence tools
which will capture this data in an effective form and which will be cost-
efficient for the organization.

Velocity(The speed):
 Velocity refers to the rate/speed at which data is getting generated.
 This is primarily due to the Internet of Things (IoT), mobile data, social
media, and other factors. At least 2 trillion searches each year, 3.8 million
searches per minute, 228 million searches per hour, and 5.6 billion
searches per day are now being conducted.

Variety(Data type):
 Big Data can be structured, unstructured, and semi-structured that
are being collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will
comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
Quasi-structured data:The data format
contains textual data with inconsistent data
formats that are formatted with effort and time
with some tools.

Veracity:
 Degree of trustworthiness of data is the veracity of data.
 Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and
manage data efficiently. Big Data is also essential in business
development.
 For example, Facebook posts with hashtags.

Value:
 Value is an essential characteristic of big data. It is not the data that we
process or store. It is valuable and reliable data that we store, process,
and also analyze.


26

Big data vs. small data
 Big data is more real-time in nature than
traditional applications
 Big data architecture
 Traditional architectures are not well-suited for big
data applications (e.g. Exa-data, Tera-data)
 Massively parallel processing, scale out
architectures are well-suited for big data
applications
27

Challenges ahead…
 The Bottleneck is in technology
 New architecture, algorithms, techniques are needed
 Also in technical skills
 Experts in using the new technology and dealing with Big data
Who are the major players in the world of
Big data?
28

Big data players
29

Major players…
 Google
 Hadoop
 MapReduce
 Mahout
 Apache Hbase
 Cassandra
30

Tools available
 NoSQL
 Databases MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper
 MapReduce
 Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie,
Greenplum
 Storage
 S3, HDFS, GDFS
 Servers
 EC2, Google App Engine, Elastic, Beanstalk, Heroku
 Processing
 R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop
31

32

Predictions & Recommendations
33

Big data knows everything..!!
34

Job Opportunities in Big Data
 Data Analysts
 analyze and interpret data, visualize it, and build reports to help make better business decisions.
 Data Scientists
 mine data by assessing data sources and use algorithms and Machine Learning techniques.
 Data Architects
 design database systems and tools.
 Database Managers
 control database system performance, perform troubleshooting, and upgrade hardware and software.
 Big Data Engineers
 design, maintain, and support Big Data solutions.
35

END OF INTRODUCTION
36

Questions of the day…
1. What is the smallest and largest units of measuring size of data?
2. How big a Quintillion measure is?
3. Give the examples of a smallest the largest entities of data.
4. Give FIVE parameters with which data can be categorized as i) simple,
ii) Moderately complex and iii) complex?
37

Questions of the day…
5. What type of data are involved in the following applications?
1. Weather forecasting
2. Mobile usage of all customers of a service provider
3. Anomaly (e.g. fraud) detection in a bank organization
4. Person categorization, that is, identifying a human
5. Air traffic control in an airport
38

Big data types (by IBM)
 Social Networks and web data
 Transactions data and Business Process data
 Customer master data
 Machine generated data
 Human generated data
39

Big Data Classification : based on characteristics for
designing data architecture for processing and analytics
40

41

Scalability and Parallel Processing
 Scalability is the capability to handle growing amounts of data and growing number of
database clients either by adding more hardware resources or by optimization and more
efficient usage of the existing resources.
 Scalability enables increase or decrease in the capacity of data storage, processing and
analytics.
 In short, you need to build scalability into the hardware architecture and database
selection, and can (for the most part) maximize performance later — during the
database design and deployment phase.
 System capability needs increment with the increased workloads. When the workload
and complexity exceed the system capacity, scale it up and scale it down.
42

Scalability Options
 Assuming you need to scale your system, there are two options:
 scaling up
 scaling out
43

Scale UP
 Resources such as CPU, network, and storage are
common targets for scaling up.
 The goal is to increase the resources supporting your
application to reach or maintain adequate
performance.
 In a hardware-centric world, this might mean adding
a larger hard drive to a computer for increased
storage capacity.
 It might mean replacing the entire computer with a
machine that has more CPU and a more performant
network interface.
44

Scale OUT
 The scale-out option implies a
distributed system whereby additional
machines are added to a cluster to
provide additional capacity. It's often
more likely to yield a linear increase in
scalability, although not necessarily
increased performance.
45

Analytics Scalability to Big Data
 Vertical scalability means scaling up the given system resources and increasing the system’s
analytics, reporting and visualization capabilities.
 Ex: designing the algorithm according to the architecture that uses resources efficiently.
 Horizontal scalability means increasing the number of systems working in coherence and
scaling out the workload.
 Ex: using more resources and distributing the storage and processing task in parallel.
 Note: Alternative ways for scaling up and out processing of analytics software and big data
analytics deploy the Massively Parallel Processing Platforms(MPPs), cloud, grid, clusters and
distributed computing software.
46

Massively Parallel Processing Platforms(MPPs)
 Massively parallel processing (MPP) is a
collaborative processing of the same
program using two or more processors.
 By using different processors, speed can be
dramatically increased.
 For example, imagine a popular insurance company with millions of
customers. As the number of customers increases, so does the customer
data. Even if the firm uses parallel processing, they may experience a delay in
processing customer data. Assume a data analyst is running a query against
100 million rows of a database. If the organization uses a massively parallel
processing system with 1000 nodes, each node has to bear only 1/1000
computational load.
47

48
Parallelization of tasks
can be done at several
levels:
• Distributing separate tasks on to
separate threads on same CPU.
• Distributing separate tasks onto
separate CPUs on the same
computer.
• Distributing separate tasks onto
separate computer.
There are several types of
MPP database
architectures
• Distributed Computing Model
• Cloud Computing
• Grid and Cluster Computing
• Volunteer Computing

Distributed Computing Model
 It uses cloud, grid or clusters, which process and analyze big and large
datasets on distributed computing nodes connected by high speed
network.
49

Cloud computing
 Type of internet-based computing that provides shared processing
resources and data to the computers and other devices on demand.
 One of the best approach for data processing to perform parallel and
distributed computing
 Offers high data security compared to other distributed technologies
50

Cloud resources
 Amazon Web Service(AWS)
 Elastic Compute Cloud(EC2)
 Microsoft Azure or Apache CloudStack
 Amazon Simple Storage Service(S3)
51

Cloud computing features
1. On-demand service
2. Resource pooling
3. Scalability
4. Accountability
5. Broad network Access
 Cloud services can be accessed from anywhere and at any time through the
internet
52

Cloud services types
1. Infrastructure as a Service(IaaS):
 Providing access to resources, such as hard disks, network connections, databases
storage , data centre and virtual service space.
 Ex: AWS EC2, Rackspace, Google Compute Engine
2. Platform as a Service(PaaS):
 Providing runtime environment to allow developers to build applications and services.
 Ex: Windows Azure (mostly used as PaaS), Force.com
3. Software as a Service(SaaS):
 Providing software applications as a service to end-users
 Ex: BigCommerce, Google Apps, Salesforce, Dropbox
53

Grid and Cluster computing
 Grid Computing:
 Distributed computing, in which a group of computers from several locations are connected with each
other to achieve a common task.
 Grid: A group of computers that might spread over remotely
 This type of computing provides large-scale resource sharing which is flexible , coordinated and secure
among its users.
 For example, a research team might analyze weather patterns in the North Atlantic region, while
another team analyzes the south Atlantic region, and both results can be combined to deliver a
complete picture of Atlantic weather patterns
54

Features of Grid computing
 Similar to cloud computing
 Scalable
 Distributed network for resource integration
55

Drawbacks of Grid Computing
 Single point of failure
 Storage capacity varies with the number of users, instances and the
amount of data transferred at a given time
56

Cluster computing
 Group of computers connected by a network to accomplish the same task.
 Used mainly for load balancing
57

Difference between Cluster and Grid Computing:
58

Volunteer Computing
 Volunteers are organizations or members who own personal computers.
 They provide computing resources to important projects that use resources to
do distributed computing and/or storage
 Volunteer Computing: uses computing resources of the volunteers
59

Issues of volunteer computing systems
 Volunteered computers heterogeneity
 Drop outs from the network over time
 Their sporadic availability
 Incorrect results at volunteers are unaccountable as they are anonymous
60

Designing Data Architecture
“Big data architecture is the logical and/or physical layout/structure of how big data will be stored,
accessed and managed within a big data or IT environment.”
 Architecture logically defines how big data solution will work, the core components(hardware,
database, software, storage) used, flow of information, security and more.
 Data processing architecture consist of 5 layers:
 (i) identification of data sources
 (ii) acquisition, ingestion, extraction, pre-processing, transformation of data
 (iii) data storage at files, servers, cluster or cloud
 (iv) data processing
 (v) data consumption
61

62

63

64

Managing data for Analysis
 Data managing means enabling, controlling, protecting, delivering and
enhancing the value of data and information asset.
 Data Management functions include:
65

66

Data Sources
 Applications, programs and tools use data.
 Sources can be external, such as sensors, trackers, web logs, computer
system logs and feeds.
 Sources can be machines, which source data from data-creating programs.
 Data sources can be i) structured ii) semi-structured iii)multi-structured
or unstructured.
67

Structured data sources
 The source may be on the same computer running a program or a networked
computer.
 Examples of structured data sources are SQL server, MySQL, Oracle DBMS, file
collection directory at a server.
 The name implies a defined name, which a process uses to identify the source.
 Ex: a name which identifies stored data in student grades during processing, the
name could be studentname_data_grades.
68
Then, what could be the name of data source!!!?

Unstructured data sources
 Distributed over high-speed networks.
 The data need high velocity processing as sources are from distributed file system.
 The sources are of file types, such as .txt, .csv(comma separated value).
 Data may be as key-value pairs, such as hash key-values pairs.
 Data may have internal structures, such as in e-mail, facebook pages, twitter pages, etc..
 Data sources can be sensors, sensor networks, signals from machines, devices,
controllers of different types in the industry M2M communication and the GPS system.
69

Data Quality
 Data quality is the measure of how well suited a data set is to serve its specific
purpose.
 A high quality data can be data with five R’s.
70

Data Integrity
 Data integrity refers to the fact that data must be reliable and accurate over its
entire lifecycle.
WHY IS DATA INTEGRITY IMPORTANT?
 Your need to have constant access to data of quality data. Data integrity is
important as it guarantees and secures the searchability and traceability of your
data to its original source.
 Organizations collect more and more data and it has become a priority to secure
and maintain the integrity of this data. Without integrity and accuracy, your
data is worthless.
71

Examples of data quality problems
 Noise
 Outliers
 Missing values
 Duplicate data
72

Noisy data
 For objects, noise is considered an extraneous object.
 For attributes, noise refers to modification of original values.
Here Noise refers to measurement error in data values
Could be random error or systematic error…!!!
73

Outliers
 Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set.
 Could indicate “interesting” cases, or could indicate errors in the data
74

Missing values
 Reasons for missing values
 Information is not collected (e.g., people decline to give their age)
 Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)
 Ways to handle missing values
 Eliminate entities with missing values
 Estimate attributes with missing values
 Ignore the missing values during analysis
 Replace with all possible values (weighted by their probabilities)
 Impute missing values
75

Duplicate data
 Data set may include data entities that are duplicates, or almost duplicates
of one another
 Major issue when merging data from heterogeneous sources
 Example: same person with multiple email addresses.
 Data cleaning
 Finding and dealing with duplicate entities
 Finding and correcting measurement error
76

Data Preprocessing
 It is an important step at the ingestion layer.
 It is a must before data mining, analytics or before running machine
learning algorithms.
 Pre-processing needs are:
77

Data cleaning
 Process of removing or correcting incomplete , incorrect,
inaccurate or irrelevant parts of the data after detecting
them.
 Example:
 Correcting the grade outliers or mistakenly entered values
78

Important terminologies in data cleaning
79
Data cleaning
tools
Data Enrichment Data Editing
Data Reduction Data wrangling
Data formats
used during Pre-
processing

Data cleaning tools
 Data cleaning is done before data mining.
 Data cleaning tools help in refining and structuring data into usable
data.
 Example:
 OpenRefine
 DataCleaner
80

81
Data Enrichment:
Refers to operations or processes
which refine, enhance or improve raw
data.
Data Editing:
Process of reviewing and adjusting the acquired datasets.
Controls data quality
Editing methods are :
• Interactive
• Selective
• Automatic
• Aggregating
• distribution

82
Enables the transformation of
acquired information into an ordered,
correct and simplified form.
Enables ingestion of meaningful data
in the datasets.
Basic concept:
Reduction of
multitudinous amount
of data and use the
meaningful parts.
uses editing, scaling, coding, sorting,
collating, smoothening, interpolating,
preparing tabular summaries.
Data
Reduction:

83
Process of transforming and
mapping the data.
Results from analysis are then
appropriate and valuable.
Example:
Mapping enables data into
another format, which makes
it valuable for analytics and
data visualizations
Data
wrangling

Data formats used during Pre-processing
84

CSV format
85
Refers to a
plain text file
which stores
the table data
of numbers
and text.
Each CSV
file line is a
data record
Each record
consists of
one or more
fields,
separated
by commas.
CSV files are
most
encountered in
spreadsheets
and databases.

Example :CSV
86

87
Activity :
Find out the
differences
between CSV and
Excel file formats

Data format conversions
 Need preprocessing for data-format conversions.
 A number of different applications, services and tools need a specific
format of data only.
 Preprocessing before their usages or storage on cloud services is a must.
88

89
Data store
export to
cloud

From the diagram,
 Shows data pre-processing, data mining, analysis, visualization
and data store.
 The data exports to cloud services.
 The results integrate at the enterprise server or data warehouse.
90

91
Cloud services
The services can be accessed
through a cloud client, such as
web browser,SQL or other
client.

92
Data store
export from
machines,files,
computers, web
servers and web
services

Export of data to AWS and Rackspace Clouds: Example
93

94

Example 2: BigQuery cloud service at google cloud platform
95
2
1
3
4
OR

Data Storage and Management: Traditional Systems
 Data Store with structured or semi-structured data.
 SQL
 RDBMS uses SQL.
 It is a language for viewing or changing databases.
 SQL does the following
96

Data Storage and Analysis
97

98
DDBMS, Enterprise
data-store server and
data warehouse

Distributed database management system(DDBMS)
 Collection of logically interrelated databases at multiple system over a computer
network.
 Features of DDBMS are:
99

100
In-memory
column formats
data
 Allows faster data retrieval when only few
columns in a table need to be selected for
querying
 Data in a column are kept together in-
memory in columnar format
 A single memory access, therefore loads
many values at the column
 Used in OLAP

101
Use of In-memory column formats in OLAP
OLAP : Online Analytical
Processing in real time
transaction processing is
fast when using in-memory
column format tables.
Enables real-time
analytics
CPU accesses all
columns in a single
instance of access to
the memory in
columnar format in-
memory data-storage

102
In-
memory
row
formats
data
 A row format in-memory allows
much faster data processing
during OLTP(Online Transaction
Processing)

103

104
Enterprise
data-store
server and
data
warehouse
Enterprise data, after data cleaning process,
integrate with the server data at
warehouse
Enterprise data server use data from
several distributed sources which store
data using various technologies.
All data merge using an integration tool

105
Enterprise data
integration and
management with
big data

106
Big Data storage

107
Big Data
NoSQL or
Not Only
SQL
 NoSQL DBs are semi-structured
 Big data store uses NoSQL
 NoSQL stands for No SQL or Not Only
SQL.
 Do not integrate with applications using
SQL
 NoSQL also used cloud data store

108
Features of NoSQL

109

110
Terminologies
Consistency :
All copies have the
same value as in
traditional DBs
Availability:
At least one copy is available
in case a partition becomes
inactive or fails
Partition:
Parts which are active but
may not cooperate as in the
distributed DBs

111
Coexistence of bigdata ,nosql and
traditional datastores

112
Various data sources and
examples of usages and tools

113

114
BIG DATA
PLATFORM
Supports large datasets and volume
of data.
The data generate at a higher velocity, in more
varieties or in higher veracity.
Managing Big Data requires large resources
of MPPs , cloud, parallel processing and
specialized tools

115
Bigdata platform should provide tools and services for

116
Hadoop
 Big Data platform consists of Big Data storage, servers and data
management and BI software
 Storage can deploy HDFS, NoSQL data stores, such as Hbase,
MongoDB,Cassandra.
 HDFS system is an open source storage system
 Scaling , self-managing and self-healing file system

117
A scalable and reliable
parallel computing platform
Manages Big Data
distributed databases
Hadoop

118
Hadoop based Big data environment

119
Mesos
 Mesos v0.9 is a resource management platform which enables sharing of
cluster nodes by multiple frameworks and which has compatibility with an
open analytics stack

120
Big Data Stack
 A stack consists of a set of software components and data store units.
 Applications, ML algorithms, Analytics and visualization tools use Big
Data Stack(BDS) at a cloud service, such as Amazon EC2, Azure or
private cloud
 Uses cluster of High Performance machines

121
Tools for Big Data environment

122
Data analysis is a process of
inspecting, cleaning, transformi
ng, and modelling data with the
goal of discovering useful
information, informing
conclusions, and supporting
decision-making.
Big Data
Analytics

123
Phases in Analytics

124
Traditional and
Bid Data
analytics
architecture
reference model

Berkeley Data Analytics Stack (BDAS)
Infrastructure
Storage
Data Processing
Application
Resource Management
Data Management
Share infrastructure across frameworks
(multi-programming for datacenters)
Efficient data sharing across frameworks
Data Processing
• in-memory processing
• trade between time, quality, and cost
Application
New apps: AMP-Genomics, Carat, …

Why BDAS..!!?
 Easy to combine batch, streaming, and interactive computations
 Single execution model that supports all computation models
 Easy to develop sophisticated algorithms
 High level abstractions for graph based, and ML algorithms
 Compatible with existing open source ecosystem (Hadoop/HDFS)
 Interoperate with existing storage and input formats (e.g., HDFS, Hive, Flume, ..)
 Support existing execution models (e.g., Hive, GraphLab)
126

127
Big Data in
Marketing
and Sales

128

Big Data Analytics in Detection of Marketing Fraud
 Fraud means someone deceiving deliberately
 Ex: mortgaging the same assets to multiple financial institutions, compromising
customer data and transferring customer info to third party, marketing product
with compromising quality,..
 Banks and financial services firms use analytics to differentiate fraudulent
interactions from legitimate business transactions.
 The analytics systems suggest immediate actions, such as blocking irregular
transactions, which stops fraud before it occurs and improves profitability.
129

Big Data and Healthcare
130

131

132

Healthcare analytics using big data can facilitate the following
 Provision of value-based and customer centric healthcare.
 Utilizing the ‘Internet of Things’ for health care.
 Preventing fraud, waste, abuse in the healthcare industry and reduce
healthcare costs.
 Improving outcomes.
 Monitoring patients in real time.
133

134

Findings of Big Data in Medicine
 Big data analytics deploys large volume of data to identify and derive
intelligence predictive models about individuals.
 Big data creates patterns and models by data mining and help in better
understanding and research.
 Deploying wearable devices data, that devices data records during active
as well as inactive periods.
135

136

137

Key reasons to take into account while using big data to improve
results of digital marketing campaign
138
Data Visualization Tools
Use of Historical data
Target Consumers
Crowdsourcing
The real power of big data is the ability to forecast client’s needs and hence offering veracious value
Web Mining

139
End of Module 1

BDA_Module1.pptx

Recommended

Recommended

More Related Content

Similar to BDA_Module1.pptx

Similar to BDA_Module1.pptx (20)

Recently uploaded

Recently uploaded (20)

BDA_Module1.pptx

Editor's Notes