6. Course Outcomes:
CO1: Describe the fundamentals of Big Data analytics.
CO2: Extrapolate Hadoop framework and Hadoop Distributed File system.
CO3: Explore the concepts of NoSQL using MongoDB and Cassandra for Big
Data.
CO4: Demonstrate the MapReduce programming model for the big data
processing along with Hadoop tools.
CO5: Explore Machine Learning algorithms for real world big data along with
analytics of web contents and Social Networks with relevant visualization tools.
6
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
8. In today’s discussion…
Introduction to data
Types of data
Current trend
Data and Big data
Big data vs. small data
Tools and techniques
8
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
9. Introduction to data
Example:
10, 25, …, Bantakal, 18CS72, sode-edu.in
Anything else?
Data vs. Information
100.0, 0.0, 250.0, 150.0, 220.0, 300.0, 110.0
Is there any information?
9
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
10. Types of Data
Structured
Unstructured
Semi-structured
10
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
11. Structured data
Any data that can be stored, accessed and processed in the form of fixed format is termed
as a 'structured' data.
11
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
12. Unstructured
Any data with unknown form or the structure is classified as unstructured data.
Example:
12
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
13. Semi-structured
Semi-structured data can contain both the forms of data.
Examples Of Semi-structured Data
Personal data stored in an XML file-
<rec><name>Prashant Rao</name><gender>Male</gender><age>35</age></rec>
<rec><name>Seema R.</name><gender>Female</gender><age>41</age></rec>
<rec><name>Satish Mane</name><gender>Male</gender><age>29</age></rec>
<rec><name>Subrato Roy</name><gender>Male</gender><age>26</age></rec>
13
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
15. How large your data is?
What is the maximum file size you have dealt so far?
Movies/files/streaming video that you have used?
What is the maximum download speed you get?
To retrieve data stored in distant locations?
How fast your computation is?
How much time to just transfer from you, process and get
result?
15
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
17. Sources of data
“Every day, we create 2.5 quintillion bytes of data
So much that 90% of the data in the world today has been created in the last two years alone. This
explosion of information is known as “Big Data,”
The data come from several sources :
etc. …… to name a few!
17
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
19. Now data is Big data!
No single standard definition!
‘Big-data’ is similar to ‘Small-data’, but bigger
…but having data bigger consequently requires different approaches
techniques, tools and architectures
…to solve: new problems
…and, of course, in a better way
Big data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms,
and analytics to manage it and extract value and hidden knowledge from it…
19
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
20. Characteristics of Big data: 5V’s
5 V's of Big Data:
Volume
Velocity
Variety
Veracity
Value
SHRINIVASA, Dept. of CSE, SMVITM Bantakal 20
21. Volume(The scale):
Volume in Big data represents the amount of data. In today’s world data is
being processed in various formats like word, excel, pdf format, and
sometimes in audio and video. These data can be structured ,
unstructured or semi-structured format. The recent social media
platforms produce a tremendous amount of data which is difficult to
handle by the organization. To handle this huge amount of data
organizations should implement modern business intelligence tools
which will capture this data in an effective form and which will be cost-
efficient for the organization.
SHRINIVASA, Dept. of CSE, SMVITM Bantakal 21
22. Velocity(The speed):
Velocity refers to the rate/speed at which data is getting generated.
This is primarily due to the Internet of Things (IoT), mobile data, social
media, and other factors. At least 2 trillion searches each year, 3.8 million
searches per minute, 228 million searches per hour, and 5.6 billion
searches per day are now being conducted.
SHRINIVASA, Dept. of CSE, SMVITM Bantakal 22
23. Variety(Data type):
Big Data can be structured, unstructured, and semi-structured that
are being collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will
comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
SHRINIVASA, Dept. of CSE, SMVITM Bantakal 23
Quasi-structured data:The data format
contains textual data with inconsistent data
formats that are formatted with effort and time
with some tools.
24. Veracity:
Degree of trustworthiness of data is the veracity of data.
Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and
manage data efficiently. Big Data is also essential in business
development.
For example, Facebook posts with hashtags.
SHRINIVASA, Dept. of CSE, SMVITM Bantakal 24
25. Value:
Value is an essential characteristic of big data. It is not the data that we
process or store. It is valuable and reliable data that we store, process,
and also analyze.
SHRINIVASA, Dept. of CSE, SMVITM Bantakal 25
27. Big data vs. small data
Big data is more real-time in nature than
traditional applications
Big data architecture
Traditional architectures are not well-suited for big
data applications (e.g. Exa-data, Tera-data)
Massively parallel processing, scale out
architectures are well-suited for big data
applications
27
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
28. Challenges ahead…
The Bottleneck is in technology
New architecture, algorithms, techniques are needed
Also in technical skills
Experts in using the new technology and dealing with Big data
Who are the major players in the world of
Big data?
28
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
34. Big data knows everything..!!
34
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
35. Job Opportunities in Big Data
Data Analysts
analyze and interpret data, visualize it, and build reports to help make better business decisions.
Data Scientists
mine data by assessing data sources and use algorithms and Machine Learning techniques.
Data Architects
design database systems and tools.
Database Managers
control database system performance, perform troubleshooting, and upgrade hardware and software.
Big Data Engineers
design, maintain, and support Big Data solutions.
35
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
37. Questions of the day…
1. What is the smallest and largest units of measuring size of data?
2. How big a Quintillion measure is?
3. Give the examples of a smallest the largest entities of data.
4. Give FIVE parameters with which data can be categorized as i) simple,
ii) Moderately complex and iii) complex?
37
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
38. Questions of the day…
5. What type of data are involved in the following applications?
1. Weather forecasting
2. Mobile usage of all customers of a service provider
3. Anomaly (e.g. fraud) detection in a bank organization
4. Person categorization, that is, identifying a human
5. Air traffic control in an airport
38
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
39. Big data types (by IBM)
Social Networks and web data
Transactions data and Business Process data
Customer master data
Machine generated data
Human generated data
39
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
40. Big Data Classification : based on characteristics for
designing data architecture for processing and analytics
40
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
42. Scalability and Parallel Processing
Scalability is the capability to handle growing amounts of data and growing number of
database clients either by adding more hardware resources or by optimization and more
efficient usage of the existing resources.
Scalability enables increase or decrease in the capacity of data storage, processing and
analytics.
In short, you need to build scalability into the hardware architecture and database
selection, and can (for the most part) maximize performance later — during the
database design and deployment phase.
System capability needs increment with the increased workloads. When the workload
and complexity exceed the system capacity, scale it up and scale it down.
42
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
43. Scalability Options
Assuming you need to scale your system, there are two options:
scaling up
scaling out
43
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
44. Scale UP
Resources such as CPU, network, and storage are
common targets for scaling up.
The goal is to increase the resources supporting your
application to reach or maintain adequate
performance.
In a hardware-centric world, this might mean adding
a larger hard drive to a computer for increased
storage capacity.
It might mean replacing the entire computer with a
machine that has more CPU and a more performant
network interface.
44
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
45. Scale OUT
The scale-out option implies a
distributed system whereby additional
machines are added to a cluster to
provide additional capacity. It's often
more likely to yield a linear increase in
scalability, although not necessarily
increased performance.
45
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
46. Analytics Scalability to Big Data
Vertical scalability means scaling up the given system resources and increasing the system’s
analytics, reporting and visualization capabilities.
Ex: designing the algorithm according to the architecture that uses resources efficiently.
Horizontal scalability means increasing the number of systems working in coherence and
scaling out the workload.
Ex: using more resources and distributing the storage and processing task in parallel.
Note: Alternative ways for scaling up and out processing of analytics software and big data
analytics deploy the Massively Parallel Processing Platforms(MPPs), cloud, grid, clusters and
distributed computing software.
46
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
47. Massively Parallel Processing Platforms(MPPs)
Massively parallel processing (MPP) is a
collaborative processing of the same
program using two or more processors.
By using different processors, speed can be
dramatically increased.
For example, imagine a popular insurance company with millions of
customers. As the number of customers increases, so does the customer
data. Even if the firm uses parallel processing, they may experience a delay in
processing customer data. Assume a data analyst is running a query against
100 million rows of a database. If the organization uses a massively parallel
processing system with 1000 nodes, each node has to bear only 1/1000
computational load.
47
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
48. 48
Parallelization of tasks
can be done at several
levels:
• Distributing separate tasks on to
separate threads on same CPU.
• Distributing separate tasks onto
separate CPUs on the same
computer.
• Distributing separate tasks onto
separate computer.
There are several types of
MPP database
architectures
• Distributed Computing Model
• Cloud Computing
• Grid and Cluster Computing
• Volunteer Computing
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
49. Distributed Computing Model
It uses cloud, grid or clusters, which process and analyze big and large
datasets on distributed computing nodes connected by high speed
network.
49
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
50. Cloud computing
Type of internet-based computing that provides shared processing
resources and data to the computers and other devices on demand.
One of the best approach for data processing to perform parallel and
distributed computing
Offers high data security compared to other distributed technologies
50
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
51. Cloud resources
Amazon Web Service(AWS)
Elastic Compute Cloud(EC2)
Microsoft Azure or Apache CloudStack
Amazon Simple Storage Service(S3)
51
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
52. Cloud computing features
1. On-demand service
2. Resource pooling
3. Scalability
4. Accountability
5. Broad network Access
Cloud services can be accessed from anywhere and at any time through the
internet
52
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
53. Cloud services types
1. Infrastructure as a Service(IaaS):
Providing access to resources, such as hard disks, network connections, databases
storage , data centre and virtual service space.
Ex: AWS EC2, Rackspace, Google Compute Engine
2. Platform as a Service(PaaS):
Providing runtime environment to allow developers to build applications and services.
Ex: Windows Azure (mostly used as PaaS), Force.com
3. Software as a Service(SaaS):
Providing software applications as a service to end-users
Ex: BigCommerce, Google Apps, Salesforce, Dropbox
53
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
54. Grid and Cluster computing
Grid Computing:
Distributed computing, in which a group of computers from several locations are connected with each
other to achieve a common task.
Grid: A group of computers that might spread over remotely
This type of computing provides large-scale resource sharing which is flexible , coordinated and secure
among its users.
For example, a research team might analyze weather patterns in the North Atlantic region, while
another team analyzes the south Atlantic region, and both results can be combined to deliver a
complete picture of Atlantic weather patterns
54
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
55. Features of Grid computing
Similar to cloud computing
Scalable
Distributed network for resource integration
55
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
56. Drawbacks of Grid Computing
Single point of failure
Storage capacity varies with the number of users, instances and the
amount of data transferred at a given time
56
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
57. Cluster computing
Group of computers connected by a network to accomplish the same task.
Used mainly for load balancing
57
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
59. Volunteer Computing
Volunteers are organizations or members who own personal computers.
They provide computing resources to important projects that use resources to
do distributed computing and/or storage
Volunteer Computing: uses computing resources of the volunteers
59
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
60. Issues of volunteer computing systems
Volunteered computers heterogeneity
Drop outs from the network over time
Their sporadic availability
Incorrect results at volunteers are unaccountable as they are anonymous
60
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
61. Designing Data Architecture
“Big data architecture is the logical and/or physical layout/structure of how big data will be stored,
accessed and managed within a big data or IT environment.”
Architecture logically defines how big data solution will work, the core components(hardware,
database, software, storage) used, flow of information, security and more.
Data processing architecture consist of 5 layers:
(i) identification of data sources
(ii) acquisition, ingestion, extraction, pre-processing, transformation of data
(iii) data storage at files, servers, cluster or cloud
(iv) data processing
(v) data consumption
61
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
65. Managing data for Analysis
Data managing means enabling, controlling, protecting, delivering and
enhancing the value of data and information asset.
Data Management functions include:
65
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
67. Data Sources
Applications, programs and tools use data.
Sources can be external, such as sensors, trackers, web logs, computer
system logs and feeds.
Sources can be machines, which source data from data-creating programs.
Data sources can be i) structured ii) semi-structured iii)multi-structured
or unstructured.
67
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
68. Structured data sources
The source may be on the same computer running a program or a networked
computer.
Examples of structured data sources are SQL server, MySQL, Oracle DBMS, file
collection directory at a server.
The name implies a defined name, which a process uses to identify the source.
Ex: a name which identifies stored data in student grades during processing, the
name could be studentname_data_grades.
68
Then, what could be the name of data source!!!?
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
69. Unstructured data sources
Distributed over high-speed networks.
The data need high velocity processing as sources are from distributed file system.
The sources are of file types, such as .txt, .csv(comma separated value).
Data may be as key-value pairs, such as hash key-values pairs.
Data may have internal structures, such as in e-mail, facebook pages, twitter pages, etc..
Data sources can be sensors, sensor networks, signals from machines, devices,
controllers of different types in the industry M2M communication and the GPS system.
69
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
70. Data Quality
Data quality is the measure of how well suited a data set is to serve its specific
purpose.
A high quality data can be data with five R’s.
70
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
71. Data Integrity
Data integrity refers to the fact that data must be reliable and accurate over its
entire lifecycle.
WHY IS DATA INTEGRITY IMPORTANT?
Your need to have constant access to data of quality data. Data integrity is
important as it guarantees and secures the searchability and traceability of your
data to its original source.
Organizations collect more and more data and it has become a priority to secure
and maintain the integrity of this data. Without integrity and accuracy, your
data is worthless.
71
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
72. Examples of data quality problems
Noise
Outliers
Missing values
Duplicate data
72
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
73. Noisy data
For objects, noise is considered an extraneous object.
For attributes, noise refers to modification of original values.
Here Noise refers to measurement error in data values
Could be random error or systematic error…!!!
73
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
74. Outliers
Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set.
Could indicate “interesting” cases, or could indicate errors in the data
74
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
75. Missing values
Reasons for missing values
Information is not collected (e.g., people decline to give their age)
Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)
Ways to handle missing values
Eliminate entities with missing values
Estimate attributes with missing values
Ignore the missing values during analysis
Replace with all possible values (weighted by their probabilities)
Impute missing values
75
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
76. Duplicate data
Data set may include data entities that are duplicates, or almost duplicates
of one another
Major issue when merging data from heterogeneous sources
Example: same person with multiple email addresses.
Data cleaning
Finding and dealing with duplicate entities
Finding and correcting measurement error
76
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
77. Data Preprocessing
It is an important step at the ingestion layer.
It is a must before data mining, analytics or before running machine
learning algorithms.
Pre-processing needs are:
77
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
78. Data cleaning
Process of removing or correcting incomplete , incorrect,
inaccurate or irrelevant parts of the data after detecting
them.
Example:
Correcting the grade outliers or mistakenly entered values
78
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
79. Important terminologies in data cleaning
79
Data cleaning
tools
Data Enrichment Data Editing
Data Reduction Data wrangling
Data formats
used during Pre-
processing
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
80. Data cleaning tools
Data cleaning is done before data mining.
Data cleaning tools help in refining and structuring data into usable
data.
Example:
OpenRefine
DataCleaner
80
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
81. 81
Data Enrichment:
Refers to operations or processes
which refine, enhance or improve raw
data.
Data Editing:
Process of reviewing and adjusting the acquired datasets.
Controls data quality
Editing methods are :
• Interactive
• Selective
• Automatic
• Aggregating
• distribution
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
82. 82
Enables the transformation of
acquired information into an ordered,
correct and simplified form.
Enables ingestion of meaningful data
in the datasets.
Basic concept:
Reduction of
multitudinous amount
of data and use the
meaningful parts.
uses editing, scaling, coding, sorting,
collating, smoothening, interpolating,
preparing tabular summaries.
Data
Reduction:
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
83. 83
Process of transforming and
mapping the data.
Results from analysis are then
appropriate and valuable.
Example:
Mapping enables data into
another format, which makes
it valuable for analytics and
data visualizations
Data
wrangling
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
84. Data formats used during Pre-processing
84
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
85. CSV format
85
Refers to a
plain text file
which stores
the table data
of numbers
and text.
Each CSV
file line is a
data record
Each record
consists of
one or more
fields,
separated
by commas.
CSV files are
most
encountered in
spreadsheets
and databases.
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
87. 87
Activity :
Find out the
differences
between CSV and
Excel file formats
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
88. Data format conversions
Need preprocessing for data-format conversions.
A number of different applications, services and tools need a specific
format of data only.
Preprocessing before their usages or storage on cloud services is a must.
88
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
90. From the diagram,
Shows data pre-processing, data mining, analysis, visualization
and data store.
The data exports to cloud services.
The results integrate at the enterprise server or data warehouse.
90
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
91. 91
Cloud services
The services can be accessed
through a cloud client, such as
web browser,SQL or other
client.
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
95. Example 2: BigQuery cloud service at google cloud platform
95
2
1
3
4
OR
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
96. Data Storage and Management: Traditional Systems
Data Store with structured or semi-structured data.
SQL
RDBMS uses SQL.
It is a language for viewing or changing databases.
SQL does the following
96
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
97. Data Storage and Analysis
97
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
99. Distributed database management system(DDBMS)
Collection of logically interrelated databases at multiple system over a computer
network.
Features of DDBMS are:
99
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
100. 100
In-memory
column formats
data
Allows faster data retrieval when only few
columns in a table need to be selected for
querying
Data in a column are kept together in-
memory in columnar format
A single memory access, therefore loads
many values at the column
Used in OLAP
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
101. 101
Use of In-memory column formats in OLAP
OLAP : Online Analytical
Processing in real time
transaction processing is
fast when using in-memory
column format tables.
Enables real-time
analytics
CPU accesses all
columns in a single
instance of access to
the memory in
columnar format in-
memory data-storage
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
102. 102
In-
memory
row
formats
data
A row format in-memory allows
much faster data processing
during OLTP(Online Transaction
Processing)
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
104. 104
Enterprise
data-store
server and
data
warehouse
Enterprise data, after data cleaning process,
integrate with the server data at
warehouse
Enterprise data server use data from
several distributed sources which store
data using various technologies.
All data merge using an integration tool
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
107. 107
Big Data
NoSQL or
Not Only
SQL
NoSQL DBs are semi-structured
Big data store uses NoSQL
NoSQL stands for No SQL or Not Only
SQL.
Do not integrate with applications using
SQL
NoSQL also used cloud data store
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
110. 110
Terminologies
Consistency :
All copies have the
same value as in
traditional DBs
Availability:
At least one copy is available
in case a partition becomes
inactive or fails
Partition:
Parts which are active but
may not cooperate as in the
distributed DBs
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
114. 114
BIG DATA
PLATFORM
Supports large datasets and volume
of data.
The data generate at a higher velocity, in more
varieties or in higher veracity.
Managing Big Data requires large resources
of MPPs , cloud, parallel processing and
specialized tools
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
116. 116
Hadoop
Big Data platform consists of Big Data storage, servers and data
management and BI software
Storage can deploy HDFS, NoSQL data stores, such as Hbase,
MongoDB,Cassandra.
HDFS system is an open source storage system
Scaling , self-managing and self-healing file system
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
117. 117
A scalable and reliable
parallel computing platform
Manages Big Data
distributed databases
Hadoop
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
118. 118
Hadoop based Big data environment
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
119. 119
Mesos
Mesos v0.9 is a resource management platform which enables sharing of
cluster nodes by multiple frameworks and which has compatibility with an
open analytics stack
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
120. 120
Big Data Stack
A stack consists of a set of software components and data store units.
Applications, ML algorithms, Analytics and visualization tools use Big
Data Stack(BDS) at a cloud service, such as Amazon EC2, Azure or
private cloud
Uses cluster of High Performance machines
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
121. 121
Tools for Big Data environment
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
122. 122
Data analysis is a process of
inspecting, cleaning, transformi
ng, and modelling data with the
goal of discovering useful
information, informing
conclusions, and supporting
decision-making.
Big Data
Analytics
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
125. Berkeley Data Analytics Stack (BDAS)
Infrastructure
Storage
Data Processing
Application
Resource Management
Data Management
Share infrastructure across frameworks
(multi-programming for datacenters)
Efficient data sharing across frameworks
Data Processing
• in-memory processing
• trade between time, quality, and cost
Application
New apps: AMP-Genomics, Carat, …
SHRINIVASA, Dept. of CSE, SMVITM Bantakal 125
126. Why BDAS..!!?
Easy to combine batch, streaming, and interactive computations
Single execution model that supports all computation models
Easy to develop sophisticated algorithms
High level abstractions for graph based, and ML algorithms
Compatible with existing open source ecosystem (Hadoop/HDFS)
Interoperate with existing storage and input formats (e.g., HDFS, Hive, Flume, ..)
Support existing execution models (e.g., Hive, GraphLab)
126
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
129. Big Data Analytics in Detection of Marketing Fraud
Fraud means someone deceiving deliberately
Ex: mortgaging the same assets to multiple financial institutions, compromising
customer data and transferring customer info to third party, marketing product
with compromising quality,..
Banks and financial services firms use analytics to differentiate fraudulent
interactions from legitimate business transactions.
The analytics systems suggest immediate actions, such as blocking irregular
transactions, which stops fraud before it occurs and improves profitability.
129
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
130. Big Data and Healthcare
130
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
133. Healthcare analytics using big data can facilitate the following
Provision of value-based and customer centric healthcare.
Utilizing the ‘Internet of Things’ for health care.
Preventing fraud, waste, abuse in the healthcare industry and reduce
healthcare costs.
Improving outcomes.
Monitoring patients in real time.
133
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
135. Findings of Big Data in Medicine
Big data analytics deploys large volume of data to identify and derive
intelligence predictive models about individuals.
Big data creates patterns and models by data mining and help in better
understanding and research.
Deploying wearable devices data, that devices data records during active
as well as inactive periods.
135
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
138. Key reasons to take into account while using big data to improve
results of digital marketing campaign
138
Data Visualization Tools
Use of Historical data
Target Consumers
Crowdsourcing
The real power of big data is the ability to forecast client’s needs and hence offering veracious value
Web Mining
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
… in two fundamental aspects… At the application layer we build new, real applications such a AMP-Genomics a genomics pipeline, and Carat, an application I’m going to talk about soon. Building real applications allows us to drive the features and design of the lower layers.