SlideShare a Scribd company logo
1 of 139
Syllabus and Introduction
2
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
3
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
4
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
5
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Course Outcomes:
 CO1: Describe the fundamentals of Big Data analytics.
 CO2: Extrapolate Hadoop framework and Hadoop Distributed File system.
 CO3: Explore the concepts of NoSQL using MongoDB and Cassandra for Big
Data.
 CO4: Demonstrate the MapReduce programming model for the big data
processing along with Hadoop tools.
 CO5: Explore Machine Learning algorithms for real world big data along with
analytics of web contents and Social Networks with relevant visualization tools.
6
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
xcd5mzi
7
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
xcd5mzi
xcd5mzi
Display
In today’s discussion…
 Introduction to data
 Types of data
 Current trend
 Data and Big data
 Big data vs. small data
 Tools and techniques
8
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Introduction to data
 Example:
10, 25, …, Bantakal, 18CS72, sode-edu.in
Anything else?
 Data vs. Information
100.0, 0.0, 250.0, 150.0, 220.0, 300.0, 110.0
Is there any information?
9
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Types of Data
 Structured
 Unstructured
 Semi-structured
10
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Structured data
 Any data that can be stored, accessed and processed in the form of fixed format is termed
as a 'structured' data.
11
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Unstructured
 Any data with unknown form or the structure is classified as unstructured data.
 Example:
12
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Semi-structured
 Semi-structured data can contain both the forms of data.
 Examples Of Semi-structured Data
 Personal data stored in an XML file-
<rec><name>Prashant Rao</name><gender>Male</gender><age>35</age></rec>
<rec><name>Seema R.</name><gender>Female</gender><age>41</age></rec>
<rec><name>Satish Mane</name><gender>Male</gender><age>29</age></rec>
<rec><name>Subrato Roy</name><gender>Male</gender><age>26</age></rec>
13
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
14
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
How large your data is?
 What is the maximum file size you have dealt so far?
 Movies/files/streaming video that you have used?
 What is the maximum download speed you get?
 To retrieve data stored in distant locations?
 How fast your computation is?
 How much time to just transfer from you, process and get
result?
15
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Growth of data
16
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Sources of data
 “Every day, we create 2.5 quintillion bytes of data
 So much that 90% of the data in the world today has been created in the last two years alone. This
explosion of information is known as “Big Data,”
 The data come from several sources :
etc. …… to name a few!
17
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Social Media:
SHRINIVASA, Dept. of CSE, SMVITM Bantakal 18
Now data is Big data!
 No single standard definition!
 ‘Big-data’ is similar to ‘Small-data’, but bigger
…but having data bigger consequently requires different approaches
 techniques, tools and architectures
…to solve: new problems
…and, of course, in a better way
Big data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms,
and analytics to manage it and extract value and hidden knowledge from it…
19
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Characteristics of Big data: 5V’s
 5 V's of Big Data:
 Volume
 Velocity
 Variety
 Veracity
 Value
SHRINIVASA, Dept. of CSE, SMVITM Bantakal 20
Volume(The scale):
 Volume in Big data represents the amount of data. In today’s world data is
being processed in various formats like word, excel, pdf format, and
sometimes in audio and video. These data can be structured ,
unstructured or semi-structured format. The recent social media
platforms produce a tremendous amount of data which is difficult to
handle by the organization. To handle this huge amount of data
organizations should implement modern business intelligence tools
which will capture this data in an effective form and which will be cost-
efficient for the organization.
SHRINIVASA, Dept. of CSE, SMVITM Bantakal 21
Velocity(The speed):
 Velocity refers to the rate/speed at which data is getting generated.
 This is primarily due to the Internet of Things (IoT), mobile data, social
media, and other factors. At least 2 trillion searches each year, 3.8 million
searches per minute, 228 million searches per hour, and 5.6 billion
searches per day are now being conducted.
SHRINIVASA, Dept. of CSE, SMVITM Bantakal 22
Variety(Data type):
 Big Data can be structured, unstructured, and semi-structured that
are being collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will
comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
SHRINIVASA, Dept. of CSE, SMVITM Bantakal 23
Quasi-structured data:The data format
contains textual data with inconsistent data
formats that are formatted with effort and time
with some tools.
Veracity:
 Degree of trustworthiness of data is the veracity of data.
 Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and
manage data efficiently. Big Data is also essential in business
development.
 For example, Facebook posts with hashtags.
SHRINIVASA, Dept. of CSE, SMVITM Bantakal 24
Value:
 Value is an essential characteristic of big data. It is not the data that we
process or store. It is valuable and reliable data that we store, process,
and also analyze.

SHRINIVASA, Dept. of CSE, SMVITM Bantakal 25
26
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Big data vs. small data
 Big data is more real-time in nature than
traditional applications
 Big data architecture
 Traditional architectures are not well-suited for big
data applications (e.g. Exa-data, Tera-data)
 Massively parallel processing, scale out
architectures are well-suited for big data
applications
27
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Challenges ahead…
 The Bottleneck is in technology
 New architecture, algorithms, techniques are needed
 Also in technical skills
 Experts in using the new technology and dealing with Big data
Who are the major players in the world of
Big data?
28
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Big data players
29
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Major players…
 Google
 Hadoop
 MapReduce
 Mahout
 Apache Hbase
 Cassandra
30
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Tools available
 NoSQL
 Databases MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper
 MapReduce
 Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie,
Greenplum
 Storage
 S3, HDFS, GDFS
 Servers
 EC2, Google App Engine, Elastic, Beanstalk, Heroku
 Processing
 R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop
31
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
32
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Predictions & Recommendations
33
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Big data knows everything..!!
34
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Job Opportunities in Big Data
 Data Analysts
 analyze and interpret data, visualize it, and build reports to help make better business decisions.
 Data Scientists
 mine data by assessing data sources and use algorithms and Machine Learning techniques.
 Data Architects
 design database systems and tools.
 Database Managers
 control database system performance, perform troubleshooting, and upgrade hardware and software.
 Big Data Engineers
 design, maintain, and support Big Data solutions.
35
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
END OF INTRODUCTION
36
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Questions of the day…
1. What is the smallest and largest units of measuring size of data?
2. How big a Quintillion measure is?
3. Give the examples of a smallest the largest entities of data.
4. Give FIVE parameters with which data can be categorized as i) simple,
ii) Moderately complex and iii) complex?
37
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Questions of the day…
5. What type of data are involved in the following applications?
1. Weather forecasting
2. Mobile usage of all customers of a service provider
3. Anomaly (e.g. fraud) detection in a bank organization
4. Person categorization, that is, identifying a human
5. Air traffic control in an airport
38
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Big data types (by IBM)
 Social Networks and web data
 Transactions data and Business Process data
 Customer master data
 Machine generated data
 Human generated data
39
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Big Data Classification : based on characteristics for
designing data architecture for processing and analytics
40
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
41
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Scalability and Parallel Processing
 Scalability is the capability to handle growing amounts of data and growing number of
database clients either by adding more hardware resources or by optimization and more
efficient usage of the existing resources.
 Scalability enables increase or decrease in the capacity of data storage, processing and
analytics.
 In short, you need to build scalability into the hardware architecture and database
selection, and can (for the most part) maximize performance later — during the
database design and deployment phase.
 System capability needs increment with the increased workloads. When the workload
and complexity exceed the system capacity, scale it up and scale it down.
42
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Scalability Options
 Assuming you need to scale your system, there are two options:
 scaling up
 scaling out
43
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Scale UP
 Resources such as CPU, network, and storage are
common targets for scaling up.
 The goal is to increase the resources supporting your
application to reach or maintain adequate
performance.
 In a hardware-centric world, this might mean adding
a larger hard drive to a computer for increased
storage capacity.
 It might mean replacing the entire computer with a
machine that has more CPU and a more performant
network interface.
44
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Scale OUT
 The scale-out option implies a
distributed system whereby additional
machines are added to a cluster to
provide additional capacity. It's often
more likely to yield a linear increase in
scalability, although not necessarily
increased performance.
45
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Analytics Scalability to Big Data
 Vertical scalability means scaling up the given system resources and increasing the system’s
analytics, reporting and visualization capabilities.
 Ex: designing the algorithm according to the architecture that uses resources efficiently.
 Horizontal scalability means increasing the number of systems working in coherence and
scaling out the workload.
 Ex: using more resources and distributing the storage and processing task in parallel.
 Note: Alternative ways for scaling up and out processing of analytics software and big data
analytics deploy the Massively Parallel Processing Platforms(MPPs), cloud, grid, clusters and
distributed computing software.
46
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Massively Parallel Processing Platforms(MPPs)
 Massively parallel processing (MPP) is a
collaborative processing of the same
program using two or more processors.
 By using different processors, speed can be
dramatically increased.
 For example, imagine a popular insurance company with millions of
customers. As the number of customers increases, so does the customer
data. Even if the firm uses parallel processing, they may experience a delay in
processing customer data. Assume a data analyst is running a query against
100 million rows of a database. If the organization uses a massively parallel
processing system with 1000 nodes, each node has to bear only 1/1000
computational load.
47
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
48
Parallelization of tasks
can be done at several
levels:
• Distributing separate tasks on to
separate threads on same CPU.
• Distributing separate tasks onto
separate CPUs on the same
computer.
• Distributing separate tasks onto
separate computer.
There are several types of
MPP database
architectures
• Distributed Computing Model
• Cloud Computing
• Grid and Cluster Computing
• Volunteer Computing
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Distributed Computing Model
 It uses cloud, grid or clusters, which process and analyze big and large
datasets on distributed computing nodes connected by high speed
network.
49
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Cloud computing
 Type of internet-based computing that provides shared processing
resources and data to the computers and other devices on demand.
 One of the best approach for data processing to perform parallel and
distributed computing
 Offers high data security compared to other distributed technologies
50
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Cloud resources
 Amazon Web Service(AWS)
 Elastic Compute Cloud(EC2)
 Microsoft Azure or Apache CloudStack
 Amazon Simple Storage Service(S3)
51
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Cloud computing features
1. On-demand service
2. Resource pooling
3. Scalability
4. Accountability
5. Broad network Access
 Cloud services can be accessed from anywhere and at any time through the
internet
52
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Cloud services types
1. Infrastructure as a Service(IaaS):
 Providing access to resources, such as hard disks, network connections, databases
storage , data centre and virtual service space.
 Ex: AWS EC2, Rackspace, Google Compute Engine
2. Platform as a Service(PaaS):
 Providing runtime environment to allow developers to build applications and services.
 Ex: Windows Azure (mostly used as PaaS), Force.com
3. Software as a Service(SaaS):
 Providing software applications as a service to end-users
 Ex: BigCommerce, Google Apps, Salesforce, Dropbox
53
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Grid and Cluster computing
 Grid Computing:
 Distributed computing, in which a group of computers from several locations are connected with each
other to achieve a common task.
 Grid: A group of computers that might spread over remotely
 This type of computing provides large-scale resource sharing which is flexible , coordinated and secure
among its users.
 For example, a research team might analyze weather patterns in the North Atlantic region, while
another team analyzes the south Atlantic region, and both results can be combined to deliver a
complete picture of Atlantic weather patterns
54
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Features of Grid computing
 Similar to cloud computing
 Scalable
 Distributed network for resource integration
55
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Drawbacks of Grid Computing
 Single point of failure
 Storage capacity varies with the number of users, instances and the
amount of data transferred at a given time
56
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Cluster computing
 Group of computers connected by a network to accomplish the same task.
 Used mainly for load balancing
57
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Difference between Cluster and Grid Computing:
58
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Volunteer Computing
 Volunteers are organizations or members who own personal computers.
 They provide computing resources to important projects that use resources to
do distributed computing and/or storage
 Volunteer Computing: uses computing resources of the volunteers
59
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Issues of volunteer computing systems
 Volunteered computers heterogeneity
 Drop outs from the network over time
 Their sporadic availability
 Incorrect results at volunteers are unaccountable as they are anonymous
60
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Designing Data Architecture
“Big data architecture is the logical and/or physical layout/structure of how big data will be stored,
accessed and managed within a big data or IT environment.”
 Architecture logically defines how big data solution will work, the core components(hardware,
database, software, storage) used, flow of information, security and more.
 Data processing architecture consist of 5 layers:
 (i) identification of data sources
 (ii) acquisition, ingestion, extraction, pre-processing, transformation of data
 (iii) data storage at files, servers, cluster or cloud
 (iv) data processing
 (v) data consumption
61
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
62
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
63
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
64
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Managing data for Analysis
 Data managing means enabling, controlling, protecting, delivering and
enhancing the value of data and information asset.
 Data Management functions include:
65
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
66
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Data Sources
 Applications, programs and tools use data.
 Sources can be external, such as sensors, trackers, web logs, computer
system logs and feeds.
 Sources can be machines, which source data from data-creating programs.
 Data sources can be i) structured ii) semi-structured iii)multi-structured
or unstructured.
67
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Structured data sources
 The source may be on the same computer running a program or a networked
computer.
 Examples of structured data sources are SQL server, MySQL, Oracle DBMS, file
collection directory at a server.
 The name implies a defined name, which a process uses to identify the source.
 Ex: a name which identifies stored data in student grades during processing, the
name could be studentname_data_grades.
68
Then, what could be the name of data source!!!?
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Unstructured data sources
 Distributed over high-speed networks.
 The data need high velocity processing as sources are from distributed file system.
 The sources are of file types, such as .txt, .csv(comma separated value).
 Data may be as key-value pairs, such as hash key-values pairs.
 Data may have internal structures, such as in e-mail, facebook pages, twitter pages, etc..
 Data sources can be sensors, sensor networks, signals from machines, devices,
controllers of different types in the industry M2M communication and the GPS system.
69
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Data Quality
 Data quality is the measure of how well suited a data set is to serve its specific
purpose.
 A high quality data can be data with five R’s.
70
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Data Integrity
 Data integrity refers to the fact that data must be reliable and accurate over its
entire lifecycle.
WHY IS DATA INTEGRITY IMPORTANT?
 Your need to have constant access to data of quality data. Data integrity is
important as it guarantees and secures the searchability and traceability of your
data to its original source.
 Organizations collect more and more data and it has become a priority to secure
and maintain the integrity of this data. Without integrity and accuracy, your
data is worthless.
71
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Examples of data quality problems
 Noise
 Outliers
 Missing values
 Duplicate data
72
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Noisy data
 For objects, noise is considered an extraneous object.
 For attributes, noise refers to modification of original values.
Here Noise refers to measurement error in data values
Could be random error or systematic error…!!!
73
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Outliers
 Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set.
 Could indicate “interesting” cases, or could indicate errors in the data
74
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Missing values
 Reasons for missing values
 Information is not collected (e.g., people decline to give their age)
 Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)
 Ways to handle missing values
 Eliminate entities with missing values
 Estimate attributes with missing values
 Ignore the missing values during analysis
 Replace with all possible values (weighted by their probabilities)
 Impute missing values
75
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Duplicate data
 Data set may include data entities that are duplicates, or almost duplicates
of one another
 Major issue when merging data from heterogeneous sources
 Example: same person with multiple email addresses.
 Data cleaning
 Finding and dealing with duplicate entities
 Finding and correcting measurement error
76
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Data Preprocessing
 It is an important step at the ingestion layer.
 It is a must before data mining, analytics or before running machine
learning algorithms.
 Pre-processing needs are:
77
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Data cleaning
 Process of removing or correcting incomplete , incorrect,
inaccurate or irrelevant parts of the data after detecting
them.
 Example:
 Correcting the grade outliers or mistakenly entered values
78
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Important terminologies in data cleaning
79
Data cleaning
tools
Data Enrichment Data Editing
Data Reduction Data wrangling
Data formats
used during Pre-
processing
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Data cleaning tools
 Data cleaning is done before data mining.
 Data cleaning tools help in refining and structuring data into usable
data.
 Example:
 OpenRefine
 DataCleaner
80
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
81
Data Enrichment:
Refers to operations or processes
which refine, enhance or improve raw
data.
Data Editing:
Process of reviewing and adjusting the acquired datasets.
Controls data quality
Editing methods are :
• Interactive
• Selective
• Automatic
• Aggregating
• distribution
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
82
Enables the transformation of
acquired information into an ordered,
correct and simplified form.
Enables ingestion of meaningful data
in the datasets.
Basic concept:
Reduction of
multitudinous amount
of data and use the
meaningful parts.
uses editing, scaling, coding, sorting,
collating, smoothening, interpolating,
preparing tabular summaries.
Data
Reduction:
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
83
Process of transforming and
mapping the data.
Results from analysis are then
appropriate and valuable.
Example:
Mapping enables data into
another format, which makes
it valuable for analytics and
data visualizations
Data
wrangling
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Data formats used during Pre-processing
84
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
CSV format
85
Refers to a
plain text file
which stores
the table data
of numbers
and text.
Each CSV
file line is a
data record
Each record
consists of
one or more
fields,
separated
by commas.
CSV files are
most
encountered in
spreadsheets
and databases.
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Example :CSV
86
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
87
Activity :
Find out the
differences
between CSV and
Excel file formats
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Data format conversions
 Need preprocessing for data-format conversions.
 A number of different applications, services and tools need a specific
format of data only.
 Preprocessing before their usages or storage on cloud services is a must.
88
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
89
Data store
export to
cloud
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
From the diagram,
 Shows data pre-processing, data mining, analysis, visualization
and data store.
 The data exports to cloud services.
 The results integrate at the enterprise server or data warehouse.
90
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
91
Cloud services
The services can be accessed
through a cloud client, such as
web browser,SQL or other
client.
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
92
Data store
export from
machines,files,
computers, web
servers and web
services
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Export of data to AWS and Rackspace Clouds: Example
93
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
94
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Example 2: BigQuery cloud service at google cloud platform
95
2
1
3
4
OR
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Data Storage and Management: Traditional Systems
 Data Store with structured or semi-structured data.
 SQL
 RDBMS uses SQL.
 It is a language for viewing or changing databases.
 SQL does the following
96
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Data Storage and Analysis
97
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
98
DDBMS, Enterprise
data-store server and
data warehouse
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Distributed database management system(DDBMS)
 Collection of logically interrelated databases at multiple system over a computer
network.
 Features of DDBMS are:
99
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
100
In-memory
column formats
data
 Allows faster data retrieval when only few
columns in a table need to be selected for
querying
 Data in a column are kept together in-
memory in columnar format
 A single memory access, therefore loads
many values at the column
 Used in OLAP
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
101
Use of In-memory column formats in OLAP
OLAP : Online Analytical
Processing in real time
transaction processing is
fast when using in-memory
column format tables.
Enables real-time
analytics
CPU accesses all
columns in a single
instance of access to
the memory in
columnar format in-
memory data-storage
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
102
In-
memory
row
formats
data
 A row format in-memory allows
much faster data processing
during OLTP(Online Transaction
Processing)
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
103
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
104
Enterprise
data-store
server and
data
warehouse
Enterprise data, after data cleaning process,
integrate with the server data at
warehouse
Enterprise data server use data from
several distributed sources which store
data using various technologies.
All data merge using an integration tool
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
105
Enterprise data
integration and
management with
big data
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
106
Big Data storage
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
107
Big Data
NoSQL or
Not Only
SQL
 NoSQL DBs are semi-structured
 Big data store uses NoSQL
 NoSQL stands for No SQL or Not Only
SQL.
 Do not integrate with applications using
SQL
 NoSQL also used cloud data store
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
108
Features of NoSQL
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
109
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
110
Terminologies
Consistency :
All copies have the
same value as in
traditional DBs
Availability:
At least one copy is available
in case a partition becomes
inactive or fails
Partition:
Parts which are active but
may not cooperate as in the
distributed DBs
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
111
Coexistence of bigdata ,nosql and
traditional datastores
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
112
Various data sources and
examples of usages and tools
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
113
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
114
BIG DATA
PLATFORM
Supports large datasets and volume
of data.
The data generate at a higher velocity, in more
varieties or in higher veracity.
Managing Big Data requires large resources
of MPPs , cloud, parallel processing and
specialized tools
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
115
Bigdata platform should provide tools and services for
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
116
Hadoop
 Big Data platform consists of Big Data storage, servers and data
management and BI software
 Storage can deploy HDFS, NoSQL data stores, such as Hbase,
MongoDB,Cassandra.
 HDFS system is an open source storage system
 Scaling , self-managing and self-healing file system
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
117
A scalable and reliable
parallel computing platform
Manages Big Data
distributed databases
Hadoop
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
118
Hadoop based Big data environment
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
119
Mesos
 Mesos v0.9 is a resource management platform which enables sharing of
cluster nodes by multiple frameworks and which has compatibility with an
open analytics stack
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
120
Big Data Stack
 A stack consists of a set of software components and data store units.
 Applications, ML algorithms, Analytics and visualization tools use Big
Data Stack(BDS) at a cloud service, such as Amazon EC2, Azure or
private cloud
 Uses cluster of High Performance machines
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
121
Tools for Big Data environment
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
122
Data analysis is a process of
inspecting, cleaning, transformi
ng, and modelling data with the
goal of discovering useful
information, informing
conclusions, and supporting
decision-making.
Big Data
Analytics
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
123
Phases in Analytics
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
124
Traditional and
Bid Data
analytics
architecture
reference model
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Berkeley Data Analytics Stack (BDAS)
Infrastructure
Storage
Data Processing
Application
Resource Management
Data Management
Share infrastructure across frameworks
(multi-programming for datacenters)
Efficient data sharing across frameworks
Data Processing
• in-memory processing
• trade between time, quality, and cost
Application
New apps: AMP-Genomics, Carat, …
SHRINIVASA, Dept. of CSE, SMVITM Bantakal 125
Why BDAS..!!?
 Easy to combine batch, streaming, and interactive computations
 Single execution model that supports all computation models
 Easy to develop sophisticated algorithms
 High level abstractions for graph based, and ML algorithms
 Compatible with existing open source ecosystem (Hadoop/HDFS)
 Interoperate with existing storage and input formats (e.g., HDFS, Hive, Flume, ..)
 Support existing execution models (e.g., Hive, GraphLab)
126
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
127
Big Data in
Marketing
and Sales
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
128
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Big Data Analytics in Detection of Marketing Fraud
 Fraud means someone deceiving deliberately
 Ex: mortgaging the same assets to multiple financial institutions, compromising
customer data and transferring customer info to third party, marketing product
with compromising quality,..
 Banks and financial services firms use analytics to differentiate fraudulent
interactions from legitimate business transactions.
 The analytics systems suggest immediate actions, such as blocking irregular
transactions, which stops fraud before it occurs and improves profitability.
129
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Big Data and Healthcare
130
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
131
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
132
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Healthcare analytics using big data can facilitate the following
 Provision of value-based and customer centric healthcare.
 Utilizing the ‘Internet of Things’ for health care.
 Preventing fraud, waste, abuse in the healthcare industry and reduce
healthcare costs.
 Improving outcomes.
 Monitoring patients in real time.
133
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
134
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Findings of Big Data in Medicine
 Big data analytics deploys large volume of data to identify and derive
intelligence predictive models about individuals.
 Big data creates patterns and models by data mining and help in better
understanding and research.
 Deploying wearable devices data, that devices data records during active
as well as inactive periods.
135
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
136
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
137
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
Key reasons to take into account while using big data to improve
results of digital marketing campaign
138
Data Visualization Tools
Use of Historical data
Target Consumers
Crowdsourcing
The real power of big data is the ability to forecast client’s needs and hence offering veracious value
Web Mining
SHRINIVASA, Dept. of CSE, SMVITM Bantakal
139
End of Module 1
SHRINIVASA, Dept. of CSE, SMVITM Bantakal

More Related Content

Similar to BDA_Module1.pptx

The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageIRJET Journal
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"MDS ap
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)Shahbaz Anjam
 
Identifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big dataIdentifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big datasarfraznawaz
 
sybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptxsybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptxcalf_ville86
 
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET Journal
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...oj08
 
Big Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesBig Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesEditor IJCATR
 

Similar to BDA_Module1.pptx (20)

The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their Usage
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
Unit 1
Unit 1Unit 1
Unit 1
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)
 
Identifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big dataIdentifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big data
 
sybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptxsybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptx
 
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
BIG DATA AND HADOOP.pdf
BIG DATA AND HADOOP.pdfBIG DATA AND HADOOP.pdf
BIG DATA AND HADOOP.pdf
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
 
Big Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesBig Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New Challenges
 

Recently uploaded

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 

Recently uploaded (20)

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 

BDA_Module1.pptx

  • 2. 2 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 3. 3 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 4. 4 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 5. 5 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 6. Course Outcomes:  CO1: Describe the fundamentals of Big Data analytics.  CO2: Extrapolate Hadoop framework and Hadoop Distributed File system.  CO3: Explore the concepts of NoSQL using MongoDB and Cassandra for Big Data.  CO4: Demonstrate the MapReduce programming model for the big data processing along with Hadoop tools.  CO5: Explore Machine Learning algorithms for real world big data along with analytics of web contents and Social Networks with relevant visualization tools. 6 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 7. xcd5mzi 7 SHRINIVASA, Dept. of CSE, SMVITM Bantakal xcd5mzi xcd5mzi Display
  • 8. In today’s discussion…  Introduction to data  Types of data  Current trend  Data and Big data  Big data vs. small data  Tools and techniques 8 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 9. Introduction to data  Example: 10, 25, …, Bantakal, 18CS72, sode-edu.in Anything else?  Data vs. Information 100.0, 0.0, 250.0, 150.0, 220.0, 300.0, 110.0 Is there any information? 9 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 10. Types of Data  Structured  Unstructured  Semi-structured 10 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 11. Structured data  Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data. 11 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 12. Unstructured  Any data with unknown form or the structure is classified as unstructured data.  Example: 12 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 13. Semi-structured  Semi-structured data can contain both the forms of data.  Examples Of Semi-structured Data  Personal data stored in an XML file- <rec><name>Prashant Rao</name><gender>Male</gender><age>35</age></rec> <rec><name>Seema R.</name><gender>Female</gender><age>41</age></rec> <rec><name>Satish Mane</name><gender>Male</gender><age>29</age></rec> <rec><name>Subrato Roy</name><gender>Male</gender><age>26</age></rec> 13 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 14. 14 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 15. How large your data is?  What is the maximum file size you have dealt so far?  Movies/files/streaming video that you have used?  What is the maximum download speed you get?  To retrieve data stored in distant locations?  How fast your computation is?  How much time to just transfer from you, process and get result? 15 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 16. Growth of data 16 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 17. Sources of data  “Every day, we create 2.5 quintillion bytes of data  So much that 90% of the data in the world today has been created in the last two years alone. This explosion of information is known as “Big Data,”  The data come from several sources : etc. …… to name a few! 17 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 18. Social Media: SHRINIVASA, Dept. of CSE, SMVITM Bantakal 18
  • 19. Now data is Big data!  No single standard definition!  ‘Big-data’ is similar to ‘Small-data’, but bigger …but having data bigger consequently requires different approaches  techniques, tools and architectures …to solve: new problems …and, of course, in a better way Big data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… 19 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 20. Characteristics of Big data: 5V’s  5 V's of Big Data:  Volume  Velocity  Variety  Veracity  Value SHRINIVASA, Dept. of CSE, SMVITM Bantakal 20
  • 21. Volume(The scale):  Volume in Big data represents the amount of data. In today’s world data is being processed in various formats like word, excel, pdf format, and sometimes in audio and video. These data can be structured , unstructured or semi-structured format. The recent social media platforms produce a tremendous amount of data which is difficult to handle by the organization. To handle this huge amount of data organizations should implement modern business intelligence tools which will capture this data in an effective form and which will be cost- efficient for the organization. SHRINIVASA, Dept. of CSE, SMVITM Bantakal 21
  • 22. Velocity(The speed):  Velocity refers to the rate/speed at which data is getting generated.  This is primarily due to the Internet of Things (IoT), mobile data, social media, and other factors. At least 2 trillion searches each year, 3.8 million searches per minute, 228 million searches per hour, and 5.6 billion searches per day are now being conducted. SHRINIVASA, Dept. of CSE, SMVITM Bantakal 22
  • 23. Variety(Data type):  Big Data can be structured, unstructured, and semi-structured that are being collected from different sources. Data will only be collected from databases and sheets in the past, But these days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc. SHRINIVASA, Dept. of CSE, SMVITM Bantakal 23 Quasi-structured data:The data format contains textual data with inconsistent data formats that are formatted with effort and time with some tools.
  • 24. Veracity:  Degree of trustworthiness of data is the veracity of data.  Veracity means how much the data is reliable. It has many ways to filter or translate the data. Veracity is the process of being able to handle and manage data efficiently. Big Data is also essential in business development.  For example, Facebook posts with hashtags. SHRINIVASA, Dept. of CSE, SMVITM Bantakal 24
  • 25. Value:  Value is an essential characteristic of big data. It is not the data that we process or store. It is valuable and reliable data that we store, process, and also analyze.  SHRINIVASA, Dept. of CSE, SMVITM Bantakal 25
  • 26. 26 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 27. Big data vs. small data  Big data is more real-time in nature than traditional applications  Big data architecture  Traditional architectures are not well-suited for big data applications (e.g. Exa-data, Tera-data)  Massively parallel processing, scale out architectures are well-suited for big data applications 27 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 28. Challenges ahead…  The Bottleneck is in technology  New architecture, algorithms, techniques are needed  Also in technical skills  Experts in using the new technology and dealing with Big data Who are the major players in the world of Big data? 28 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 29. Big data players 29 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 30. Major players…  Google  Hadoop  MapReduce  Mahout  Apache Hbase  Cassandra 30 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 31. Tools available  NoSQL  Databases MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper  MapReduce  Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum  Storage  S3, HDFS, GDFS  Servers  EC2, Google App Engine, Elastic, Beanstalk, Heroku  Processing  R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop 31 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 32. 32 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 33. Predictions & Recommendations 33 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 34. Big data knows everything..!! 34 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 35. Job Opportunities in Big Data  Data Analysts  analyze and interpret data, visualize it, and build reports to help make better business decisions.  Data Scientists  mine data by assessing data sources and use algorithms and Machine Learning techniques.  Data Architects  design database systems and tools.  Database Managers  control database system performance, perform troubleshooting, and upgrade hardware and software.  Big Data Engineers  design, maintain, and support Big Data solutions. 35 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 36. END OF INTRODUCTION 36 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 37. Questions of the day… 1. What is the smallest and largest units of measuring size of data? 2. How big a Quintillion measure is? 3. Give the examples of a smallest the largest entities of data. 4. Give FIVE parameters with which data can be categorized as i) simple, ii) Moderately complex and iii) complex? 37 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 38. Questions of the day… 5. What type of data are involved in the following applications? 1. Weather forecasting 2. Mobile usage of all customers of a service provider 3. Anomaly (e.g. fraud) detection in a bank organization 4. Person categorization, that is, identifying a human 5. Air traffic control in an airport 38 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 39. Big data types (by IBM)  Social Networks and web data  Transactions data and Business Process data  Customer master data  Machine generated data  Human generated data 39 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 40. Big Data Classification : based on characteristics for designing data architecture for processing and analytics 40 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 41. 41 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 42. Scalability and Parallel Processing  Scalability is the capability to handle growing amounts of data and growing number of database clients either by adding more hardware resources or by optimization and more efficient usage of the existing resources.  Scalability enables increase or decrease in the capacity of data storage, processing and analytics.  In short, you need to build scalability into the hardware architecture and database selection, and can (for the most part) maximize performance later — during the database design and deployment phase.  System capability needs increment with the increased workloads. When the workload and complexity exceed the system capacity, scale it up and scale it down. 42 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 43. Scalability Options  Assuming you need to scale your system, there are two options:  scaling up  scaling out 43 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 44. Scale UP  Resources such as CPU, network, and storage are common targets for scaling up.  The goal is to increase the resources supporting your application to reach or maintain adequate performance.  In a hardware-centric world, this might mean adding a larger hard drive to a computer for increased storage capacity.  It might mean replacing the entire computer with a machine that has more CPU and a more performant network interface. 44 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 45. Scale OUT  The scale-out option implies a distributed system whereby additional machines are added to a cluster to provide additional capacity. It's often more likely to yield a linear increase in scalability, although not necessarily increased performance. 45 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 46. Analytics Scalability to Big Data  Vertical scalability means scaling up the given system resources and increasing the system’s analytics, reporting and visualization capabilities.  Ex: designing the algorithm according to the architecture that uses resources efficiently.  Horizontal scalability means increasing the number of systems working in coherence and scaling out the workload.  Ex: using more resources and distributing the storage and processing task in parallel.  Note: Alternative ways for scaling up and out processing of analytics software and big data analytics deploy the Massively Parallel Processing Platforms(MPPs), cloud, grid, clusters and distributed computing software. 46 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 47. Massively Parallel Processing Platforms(MPPs)  Massively parallel processing (MPP) is a collaborative processing of the same program using two or more processors.  By using different processors, speed can be dramatically increased.  For example, imagine a popular insurance company with millions of customers. As the number of customers increases, so does the customer data. Even if the firm uses parallel processing, they may experience a delay in processing customer data. Assume a data analyst is running a query against 100 million rows of a database. If the organization uses a massively parallel processing system with 1000 nodes, each node has to bear only 1/1000 computational load. 47 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 48. 48 Parallelization of tasks can be done at several levels: • Distributing separate tasks on to separate threads on same CPU. • Distributing separate tasks onto separate CPUs on the same computer. • Distributing separate tasks onto separate computer. There are several types of MPP database architectures • Distributed Computing Model • Cloud Computing • Grid and Cluster Computing • Volunteer Computing SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 49. Distributed Computing Model  It uses cloud, grid or clusters, which process and analyze big and large datasets on distributed computing nodes connected by high speed network. 49 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 50. Cloud computing  Type of internet-based computing that provides shared processing resources and data to the computers and other devices on demand.  One of the best approach for data processing to perform parallel and distributed computing  Offers high data security compared to other distributed technologies 50 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 51. Cloud resources  Amazon Web Service(AWS)  Elastic Compute Cloud(EC2)  Microsoft Azure or Apache CloudStack  Amazon Simple Storage Service(S3) 51 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 52. Cloud computing features 1. On-demand service 2. Resource pooling 3. Scalability 4. Accountability 5. Broad network Access  Cloud services can be accessed from anywhere and at any time through the internet 52 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 53. Cloud services types 1. Infrastructure as a Service(IaaS):  Providing access to resources, such as hard disks, network connections, databases storage , data centre and virtual service space.  Ex: AWS EC2, Rackspace, Google Compute Engine 2. Platform as a Service(PaaS):  Providing runtime environment to allow developers to build applications and services.  Ex: Windows Azure (mostly used as PaaS), Force.com 3. Software as a Service(SaaS):  Providing software applications as a service to end-users  Ex: BigCommerce, Google Apps, Salesforce, Dropbox 53 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 54. Grid and Cluster computing  Grid Computing:  Distributed computing, in which a group of computers from several locations are connected with each other to achieve a common task.  Grid: A group of computers that might spread over remotely  This type of computing provides large-scale resource sharing which is flexible , coordinated and secure among its users.  For example, a research team might analyze weather patterns in the North Atlantic region, while another team analyzes the south Atlantic region, and both results can be combined to deliver a complete picture of Atlantic weather patterns 54 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 55. Features of Grid computing  Similar to cloud computing  Scalable  Distributed network for resource integration 55 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 56. Drawbacks of Grid Computing  Single point of failure  Storage capacity varies with the number of users, instances and the amount of data transferred at a given time 56 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 57. Cluster computing  Group of computers connected by a network to accomplish the same task.  Used mainly for load balancing 57 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 58. Difference between Cluster and Grid Computing: 58 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 59. Volunteer Computing  Volunteers are organizations or members who own personal computers.  They provide computing resources to important projects that use resources to do distributed computing and/or storage  Volunteer Computing: uses computing resources of the volunteers 59 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 60. Issues of volunteer computing systems  Volunteered computers heterogeneity  Drop outs from the network over time  Their sporadic availability  Incorrect results at volunteers are unaccountable as they are anonymous 60 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 61. Designing Data Architecture “Big data architecture is the logical and/or physical layout/structure of how big data will be stored, accessed and managed within a big data or IT environment.”  Architecture logically defines how big data solution will work, the core components(hardware, database, software, storage) used, flow of information, security and more.  Data processing architecture consist of 5 layers:  (i) identification of data sources  (ii) acquisition, ingestion, extraction, pre-processing, transformation of data  (iii) data storage at files, servers, cluster or cloud  (iv) data processing  (v) data consumption 61 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 62. 62 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 63. 63 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 64. 64 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 65. Managing data for Analysis  Data managing means enabling, controlling, protecting, delivering and enhancing the value of data and information asset.  Data Management functions include: 65 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 66. 66 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 67. Data Sources  Applications, programs and tools use data.  Sources can be external, such as sensors, trackers, web logs, computer system logs and feeds.  Sources can be machines, which source data from data-creating programs.  Data sources can be i) structured ii) semi-structured iii)multi-structured or unstructured. 67 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 68. Structured data sources  The source may be on the same computer running a program or a networked computer.  Examples of structured data sources are SQL server, MySQL, Oracle DBMS, file collection directory at a server.  The name implies a defined name, which a process uses to identify the source.  Ex: a name which identifies stored data in student grades during processing, the name could be studentname_data_grades. 68 Then, what could be the name of data source!!!? SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 69. Unstructured data sources  Distributed over high-speed networks.  The data need high velocity processing as sources are from distributed file system.  The sources are of file types, such as .txt, .csv(comma separated value).  Data may be as key-value pairs, such as hash key-values pairs.  Data may have internal structures, such as in e-mail, facebook pages, twitter pages, etc..  Data sources can be sensors, sensor networks, signals from machines, devices, controllers of different types in the industry M2M communication and the GPS system. 69 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 70. Data Quality  Data quality is the measure of how well suited a data set is to serve its specific purpose.  A high quality data can be data with five R’s. 70 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 71. Data Integrity  Data integrity refers to the fact that data must be reliable and accurate over its entire lifecycle. WHY IS DATA INTEGRITY IMPORTANT?  Your need to have constant access to data of quality data. Data integrity is important as it guarantees and secures the searchability and traceability of your data to its original source.  Organizations collect more and more data and it has become a priority to secure and maintain the integrity of this data. Without integrity and accuracy, your data is worthless. 71 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 72. Examples of data quality problems  Noise  Outliers  Missing values  Duplicate data 72 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 73. Noisy data  For objects, noise is considered an extraneous object.  For attributes, noise refers to modification of original values. Here Noise refers to measurement error in data values Could be random error or systematic error…!!! 73 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 74. Outliers  Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set.  Could indicate “interesting” cases, or could indicate errors in the data 74 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 75. Missing values  Reasons for missing values  Information is not collected (e.g., people decline to give their age)  Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)  Ways to handle missing values  Eliminate entities with missing values  Estimate attributes with missing values  Ignore the missing values during analysis  Replace with all possible values (weighted by their probabilities)  Impute missing values 75 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 76. Duplicate data  Data set may include data entities that are duplicates, or almost duplicates of one another  Major issue when merging data from heterogeneous sources  Example: same person with multiple email addresses.  Data cleaning  Finding and dealing with duplicate entities  Finding and correcting measurement error 76 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 77. Data Preprocessing  It is an important step at the ingestion layer.  It is a must before data mining, analytics or before running machine learning algorithms.  Pre-processing needs are: 77 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 78. Data cleaning  Process of removing or correcting incomplete , incorrect, inaccurate or irrelevant parts of the data after detecting them.  Example:  Correcting the grade outliers or mistakenly entered values 78 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 79. Important terminologies in data cleaning 79 Data cleaning tools Data Enrichment Data Editing Data Reduction Data wrangling Data formats used during Pre- processing SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 80. Data cleaning tools  Data cleaning is done before data mining.  Data cleaning tools help in refining and structuring data into usable data.  Example:  OpenRefine  DataCleaner 80 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 81. 81 Data Enrichment: Refers to operations or processes which refine, enhance or improve raw data. Data Editing: Process of reviewing and adjusting the acquired datasets. Controls data quality Editing methods are : • Interactive • Selective • Automatic • Aggregating • distribution SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 82. 82 Enables the transformation of acquired information into an ordered, correct and simplified form. Enables ingestion of meaningful data in the datasets. Basic concept: Reduction of multitudinous amount of data and use the meaningful parts. uses editing, scaling, coding, sorting, collating, smoothening, interpolating, preparing tabular summaries. Data Reduction: SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 83. 83 Process of transforming and mapping the data. Results from analysis are then appropriate and valuable. Example: Mapping enables data into another format, which makes it valuable for analytics and data visualizations Data wrangling SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 84. Data formats used during Pre-processing 84 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 85. CSV format 85 Refers to a plain text file which stores the table data of numbers and text. Each CSV file line is a data record Each record consists of one or more fields, separated by commas. CSV files are most encountered in spreadsheets and databases. SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 86. Example :CSV 86 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 87. 87 Activity : Find out the differences between CSV and Excel file formats SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 88. Data format conversions  Need preprocessing for data-format conversions.  A number of different applications, services and tools need a specific format of data only.  Preprocessing before their usages or storage on cloud services is a must. 88 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 89. 89 Data store export to cloud SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 90. From the diagram,  Shows data pre-processing, data mining, analysis, visualization and data store.  The data exports to cloud services.  The results integrate at the enterprise server or data warehouse. 90 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 91. 91 Cloud services The services can be accessed through a cloud client, such as web browser,SQL or other client. SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 92. 92 Data store export from machines,files, computers, web servers and web services SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 93. Export of data to AWS and Rackspace Clouds: Example 93 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 94. 94 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 95. Example 2: BigQuery cloud service at google cloud platform 95 2 1 3 4 OR SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 96. Data Storage and Management: Traditional Systems  Data Store with structured or semi-structured data.  SQL  RDBMS uses SQL.  It is a language for viewing or changing databases.  SQL does the following 96 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 97. Data Storage and Analysis 97 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 98. 98 DDBMS, Enterprise data-store server and data warehouse SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 99. Distributed database management system(DDBMS)  Collection of logically interrelated databases at multiple system over a computer network.  Features of DDBMS are: 99 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 100. 100 In-memory column formats data  Allows faster data retrieval when only few columns in a table need to be selected for querying  Data in a column are kept together in- memory in columnar format  A single memory access, therefore loads many values at the column  Used in OLAP SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 101. 101 Use of In-memory column formats in OLAP OLAP : Online Analytical Processing in real time transaction processing is fast when using in-memory column format tables. Enables real-time analytics CPU accesses all columns in a single instance of access to the memory in columnar format in- memory data-storage SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 102. 102 In- memory row formats data  A row format in-memory allows much faster data processing during OLTP(Online Transaction Processing) SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 103. 103 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 104. 104 Enterprise data-store server and data warehouse Enterprise data, after data cleaning process, integrate with the server data at warehouse Enterprise data server use data from several distributed sources which store data using various technologies. All data merge using an integration tool SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 105. 105 Enterprise data integration and management with big data SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 106. 106 Big Data storage SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 107. 107 Big Data NoSQL or Not Only SQL  NoSQL DBs are semi-structured  Big data store uses NoSQL  NoSQL stands for No SQL or Not Only SQL.  Do not integrate with applications using SQL  NoSQL also used cloud data store SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 108. 108 Features of NoSQL SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 109. 109 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 110. 110 Terminologies Consistency : All copies have the same value as in traditional DBs Availability: At least one copy is available in case a partition becomes inactive or fails Partition: Parts which are active but may not cooperate as in the distributed DBs SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 111. 111 Coexistence of bigdata ,nosql and traditional datastores SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 112. 112 Various data sources and examples of usages and tools SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 113. 113 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 114. 114 BIG DATA PLATFORM Supports large datasets and volume of data. The data generate at a higher velocity, in more varieties or in higher veracity. Managing Big Data requires large resources of MPPs , cloud, parallel processing and specialized tools SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 115. 115 Bigdata platform should provide tools and services for SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 116. 116 Hadoop  Big Data platform consists of Big Data storage, servers and data management and BI software  Storage can deploy HDFS, NoSQL data stores, such as Hbase, MongoDB,Cassandra.  HDFS system is an open source storage system  Scaling , self-managing and self-healing file system SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 117. 117 A scalable and reliable parallel computing platform Manages Big Data distributed databases Hadoop SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 118. 118 Hadoop based Big data environment SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 119. 119 Mesos  Mesos v0.9 is a resource management platform which enables sharing of cluster nodes by multiple frameworks and which has compatibility with an open analytics stack SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 120. 120 Big Data Stack  A stack consists of a set of software components and data store units.  Applications, ML algorithms, Analytics and visualization tools use Big Data Stack(BDS) at a cloud service, such as Amazon EC2, Azure or private cloud  Uses cluster of High Performance machines SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 121. 121 Tools for Big Data environment SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 122. 122 Data analysis is a process of inspecting, cleaning, transformi ng, and modelling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Big Data Analytics SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 123. 123 Phases in Analytics SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 124. 124 Traditional and Bid Data analytics architecture reference model SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 125. Berkeley Data Analytics Stack (BDAS) Infrastructure Storage Data Processing Application Resource Management Data Management Share infrastructure across frameworks (multi-programming for datacenters) Efficient data sharing across frameworks Data Processing • in-memory processing • trade between time, quality, and cost Application New apps: AMP-Genomics, Carat, … SHRINIVASA, Dept. of CSE, SMVITM Bantakal 125
  • 126. Why BDAS..!!?  Easy to combine batch, streaming, and interactive computations  Single execution model that supports all computation models  Easy to develop sophisticated algorithms  High level abstractions for graph based, and ML algorithms  Compatible with existing open source ecosystem (Hadoop/HDFS)  Interoperate with existing storage and input formats (e.g., HDFS, Hive, Flume, ..)  Support existing execution models (e.g., Hive, GraphLab) 126 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 127. 127 Big Data in Marketing and Sales SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 128. 128 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 129. Big Data Analytics in Detection of Marketing Fraud  Fraud means someone deceiving deliberately  Ex: mortgaging the same assets to multiple financial institutions, compromising customer data and transferring customer info to third party, marketing product with compromising quality,..  Banks and financial services firms use analytics to differentiate fraudulent interactions from legitimate business transactions.  The analytics systems suggest immediate actions, such as blocking irregular transactions, which stops fraud before it occurs and improves profitability. 129 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 130. Big Data and Healthcare 130 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 131. 131 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 132. 132 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 133. Healthcare analytics using big data can facilitate the following  Provision of value-based and customer centric healthcare.  Utilizing the ‘Internet of Things’ for health care.  Preventing fraud, waste, abuse in the healthcare industry and reduce healthcare costs.  Improving outcomes.  Monitoring patients in real time. 133 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 134. 134 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 135. Findings of Big Data in Medicine  Big data analytics deploys large volume of data to identify and derive intelligence predictive models about individuals.  Big data creates patterns and models by data mining and help in better understanding and research.  Deploying wearable devices data, that devices data records during active as well as inactive periods. 135 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 136. 136 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 137. 137 SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 138. Key reasons to take into account while using big data to improve results of digital marketing campaign 138 Data Visualization Tools Use of Historical data Target Consumers Crowdsourcing The real power of big data is the ability to forecast client’s needs and hence offering veracious value Web Mining SHRINIVASA, Dept. of CSE, SMVITM Bantakal
  • 139. 139 End of Module 1 SHRINIVASA, Dept. of CSE, SMVITM Bantakal

Editor's Notes

  1. … in two fundamental aspects… At the application layer we build new, real applications such a AMP-Genomics a genomics pipeline, and Carat, an application I’m going to talk about soon. Building real applications allows us to drive the features and design of the lower layers.