3. Data Analytics (MCA19304)
COURSE OUTCOMES
AT THE END OF COURSE THE STUDENT SHOULD BE ABLE TO :
1. DEVELOP AND MAINTAIN RELIABLE, SCALABLE SYSTEMS USING
APACHE, HADOOP
2. WRITE MAP REDUCE BASED APPLICATION
3. DIFFERENTIATE BETWEEN CONVENTIONAL SQL AND NOSQL
4. ANALYZE AND DEVELOP BIG DATA SOLUTIONS USING HIVE AND PIG
4. Data Analytics (MCA19304)
UNIT I
• DISTRIBUTED FILE SYSTEM AND ITS ISSUES
• INTRODUCTION TO BIG DATA,
• BIG DATA CHARACTERISTICS
• TYPES OF BIG DATA
• TRADITIONAL VS. BIG DATA APPROACH
• BIG DATA APPLICATIONS
6. Distributed file system and its issues
• A single machine with 4 Hard disks with 1 tb of data (I/O Channel), with 100
Mbps speed. For Processing needs 45 mins.
• For Faster Processing :
• Divide data and store it on multiple machines with same configuration as above
– Assume all machines are processing data in parallel manner then , It will take
45/5 = 9 mins for processing.
• Processing will be 5 times faster than a single machine.
8. Distributed file system and its issues
Each machine have its own local file system (physical file system ) where you store data i.e create folders and
subfolders and so on.
Distributed file system is not physical, it is virtual or logical file system.
Hadoop used DFS.
Install libraries on every machine running as a separate process in different machines.
These are creating virtual layer over the physical file system under it.
This virtual layer is called distributed file system
Distributed File System
9. Distributed file system and its issues
• Virtual File System is a software i.e set of programs—obviously….Set of
commands
• Ex. Dfs -copy source file destination file
• Dfs -copy file1 file 2
• It read file1 which is distributed on 5 machines say ( A,B,C,D,E ), user having
no idea about it….Where each part of file is ?
• ( path is virtual path) nowhere it is existing.
• Any dfs follows master slave architecture
10. DFS
Master Machine
Slave Machines
Upper Machine is Master Machine and Lower 5
are Slave ones.
Data is splitted and stored on slave machines.
Master does not store any data. It only stores
metadata.
Master Machine only know (as File is divided into
blocks (File to Block Mapping and blocks are
distributed on slave machines i.e Block to Slave
mapping)
Data can only be accessed via Master as Only
Master know the actual location of data on each
slave.
11. HDFS
• While reading data, if any of the node failure then client may get partial data.
• To overcome this at the time of configuring HDFS, replication factor is set i.e if
replication factor = 2 , it means every block is replicated ( copied at two places) i.e 2
copies are maintained for each block.
• In case of failure of one node, block can be accessed from another node. Data is
transmitted to machine (Server) where program is running.
.
12. Features of DFS
Transparency :
Structure transparency –
There is no need for the client to know about the number or locations of file servers and the
storage devices.
Access transparency –
Both local and remote files should be accessible in the same manner.
Naming transparency –
Once a name is given to the file, it should not be changed during transferring from one
node to another.
13. Features of DFS
• Replication transparency –
If a file is copied on multiple nodes, both the copies of the file and their
locations should be hidden from one node to another.
User mobility :
It will automatically bring the user’s home directory to the node where the
user logs in.
• Performance :
Performance is based on the average amount of time needed to convince the
client requests.
• This time covers the CPU time + time taken to access secondary storage +
network access time.
14. Features of DFS
Simplicity and ease of use :
The user interface of a file system should be simple and the number of commands in the file should be
small.
High availability :
A Distributed File System should be able to continue in case of any partial failures like a link failure, a
node failure, or a storage drive crash.
A high authentic and adaptable distributed file system should have different and independent file servers
for controlling different and independent storage devices.
Scalability :
Since growing the network by adding new machines or joining two networks together is routine, the
distributed system will inevitably grow over time. As a result, a good distributed file system should be
built to scale quickly as the number of nodes and users in the system grows. Service should not be
substantially disrupted as the number of nodes and users grows.
15. Features of DFS
High reliability :
A file system should create backup copies of key files that can be used if the originals are lost.
Many file systems employ stable storage as a high-reliability strategy.
Data integrity :
Multiple users frequently share a file system.
The integrity of data saved in a shared file must be guaranteed by the file system.
That is, concurrent access requests from many users who are competing for access to the
same file must be correctly synchronized using a concurrency control method.
Atomic transactions are a high-level concurrency management mechanism for data
integrity that is frequently offered to users by a file system.
16. Features of DFS
Security :
Users of heterogeneous distributed systems have the option of using multiple computer
platforms for different purposes.
Heterogeneity :
To safeguard the information contained in the file system from unwanted & unauthorized
access, security mechanisms must be implemented.
A distributed file system should be secure so that its users may trust that their data will be
kept private.
17. Issues with DFS
In Distributed File System nodes and connections needs to be secured therefore
we can say that security is at stake.
There is a possibility of lose of messages and data in the network while movement
from one node to another.
Database connection in case of Distributed File System is complicated.
Also handling of the database is not easy in Distributed File System as compared
to a single user system.
There are chances that overloading will take place if all nodes tries to send data
at once.do with the local
30. Types of Big Data
• Structured
The structured data includes all the data that can be stored in a tabular column.
Relational databases are examples of structured data.
It is easy to make sense of the relational databases.
Most of the modern computers are able to make sense of structured data.
31. Types of Big Data
Unstructured
• Unstructured data refers to the data that lacks any specific form or structure whatsoever.
• The unstructured data is the one that cannot be stored in a spreadsheet;
• Unstructured data, on the other hand, is the one which cannot be fit into tabular databases.
• Examples of unstructured data include audio, video, and other sorts of data which comprise such a big chunk
of the big data today. Email is an example of unstructured data.
32. Types of Big Data
Semi-structured
• The semi-structured data includes both structured and unstructured data.
• This type of data sets include a proper structure, but still it might not be possible
to sort or process that data due to some constraints.
• This type of data includes the XML data, JSON files, and others.
33. Traditional Vs. Big Data
• 1.Traditional data
• Traditional data is the structured data which is being majorly maintained by all types of businesses
starting from very small to big organizations.
• In traditional database system a centralized database architecture used to store and maintain the data in
a fixed format or fields in a file.
• For managing and accessing the data structured query language (SQL) is used.
• 2. Big data :
Big data deal with too large or complex data sets which is difficult to manage in traditional data-processing
application software.
• It deals with large volume of both structured, semi structured and unstructured data. Volume, velocity and
variety, veracity and value.
• Big data not only refers to large amount of data it refers to extracting meaningful data by analyzing the
huge amount of complex data sets.
34. S.No. TRADITIONAL DATA BIG DATA
01. Traditional data is generated in enterprise level. Big data is generated in outside and enterprise level.
02. Its volume ranges from Gigabytes to Terabytes. Its volume ranges from Petabytes to Zettabytes or Exabytes.
03. Traditional database system deals with structured data. Big data system deals with structured, semi structured and unstructured data.
04. Traditional data is generated per hour or per day or more. But big data is generated more frequently mainly per seconds.
05.
Traditional data source is centralized and it is managed in centralized
form. Big data source is distributed and it is managed in distributed form.
06. Data integration is very easy. Data integration is very difficult.
07. Normal system configuration is capable to process traditional data. High system configuration is required to process big data.
35. 08. The size of the data is very small. The size is more than the traditional data size.
09.
Traditional data base tools are required to perform any data base
operation. Special kind of data base tools are required to perform any data base operation.
10. Normal functions can manipulate data. Special kind of functions can manipulate data.
11. Its data model is strict schema based and it is static. Its data model is flat schema based and it is dynamic.
12.. Traditional data is stable and inter relationship. Big data is not stable and unknown relationship.
13. Traditional data is in manageable volume. Big data is in huge volume which becomes unmanageable.
14. It is easy to manage and manipulate the data. It is difficult to manage and manipulate the data.
15.
Its data sources includes ERP transaction data, CRM transaction data,
financial data, organizational data, web transaction data etc. Its data sources includes social media, device data, sensor data, video, images, audio etc.
36. Applications of Big Data
•Big data in retail
•Big data in healthcare
•Big data in education
•Big data in e-commerce
•Big data in media and entertainment
•Big data in finance
•Big data in travel industry
•Big data in telecom
•Big data in automobile