KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
Data Analytics (KIT-601)
Unit-5: Frame Works and Visualization &
Introduction to R
Dr. Radhey Shyam
Professor
Department of Information Technology
SRMCEM Lucknow
(Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow)
Unit-5 has been prepared and compiled by Dr. Radhey Shyam, with grateful acknowledgment to those who
made their course contents freely available or (Contributed directly or indirectly). Feel free to use this
study material for your own academic purposes. For any query, communication can be made through this
email : shyam0058@gmail.com.
April 28, 2024

Data Analytics (KIT 601)
Course Outcome ( CO) Bloom’s Knowledge Level (KL)
At the end of course , the student will be able to
CO 1 Discuss various concepts of data analytics pipeline K1, K2
CO 2 Apply classification and regression techniques K3
CO 3 Explain and apply mining techniques on streaming data K2, K3
CO 4 Compare different clustering and frequent pattern mining algorithms K4
CO 5 Describe the concept of R programming and implement analytics on Big data using R. K2,K3
DETAILED SYLLABUS 3-0-0
Unit Topic Proposed
Lecture
I
Introduction to Data Analytics: Sources and nature of data, classification of data
(structured, semi-structured, unstructured), characteristics of data, introduction to Big Data
platform, need of data analytics, evolution of analytic scalability, analytic process and
tools, analysis vs reporting, modern data analytic tools, applications of data analytics.
Data Analytics Lifecycle: Need, key roles for successful analytic projects, various phases
of data analytics lifecycle – discovery, data preparation, model planning, model building,
communicating results, operationalization.
08
II
Data Analysis: Regression modeling, multivariate analysis, Bayesian modeling, inference
and Bayesian networks, support vector and kernel methods, analysis of time series: linear
systems analysis & nonlinear dynamics, rule induction, neural networks: learning and
generalisation, competitive learning, principal component analysis and neural networks,
fuzzy logic: extracting fuzzy models from data, fuzzy decision trees, stochastic search
methods.
08
III
Mining Data Streams: Introduction to streams concepts, stream data model and
architecture, stream computing, sampling data in a stream, filtering streams, counting
distinct elements in a stream, estimating moments, counting oneness in a window,
decaying window, Real-time Analytics Platform ( RTAP) applications, Case studies – real
time sentiment analysis, stock market predictions.
08
IV
Frequent Itemsets and Clustering: Mining frequent itemsets, market based modelling,
Apriori algorithm, handling large data sets in main memory, limited pass algorithm,
counting frequent itemsets in a stream, clustering techniques: hierarchical, K-means,
clustering high dimensional data, CLIQUE and ProCLUS, frequent pattern based clustering
methods, clustering in non-euclidean space, clustering for streams and parallelism.
08
V
Frame Works and Visualization: MapReduce, Hadoop, Pig, Hive, HBase, MapR,
Sharding, NoSQL Databases, S3, Hadoop Distributed File Systems, Visualization: visual
data analysis techniques, interaction techniques, systems and applications.
Introduction to R - R graphical user interfaces, data import and export, attribute and data
types, descriptive statistics, exploratory data analysis, visualization before analysis,
analytics for unstructured data.
08
Text books and References:
1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer
2. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets, Cambridge University Press.
3. John Garrett,Data Analytics for IT Networks : Developing Innovative Use Cases, Pearson Education
Curriculum & Evaluation Scheme IT & CSI (V & VI semester) 23
A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
Part I: Frame Works and
Visualization
1 Frameworks and visualization
Frameworks and visualization are two important aspects of software development that are commonly used
together to create applications that are powerful, flexible, and user-friendly.
A framework is a pre-existing set of tools, libraries, and code structures that provide a foundation for
building software applications. Frameworks are designed to simplify the development process by providing
reusable components that can be customized and configured to meet the specific needs of a project.
There are many popular frameworks available for various programming languages, such as Django for
Python, Ruby on Rails for Ruby, and Angular for JavaScript.
Visualization refers to the process of creating graphical representations of data and information. Visu-
alizations are used to help people better understand complex information and to communicate insights and
ideas more effectively.
Visualization can be done using a variety of tools and technologies, such as charts, graphs, diagrams, and
maps. These visualizations can be created using programming languages like Python or JavaScript, or with
specialized tools like Tableau or Power BI.
Frameworks and visualization can be used together to create powerful applications that incorporate data
analysis, reporting, and visualization. For example, a web application built on the Django framework could
use JavaScript visualization libraries like D3.js or Highcharts.js to create interactive charts and graphs that
help users understand complex data.
1.1 MapReduce, Hadoop
Hadoop is an open source software framework used to develop data processing applications which are executed
in a distributed computing environment. There are (of Hadoop Architecture) basically two components in
Hadoop: The first one is HDFS for storage (Hadoop distributed File System), that allows you to store data
of various formats across a cluster. Haddop file system illustrated in Figure 1 and 2.
The second one is YARN, for resource management in Hadoop. It allows parallel processing over the
data, i.e. stored across HDFS. MapReduce is the core component for data processing in Hadoop framework.
3

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
Figure 1: Components of Hadoop Framework.
Figure 2: Working of MapReduce.
It is a processing technique built on divide and conquer algorithm. It is made of two different tasks - Map
and Reduce. Map takes a set of data and converts it into another set of data, where individual elements
4

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
Figure 3: Working of Hadoop Framework.
are broken down into tuples. Secondly, reduce task, which takes the output from a map as an input and
combines those data tuples into a smaller set of tuples and fetches it. Working of MapReduce in illustrated
in Figure 2.
1.1.1 How MapReduce Algorithm Works?
The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and reducing.
The data goes through the following phases:
ˆ Input Splits: In this phase it takes input tasks (say Data Sets) and divided into fixed-size pieces
called input splits.
ˆ Mapping: This is the very first phase in the execution of map-reduce program. It takes input tasks
(say DataSets) and divides them into smaller sub-tasks. Then perform required computation on each
sub-task in parallel. The output of this Map Function is a set of key and value pairs in the form of
¡word, frequency¿.
ˆ Shuffling: Shuffle Function is also known as “Combine Function”. It performs the following two
sub-steps:
5

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
– Merging
– Sorting
This phase consumes the output of mapping phase and performs these two sub-steps on each and every
key-value pair.
– Merging step combines all key-value pairs which have same keys.
– Sorting step takes input from merging step and sorts all key-value pairs by using Keys.
Finally, Shuffle Function returns a list of <Key, List<Value>> sorted pairs to next step.
ˆ Reducing: In this phase, output values from the shuffling phase are aggregated. This phase combines
values from shuffling phase and returns a single output value. In short, this phase summarizes the
complete dataset.
Let’s understand this with an example – Consider you have following input data for your Map Reduce
Program Welcome to Hadoop Class Hadoop is good, Hadoop is bad
The final output of the MapReduce task is shown in table 1.
Table 1: Final output of the MapReduce.
bad 1
class 1
good 1
hadoop 3
is 2
to 1
welcome 1
1.2 Pig
Apache Pig is a platform for analyzing large datasets that are stored in Hadoop Distributed File System
(HDFS). The Pig platform provides a high-level language called Pig Latin, which is used to write data
processing programs that are executed on Hadoop clusters.
The architecture of Pig consists of the following components:
1. Pig Latin Parser: This component is responsible for parsing the Pig Latin scripts written by users
and converting them into a series of logical execution plans.
6

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
2. Logical Plan Generator: This component generates a logical execution plan from the parsed Pig
Latin script. The logical plan represents the data flow operations required to execute the script.
3. Optimization and Compilation: This component optimizes the logical execution plan generated
by the previous component and compiles it into a physical execution plan that can be executed on
Hadoop.
4. Execution Engine: This component executes the physical execution plan on the Hadoop cluster,
processing the data stored in HDFS and generating output.
5. UDFs: User-Defined Functions (UDFs) are custom functions that can be written in Java, Python or
any other language supported by Hadoop, and can be integrated with Pig to perform custom data
processing operations.
Overall, the architecture of Pig provides a scalable, efficient and flexible platform for analyzing large datasets
in Hadoop, and is widely used in big data processing applications.
1.3 Hive
Hive is a data warehouse software built on top of Hadoop that allows for querying and analysis of large
datasets stored in Hadoop Distributed File System (HDFS). Hive provides a SQL-like interface called HiveQL
(HQL) that enables users to write queries against the data stored in Hadoop, without needing to know how
to write MapReduce programs.
Hive architecture consists of the following components:
1. Metastore: This component stores metadata about the data stored in HDFS, such as the schema and
the location of tables.
2. Driver: This component accepts HiveQL queries, compiles them into MapReduce programs, and
submits them to the Hadoop cluster for execution.
3. Compiler: This component parses the HiveQL queries, converts them into logical and physical exe-
cution plans, and optimizes them for execution on Hadoop.
4. Execution Engine: This component executes the compiled MapReduce programs on the Hadoop
cluster, processing the data stored in HDFS and generating output.
7

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
5. UDFs: User-Defined Functions (UDFs) are custom functions that can be written in Java, Python or
any other language supported by Hadoop, and can be integrated with Hive to perform custom data
processing operations.
Overall, Hive provides a powerful and flexible platform for analyzing large datasets stored in Hadoop, using
a familiar SQL-like interface that is easy to use for users who are familiar with SQL.
1.4 HBase
HBase is a column-oriented NoSQL database built on top of Hadoop Distributed File System (HDFS). It is
designed to handle large volumes of structured and semi-structured data in a distributed environment.
HBase architecture consists of the following components:
1. RegionServer: This component manages regions of data stored in HDFS, and provides read and
write access to the data stored in those regions.
2. HMaster: This component manages the assignment of regions to RegionServers, handles schema
changes and metadata management, and provides monitoring and administration of the HBase cluster.
3. ZooKeeper: This component provides coordination services for distributed systems, such as leader
election, configuration management, and synchronization.
4. HDFS: HBase uses HDFS as its underlying storage layer for storing data, and it stores data in HDFS
files called HFiles.
5. Clients: HBase provides client libraries for Java, Python, and other languages, which can be used to
interact with the HBase cluster and perform read and write operations on the data stored in HBase.
HBase provides several features that make it well-suited for handling large volumes of data, including auto-
matic sharding, high write throughput, and support for transactions. HBase is commonly used in applications
that require real-time access to large volumes of data, such as social media platforms, e-commerce websites,
and financial trading systems.
1.5 MapR
MapR is a data platform that provides a complete set of data management, storage, and processing services
for big data applications. MapR is built on top of Apache Hadoop and extends it with additional features
8

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
and capabilities.
MapR architecture consists of the following components:
1. MapR-FS: This component is a distributed file system that provides scalable and reliable storage for
big data. MapR-FS is designed to be highly available and fault-tolerant, and it provides advanced
features such as snapshots, mirroring, and multi-tenancy.
2. MapR-DB: This component is a NoSQL database that provides real-time access to data stored in
MapR-FS. MapR-DB supports a wide range of data models, including key-value, JSON, and binary
formats, and it provides features such as automatic sharding, replication, and secondary indexing.
3. MapR Streams: This component is a messaging system that allows for real-time processing of data
streams. MapR Streams is built on top of Apache Kafka and provides advanced features such as global
replication, message-level security, and integrated management.
4. MapR Analytics: This component provides a set of tools for processing and analyzing data stored
in MapR-FS and MapR-DB. MapR Analytics includes support for Apache Spark, Apache Drill, and
other popular big data processing frameworks.
5. MapR Control System: This component provides a centralized management console for the MapR
platform, allowing administrators to monitor and manage the entire system from a single interface.
Overall, MapR provides a comprehensive platform for managing and processing big data, with advanced
features and capabilities that make it well-suited for enterprise-level applications.
1.6 Sharding
Sharding is a technique used in distributed database systems to partition data across multiple nodes or
servers, in order to improve scalability, availability, and performance.
In a sharded database, data is divided into smaller subsets called shards or partitions, which are dis-
tributed across multiple servers. Each server is responsible for storing and processing a specific subset of
data. This allows the system to handle larger amounts of data and more concurrent requests, while also
improving fault tolerance and reducing single points of failure.
Sharding can be done in different ways, depending on the specific needs of the application and the
database system being used. Some common sharding techniques include:
9

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
1. Range-based sharding: In this technique, data is partitioned based on a specific range of values,
such as a range of dates, geographical locations, or customer IDs. Each shard is responsible for storing
data within a specific range.
2. Hash-based sharding: In this technique, data is partitioned based on a hash function applied to
a specific field or set of fields. The hash function maps each record to a specific shard, based on the
result of the hash.
3. Round-robin sharding: In this technique, data is partitioned in a circular fashion, with each shard
being assigned to a specific server in a rotating fashion. This technique is simple and evenly distributes
the data, but it can lead to uneven load distribution if the data is not evenly distributed.
Sharding can provide significant benefits for large-scale distributed database systems, but it can also in-
troduce additional complexity and management overhead. Properly designing and implementing a sharded
database requires careful planning and consideration of factors such as data distribution, fault tolerance,
and performance.
1.7 NoSQL databases
NoSQL databases are a type of database management system that do not use traditional SQL-based relational
data models. NoSQL databases are designed to handle large amounts of unstructured, semi-structured, and
structured data in a distributed and scalable manner.
NoSQL databases are generally classified into four categories:
1. Document-based databases: These databases store and manage data in the form of documents, typ-
ically using a JSON or BSON data model. Examples of document-based databases include MongoDB,
Couchbase, and Amazon DocumentDB.
2. Key-value stores: These databases store and manage data as key-value pairs, similar to a hash table.
Examples of key-value stores include Redis, Riak, and Amazon DynamoDB.
3. Column-family stores: These databases store and manage data as columns and column families,
similar to a table in a relational database. Examples of column-family stores include Apache Cassandra
and HBase.
10

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
4. Graph databases: These databases store and manage data as nodes and edges in a graph data model,
allowing for efficient traversal and analysis of complex relationships. Examples of graph databases
include Neo4j, ArangoDB, and Amazon Neptune.
NoSQL databases offer several advantages over traditional relational databases, including:
1. Scalability: NoSQL databases are designed to scale horizontally across multiple servers, allowing
them to handle large volumes of data and high levels of concurrency.
2. Flexibility: NoSQL databases can handle different types of data, including unstructured and semi-
structured data, which can be difficult to manage in a traditional relational database.
3. Performance: NoSQL databases can provide high performance for specific types of queries, such as
those that require complex data processing or real-time analysis.
NoSQL databases have become increasingly popular in recent years, particularly for applications that require
high scalability, flexibility, and performance. However, NoSQL databases also have some disadvantages, such
as less robust consistency guarantees and a lack of standardization across different database types.
1.8 S3
Amazon S3 (Simple Storage Service) is a scalable, secure, and durable cloud storage service offered by
Amazon Web Services (AWS). S3 allows users to store and retrieve data from anywhere on the internet,
using a simple web interface, API, or command-line tools.
S3 provides several benefits over traditional on-premises storage solutions, including:
1. Scalability: S3 can scale to store and retrieve any amount of data, from a few gigabytes to multiple
petabytes, and it can handle millions of requests per second.
2. Durability: S3 is designed to provide 99.999999999% (11 nines) durability for stored objects, using
multiple layers of redundancy and automatic error correction.
3. Security: S3 provides strong encryption for data in transit and at rest, and it offers access controls
and permissions to ensure that only authorized users can access data.
4. Cost-effectiveness: S3 provides a pay-as-you-go pricing model, with no upfront costs or minimum
usage requirements.
11

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
S3 can be used for a wide range of use cases, including:
1. Data backup and recovery: S3 can be used to store backup copies of data, and it can be configured
to automatically replicate data to multiple regions for disaster recovery purposes.
2. Content delivery: S3 can be used to store and distribute content, such as images, videos, and other
media files, through a content delivery network (CDN).
3. Big data analytics: S3 can be used as a data lake to store large amounts of data for analytics
purposes, and it can be integrated with other AWS services, such as Amazon EMR, to process and
analyze data.
4. Application hosting: S3 can be used to store and serve static web content, such as HTML pages
and JavaScript files, for web applications.
Overall, S3 is a highly scalable and durable cloud storage service that offers a wide range of features and
benefits for storing, managing, and accessing data in the cloud.
1.9 Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system that is designed to store large
amounts of data across many commodity hardware nodes in a Hadoop cluster. HDFS is one of the core
components of the Hadoop ecosystem, and it provides a fault-tolerant and scalable solution for storing and
processing big data.
HDFS works by dividing large data files into smaller blocks and distributing them across multiple nodes
in a cluster. Each block is replicated across several nodes to ensure data durability and availability in case
of node failures. HDFS uses a master/slave architecture, where a NameNode serves as the master node and
manages the file system metadata, while multiple DataNodes serve as slave nodes and store the actual data
blocks.
HDFS provides several benefits over traditional file systems, including:
1. Scalability: HDFS can store and process petabytes or even exabytes of data across thousands of
nodes in a Hadoop cluster, allowing it to handle large-scale data processing workloads.
2. Fault tolerance: HDFS is designed to be fault-tolerant, with data replication and block-level check-
sums to ensure data durability and availability in case of node failures.
12

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
3. Data locality: HDFS is optimized for data locality, which means that data processing tasks can be
executed on the same nodes where the data is stored, minimizing network overhead and improving
performance.
4. Open source: HDFS is an open-source software project that is maintained by the Apache Software
Foundation, which means that it is freely available and can be customized and extended by developers.
HDFS is commonly used in conjunction with other Hadoop ecosystem tools, such as MapReduce, HBase,
and Spark, to process and analyze large data sets.
1.10 Visualization
Visualization refers to the graphical representation of data and information, often in the form of charts,
graphs, maps, and other visual aids. The purpose of visualization is to make complex data and information
more accessible and understandable to users, by presenting it in a visually appealing and interactive format.
Visualization can be used for a wide range of applications, including:
1. Data exploration: Visualization can be used to explore and analyze large data sets, by visually
representing patterns, trends, and relationships in the data.
2. Data communication: Visualization can be used to communicate complex data and information to
non-expert audiences, by presenting it in a clear and intuitive format.
3. Decision making: Visualization can be used to support decision making processes, by providing
decision makers with actionable insights and information.
4. Storytelling: Visualization can be used to tell compelling stories and narratives based on data and
information, by presenting it in a visually engaging and interactive format.
There are many different types of visualization techniques and tools, including:
1. Charts and graphs: These are the most common forms of visualization, and they include bar charts,
line charts, scatter plots, and many others.
2. Maps: Maps can be used to visualize geographic data, such as the distribution of population, resources,
or economic activity.
13

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
3. Infographics: Infographics are visual representations of data and information that are designed to
communicate complex concepts in a clear and engaging way.
4. Dashboards: Dashboards are interactive visualizations that allow users to explore and analyze data
in real time, by displaying key performance indicators and other metrics.
Overall, visualization is a powerful tool for understanding and communicating data and information, and it
is increasingly being used in a wide range of fields, including business, science, healthcare, and journalism.
1.10.1 Visual data analysis techniques
Visual data analysis techniques are used to explore and analyze data visually, by creating charts, graphs,
maps, and other visual aids that help to identify patterns, trends, and relationships in the data. Some
common visual data analysis techniques include:
1. Scatter plots: Scatter plots are used to display the relationship between two variables, by plotting
each data point on a graph with one variable on the x-axis and the other variable on the y-axis.
2. Heat maps: Heat maps are used to display the density of data points across a two-dimensional space,
by using colors to represent the intensity of the data.
3. Box plots: Box plots are used to display the distribution of data across different categories or groups,
by showing the range, median, and quartiles of the data.
4. Network diagrams: Network diagrams are used to visualize complex relationships between entities
or nodes in a network, by using nodes and edges to represent the entities and their connections.
5. Geographic maps: Geographic maps are used to display data based on their location, by using colors
or symbols to represent the data points on a map.
6. Time series charts: Time series charts are used to display changes in data over time, by plotting
the data on a graph with time on the x-axis and the data value on the y-axis.
7. Bubble charts: Bubble charts are used to display data in three dimensions, by using a third variable
to determine the size of the data point on a two-dimensional graph.
8. Histograms: Histograms are used to display the distribution of data across a single variable, by
grouping the data into intervals and displaying the frequency of data points within each interval.
14

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
Overall, visual data analysis techniques provide a powerful way to explore and analyze complex data sets,
by presenting the data in a way that is easy to understand and interpret. By using visual data analysis
techniques, data analysts and scientists can quickly identify patterns, trends, and relationships in the data,
and make data-driven decisions that can drive business and scientific outcomes.
1.10.2 Interaction techniques
Interaction techniques are used in visual data analysis to enable users to interact with data visualizations
and explore the underlying data in more detail. Some common interaction techniques include:
1. Zooming and panning: These techniques allow users to zoom in on specific areas of a visualization
or pan across different parts of the visualization to explore it in more detail.
2. Brushing and linking: These techniques allow users to highlight specific data points or regions of a
visualization and see how they relate to other parts of the visualization or other visualizations.
3. Filtering and selection: These techniques allow users to select specific data points or subsets of
data based on specific criteria or filters, to explore specific parts of the data in more detail.
4. Tooltips and annotations: These techniques provide additional information about specific data
points or regions of a visualization, by displaying tooltips or annotations when users hover over or click
on specific parts of the visualization.
5. Interactivity and animation: These techniques allow users to interact with visualizations in real
time, by using sliders, buttons, or other interactive elements to modify the visualization parameters or
animate the data over time.
Overall, interaction techniques provide a powerful way to explore and analyze data in a more dynamic and
interactive way, by enabling users to manipulate and interact with visualizations to gain deeper insights
into the underlying data. By using interaction techniques, data analysts and scientists can quickly identify
patterns, trends, and relationships in the data, and make data-driven decisions that can drive business and
scientific outcomes.
1.10.3 Systems and applications
Systems and applications are two different types of software that are used in computing.
15

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
A system is a collection of software components and hardware that work together to perform a specific
task or set of tasks. Systems are often designed to provide an underlying infrastructure for applications
to run on. Examples of systems include operating systems, database management systems, and network
systems.
An application, on the other hand, is a software program that is designed to perform a specific task or
set of tasks, often for end-users. Applications are built on top of systems, and they rely on the underlying
infrastructure provided by the system to operate. Examples of applications include word processors, web
browsers, and email clients.
Both systems and applications are essential components of modern computing. Systems provide the
underlying infrastructure and services that enable applications to run, while applications provide the user-
facing interfaces and functionality that end-users interact with directly. Together, systems and applications
enable us to perform a wide range of tasks, from simple word processing to complex data analysis and
machine learning.
16

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
Part II: Introduction to R
1 Introduction to R
R is a programming language and software environment that is widely used for statistical computing, data
analysis, and visualization. It was developed in the early 1990s by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand.
R provides a wide range of statistical and graphical techniques, including linear and nonlinear modeling,
classical statistical tests, time-series analysis, classification, clustering, and more. R is also highly extensible,
with a large number of packages and libraries available for specialized tasks such as machine learning, text
analysis, and image processing.
One of the key features of R is its flexibility and ease of use. R provides a simple and intuitive syntax
that is easy to learn and use, even for those without a strong programming background. R also has a large
and active community of users and developers, who contribute to the development of packages and provide
support and resources for users.
R can be used in a variety of settings, including academic research, data analysis and visualization,
and industry applications. Some common use cases for R include analyzing large data sets, creating data
visualizations, and building predictive models for machine learning and data science applications.
Overall, R is a powerful and versatile tool for statistical computing, data analysis, and visualization, with
a large and active community of users and developers. Whether you are a student, researcher, data analyst,
or data scientist, R provides a flexible and powerful environment for exploring and analyzing data
1.1 R graphical user interfaces
R provides a number of graphical user interfaces (GUIs) that can make it easier to work with R for those
who are new to the language or who prefer a more visual approach to data analysis. Some popular R GUIs
include:
1. RStudio: RStudio is a free and open-source integrated development environment (IDE) for R that
provides a modern and user-friendly interface. It includes a code editor, console, debugging tools, and
data visualization capabilities.
17

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
2. RKWard: RKWard is another free and open-source GUI for R that provides a range of features for
data analysis, including a spreadsheet-like data editor, syntax highlighting, and built-in support for
common statistical tests.
3. Jupyter Notebooks: Jupyter Notebooks is a web-based tool that provides an interactive environment
for working with data and code. It supports multiple programming languages, including R, and provides
a flexible and customizable interface for data analysis.
4. Tinn-R: Tinn-R is a lightweight and customizable GUI for R that provides a simple interface for
working with R scripts and data files.
5. Emacs + ESS: Emacs is a powerful text editor that can be used with the Emacs Speaks Statistics
(ESS) package to provide an integrated environment for R development and data analysis.
Overall, R provides a wide range of GUIs that can make it easier to work with R, depending on your
preferences and needs. Whether you prefer a modern and user-friendly interface like RStudio, or a more
lightweight and customizable GUI like Tinn-R, there is likely a GUI that will meet your needs.
1.2 data import and export
In R, data import and export are essential tasks for data analysis. R provides a variety of functions for
importing and exporting data from a wide range of file formats. Some common data import/export functions
in R include:
1. read.csv() and write.csv(): These functions are used to import and export data in CSV (Comma
Separated Values) format. CSV is a commonly used file format for storing tabular data.
2. read.table() and write.table(): These functions are used to import and export data in a variety of
text-based formats, including CSV, TSV (Tab Separated Values), and other delimited text formats.
3. read.xlsx() and write.xlsx(): These functions are used to import and export data in Excel (.xlsx)
format. Excel is a widely used spreadsheet application, and being able to import and export data from
Excel is an important task in data analysis.
4. readRDS() and saveRDS(): These functions are used to import and export R objects in binary
format. R objects can be complex data structures, and saving them in binary format can be a more
efficient way of storing and sharing data than text-based formats.
18

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
5. read.spss() and write.spss(): These functions are used to import and export data in SPSS format.
SPSS is a statistical software package that is commonly used in social science research.
6. read sql() and write sql(): These functions are used to import and export data from SQL databases.
SQL is a popular database management language, and being able to interact with SQL databases is
an important task in data analysis.
These are just a few examples of the many data import/export functions available in R. The specific function
used will depend on the file format and data source being used. Overall, R provides a wide range of tools
for importing and exporting data, making it a versatile and powerful tool for data analysis.
1.3 attribute and data types
In data analysis, attributes refer to the characteristics or properties of a variable, while data types refer to
the format in which data is stored. In R, there are several common attribute types and data types that are
used in data analysis.
1. Attribute Types:
ˆ Names: The names attribute specifies the names of the variables in a dataset.
ˆ Class: The class attribute specifies the type of data stored in a variable (e.g. numeric, character,
factor).
ˆ Dimensions: The dimensions attribute specifies the dimensions of a dataset (e.g. number of
rows and columns).
ˆ Factors: Factors are a specific type of attribute that represent categorical data with levels.
2. Data Types:
ˆ Numeric: Numeric data types represent numerical values (e.g. integers, decimals, etc.). Numeric
data types can be further divided into integer (e.g. 1, 2, 3) and floating-point (e.g. 1.2, 3.14)
types.
ˆ Character: Character data types represent text data (e.g. ”hello”, ”world”).
ˆ Logical: Logical data types represent boolean values (TRUE or FALSE).
ˆ Date and time: Date and time data types represent date and time values (e.g. ”2023-04-14”,
”15:30:00”).
19

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
ˆ Factor: Factor data types represent categorical data with levels (e.g. ”Male”, ”Female”).
ˆ Complex: Complex data types represent complex numbers with real and imaginary parts.
Understanding attribute types and data types is important in data analysis as they can affect the way data is
stored, manipulated, and analyzed. By correctly specifying the attribute types and data types of variables in
a dataset, data analysts can ensure that they are working with the appropriate data types and can perform
accurate and efficient analyses.
1.4 Descriptive statistics
Descriptive statistics are a set of methods used to describe and summarize important features of a dataset.
These methods provide a way to organize and analyze large amounts of data in a meaningful way. Some
common descriptive statistics include:
1. Measures of central tendency: These statistics give an idea of where the data is centered. The
most commonly used measures of central tendency are the mean, median, and mode.
2. Measures of variability: These statistics give an idea of how spread out the data is. The most
commonly used measures of variability are the range, variance, and standard deviation.
3. Frequency distributions: These show the frequency of each value in a dataset.
4. Percentiles: These divide the dataset into equal parts based on a percentage. For example, the 50th
percentile is the value that separates the top 50% of values from the bottom 50%.
5. Box plots: These show the distribution of a dataset and identify outliers.
6. Histograms: These show the distribution of a dataset by dividing it into bins and counting the
number of values in each bin.
7. Scatter plots: These show the relationship between two variables by plotting them on a graph.
Descriptive statistics are important in data analysis as they provide a way to summarize and understand
large amounts of data. They can also help identify patterns and relationships within a dataset. By using
descriptive statistics, data analysts can gain insights into the characteristics of the data and make informed
decisions about how to further analyze it.
20

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
1.5 Exploratory data analysis
Exploratory data analysis (EDA) is an approach to analyzing and visualizing data in order to summarize its
main characteristics and identify patterns and relationships within the data. The goal of EDA is to generate
hypotheses, test assumptions, and provide a basis for more in-depth analysis.
EDA involves several steps, including:
1. Data collection: This involves obtaining the data from various sources and ensuring that it is in the
appropriate format for analysis.
2. Data cleaning: This involves identifying and correcting errors, missing values, and outliers in the
data.
3. Data visualization: This involves creating various graphs and charts to visualize the data and identify
patterns and relationships.
4. Summary statistics: This involves calculating summary statistics such as means, medians, standard
deviations, and variances to describe the central tendency and variability of the data.
5. Hypothesis testing: This involves testing hypotheses about the data using statistical methods.
6. Machine learning: This involves applying machine learning algorithms to the data in order to identify
patterns and relationships and make predictions.
EDA is an important step in data analysis as it provides a basis for further analysis and helps ensure that
the data is appropriate for the analysis being performed. By understanding the main characteristics of the
data and identifying patterns and relationships, analysts can make informed decisions about how to proceed
with their analysis and generate hypotheses that can be tested using more advanced statistical methods.
1.6 Visualization before analysis
Visualization before analysis is an important step in data analysis. It involves creating visual representations
of the data in order to gain a better understanding of its characteristics and identify patterns and relation-
ships. By visualizing the data before performing any analysis, analysts can gain insights that may not be
apparent from a simple numerical analysis.
There are several benefits of visualizing data before analysis, including:
21

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
1. Identifying outliers: Visualizations can help identify outliers, which are data points that fall far
outside the typical range of values. Outliers can significantly affect the results of an analysis, and
visualizing the data can help analysts identify them and determine whether they should be included
or excluded from the analysis.
2. Understanding the distribution of data: Visualizations can help analysts understand the distri-
bution of the data, including its shape, spread, and skewness. This can help them choose appropriate
statistical methods for analysis.
3. Identifying relationships between variables: Visualizations can help identify relationships be-
tween variables, such as correlations or trends. This can help analysts determine which variables to
include in their analysis and how to model the relationship between them.
4. Communicating results: Visualizations can be used to communicate results to stakeholders in a clear
and concise manner. By presenting data in a visually appealing way, analysts can help stakeholders
understand the main insights and implications of the analysis.
In summary, visualizing data before analysis is an important step in data analysis that can help analysts
gain insights, identify outliers, understand the distribution of data, and communicate results to stakeholders.
1.7 analytics for unstructured data
Analytics for unstructured data refers to the process of analyzing and extracting insights from non-tabular,
unstructured data sources such as text, images, audio, and video. Unstructured data is typically generated at
a high volume, velocity, and variety, making it difficult to analyze using traditional data analysis techniques.
There are several analytics techniques that can be used to analyze unstructured data, including:
1. Natural Language Processing (NLP): NLP is a field of study that focuses on the interaction
between human language and computers. It involves using algorithms to extract meaning and insights
from unstructured text data, including sentiment analysis, topic modeling, and entity extraction.
2. Image and video analytics: Image and video analytics involve using computer vision techniques
to extract insights from visual data. This can include facial recognition, object detection, and image
segmentation.
22

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
3. Speech and audio analytics: Speech and audio analytics involve using signal processing techniques
to extract insights from audio data, such as speech recognition, speaker identification, and emotion
detection.
4. Machine learning: Machine learning algorithms can be used to analyze unstructured data by learning
from patterns and relationships in the data. This can include techniques such as clustering, classifica-
tion, and regression.
To analyze unstructured data effectively, it is important to have a robust infrastructure for data storage,
processing, and analysis. This may involve using distributed computing platforms such as Hadoop and Spark,
as well as specialized software tools for data preprocessing, feature extraction, and model development.
In summary, analytics for unstructured data involves using specialized techniques and tools to extract
insights from non-tabular, unstructured data sources. By analyzing unstructured data, organizations can
gain valuable insights into customer sentiment, product feedback, market trends, and other areas of interest.
23

Printed Page: 1 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0
BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS
Time: 3 Hours Total Marks: 100
Note: Attempt all Sections. If you require any missing data, then choose suitably.
SECTION A
1. Attempt all questions in brief. 2*10 = 20
Qno Questions CO
(a) Discuss the need of data analytics. 1
(b) Give the classification of data. 1
(c) Define neural network. 2
(d) What is multivariate analysis? 2
(e) Give the full form of RTAP and discuss its application. 3
(f) What is the role of sampling data in a stream? 3
(g) Discuss the use of limited pass algorithm. 4
(h) What is the principle behind hierarchical clustering technique? 4
(i) List five R functions used in descriptive statistics. 5
(j) List the names of any 2 visualization tools. 5
SECTION B
2. Attempt any three of the following: 10*3 = 30
Qno Questions CO
(a) Explain the process model and computation model for Big data
platform.
1
(b) Explain the use and advantages of decision trees. 2
(c) Explain the architecture of data stream model. 3
(d) Illustrate the K-means algorithm in detail with its advantages. 4
(e) Differentiate between NoSQL and RDBMS databases. 5
SECTION C
3. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain the various phases of data analytics life cycle. 1
(b) Explain modern data analytics tools in detail. 1
4. Attempt any one part of the following: 10 *1 = 10
Qno Questions CO
(a) Compare various types of support vector and kernel methods of data
analysis.
2
(b) Given data= {2,3,4,5,6,7;1,5,3,6,7,8}. Compute the principal
component using PCA algorithm.
2
A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S

Printed Page: 2 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0
BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS
Qno Questions CO
(a) Explain any one algorithm to count number of distinct elements in a
data stream.
3
(b) Discuss the case study of stock market predictions in detail. 3
Qno Questions CO
(a) Differentiate between CLIQUE and ProCLUS clustering. 4
(b) A database has 5 transactions. Let min_sup=60% and min_conf=80%.
TID Items_Bought
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y}
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I, E}
i) Find all frequent itemsets using Apriori algorithm.
ii) List all the strong association rules (with support s and confidence
c).
4
Qno Questions CO
(a) Explain the HIVE architecture with its features in detail. 5
(b) Write R function to check whether the given number is prime or not. 5
A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
2 Reference
[1] https://www.jigsawacademy.com/blogs/hr-analytics/data-analytics-lifecycle/
[2] https://statacumen.com/teach/ADA1/ADA1_notes_F14.pdf
[3] https://www.youtube.com/watch?v=fDRa82lxzaU
[4] https://www.investopedia.com/terms/d/data-analytics.asp
[5] http://egyankosh.ac.in/bitstream/123456789/10935/1/Unit-2.pdf
[6] http://epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/computer_science/16._d
ata_analytics/03._evolution_of_analytical_scalability/et/9280_et_3_et.pdf
[7] https://bhavanakhivsara.files.wordpress.com/2018/06/data-science-and-big-data-analy
-nieizv_book.pdf
[8] https://www.researchgate.net/publication/317214679_Sentiment_Analysis_for_Effect
ive_Stock_Market_Prediction
[9] https://snscourseware.org/snscenew/files/1569681518.pdf
[10] http://csis.pace.edu/ctappert/cs816-19fall/books/2015DataScience&BigDataAnalytics.
pdf
[11] https://www.youtube.com/watch?v=mccsmoh2_3c
[12] https://mentalmodels4life.net/2015/11/18/agile-data-science-applying-kanban-in-the-a
nalytics-life-cycle/
[13] https://www.sas.com/en_in/insights/big-data/what-is-big-data.html#:~:text=Big%20dat
a%20refers%20to%20data,around%20for%20a%20long%20time.
[14] https://www.javatpoint.com/big-data-characteristics
[15] Liu, S., Wang, M., Zhan, Y., & Shi, J. (2009). Daily work stress and alcohol use: Testing the cross-
level moderation effects of neuroticism and job involvement. Personnel Psychology,62(3), 575–597.
http://dx.doi.org/10.1111/j.1744-6570.2009.01149.x
28

A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
[16] https://www.google.com/search?q=architecture+of+data+stream+model&sxsrf=APwXEdf9LJ8N
XMypRU-Sg28SH8m_pwiUDA:1679823244352&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjGgY-epfn9
AhX5xTgGHRWjDmMQ_AUoAXoECAEQAw&biw=1366&bih=622#imgrc=wnFWJQ01p-w_jM
********************
29

KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization

Recommended

Recommended

More Related Content

Similar to KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization

Similar to KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization (20)

More from Dr. Radhey Shyam

More from Dr. Radhey Shyam (20)

Recently uploaded

Recently uploaded (20)

KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization