This document discusses analyzing fire department call data from San Francisco using HiveQL and MapReduce. The authors cleaned the data, loaded it into HDFS, and performed queries and analysis. They found that Hive queries took less time than custom MapReduce programs for the same queries on this dataset. Visualizations of query results were created using JFreeCharts. The goal was to help improve fire department resource allocation and response based on patterns in call volume, location, and time.
Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
This summary provides an overview of the SparkR package, which provides an R frontend for the Apache Spark distributed computing framework:
- SparkR enables large-scale data analysis from the R shell by using Spark's distributed computation engine to parallelize and optimize R programs. It allows R users to leverage Spark's libraries, data sources, and optimizations while programming in R.
- The central component of SparkR is the distributed DataFrame, which provides a familiar data frame interface to R users but can handle large datasets using Spark. DataFrame operations are optimized using Spark's query optimizer.
- SparkR's architecture includes an R to JVM binding that allows R programs to submit jobs to Spark, and support for running R execut
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
The document discusses techniques for scaling applications across computing nodes in high performance computing (HPC) clusters. It analyzes the performance of different computing nodes on various applications like BLASTX, HPL, and JAGS. Array job facilities are used to parallelize applications by dividing iterations into independent tasks assigned across nodes. Python programs are created to analyze system performance based on log files and produce plots showing differences in node performance on different applications. The plots help with preventative maintenance and capacity management of the HPC system.
This document discusses the rise of open source analytics tools and languages. It notes that SAS and SPSS previously dominated the market but were very expensive. R, Python, and Hadoop have provided lower-cost open source alternatives for data storage, querying, visualization, and statistical analysis. The document reviews popular open source tools like R, Python, RapidMiner, and Hadoop ecosystems. It also discusses commercial offerings that build on open source like Revolution Analytics. Overall, open source has helped reduce the costs of analytics software and enabled more organizations to benefit from data-driven insights.
IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...IRJET Journal
This document discusses techniques to improve resource utilization for Hadoop MapReduce in heterogeneous systems. It proposes implementing classification at the job level using SVM to assign jobs to appropriate nodes. It also proposes using the PRISM fine-grained scheduling algorithm to schedule tasks at the phase level to increase parallelism and reduce job running times by 10-30%. Finally, it proposes a dynamic slot configuration algorithm to optimize the number of map and reduce slots on each node. The authors claim these techniques improved performance and reduced running times by up to 30% depending on job characteristics.
R can perform various data analysis and data science tasks for free through its extensive packages and community support. It is an open-source statistical programming language that is widely used for data manipulation, visualization, and machine learning. Some key features of R include its ability to perform interactive visualization, ensemble learning, text/social media mining, and integration with other languages and technologies like SQL, Python, and Tableau. While powerful, R does have some limitations like a steep learning curve and slower execution compared to other languages.
This document discusses analyzing fire department call data from San Francisco using HiveQL and MapReduce. The authors cleaned the data, loaded it into HDFS, and performed queries and analysis. They found that Hive queries took less time than custom MapReduce programs for the same queries on this dataset. Visualizations of query results were created using JFreeCharts. The goal was to help improve fire department resource allocation and response based on patterns in call volume, location, and time.
Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
This summary provides an overview of the SparkR package, which provides an R frontend for the Apache Spark distributed computing framework:
- SparkR enables large-scale data analysis from the R shell by using Spark's distributed computation engine to parallelize and optimize R programs. It allows R users to leverage Spark's libraries, data sources, and optimizations while programming in R.
- The central component of SparkR is the distributed DataFrame, which provides a familiar data frame interface to R users but can handle large datasets using Spark. DataFrame operations are optimized using Spark's query optimizer.
- SparkR's architecture includes an R to JVM binding that allows R programs to submit jobs to Spark, and support for running R execut
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
The document discusses techniques for scaling applications across computing nodes in high performance computing (HPC) clusters. It analyzes the performance of different computing nodes on various applications like BLASTX, HPL, and JAGS. Array job facilities are used to parallelize applications by dividing iterations into independent tasks assigned across nodes. Python programs are created to analyze system performance based on log files and produce plots showing differences in node performance on different applications. The plots help with preventative maintenance and capacity management of the HPC system.
This document discusses the rise of open source analytics tools and languages. It notes that SAS and SPSS previously dominated the market but were very expensive. R, Python, and Hadoop have provided lower-cost open source alternatives for data storage, querying, visualization, and statistical analysis. The document reviews popular open source tools like R, Python, RapidMiner, and Hadoop ecosystems. It also discusses commercial offerings that build on open source like Revolution Analytics. Overall, open source has helped reduce the costs of analytics software and enabled more organizations to benefit from data-driven insights.
IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...IRJET Journal
This document discusses techniques to improve resource utilization for Hadoop MapReduce in heterogeneous systems. It proposes implementing classification at the job level using SVM to assign jobs to appropriate nodes. It also proposes using the PRISM fine-grained scheduling algorithm to schedule tasks at the phase level to increase parallelism and reduce job running times by 10-30%. Finally, it proposes a dynamic slot configuration algorithm to optimize the number of map and reduce slots on each node. The authors claim these techniques improved performance and reduced running times by up to 30% depending on job characteristics.
R can perform various data analysis and data science tasks for free through its extensive packages and community support. It is an open-source statistical programming language that is widely used for data manipulation, visualization, and machine learning. Some key features of R include its ability to perform interactive visualization, ensemble learning, text/social media mining, and integration with other languages and technologies like SQL, Python, and Tableau. While powerful, R does have some limitations like a steep learning curve and slower execution compared to other languages.
Mankind has stored more than 295 billion gigabytes (or 295 Exabyte) of data since 1986, as per a report by the University of Southern California. Storing and monitoring this data in widely distributed environments for 24/7 is a huge task for global service organizations. These datasets require high processing power which can’t be offered by traditional databases as they are stored in an unstructured format. Although one can use Map Reduce paradigm to solve this problem using java based Hadoop, it cannot provide us with maximum functionality. Drawbacks can be overcome using Hadoop-streaming techniques that allow users to define non-java executable for processing this datasets. This paper proposes a THESAURUS model which allows a faster and easier version of business analysis.
This document discusses a Hadoop Job Runner UI Tool that was created to make running Hadoop jobs easier. It allows users to browse input data locally, copy the data and job class to HDFS, run the job, and display results without using command lines. The tool simplifies tasks like distributing data and code, executing jobs, and retrieving output. Background information on Hadoop, MapReduce, and distributed computing environments is also provided.
Programming Modes and Performance of Raspberry-Pi ClustersAM Publications
In present times, updated information and knowledge has become readily accessible to researchers, enthusiasts, developers, and academics through the Internet on many different subjects for wider areas of application. The underlying framework facilitating such possibilities is networking of servers, nodes, and personal computers. However, such setups, comprising of mainframes, servers and networking devices are inaccessible to many, costly, and are not portable. In addition, students and lab-level enthusiasts do not have the requisite access to modify the functionality to suit specific purposes. The Raspberry-Pi (R-Pi) is a small device capable of many functionalities akin to super-computing while being portable, economical and flexible. It runs on open source Linux, making it a preferred choice for lab-level research and studies. Users have started using the embedded networking capability to design portable clusters that replace the costlier machines. This paper introduces new users to the most commonly used frameworks and some recent developments that best exploit the capabilities of R-Pi when used in clusters. This paper also introduces some of the tools and measures that rate efficiencies of clusters to help users assess the quality of cluster design. The paper aims to make users aware of the various parameters in a cluster environment.
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit
Map Reduce has gained remarkable significance as a rominent parallel data processing tool in the research community, academia and industry with the spurt in volume of data that is to be analyzed. Map Reduce is used in different applications such as data mining, data analytic where massive data analysis is required, but still it is constantly being explored on different parameters such as performance and efficiency. This survey intends to explore large scale data processing using Map Reduce and its various implementations to facilitate the database, researchers and other communities in developing the technical understanding of the Map Reduce framework. In this survey, different Map Reduce implementations are explored and their inherent features are compared on different parameters. It also addresses the open issues and challenges raised on fully functional DBMS/Data Warehouse on Map Reduce. The comparison of various Map Reduce implementations is done with the most popular implementation Hadoop and other similar implementations using other platforms.
This document provides an agenda for a presentation on big data and big data analytics using R. The presentation introduces the presenter and has sections on defining big data, discussing tools for storing and analyzing big data in R like HDFS and MongoDB, and presenting case studies analyzing social network and customer data using R and Hadoop. The presentation also covers challenges of big data analytics, existing case studies using tools like SAP Hana and Revolution Analytics, and concerns around privacy with large-scale data analysis.
This document discusses R programming and compares it to Python. R is an open-source programming language commonly used for statistical analysis and visualization. It has many libraries that enable data analysis and machine learning. The document compares key aspects of R and Python, such as their creators, release years, software environments, usability, and pros and cons. It concludes that R is easy to learn and offers powerful graphics and statistical techniques through libraries, making it well-suited for data analysis applications.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
This document presents an implementation of the p-PIC clustering algorithm using the MapReduce framework to handle big data. P-PIC is a parallel version of the Power Iteration Clustering (PIC) algorithm that is able to cluster large datasets in a distributed environment. The document first provides background on PIC and challenges with scaling to big data. It then describes how p-PIC addresses these challenges using MPI for parallelization. The design of implementing p-PIC within MapReduce is presented, including the map and reduce functions. Experimental results on synthetic datasets up to 100,000 records show that p-PIC using MapReduce has increased performance and scalability compared to the original p-PIC implementation using MPI.
This document discusses a Hadoop Job Runner UI Tool that was developed to provide a graphical user interface for running Hadoop jobs. The tool allows users to browse input data locally, copy the data to HDFS, copy Java classes to remote servers, run Hadoop jobs, and copy results back from HDFS to display outputs and job statistics. The document also provides background on Hadoop and MapReduce, including an overview of how MapReduce works and how it enables distributed and parallel processing of large datasets.
The document provides an overview of data science with Python and integrating Python with Hadoop and Apache Spark frameworks. It discusses:
- Why Python should be integrated with Hadoop and the ecosystem including HDFS, MapReduce, and Spark.
- Key concepts of Hadoop including HDFS for storage, MapReduce for processing, and how Python can be integrated via APIs.
- Benefits of Apache Spark like speed, simplicity, and efficiency through its RDD abstraction and how PySpark enables Python access.
- Examples of using Hadoop Streaming and PySpark to analyze data and determine word counts from documents.
Enhancing Big Data Analysis by using Map-reduce TechniquejournalBEEI
Database is defined as a set of data that is organized and distributed in a manner that permits the user to access the data being stored in an easy and more convenient manner. However, in the era of big-data the traditional methods of data analytics may not be able to manage and process the large amount of data. In order to develop an efficient way of handling big-data, this work enhances the use of Map-Reduce technique to handle big-data distributed on the cloud. This approach was evaluated using Hadoop server and applied on Electroencephalogram (EEG) Big-data as a case study. The proposed approach showed clear enhancement on managing and processing the EEG Big-data with average of 50% reduction on response time. The obtained results provide EEG researchers and specialist with an easy and fast method of handling the EEG big data.
This document describes a user-based collaborative filtering recommender system using MapReduce and Bloom filters on Hadoop. It aims to solve scalability issues with conventional collaborative filtering. The algorithm partitions user data, then uses mappers to calculate similarities between users, identify neighbors, and make predictions. Reducers collect and output recommendations. Experiments show the MapReduce approach with Bloom filters speeds up recommendations compared to MapReduce alone, and scales linearly as nodes increase. The system provides personalized recommendations for items like movies in large, distributed environments.
2.DATAMANAGEMENT-DIGITAL TRANSFORMATION AND STRATEGYGeorgeDiamandis11
The document discusses digitalization in logistics and analytics of key performance indicators. It covers several topics related to data management, including business intelligence, data warehousing, big data, and analytics tools. Case studies are provided on how various organizations have optimized operations, increased speed, and created new services using big data analytics techniques. Examples include detecting fraud, anticipating demand, optimizing inventory, scenario simulation, improving health outcomes, and customizing education.
This document provides an overview of the Pig tool, which is a scripting language for exploring large datasets within the Apache Hadoop ecosystem. It discusses how Pig allows processing of terabytes of data through just a few lines of code by customizing all parts of the processing path, including storing, filtering, grouping, and joining data. The document then presents a sample problem using a publicly available million song dataset to demonstrate loading and storing the data with Pig, finding the song density, and filtering the results. It analyzes the input and output, showing that the input dataset contained 1 million records which were filtered down to the top 50 songs with the highest sound densities.
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcsandit
Big data analysis has become much popular in the present day scenario and the manipulation of
big data has gained the keen attention of researchers in the field of data analytics. Analysis of
big data is currently considered as an integral part of many computational and statistical
departments. As a result, novel approaches in data analysis are evolving on a daily basis.
Thousands of transaction requests are handled and processed everyday by different websites
associated with e-commerce, e-banking, e-shopping carts etc. The network traffic and weblog
analysis comes to play a crucial role in such situations where Hadoop can be suggested as an
efficient solution for processing the Netflow data collected from switches as well as website
access-logs during fixed intervals.
Rapid Miner is an open-source data mining software tool. It provides functionality for data loading, preprocessing, transformation, data mining, modeling, evaluation, and deployment. Rapid Miner uses learning schemes and attribute evaluators from Weka and statistical modeling schemes from R. It can be used for tasks like text mining, feature engineering, and distributed data mining. Rapid Miner includes a graphical user interface to design analytical workflows using operators. It can also be called as an API or from the command line.
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcscpconf
Big data analysis has become much popular in the present day scenario and the manipulation of big data has gained the keen attention of researchers in the field of data analytics. Analysis of
big data is currently considered as an integral part of many computational and statistical departments. As a result, novel approaches in data analysis are evolving on a daily basis.
Thousands of transaction requests are handled and processed every day by different websites associated with e-commerce, e-banking, e-shopping carts etc. The network traffic and weblog
analysis comes to play a crucial role in such situations where Hadoop can be suggested as an efficient solution for processing the Netflow data collected from switches as well as website
access-logs during fixed intervals.
This document discusses big data and Hadoop. It defines big data as large datasets that are difficult to process using traditional methods due to their volume, variety, and velocity. Hadoop is presented as an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. The key components of Hadoop are the Hadoop Distributed File System (HDFS) for storage and MapReduce as a programming model for distributed processing. A number of other technologies in Hadoop's ecosystem are also described such as HBase, Avro, Pig, Hive, Sqoop, Zookeeper and Mahout. The document concludes that Hadoop provides solutions for efficiently processing and analyzing big data.
This document discusses using machine learning algorithms to predict employee attrition and understand factors that influence turnover. It evaluates different machine learning models on an employee turnover dataset to classify employees who are at risk of leaving. Logistic regression and random forest classifiers are applied and achieve accuracy rates of 78% and 98% respectively. The document also discusses preprocessing techniques and visualizing insights from the models to better understand employee turnover.
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
This document summarizes a survey paper on MapReduce processing using Hadoop. It discusses how big data is growing rapidly due to factors like the internet and social media. Traditional databases cannot handle big data. Hadoop uses MapReduce and HDFS to store and process extremely large datasets across commodity servers in a distributed manner. HDFS stores data in a distributed file system, while MapReduce allows parallel processing of that data. The paper describes the MapReduce process and its core functions like map, shuffle, reduce. It explains how Hadoop provides advantages like scalability, cost effectiveness, flexibility and parallel processing for big data.
Mankind has stored more than 295 billion gigabytes (or 295 Exabyte) of data since 1986, as per a report by the University of Southern California. Storing and monitoring this data in widely distributed environments for 24/7 is a huge task for global service organizations. These datasets require high processing power which can’t be offered by traditional databases as they are stored in an unstructured format. Although one can use Map Reduce paradigm to solve this problem using java based Hadoop, it cannot provide us with maximum functionality. Drawbacks can be overcome using Hadoop-streaming techniques that allow users to define non-java executable for processing this datasets. This paper proposes a THESAURUS model which allows a faster and easier version of business analysis.
This document discusses a Hadoop Job Runner UI Tool that was created to make running Hadoop jobs easier. It allows users to browse input data locally, copy the data and job class to HDFS, run the job, and display results without using command lines. The tool simplifies tasks like distributing data and code, executing jobs, and retrieving output. Background information on Hadoop, MapReduce, and distributed computing environments is also provided.
Programming Modes and Performance of Raspberry-Pi ClustersAM Publications
In present times, updated information and knowledge has become readily accessible to researchers, enthusiasts, developers, and academics through the Internet on many different subjects for wider areas of application. The underlying framework facilitating such possibilities is networking of servers, nodes, and personal computers. However, such setups, comprising of mainframes, servers and networking devices are inaccessible to many, costly, and are not portable. In addition, students and lab-level enthusiasts do not have the requisite access to modify the functionality to suit specific purposes. The Raspberry-Pi (R-Pi) is a small device capable of many functionalities akin to super-computing while being portable, economical and flexible. It runs on open source Linux, making it a preferred choice for lab-level research and studies. Users have started using the embedded networking capability to design portable clusters that replace the costlier machines. This paper introduces new users to the most commonly used frameworks and some recent developments that best exploit the capabilities of R-Pi when used in clusters. This paper also introduces some of the tools and measures that rate efficiencies of clusters to help users assess the quality of cluster design. The paper aims to make users aware of the various parameters in a cluster environment.
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit
Map Reduce has gained remarkable significance as a rominent parallel data processing tool in the research community, academia and industry with the spurt in volume of data that is to be analyzed. Map Reduce is used in different applications such as data mining, data analytic where massive data analysis is required, but still it is constantly being explored on different parameters such as performance and efficiency. This survey intends to explore large scale data processing using Map Reduce and its various implementations to facilitate the database, researchers and other communities in developing the technical understanding of the Map Reduce framework. In this survey, different Map Reduce implementations are explored and their inherent features are compared on different parameters. It also addresses the open issues and challenges raised on fully functional DBMS/Data Warehouse on Map Reduce. The comparison of various Map Reduce implementations is done with the most popular implementation Hadoop and other similar implementations using other platforms.
This document provides an agenda for a presentation on big data and big data analytics using R. The presentation introduces the presenter and has sections on defining big data, discussing tools for storing and analyzing big data in R like HDFS and MongoDB, and presenting case studies analyzing social network and customer data using R and Hadoop. The presentation also covers challenges of big data analytics, existing case studies using tools like SAP Hana and Revolution Analytics, and concerns around privacy with large-scale data analysis.
This document discusses R programming and compares it to Python. R is an open-source programming language commonly used for statistical analysis and visualization. It has many libraries that enable data analysis and machine learning. The document compares key aspects of R and Python, such as their creators, release years, software environments, usability, and pros and cons. It concludes that R is easy to learn and offers powerful graphics and statistical techniques through libraries, making it well-suited for data analysis applications.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
This document presents an implementation of the p-PIC clustering algorithm using the MapReduce framework to handle big data. P-PIC is a parallel version of the Power Iteration Clustering (PIC) algorithm that is able to cluster large datasets in a distributed environment. The document first provides background on PIC and challenges with scaling to big data. It then describes how p-PIC addresses these challenges using MPI for parallelization. The design of implementing p-PIC within MapReduce is presented, including the map and reduce functions. Experimental results on synthetic datasets up to 100,000 records show that p-PIC using MapReduce has increased performance and scalability compared to the original p-PIC implementation using MPI.
This document discusses a Hadoop Job Runner UI Tool that was developed to provide a graphical user interface for running Hadoop jobs. The tool allows users to browse input data locally, copy the data to HDFS, copy Java classes to remote servers, run Hadoop jobs, and copy results back from HDFS to display outputs and job statistics. The document also provides background on Hadoop and MapReduce, including an overview of how MapReduce works and how it enables distributed and parallel processing of large datasets.
The document provides an overview of data science with Python and integrating Python with Hadoop and Apache Spark frameworks. It discusses:
- Why Python should be integrated with Hadoop and the ecosystem including HDFS, MapReduce, and Spark.
- Key concepts of Hadoop including HDFS for storage, MapReduce for processing, and how Python can be integrated via APIs.
- Benefits of Apache Spark like speed, simplicity, and efficiency through its RDD abstraction and how PySpark enables Python access.
- Examples of using Hadoop Streaming and PySpark to analyze data and determine word counts from documents.
Enhancing Big Data Analysis by using Map-reduce TechniquejournalBEEI
Database is defined as a set of data that is organized and distributed in a manner that permits the user to access the data being stored in an easy and more convenient manner. However, in the era of big-data the traditional methods of data analytics may not be able to manage and process the large amount of data. In order to develop an efficient way of handling big-data, this work enhances the use of Map-Reduce technique to handle big-data distributed on the cloud. This approach was evaluated using Hadoop server and applied on Electroencephalogram (EEG) Big-data as a case study. The proposed approach showed clear enhancement on managing and processing the EEG Big-data with average of 50% reduction on response time. The obtained results provide EEG researchers and specialist with an easy and fast method of handling the EEG big data.
This document describes a user-based collaborative filtering recommender system using MapReduce and Bloom filters on Hadoop. It aims to solve scalability issues with conventional collaborative filtering. The algorithm partitions user data, then uses mappers to calculate similarities between users, identify neighbors, and make predictions. Reducers collect and output recommendations. Experiments show the MapReduce approach with Bloom filters speeds up recommendations compared to MapReduce alone, and scales linearly as nodes increase. The system provides personalized recommendations for items like movies in large, distributed environments.
2.DATAMANAGEMENT-DIGITAL TRANSFORMATION AND STRATEGYGeorgeDiamandis11
The document discusses digitalization in logistics and analytics of key performance indicators. It covers several topics related to data management, including business intelligence, data warehousing, big data, and analytics tools. Case studies are provided on how various organizations have optimized operations, increased speed, and created new services using big data analytics techniques. Examples include detecting fraud, anticipating demand, optimizing inventory, scenario simulation, improving health outcomes, and customizing education.
This document provides an overview of the Pig tool, which is a scripting language for exploring large datasets within the Apache Hadoop ecosystem. It discusses how Pig allows processing of terabytes of data through just a few lines of code by customizing all parts of the processing path, including storing, filtering, grouping, and joining data. The document then presents a sample problem using a publicly available million song dataset to demonstrate loading and storing the data with Pig, finding the song density, and filtering the results. It analyzes the input and output, showing that the input dataset contained 1 million records which were filtered down to the top 50 songs with the highest sound densities.
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcsandit
Big data analysis has become much popular in the present day scenario and the manipulation of
big data has gained the keen attention of researchers in the field of data analytics. Analysis of
big data is currently considered as an integral part of many computational and statistical
departments. As a result, novel approaches in data analysis are evolving on a daily basis.
Thousands of transaction requests are handled and processed everyday by different websites
associated with e-commerce, e-banking, e-shopping carts etc. The network traffic and weblog
analysis comes to play a crucial role in such situations where Hadoop can be suggested as an
efficient solution for processing the Netflow data collected from switches as well as website
access-logs during fixed intervals.
Rapid Miner is an open-source data mining software tool. It provides functionality for data loading, preprocessing, transformation, data mining, modeling, evaluation, and deployment. Rapid Miner uses learning schemes and attribute evaluators from Weka and statistical modeling schemes from R. It can be used for tasks like text mining, feature engineering, and distributed data mining. Rapid Miner includes a graphical user interface to design analytical workflows using operators. It can also be called as an API or from the command line.
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcscpconf
Big data analysis has become much popular in the present day scenario and the manipulation of big data has gained the keen attention of researchers in the field of data analytics. Analysis of
big data is currently considered as an integral part of many computational and statistical departments. As a result, novel approaches in data analysis are evolving on a daily basis.
Thousands of transaction requests are handled and processed every day by different websites associated with e-commerce, e-banking, e-shopping carts etc. The network traffic and weblog
analysis comes to play a crucial role in such situations where Hadoop can be suggested as an efficient solution for processing the Netflow data collected from switches as well as website
access-logs during fixed intervals.
This document discusses big data and Hadoop. It defines big data as large datasets that are difficult to process using traditional methods due to their volume, variety, and velocity. Hadoop is presented as an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. The key components of Hadoop are the Hadoop Distributed File System (HDFS) for storage and MapReduce as a programming model for distributed processing. A number of other technologies in Hadoop's ecosystem are also described such as HBase, Avro, Pig, Hive, Sqoop, Zookeeper and Mahout. The document concludes that Hadoop provides solutions for efficiently processing and analyzing big data.
This document discusses using machine learning algorithms to predict employee attrition and understand factors that influence turnover. It evaluates different machine learning models on an employee turnover dataset to classify employees who are at risk of leaving. Logistic regression and random forest classifiers are applied and achieve accuracy rates of 78% and 98% respectively. The document also discusses preprocessing techniques and visualizing insights from the models to better understand employee turnover.
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
This document summarizes a survey paper on MapReduce processing using Hadoop. It discusses how big data is growing rapidly due to factors like the internet and social media. Traditional databases cannot handle big data. Hadoop uses MapReduce and HDFS to store and process extremely large datasets across commodity servers in a distributed manner. HDFS stores data in a distributed file system, while MapReduce allows parallel processing of that data. The paper describes the MapReduce process and its core functions like map, shuffle, reduce. It explains how Hadoop provides advantages like scalability, cost effectiveness, flexibility and parallel processing for big data.
Similar to KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization (20)
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfDr. Radhey Shyam
The document provides an overview of data analytics and big data concepts. It discusses the characteristics of big data, including the four V's of volume, velocity, variety and veracity. It describes different types of data like structured, semi-structured and unstructured data. The document also introduces popular big data platforms like Hadoop, Spark and Cassandra. Finally, it outlines key reasons for the need of data analytics, such as enabling better decision making and improving organizational efficiency.
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdfDr. Radhey Shyam
1) The document discusses software metrics and Halstead's software science measures including program length, vocabulary, volume, difficulty, and effort. It provides the formulas to calculate these measures and an example calculation for a C function.
2) Guidelines are provided for identifying operands and operators when applying Halstead's measures to source code. Various program elements like variables, functions, and statements are classified.
3) References on software engineering and metrics are listed at the end.
Introduction to Data Analytics and data analytics life cycleDr. Radhey Shyam
The document provides an overview of data analytics and big data concepts. It discusses the characteristics of big data, including the four V's of volume, velocity, variety and veracity. It also describes different types of data like structured, semi-structured and unstructured data. The document then introduces big data platforms and tools like Hadoop, Spark and Cassandra. Finally, it discusses the need for data analytics in business, including enabling better decision making and improving efficiency.
This document provides an overview of database normalization concepts. It begins by defining normalization as the process of organizing data in a database to eliminate redundant data and ensure data dependencies are properly represented by constraints. It then discusses first normal form (1NF), which requires each cell to contain a single value. Candidate keys and super keys are also defined. The document concludes by briefly mentioning higher normal forms up to fifth normal form (5NF) and some alternative database design approaches such as NoSQL and graph databases.
The document provides information about the syllabus for the Data Analytics (KIT-601) course. It includes 5 units that will be covered: Introduction to Data Analytics, Data Analysis techniques including regression modeling and multivariate analysis, Mining Data Streams, Frequent Itemsets and Clustering, and Frameworks and Visualization. It lists the course outcomes and Bloom's taxonomy levels. It also provides details on the topics to be covered in each unit, including proposed lecture hours, textbooks, and an evaluation scheme. The syllabus aims to discuss concepts of data analytics and apply techniques such as classification, regression, clustering, and frequent pattern mining on data.
This document provides an introduction to the concepts of data analytics and the data analytics lifecycle. It discusses big data in terms of the 4Vs - volume, velocity, variety and veracity. It also discusses other characteristics of big data like volatility, validity, variability and value. The document then discusses various concepts in data analytics like traditional business intelligence, data mining, statistical applications, predictive analysis, and data modeling. It explains how these concepts are used to analyze large datasets and derive value from big data. The goal of data analytics is to gain insights and a competitive advantage through analyzing large and diverse datasets.
This document is a slide presentation by Dr. Radhey Shyam on the topics of reinforcement learning and genetic algorithms. It discusses various types of applications that genetic algorithms can be used for, including control systems, design optimization, scheduling, robotics, machine learning, signal processing, game playing, and solving combinatorial optimization problems. Examples provided include gas pipeline control, missile evasion, aircraft design, manufacturing scheduling, neural network design, filter design, and solving the traveling salesman problem.
This document provides an overview of self-organizing maps (SOMs), a type of artificial neural network. It discusses the biological motivation for SOMs, which are inspired by self-organizing systems in the brain. The document outlines the basic architecture and learning algorithm of SOMs, including initialization, training procedures, and classification. It also reviews various properties of SOMs, such as their ability to approximate input spaces and perform topological ordering and density matching. Finally, applications of SOMs are briefly mentioned, such as for speech recognition, image analysis, and data visualization.
The document describes Convolutional Neural Networks (CNNs). It explains that CNNs are a type of neural network that uses convolutional layers, which apply filters to input data to extract features. This helps reduce the number of parameters needed compared to fully connected networks. The document provides examples of how CNNs can be used for image recognition, speech recognition, and text classification by applying filters that move across spatial or temporal dimensions of the input data.
1) The document discusses software metrics and Halstead's software science measures including program length, vocabulary, volume, difficulty, and effort. It provides the formulas to calculate these measures and an example calculation for a C function.
2) Guidelines are provided for identifying operands and operators when applying Halstead's measures to source code. Various program elements like variables, functions, and statements are classified.
3) The document also discusses other software metrics like lines of code (LOC) and function points which can be used to measure size and complexity. It provides a sample calculation of LOC and function points for a simple program.
The document provides an overview of Software Requirement Specification (SRS) and Software Quality Assurance (SQA). It discusses the importance of well-written requirements documents, as without them developers do not know what to build and customers do not know what to expect. The document also outlines different types of requirements like functional, non-functional, user and system requirements. It describes various requirements elicitation techniques like interviews, brainstorming sessions, use case approach etc. Finally, it discusses modeling requirements using tools like data flow diagrams, data dictionaries and entity relationship diagrams.
This document provides a 3 paragraph summary of a software engineering course titled "Software Engineering (KCS-601)" taught by Dr. Radhey Shyam at SRMCEM Lucknow. The course contents were compiled by Dr. Shyam and are available for students' academic use. Students can contact Dr. Shyam via email for any queries regarding the course material.
This document provides an overview of the unit 3 course material for Software Design taught by Dr. Radhey Shyam at SRMCEM Lucknow. The document discusses key concepts in software design including the importance of design, characteristics of good and bad design, coupling and cohesion, modularization, design models, high level design and architectural design. Specific topics covered include software design documentation, conceptual vs technical design, types of coupling and cohesion, advantages of modular systems, design frameworks, and strategies for design such as top-down, bottom-up, and hybrid approaches.
This document discusses image representation and description techniques. It begins by explaining that image segmentation results in a set of regions that need to be represented, often by their boundaries or internal characteristics, and described using features. Several boundary and regional representation and description methods are then outlined, including chain codes, shape numbers, Fourier descriptors, statistical moments, topology, and textures.
This document discusses image segmentation using morphological watersheds. It begins by explaining the concepts of regional minima, catchment basins, and watershed lines in a topographic representation of an image. It then describes the watershed algorithm which involves flooding the image from regional minima and building dams when flood waters would merge. The resulting dams represent the watershed lines and segmented boundaries. The document provides examples to illustrate the flooding process and discusses how markers can be used to limit oversegmentation from noise.
This document discusses image restoration and contains summaries of several lecture slides on image degradation and restoration models, noise models, and frequency domain filtering techniques for periodic noise reduction. It was compiled by Dr. Radhey Shyam with contributions from Dr. Philippe Cattin, and is intended for academic use by students to help explain basic concepts of image restoration.
The document is a unit on image enhancement from an image processing course. It was written by Dr. Radhey Shyam of the computer science department at BIET Lucknow, India. The unit introduces basic concepts of image enhancement in the spatial and frequency domains. Students will learn about arithmetic and logical operations on pixels to enhance images.
This document provides an overview of color image processing. It discusses that color is important for object identification and extraction. It describes the primary colors of light (red, green, blue) and pigments (cyan, magenta, yellow) and how they are used in different color models. The key color characteristics of brightness, hue, and saturation are defined. Common color models for image processing like RGB, CMY, and HSI are introduced. The RGB color model is described in more detail, representing colors as points in a color cube defined by normalized red, green, and blue values between 0 and 1.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Discover the latest insights on Data Driven Maintenance with our comprehensive webinar presentation. Learn about traditional maintenance challenges, the right approach to utilizing data, and the benefits of adopting a Data Driven Maintenance strategy. Explore real-world examples, industry best practices, and innovative solutions like FMECA and the D3M model. This presentation, led by expert Jules Oudmans, is essential for asset owners looking to optimize their maintenance processes and leverage digital technologies for improved efficiency and performance. Download now to stay ahead in the evolving maintenance landscape.
Software Engineering and Project Management - Introduction, Modeling Concepts...Prakhyath Rai
Introduction, Modeling Concepts and Class Modeling: What is Object orientation? What is OO development? OO Themes; Evidence for usefulness of OO development; OO modeling history. Modeling
as Design technique: Modeling, abstraction, The Three models. Class Modeling: Object and Class Concept, Link and associations concepts, Generalization and Inheritance, A sample class model, Navigation of class models, and UML diagrams
Building the Analysis Models: Requirement Analysis, Analysis Model Approaches, Data modeling Concepts, Object Oriented Analysis, Scenario-Based Modeling, Flow-Oriented Modeling, class Based Modeling, Creating a Behavioral Model.
Software Engineering and Project Management - Introduction, Modeling Concepts...
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
1. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
Data Analytics (KIT-601)
Unit-5: Frame Works and Visualization &
Introduction to R
Dr. Radhey Shyam
Professor
Department of Information Technology
SRMCEM Lucknow
(Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow)
Unit-5 has been prepared and compiled by Dr. Radhey Shyam, with grateful acknowledgment to those who
made their course contents freely available or (Contributed directly or indirectly). Feel free to use this
study material for your own academic purposes. For any query, communication can be made through this
email : shyam0058@gmail.com.
April 28, 2024
2. Data Analytics (KIT 601)
Course Outcome ( CO) Bloom’s Knowledge Level (KL)
At the end of course , the student will be able to
CO 1 Discuss various concepts of data analytics pipeline K1, K2
CO 2 Apply classification and regression techniques K3
CO 3 Explain and apply mining techniques on streaming data K2, K3
CO 4 Compare different clustering and frequent pattern mining algorithms K4
CO 5 Describe the concept of R programming and implement analytics on Big data using R. K2,K3
DETAILED SYLLABUS 3-0-0
Unit Topic Proposed
Lecture
I
Introduction to Data Analytics: Sources and nature of data, classification of data
(structured, semi-structured, unstructured), characteristics of data, introduction to Big Data
platform, need of data analytics, evolution of analytic scalability, analytic process and
tools, analysis vs reporting, modern data analytic tools, applications of data analytics.
Data Analytics Lifecycle: Need, key roles for successful analytic projects, various phases
of data analytics lifecycle – discovery, data preparation, model planning, model building,
communicating results, operationalization.
08
II
Data Analysis: Regression modeling, multivariate analysis, Bayesian modeling, inference
and Bayesian networks, support vector and kernel methods, analysis of time series: linear
systems analysis & nonlinear dynamics, rule induction, neural networks: learning and
generalisation, competitive learning, principal component analysis and neural networks,
fuzzy logic: extracting fuzzy models from data, fuzzy decision trees, stochastic search
methods.
08
III
Mining Data Streams: Introduction to streams concepts, stream data model and
architecture, stream computing, sampling data in a stream, filtering streams, counting
distinct elements in a stream, estimating moments, counting oneness in a window,
decaying window, Real-time Analytics Platform ( RTAP) applications, Case studies – real
time sentiment analysis, stock market predictions.
08
IV
Frequent Itemsets and Clustering: Mining frequent itemsets, market based modelling,
Apriori algorithm, handling large data sets in main memory, limited pass algorithm,
counting frequent itemsets in a stream, clustering techniques: hierarchical, K-means,
clustering high dimensional data, CLIQUE and ProCLUS, frequent pattern based clustering
methods, clustering in non-euclidean space, clustering for streams and parallelism.
08
V
Frame Works and Visualization: MapReduce, Hadoop, Pig, Hive, HBase, MapR,
Sharding, NoSQL Databases, S3, Hadoop Distributed File Systems, Visualization: visual
data analysis techniques, interaction techniques, systems and applications.
Introduction to R - R graphical user interfaces, data import and export, attribute and data
types, descriptive statistics, exploratory data analysis, visualization before analysis,
analytics for unstructured data.
08
Text books and References:
1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer
2. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets, Cambridge University Press.
3. John Garrett,Data Analytics for IT Networks : Developing Innovative Use Cases, Pearson Education
Curriculum & Evaluation Scheme IT & CSI (V & VI semester) 23
A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
3. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
Part I: Frame Works and
Visualization
1 Frameworks and visualization
Frameworks and visualization are two important aspects of software development that are commonly used
together to create applications that are powerful, flexible, and user-friendly.
A framework is a pre-existing set of tools, libraries, and code structures that provide a foundation for
building software applications. Frameworks are designed to simplify the development process by providing
reusable components that can be customized and configured to meet the specific needs of a project.
There are many popular frameworks available for various programming languages, such as Django for
Python, Ruby on Rails for Ruby, and Angular for JavaScript.
Visualization refers to the process of creating graphical representations of data and information. Visu-
alizations are used to help people better understand complex information and to communicate insights and
ideas more effectively.
Visualization can be done using a variety of tools and technologies, such as charts, graphs, diagrams, and
maps. These visualizations can be created using programming languages like Python or JavaScript, or with
specialized tools like Tableau or Power BI.
Frameworks and visualization can be used together to create powerful applications that incorporate data
analysis, reporting, and visualization. For example, a web application built on the Django framework could
use JavaScript visualization libraries like D3.js or Highcharts.js to create interactive charts and graphs that
help users understand complex data.
1.1 MapReduce, Hadoop
Hadoop is an open source software framework used to develop data processing applications which are executed
in a distributed computing environment. There are (of Hadoop Architecture) basically two components in
Hadoop: The first one is HDFS for storage (Hadoop distributed File System), that allows you to store data
of various formats across a cluster. Haddop file system illustrated in Figure 1 and 2.
The second one is YARN, for resource management in Hadoop. It allows parallel processing over the
data, i.e. stored across HDFS. MapReduce is the core component for data processing in Hadoop framework.
3
4. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
Figure 1: Components of Hadoop Framework.
Figure 2: Working of MapReduce.
It is a processing technique built on divide and conquer algorithm. It is made of two different tasks - Map
and Reduce. Map takes a set of data and converts it into another set of data, where individual elements
4
5. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
Figure 3: Working of Hadoop Framework.
are broken down into tuples. Secondly, reduce task, which takes the output from a map as an input and
combines those data tuples into a smaller set of tuples and fetches it. Working of MapReduce in illustrated
in Figure 2.
1.1.1 How MapReduce Algorithm Works?
The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and reducing.
The data goes through the following phases:
ˆ Input Splits: In this phase it takes input tasks (say Data Sets) and divided into fixed-size pieces
called input splits.
ˆ Mapping: This is the very first phase in the execution of map-reduce program. It takes input tasks
(say DataSets) and divides them into smaller sub-tasks. Then perform required computation on each
sub-task in parallel. The output of this Map Function is a set of key and value pairs in the form of
¡word, frequency¿.
ˆ Shuffling: Shuffle Function is also known as “Combine Function”. It performs the following two
sub-steps:
5
6. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
– Merging
– Sorting
This phase consumes the output of mapping phase and performs these two sub-steps on each and every
key-value pair.
– Merging step combines all key-value pairs which have same keys.
– Sorting step takes input from merging step and sorts all key-value pairs by using Keys.
Finally, Shuffle Function returns a list of <Key, List<Value>> sorted pairs to next step.
ˆ Reducing: In this phase, output values from the shuffling phase are aggregated. This phase combines
values from shuffling phase and returns a single output value. In short, this phase summarizes the
complete dataset.
Let’s understand this with an example – Consider you have following input data for your Map Reduce
Program Welcome to Hadoop Class Hadoop is good, Hadoop is bad
The final output of the MapReduce task is shown in table 1.
Table 1: Final output of the MapReduce.
bad 1
class 1
good 1
hadoop 3
is 2
to 1
welcome 1
1.2 Pig
Apache Pig is a platform for analyzing large datasets that are stored in Hadoop Distributed File System
(HDFS). The Pig platform provides a high-level language called Pig Latin, which is used to write data
processing programs that are executed on Hadoop clusters.
The architecture of Pig consists of the following components:
1. Pig Latin Parser: This component is responsible for parsing the Pig Latin scripts written by users
and converting them into a series of logical execution plans.
6
7. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
2. Logical Plan Generator: This component generates a logical execution plan from the parsed Pig
Latin script. The logical plan represents the data flow operations required to execute the script.
3. Optimization and Compilation: This component optimizes the logical execution plan generated
by the previous component and compiles it into a physical execution plan that can be executed on
Hadoop.
4. Execution Engine: This component executes the physical execution plan on the Hadoop cluster,
processing the data stored in HDFS and generating output.
5. UDFs: User-Defined Functions (UDFs) are custom functions that can be written in Java, Python or
any other language supported by Hadoop, and can be integrated with Pig to perform custom data
processing operations.
Overall, the architecture of Pig provides a scalable, efficient and flexible platform for analyzing large datasets
in Hadoop, and is widely used in big data processing applications.
1.3 Hive
Hive is a data warehouse software built on top of Hadoop that allows for querying and analysis of large
datasets stored in Hadoop Distributed File System (HDFS). Hive provides a SQL-like interface called HiveQL
(HQL) that enables users to write queries against the data stored in Hadoop, without needing to know how
to write MapReduce programs.
Hive architecture consists of the following components:
1. Metastore: This component stores metadata about the data stored in HDFS, such as the schema and
the location of tables.
2. Driver: This component accepts HiveQL queries, compiles them into MapReduce programs, and
submits them to the Hadoop cluster for execution.
3. Compiler: This component parses the HiveQL queries, converts them into logical and physical exe-
cution plans, and optimizes them for execution on Hadoop.
4. Execution Engine: This component executes the compiled MapReduce programs on the Hadoop
cluster, processing the data stored in HDFS and generating output.
7
8. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
5. UDFs: User-Defined Functions (UDFs) are custom functions that can be written in Java, Python or
any other language supported by Hadoop, and can be integrated with Hive to perform custom data
processing operations.
Overall, Hive provides a powerful and flexible platform for analyzing large datasets stored in Hadoop, using
a familiar SQL-like interface that is easy to use for users who are familiar with SQL.
1.4 HBase
HBase is a column-oriented NoSQL database built on top of Hadoop Distributed File System (HDFS). It is
designed to handle large volumes of structured and semi-structured data in a distributed environment.
HBase architecture consists of the following components:
1. RegionServer: This component manages regions of data stored in HDFS, and provides read and
write access to the data stored in those regions.
2. HMaster: This component manages the assignment of regions to RegionServers, handles schema
changes and metadata management, and provides monitoring and administration of the HBase cluster.
3. ZooKeeper: This component provides coordination services for distributed systems, such as leader
election, configuration management, and synchronization.
4. HDFS: HBase uses HDFS as its underlying storage layer for storing data, and it stores data in HDFS
files called HFiles.
5. Clients: HBase provides client libraries for Java, Python, and other languages, which can be used to
interact with the HBase cluster and perform read and write operations on the data stored in HBase.
HBase provides several features that make it well-suited for handling large volumes of data, including auto-
matic sharding, high write throughput, and support for transactions. HBase is commonly used in applications
that require real-time access to large volumes of data, such as social media platforms, e-commerce websites,
and financial trading systems.
1.5 MapR
MapR is a data platform that provides a complete set of data management, storage, and processing services
for big data applications. MapR is built on top of Apache Hadoop and extends it with additional features
8
9. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
and capabilities.
MapR architecture consists of the following components:
1. MapR-FS: This component is a distributed file system that provides scalable and reliable storage for
big data. MapR-FS is designed to be highly available and fault-tolerant, and it provides advanced
features such as snapshots, mirroring, and multi-tenancy.
2. MapR-DB: This component is a NoSQL database that provides real-time access to data stored in
MapR-FS. MapR-DB supports a wide range of data models, including key-value, JSON, and binary
formats, and it provides features such as automatic sharding, replication, and secondary indexing.
3. MapR Streams: This component is a messaging system that allows for real-time processing of data
streams. MapR Streams is built on top of Apache Kafka and provides advanced features such as global
replication, message-level security, and integrated management.
4. MapR Analytics: This component provides a set of tools for processing and analyzing data stored
in MapR-FS and MapR-DB. MapR Analytics includes support for Apache Spark, Apache Drill, and
other popular big data processing frameworks.
5. MapR Control System: This component provides a centralized management console for the MapR
platform, allowing administrators to monitor and manage the entire system from a single interface.
Overall, MapR provides a comprehensive platform for managing and processing big data, with advanced
features and capabilities that make it well-suited for enterprise-level applications.
1.6 Sharding
Sharding is a technique used in distributed database systems to partition data across multiple nodes or
servers, in order to improve scalability, availability, and performance.
In a sharded database, data is divided into smaller subsets called shards or partitions, which are dis-
tributed across multiple servers. Each server is responsible for storing and processing a specific subset of
data. This allows the system to handle larger amounts of data and more concurrent requests, while also
improving fault tolerance and reducing single points of failure.
Sharding can be done in different ways, depending on the specific needs of the application and the
database system being used. Some common sharding techniques include:
9
10. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
1. Range-based sharding: In this technique, data is partitioned based on a specific range of values,
such as a range of dates, geographical locations, or customer IDs. Each shard is responsible for storing
data within a specific range.
2. Hash-based sharding: In this technique, data is partitioned based on a hash function applied to
a specific field or set of fields. The hash function maps each record to a specific shard, based on the
result of the hash.
3. Round-robin sharding: In this technique, data is partitioned in a circular fashion, with each shard
being assigned to a specific server in a rotating fashion. This technique is simple and evenly distributes
the data, but it can lead to uneven load distribution if the data is not evenly distributed.
Sharding can provide significant benefits for large-scale distributed database systems, but it can also in-
troduce additional complexity and management overhead. Properly designing and implementing a sharded
database requires careful planning and consideration of factors such as data distribution, fault tolerance,
and performance.
1.7 NoSQL databases
NoSQL databases are a type of database management system that do not use traditional SQL-based relational
data models. NoSQL databases are designed to handle large amounts of unstructured, semi-structured, and
structured data in a distributed and scalable manner.
NoSQL databases are generally classified into four categories:
1. Document-based databases: These databases store and manage data in the form of documents, typ-
ically using a JSON or BSON data model. Examples of document-based databases include MongoDB,
Couchbase, and Amazon DocumentDB.
2. Key-value stores: These databases store and manage data as key-value pairs, similar to a hash table.
Examples of key-value stores include Redis, Riak, and Amazon DynamoDB.
3. Column-family stores: These databases store and manage data as columns and column families,
similar to a table in a relational database. Examples of column-family stores include Apache Cassandra
and HBase.
10
11. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
4. Graph databases: These databases store and manage data as nodes and edges in a graph data model,
allowing for efficient traversal and analysis of complex relationships. Examples of graph databases
include Neo4j, ArangoDB, and Amazon Neptune.
NoSQL databases offer several advantages over traditional relational databases, including:
1. Scalability: NoSQL databases are designed to scale horizontally across multiple servers, allowing
them to handle large volumes of data and high levels of concurrency.
2. Flexibility: NoSQL databases can handle different types of data, including unstructured and semi-
structured data, which can be difficult to manage in a traditional relational database.
3. Performance: NoSQL databases can provide high performance for specific types of queries, such as
those that require complex data processing or real-time analysis.
NoSQL databases have become increasingly popular in recent years, particularly for applications that require
high scalability, flexibility, and performance. However, NoSQL databases also have some disadvantages, such
as less robust consistency guarantees and a lack of standardization across different database types.
1.8 S3
Amazon S3 (Simple Storage Service) is a scalable, secure, and durable cloud storage service offered by
Amazon Web Services (AWS). S3 allows users to store and retrieve data from anywhere on the internet,
using a simple web interface, API, or command-line tools.
S3 provides several benefits over traditional on-premises storage solutions, including:
1. Scalability: S3 can scale to store and retrieve any amount of data, from a few gigabytes to multiple
petabytes, and it can handle millions of requests per second.
2. Durability: S3 is designed to provide 99.999999999% (11 nines) durability for stored objects, using
multiple layers of redundancy and automatic error correction.
3. Security: S3 provides strong encryption for data in transit and at rest, and it offers access controls
and permissions to ensure that only authorized users can access data.
4. Cost-effectiveness: S3 provides a pay-as-you-go pricing model, with no upfront costs or minimum
usage requirements.
11
12. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
S3 can be used for a wide range of use cases, including:
1. Data backup and recovery: S3 can be used to store backup copies of data, and it can be configured
to automatically replicate data to multiple regions for disaster recovery purposes.
2. Content delivery: S3 can be used to store and distribute content, such as images, videos, and other
media files, through a content delivery network (CDN).
3. Big data analytics: S3 can be used as a data lake to store large amounts of data for analytics
purposes, and it can be integrated with other AWS services, such as Amazon EMR, to process and
analyze data.
4. Application hosting: S3 can be used to store and serve static web content, such as HTML pages
and JavaScript files, for web applications.
Overall, S3 is a highly scalable and durable cloud storage service that offers a wide range of features and
benefits for storing, managing, and accessing data in the cloud.
1.9 Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system that is designed to store large
amounts of data across many commodity hardware nodes in a Hadoop cluster. HDFS is one of the core
components of the Hadoop ecosystem, and it provides a fault-tolerant and scalable solution for storing and
processing big data.
HDFS works by dividing large data files into smaller blocks and distributing them across multiple nodes
in a cluster. Each block is replicated across several nodes to ensure data durability and availability in case
of node failures. HDFS uses a master/slave architecture, where a NameNode serves as the master node and
manages the file system metadata, while multiple DataNodes serve as slave nodes and store the actual data
blocks.
HDFS provides several benefits over traditional file systems, including:
1. Scalability: HDFS can store and process petabytes or even exabytes of data across thousands of
nodes in a Hadoop cluster, allowing it to handle large-scale data processing workloads.
2. Fault tolerance: HDFS is designed to be fault-tolerant, with data replication and block-level check-
sums to ensure data durability and availability in case of node failures.
12
13. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
3. Data locality: HDFS is optimized for data locality, which means that data processing tasks can be
executed on the same nodes where the data is stored, minimizing network overhead and improving
performance.
4. Open source: HDFS is an open-source software project that is maintained by the Apache Software
Foundation, which means that it is freely available and can be customized and extended by developers.
HDFS is commonly used in conjunction with other Hadoop ecosystem tools, such as MapReduce, HBase,
and Spark, to process and analyze large data sets.
1.10 Visualization
Visualization refers to the graphical representation of data and information, often in the form of charts,
graphs, maps, and other visual aids. The purpose of visualization is to make complex data and information
more accessible and understandable to users, by presenting it in a visually appealing and interactive format.
Visualization can be used for a wide range of applications, including:
1. Data exploration: Visualization can be used to explore and analyze large data sets, by visually
representing patterns, trends, and relationships in the data.
2. Data communication: Visualization can be used to communicate complex data and information to
non-expert audiences, by presenting it in a clear and intuitive format.
3. Decision making: Visualization can be used to support decision making processes, by providing
decision makers with actionable insights and information.
4. Storytelling: Visualization can be used to tell compelling stories and narratives based on data and
information, by presenting it in a visually engaging and interactive format.
There are many different types of visualization techniques and tools, including:
1. Charts and graphs: These are the most common forms of visualization, and they include bar charts,
line charts, scatter plots, and many others.
2. Maps: Maps can be used to visualize geographic data, such as the distribution of population, resources,
or economic activity.
13
14. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
3. Infographics: Infographics are visual representations of data and information that are designed to
communicate complex concepts in a clear and engaging way.
4. Dashboards: Dashboards are interactive visualizations that allow users to explore and analyze data
in real time, by displaying key performance indicators and other metrics.
Overall, visualization is a powerful tool for understanding and communicating data and information, and it
is increasingly being used in a wide range of fields, including business, science, healthcare, and journalism.
1.10.1 Visual data analysis techniques
Visual data analysis techniques are used to explore and analyze data visually, by creating charts, graphs,
maps, and other visual aids that help to identify patterns, trends, and relationships in the data. Some
common visual data analysis techniques include:
1. Scatter plots: Scatter plots are used to display the relationship between two variables, by plotting
each data point on a graph with one variable on the x-axis and the other variable on the y-axis.
2. Heat maps: Heat maps are used to display the density of data points across a two-dimensional space,
by using colors to represent the intensity of the data.
3. Box plots: Box plots are used to display the distribution of data across different categories or groups,
by showing the range, median, and quartiles of the data.
4. Network diagrams: Network diagrams are used to visualize complex relationships between entities
or nodes in a network, by using nodes and edges to represent the entities and their connections.
5. Geographic maps: Geographic maps are used to display data based on their location, by using colors
or symbols to represent the data points on a map.
6. Time series charts: Time series charts are used to display changes in data over time, by plotting
the data on a graph with time on the x-axis and the data value on the y-axis.
7. Bubble charts: Bubble charts are used to display data in three dimensions, by using a third variable
to determine the size of the data point on a two-dimensional graph.
8. Histograms: Histograms are used to display the distribution of data across a single variable, by
grouping the data into intervals and displaying the frequency of data points within each interval.
14
15. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
Overall, visual data analysis techniques provide a powerful way to explore and analyze complex data sets,
by presenting the data in a way that is easy to understand and interpret. By using visual data analysis
techniques, data analysts and scientists can quickly identify patterns, trends, and relationships in the data,
and make data-driven decisions that can drive business and scientific outcomes.
1.10.2 Interaction techniques
Interaction techniques are used in visual data analysis to enable users to interact with data visualizations
and explore the underlying data in more detail. Some common interaction techniques include:
1. Zooming and panning: These techniques allow users to zoom in on specific areas of a visualization
or pan across different parts of the visualization to explore it in more detail.
2. Brushing and linking: These techniques allow users to highlight specific data points or regions of a
visualization and see how they relate to other parts of the visualization or other visualizations.
3. Filtering and selection: These techniques allow users to select specific data points or subsets of
data based on specific criteria or filters, to explore specific parts of the data in more detail.
4. Tooltips and annotations: These techniques provide additional information about specific data
points or regions of a visualization, by displaying tooltips or annotations when users hover over or click
on specific parts of the visualization.
5. Interactivity and animation: These techniques allow users to interact with visualizations in real
time, by using sliders, buttons, or other interactive elements to modify the visualization parameters or
animate the data over time.
Overall, interaction techniques provide a powerful way to explore and analyze data in a more dynamic and
interactive way, by enabling users to manipulate and interact with visualizations to gain deeper insights
into the underlying data. By using interaction techniques, data analysts and scientists can quickly identify
patterns, trends, and relationships in the data, and make data-driven decisions that can drive business and
scientific outcomes.
1.10.3 Systems and applications
Systems and applications are two different types of software that are used in computing.
15
16. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
A system is a collection of software components and hardware that work together to perform a specific
task or set of tasks. Systems are often designed to provide an underlying infrastructure for applications
to run on. Examples of systems include operating systems, database management systems, and network
systems.
An application, on the other hand, is a software program that is designed to perform a specific task or
set of tasks, often for end-users. Applications are built on top of systems, and they rely on the underlying
infrastructure provided by the system to operate. Examples of applications include word processors, web
browsers, and email clients.
Both systems and applications are essential components of modern computing. Systems provide the
underlying infrastructure and services that enable applications to run, while applications provide the user-
facing interfaces and functionality that end-users interact with directly. Together, systems and applications
enable us to perform a wide range of tasks, from simple word processing to complex data analysis and
machine learning.
16
17. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
Part II: Introduction to R
1 Introduction to R
R is a programming language and software environment that is widely used for statistical computing, data
analysis, and visualization. It was developed in the early 1990s by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand.
R provides a wide range of statistical and graphical techniques, including linear and nonlinear modeling,
classical statistical tests, time-series analysis, classification, clustering, and more. R is also highly extensible,
with a large number of packages and libraries available for specialized tasks such as machine learning, text
analysis, and image processing.
One of the key features of R is its flexibility and ease of use. R provides a simple and intuitive syntax
that is easy to learn and use, even for those without a strong programming background. R also has a large
and active community of users and developers, who contribute to the development of packages and provide
support and resources for users.
R can be used in a variety of settings, including academic research, data analysis and visualization,
and industry applications. Some common use cases for R include analyzing large data sets, creating data
visualizations, and building predictive models for machine learning and data science applications.
Overall, R is a powerful and versatile tool for statistical computing, data analysis, and visualization, with
a large and active community of users and developers. Whether you are a student, researcher, data analyst,
or data scientist, R provides a flexible and powerful environment for exploring and analyzing data
1.1 R graphical user interfaces
R provides a number of graphical user interfaces (GUIs) that can make it easier to work with R for those
who are new to the language or who prefer a more visual approach to data analysis. Some popular R GUIs
include:
1. RStudio: RStudio is a free and open-source integrated development environment (IDE) for R that
provides a modern and user-friendly interface. It includes a code editor, console, debugging tools, and
data visualization capabilities.
17
18. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
2. RKWard: RKWard is another free and open-source GUI for R that provides a range of features for
data analysis, including a spreadsheet-like data editor, syntax highlighting, and built-in support for
common statistical tests.
3. Jupyter Notebooks: Jupyter Notebooks is a web-based tool that provides an interactive environment
for working with data and code. It supports multiple programming languages, including R, and provides
a flexible and customizable interface for data analysis.
4. Tinn-R: Tinn-R is a lightweight and customizable GUI for R that provides a simple interface for
working with R scripts and data files.
5. Emacs + ESS: Emacs is a powerful text editor that can be used with the Emacs Speaks Statistics
(ESS) package to provide an integrated environment for R development and data analysis.
Overall, R provides a wide range of GUIs that can make it easier to work with R, depending on your
preferences and needs. Whether you prefer a modern and user-friendly interface like RStudio, or a more
lightweight and customizable GUI like Tinn-R, there is likely a GUI that will meet your needs.
1.2 data import and export
In R, data import and export are essential tasks for data analysis. R provides a variety of functions for
importing and exporting data from a wide range of file formats. Some common data import/export functions
in R include:
1. read.csv() and write.csv(): These functions are used to import and export data in CSV (Comma
Separated Values) format. CSV is a commonly used file format for storing tabular data.
2. read.table() and write.table(): These functions are used to import and export data in a variety of
text-based formats, including CSV, TSV (Tab Separated Values), and other delimited text formats.
3. read.xlsx() and write.xlsx(): These functions are used to import and export data in Excel (.xlsx)
format. Excel is a widely used spreadsheet application, and being able to import and export data from
Excel is an important task in data analysis.
4. readRDS() and saveRDS(): These functions are used to import and export R objects in binary
format. R objects can be complex data structures, and saving them in binary format can be a more
efficient way of storing and sharing data than text-based formats.
18
19. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
5. read.spss() and write.spss(): These functions are used to import and export data in SPSS format.
SPSS is a statistical software package that is commonly used in social science research.
6. read sql() and write sql(): These functions are used to import and export data from SQL databases.
SQL is a popular database management language, and being able to interact with SQL databases is
an important task in data analysis.
These are just a few examples of the many data import/export functions available in R. The specific function
used will depend on the file format and data source being used. Overall, R provides a wide range of tools
for importing and exporting data, making it a versatile and powerful tool for data analysis.
1.3 attribute and data types
In data analysis, attributes refer to the characteristics or properties of a variable, while data types refer to
the format in which data is stored. In R, there are several common attribute types and data types that are
used in data analysis.
1. Attribute Types:
ˆ Names: The names attribute specifies the names of the variables in a dataset.
ˆ Class: The class attribute specifies the type of data stored in a variable (e.g. numeric, character,
factor).
ˆ Dimensions: The dimensions attribute specifies the dimensions of a dataset (e.g. number of
rows and columns).
ˆ Factors: Factors are a specific type of attribute that represent categorical data with levels.
2. Data Types:
ˆ Numeric: Numeric data types represent numerical values (e.g. integers, decimals, etc.). Numeric
data types can be further divided into integer (e.g. 1, 2, 3) and floating-point (e.g. 1.2, 3.14)
types.
ˆ Character: Character data types represent text data (e.g. ”hello”, ”world”).
ˆ Logical: Logical data types represent boolean values (TRUE or FALSE).
ˆ Date and time: Date and time data types represent date and time values (e.g. ”2023-04-14”,
”15:30:00”).
19
20. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
ˆ Factor: Factor data types represent categorical data with levels (e.g. ”Male”, ”Female”).
ˆ Complex: Complex data types represent complex numbers with real and imaginary parts.
Understanding attribute types and data types is important in data analysis as they can affect the way data is
stored, manipulated, and analyzed. By correctly specifying the attribute types and data types of variables in
a dataset, data analysts can ensure that they are working with the appropriate data types and can perform
accurate and efficient analyses.
1.4 Descriptive statistics
Descriptive statistics are a set of methods used to describe and summarize important features of a dataset.
These methods provide a way to organize and analyze large amounts of data in a meaningful way. Some
common descriptive statistics include:
1. Measures of central tendency: These statistics give an idea of where the data is centered. The
most commonly used measures of central tendency are the mean, median, and mode.
2. Measures of variability: These statistics give an idea of how spread out the data is. The most
commonly used measures of variability are the range, variance, and standard deviation.
3. Frequency distributions: These show the frequency of each value in a dataset.
4. Percentiles: These divide the dataset into equal parts based on a percentage. For example, the 50th
percentile is the value that separates the top 50% of values from the bottom 50%.
5. Box plots: These show the distribution of a dataset and identify outliers.
6. Histograms: These show the distribution of a dataset by dividing it into bins and counting the
number of values in each bin.
7. Scatter plots: These show the relationship between two variables by plotting them on a graph.
Descriptive statistics are important in data analysis as they provide a way to summarize and understand
large amounts of data. They can also help identify patterns and relationships within a dataset. By using
descriptive statistics, data analysts can gain insights into the characteristics of the data and make informed
decisions about how to further analyze it.
20
21. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
1.5 Exploratory data analysis
Exploratory data analysis (EDA) is an approach to analyzing and visualizing data in order to summarize its
main characteristics and identify patterns and relationships within the data. The goal of EDA is to generate
hypotheses, test assumptions, and provide a basis for more in-depth analysis.
EDA involves several steps, including:
1. Data collection: This involves obtaining the data from various sources and ensuring that it is in the
appropriate format for analysis.
2. Data cleaning: This involves identifying and correcting errors, missing values, and outliers in the
data.
3. Data visualization: This involves creating various graphs and charts to visualize the data and identify
patterns and relationships.
4. Summary statistics: This involves calculating summary statistics such as means, medians, standard
deviations, and variances to describe the central tendency and variability of the data.
5. Hypothesis testing: This involves testing hypotheses about the data using statistical methods.
6. Machine learning: This involves applying machine learning algorithms to the data in order to identify
patterns and relationships and make predictions.
EDA is an important step in data analysis as it provides a basis for further analysis and helps ensure that
the data is appropriate for the analysis being performed. By understanding the main characteristics of the
data and identifying patterns and relationships, analysts can make informed decisions about how to proceed
with their analysis and generate hypotheses that can be tested using more advanced statistical methods.
1.6 Visualization before analysis
Visualization before analysis is an important step in data analysis. It involves creating visual representations
of the data in order to gain a better understanding of its characteristics and identify patterns and relation-
ships. By visualizing the data before performing any analysis, analysts can gain insights that may not be
apparent from a simple numerical analysis.
There are several benefits of visualizing data before analysis, including:
21
22. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
1. Identifying outliers: Visualizations can help identify outliers, which are data points that fall far
outside the typical range of values. Outliers can significantly affect the results of an analysis, and
visualizing the data can help analysts identify them and determine whether they should be included
or excluded from the analysis.
2. Understanding the distribution of data: Visualizations can help analysts understand the distri-
bution of the data, including its shape, spread, and skewness. This can help them choose appropriate
statistical methods for analysis.
3. Identifying relationships between variables: Visualizations can help identify relationships be-
tween variables, such as correlations or trends. This can help analysts determine which variables to
include in their analysis and how to model the relationship between them.
4. Communicating results: Visualizations can be used to communicate results to stakeholders in a clear
and concise manner. By presenting data in a visually appealing way, analysts can help stakeholders
understand the main insights and implications of the analysis.
In summary, visualizing data before analysis is an important step in data analysis that can help analysts
gain insights, identify outliers, understand the distribution of data, and communicate results to stakeholders.
1.7 analytics for unstructured data
Analytics for unstructured data refers to the process of analyzing and extracting insights from non-tabular,
unstructured data sources such as text, images, audio, and video. Unstructured data is typically generated at
a high volume, velocity, and variety, making it difficult to analyze using traditional data analysis techniques.
There are several analytics techniques that can be used to analyze unstructured data, including:
1. Natural Language Processing (NLP): NLP is a field of study that focuses on the interaction
between human language and computers. It involves using algorithms to extract meaning and insights
from unstructured text data, including sentiment analysis, topic modeling, and entity extraction.
2. Image and video analytics: Image and video analytics involve using computer vision techniques
to extract insights from visual data. This can include facial recognition, object detection, and image
segmentation.
22
23. A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
3. Speech and audio analytics: Speech and audio analytics involve using signal processing techniques
to extract insights from audio data, such as speech recognition, speaker identification, and emotion
detection.
4. Machine learning: Machine learning algorithms can be used to analyze unstructured data by learning
from patterns and relationships in the data. This can include techniques such as clustering, classifica-
tion, and regression.
To analyze unstructured data effectively, it is important to have a robust infrastructure for data storage,
processing, and analysis. This may involve using distributed computing platforms such as Hadoop and Spark,
as well as specialized software tools for data preprocessing, feature extraction, and model development.
In summary, analytics for unstructured data involves using specialized techniques and tools to extract
insights from non-tabular, unstructured data sources. By analyzing unstructured data, organizations can
gain valuable insights into customer sentiment, product feedback, market trends, and other areas of interest.
23
24. Printed Page: 1 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0
BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS
Time: 3 Hours Total Marks: 100
Note: Attempt all Sections. If you require any missing data, then choose suitably.
SECTION A
1. Attempt all questions in brief. 2*10 = 20
Qno Questions CO
(a) Discuss the need of data analytics. 1
(b) Give the classification of data. 1
(c) Define neural network. 2
(d) What is multivariate analysis? 2
(e) Give the full form of RTAP and discuss its application. 3
(f) What is the role of sampling data in a stream? 3
(g) Discuss the use of limited pass algorithm. 4
(h) What is the principle behind hierarchical clustering technique? 4
(i) List five R functions used in descriptive statistics. 5
(j) List the names of any 2 visualization tools. 5
SECTION B
2. Attempt any three of the following: 10*3 = 30
Qno Questions CO
(a) Explain the process model and computation model for Big data
platform.
1
(b) Explain the use and advantages of decision trees. 2
(c) Explain the architecture of data stream model. 3
(d) Illustrate the K-means algorithm in detail with its advantages. 4
(e) Differentiate between NoSQL and RDBMS databases. 5
SECTION C
3. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain the various phases of data analytics life cycle. 1
(b) Explain modern data analytics tools in detail. 1
4. Attempt any one part of the following: 10 *1 = 10
Qno Questions CO
(a) Compare various types of support vector and kernel methods of data
analysis.
2
(b) Given data= {2,3,4,5,6,7;1,5,3,6,7,8}. Compute the principal
component using PCA algorithm.
2
A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S
25. Printed Page: 2 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0
BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS
5. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain any one algorithm to count number of distinct elements in a
data stream.
3
(b) Discuss the case study of stock market predictions in detail. 3
6. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Differentiate between CLIQUE and ProCLUS clustering. 4
(b) A database has 5 transactions. Let min_sup=60% and min_conf=80%.
TID Items_Bought
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y}
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I, E}
i) Find all frequent itemsets using Apriori algorithm.
ii) List all the strong association rules (with support s and confidence
c).
4
7. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain the HIVE architecture with its features in detail. 5
(b) Write R function to check whether the given number is prime or not. 5
A
p
r
i
l
2
8
,
2
0
2
4
/
D
r
.
R
S