Big data is a term which refers to those data sets or combinations of data sets whose volume,
complexity, and rate of growth make them difficult to be captured, managed, processed or analysed by
traditional tools and technologies. Big data is relatively a new concept that includes huge quantities of data,
social media analytics and real time data. Over the past era, there have been a lot of efforts and studies are
carried out in growing proficient tools for performing various tasks in big data. Due to the large and
complex collection of datasets it is difficult to process on traditional data processing applications. This
concern turns to be further mandatory for producing various tools in big data. In this survey, a various
collection of big data tools are illustrated and also analysed with the salient features.
A real-time big data sentiment analysis for iraqi tweets using spark streamingjournalBEEI
The scale of data streaming in social networks, such as Twitter, is increasing exponentially. Twitter is one of the most important and suitable big data sources for machine learning research in terms of analysis, prediction, extract knowledge, and opinions. People use Twitter platform daily to express their opinion which is a fundamental fact that influence their behaviors. In recent years, the flow of Iraqi dialect has been increased, especially on the Twitter platform. Sentiment analysis for different dialects and opinion mining has become a hot topic in data science researches. In this paper, we will attempt to develop a real-time analytic model for sentiment analysis and opinion mining to Iraqi tweets using spark streaming, also create a dataset for researcher in this field. The Twitter handle Bassam AlRawi is the case study here. The new method is more suitable in the current day machine learning applications and fast online prediction.
Big data is used for structured, unstructured and semi-structured large volume of data which is difficult to
manage and costly to store. Using explanatory analysis techniques to understand such raw data, carefully
balance the benefits in terms of storage and retrieval techniques is an essential part of the Big Data. The
research discusses the Map Reduce issues, framework for Map Reduce programming model and
implementation. The paper includes the analysis of Big Data using Map Reduce techniques and identifying
a required document from a stream of documents. Identifying a required document is part of the security in
a stream of documents in the cyber world. The document may be significant in business, medical, social, or
terrorism.
Big Data Processing with Hadoop : A ReviewIRJET Journal
1. This document provides an overview of big data processing with Hadoop. It defines big data and describes the challenges of volume, velocity, variety and variability.
2. Traditional data processing approaches are inadequate for big data due to its scale. Hadoop provides a distributed file system called HDFS and a MapReduce framework to address this.
3. HDFS uses a master-slave architecture with a NameNode and DataNodes to store and retrieve file blocks. MapReduce allows distributed processing of large datasets across clusters through mapping and reducing functions.
The document provides an overview of Hadoop and its core components. It discusses:
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers.
- The two core components of Hadoop are HDFS for distributed storage, and MapReduce for distributed processing. HDFS stores data reliably across machines, while MapReduce processes large amounts of data in parallel.
- Hadoop can operate in three modes - standalone, pseudo-distributed and fully distributed. The document focuses on setting up Hadoop in standalone mode for development and testing purposes on a single machine.
An Efficient Approach for Clustering High Dimensional DataIJSTA
The document discusses clustering high dimensional data using an efficient approach called "Big Data Clustering using k-Mediods BAT Algorithm" (KMBAT). KMBAT simultaneously considers all data points as potential exemplars and exchanges real-valued messages between data points until a high-quality set of exemplars and corresponding clusters emerges. It is demonstrated on Facebook user profile data stored in an HDInsight Hadoop cluster. KMBAT finds better clustering solutions than other methods in less time for high dimensional big data.
Big data Mining Using Very-Large-Scale Data Processing PlatformsIJERA Editor
Big Data consists of large-volume, complex, growing data sets with multiple, heterogenous sources. With the
tremendous development of networking, data storage, and the data collection capacity, Big Data are now rapidly
expanding in all science and engineering domains, including physical, biological and biomedical sciences. The
MapReduce programming mode which has parallel processing ability to analyze the large-scale network.
MapReduce is a programming model that allows easy development of scalable parallel applications to process
big data on large clusters of commodity machines. Google’s MapReduce or its open-source equivalent Hadoop
is a powerful tool for building such applications.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
A real-time big data sentiment analysis for iraqi tweets using spark streamingjournalBEEI
The scale of data streaming in social networks, such as Twitter, is increasing exponentially. Twitter is one of the most important and suitable big data sources for machine learning research in terms of analysis, prediction, extract knowledge, and opinions. People use Twitter platform daily to express their opinion which is a fundamental fact that influence their behaviors. In recent years, the flow of Iraqi dialect has been increased, especially on the Twitter platform. Sentiment analysis for different dialects and opinion mining has become a hot topic in data science researches. In this paper, we will attempt to develop a real-time analytic model for sentiment analysis and opinion mining to Iraqi tweets using spark streaming, also create a dataset for researcher in this field. The Twitter handle Bassam AlRawi is the case study here. The new method is more suitable in the current day machine learning applications and fast online prediction.
Big data is used for structured, unstructured and semi-structured large volume of data which is difficult to
manage and costly to store. Using explanatory analysis techniques to understand such raw data, carefully
balance the benefits in terms of storage and retrieval techniques is an essential part of the Big Data. The
research discusses the Map Reduce issues, framework for Map Reduce programming model and
implementation. The paper includes the analysis of Big Data using Map Reduce techniques and identifying
a required document from a stream of documents. Identifying a required document is part of the security in
a stream of documents in the cyber world. The document may be significant in business, medical, social, or
terrorism.
Big Data Processing with Hadoop : A ReviewIRJET Journal
1. This document provides an overview of big data processing with Hadoop. It defines big data and describes the challenges of volume, velocity, variety and variability.
2. Traditional data processing approaches are inadequate for big data due to its scale. Hadoop provides a distributed file system called HDFS and a MapReduce framework to address this.
3. HDFS uses a master-slave architecture with a NameNode and DataNodes to store and retrieve file blocks. MapReduce allows distributed processing of large datasets across clusters through mapping and reducing functions.
The document provides an overview of Hadoop and its core components. It discusses:
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers.
- The two core components of Hadoop are HDFS for distributed storage, and MapReduce for distributed processing. HDFS stores data reliably across machines, while MapReduce processes large amounts of data in parallel.
- Hadoop can operate in three modes - standalone, pseudo-distributed and fully distributed. The document focuses on setting up Hadoop in standalone mode for development and testing purposes on a single machine.
An Efficient Approach for Clustering High Dimensional DataIJSTA
The document discusses clustering high dimensional data using an efficient approach called "Big Data Clustering using k-Mediods BAT Algorithm" (KMBAT). KMBAT simultaneously considers all data points as potential exemplars and exchanges real-valued messages between data points until a high-quality set of exemplars and corresponding clusters emerges. It is demonstrated on Facebook user profile data stored in an HDInsight Hadoop cluster. KMBAT finds better clustering solutions than other methods in less time for high dimensional big data.
Big data Mining Using Very-Large-Scale Data Processing PlatformsIJERA Editor
Big Data consists of large-volume, complex, growing data sets with multiple, heterogenous sources. With the
tremendous development of networking, data storage, and the data collection capacity, Big Data are now rapidly
expanding in all science and engineering domains, including physical, biological and biomedical sciences. The
MapReduce programming mode which has parallel processing ability to analyze the large-scale network.
MapReduce is a programming model that allows easy development of scalable parallel applications to process
big data on large clusters of commodity machines. Google’s MapReduce or its open-source equivalent Hadoop
is a powerful tool for building such applications.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
This document provides an overview of data visualization and Tableau features. It discusses the shortcomings of traditional business intelligence tools, the business case for visual analytics, and Tableau's software ecosystem. The document also describes Tableau's desktop workspace, including the start page, data connections, worksheets, and features like dashboards, collaboration, data sources, visualizations, and security options. Tableau is presented as a tool that improves decision-making by enabling interactive data exploration and visualization.
1) The document discusses using k-means clustering to analyze big data. K-means is an algorithm that partitions data into k clusters based on similarity.
2) It provides background on big data characteristics like volume, variety, and velocity. It also discusses challenges of heterogeneous, decentralized, and evolving data.
3) The document proposes applying k-means clustering to big data to map data into clusters according to its properties in a fast and efficient manner. This allows statistical analysis and knowledge extraction from large, complex datasets.
11.challenging issues of spatio temporal data miningAlexander Decker
This document discusses the challenging issues of spatio-temporal data mining. It begins with an introduction to spatio-temporal databases and how they differ from traditional databases by managing moving objects and their locations over time. It then provides an overview of spatial data mining and temporal data mining before focusing on spatio-temporal data mining, which aims to analyze large databases containing both spatial and temporal information. The document outlines some of the key challenges in applying traditional data mining techniques to spatio-temporal data due to its continuous and correlated nature.
The document describes an experiment comparing three big data analysis platforms: Apache Hive, Apache Spark, and R. Seven identical analyses of clickstream data were performed on each platform, and the time taken to complete each operation was recorded. The results showed that Spark was faster for queries involving transformations of big data, while R was faster for operations involving actions on big data. The document provides details on the hardware, software, data, and specific analytical tasks used in the experiment.
Big Data Mining - Classification, Techniques and IssuesKaran Deep Singh
The document discusses big data mining and provides an overview of related concepts and techniques. It describes how big data is characterized by large volume, variety, and velocity of data that is difficult to manage with traditional methods. Common techniques for big data mining discussed include NoSQL databases, MapReduce, and Hadoop. Some challenges of big data mining are also mentioned, such as dealing with high volumes of unstructured data and limitations of traditional databases in handling diverse and continuously growing data sources.
This document discusses dimensionality reduction techniques for data mining. It introduces principal component analysis (PCA) as an unsupervised linear algorithm that reduces the dimensionality of a dataset while retaining most of its information. PCA finds new variables, or principal components, that are smaller in number than the original variables. It provides a geometric and algebraic description of PCA. The document also describes a proposed data mining system architecture to examine a university course database. The architecture includes components for data warehousing, online analytical processing tools, and a graphical interface.
Mankind has stored more than 295 billion gigabytes (or 295 Exabyte) of data since 1986, as per a report by the University of Southern California. Storing and monitoring this data in widely distributed environments for 24/7 is a huge task for global service organizations. These datasets require high processing power which can’t be offered by traditional databases as they are stored in an unstructured format. Although one can use Map Reduce paradigm to solve this problem using java based Hadoop, it cannot provide us with maximum functionality. Drawbacks can be overcome using Hadoop-streaming techniques that allow users to define non-java executable for processing this datasets. This paper proposes a THESAURUS model which allows a faster and easier version of business analysis.
Authentic and Anonymous Data Sharing with Data Partitioning in Big Datarahulmonikasharma
A Hadoop is a framework for the transformation analysis of very huge data. This paper presents an distributed approach for data storage with the help of Hadoop distributed file system (HDFS). This scheme overcomes the drawbacks of other data storage scheme, as it stores the data in distributed format. So, there is no chance of any loss of data. HDFS stores the data in the form of replica’s, it is advantageous in case of failure of any node; user is able to easily recover from data loss unlike other storage system, if you loss then you cannot. We have proposed ID-Based Ring Signature Scheme to provide secure data sharing among the network, so that only authorized person have access to the data. System is became more attack prone with the help of Advanced Encryption Standard (AES). Even if attacker is successful in getting source data but it’s unable to decode it.
Data mining involves using algorithms to automatically find patterns in large datasets. It is used to make predictions about future trends and behaviors to help companies make proactive decisions. The document discusses the history and evolution of data mining, from early data collection and storage to today's powerful algorithms and massive databases. Common data mining techniques are also outlined.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology
An introduction to Data Mining by Kurt ThearlingPim Piepers
An Introduction to Data Mining Discovering hidden value in your data warehouse By Kurt Thearling Overview Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Most companies already collect and refine massive quantities of data. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel processing computers, data mining tools can analyze massive databases to deliver answers to questions such as, "Which clients are most likely to respond to my next promotional mailing, and why?" This white paper provides an introduction to the basic technologies of data mining. Examples of profitable applications illustrate its relevance to today’s business environment as well as a basic description of how data warehouse architectures can evolve to deliver the value of data mining to end users.
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
This document discusses big data summarization, including the challenges and potential solutions. It presents a framework for big data summarization with four main stages: 1) data clustering to group similar documents, 2) data generalization to abstract data to a higher conceptual level, 3) semantic term identification to identify metadata for more efficient data representation, and 4) evaluation of the summaries. Key challenges addressed include initializing clustering methods, selecting attributes to control generalization, and ensuring semantic associations in representations. Solutions proposed are detailed assessments of clustering initialization methods and statistical approaches for clustering, generalization and term identification.
A Review on Classification of Data Imbalance using BigDataIJMIT JOURNAL
Classification is one among the data mining function that assigns items in a collection to target categories or collection of data to provide more accurate predictions and analysis. Classification using supervised learning method aims to identify the category of the class to which a new data will fall under. With the advancement of technology and increase in the generation of real-time data from various sources like Internet, IoT and Social media it needs more processing and challenging. One such challenge in processing is data imbalance. In the imbalanced dataset, majority classes dominate over minority classes causing the machine learning classifiers to be more biased towards majority classes and also most classification algorithm predicts all the test data with majority classes. In this paper, the author analysis the data imbalance models using big data and classification algorithm.
An Comprehensive Study of Big Data Environment and its Challenges.ijceronline
Big Data is a data analysis methodology enabled by recent advances in technologies and Architecture. Big data is a massive volume of both structured and unstructured data, which is so large that it's difficult to process with traditional database and software techniques. This paper provides insight to Big data and discusses its nature, definition that include such features as Volume, Velocity, and Variety .This paper also provides insight to source of big data generation, tools available for processing large volume of variety of data, applications of big data and challenges involved in handling big data
This document provides an introduction and overview of the INF2190 - Data Analytics course. It discusses the instructor, Attila Barta, details on where and when the course will take place. It then provides definitions and history of data analytics, discusses how the field has evolved with big data, and references enterprise data analytics architectures. It contrasts traditional vs. big data era data analytics approaches and tools. The objective of the course is described as providing students with the foundation to become data scientists.
IRJET-Unraveling the Data Structures of Big data, the HDFS Architecture and I...IRJET Journal
This document discusses big data, its data structures, the Hadoop Distributed File System (HDFS) architecture, and data replication in HDFS. It begins by defining big data and describing its four main data structures: structured, semi-structured, quasi-structured, and unstructured. It then provides details on the master/slave architecture of HDFS and the roles of the NameNode and DataNodes. It explains how data is replicated across multiple DataNodes for reliability and how the replicas are placed on different racks to prevent total data loss if a rack fails. The importance of data replication in HDFS for availability, reliability, and fault tolerance is also highlighted.
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET Journal
This document provides a systematic review of research articles on big data analysis. It analyzed 64 articles published between 2014-2018 from IEEE Explorer and Google Scholar databases. Key findings include: the number of published articles has increased each year, reflecting the growing importance of big data; experimental and case study articles accounted for 25 of the analyzed papers; 19 articles were ultimately selected for review, with 11 from Google Scholar and 8 from IEEE Explorer. The review aims to provide an overview of current research progress on big data analysis techniques.
This document discusses big data tools and management at large scales. It introduces Hadoop, an open-source software framework for distributed storage and processing of large datasets using MapReduce. Hadoop allows parallel processing of data across thousands of nodes and has been adopted by large companies like Yahoo!, Facebook, and Baidu to manage petabytes of data and perform tasks like sorting terabytes of data in hours.
An elastic , effective, activety or intelligent ,graceful networking architecture layout be desired to make processing massive data. next to that ,existent network architectures be considerably incapable for
cleatting the huge data. massive data thrusts network exchequers into border it consequence with in network overcrowding ,needy achievement, then permicious employer exprtises. this offered the current state-of-the-art research affronts ,potential solutions into huge data networking notion. More specifically, present the state of networking problems into massive data connected intrequirements,capacity,running ,
data manipulating also will introduce the architectures of MapReduce , Hadoop paradigm within research
requirements, fabric networks and software defined networks which utilizized into making today’s idly growing digital world and compare and contrast into identify relevant drawbacks and solutions.
El documento describe el módulo de rúbricas en Moodle, que permite calificar trabajos de estudiantes usando rúbricas. Explica cómo configurar una actividad para usar una rúbrica, con opciones para crear, editar y eliminar rúbricas. Indica que una rúbrica contiene una descripción de la tarea, dimensiones a evaluar, escala de niveles de logro y una tabla para estructurarla. El módulo convierte la calificación a la escala de la rúbrica seleccionada.
Los estudiantes Kike, Gabriel, Alfredo y Martha discuten las próximas elecciones en su escuela. Planean votar por el candidato Rodrigo Ramírez porque promete bajar los precios de los libros requeridos y mantener los baños más limpios. Cuando Rodrigo llega a pedir su apoyo, los estudiantes se ofrecen a convencer a otros de votar por él para asegurar su candidatura.
This document provides an overview of data visualization and Tableau features. It discusses the shortcomings of traditional business intelligence tools, the business case for visual analytics, and Tableau's software ecosystem. The document also describes Tableau's desktop workspace, including the start page, data connections, worksheets, and features like dashboards, collaboration, data sources, visualizations, and security options. Tableau is presented as a tool that improves decision-making by enabling interactive data exploration and visualization.
1) The document discusses using k-means clustering to analyze big data. K-means is an algorithm that partitions data into k clusters based on similarity.
2) It provides background on big data characteristics like volume, variety, and velocity. It also discusses challenges of heterogeneous, decentralized, and evolving data.
3) The document proposes applying k-means clustering to big data to map data into clusters according to its properties in a fast and efficient manner. This allows statistical analysis and knowledge extraction from large, complex datasets.
11.challenging issues of spatio temporal data miningAlexander Decker
This document discusses the challenging issues of spatio-temporal data mining. It begins with an introduction to spatio-temporal databases and how they differ from traditional databases by managing moving objects and their locations over time. It then provides an overview of spatial data mining and temporal data mining before focusing on spatio-temporal data mining, which aims to analyze large databases containing both spatial and temporal information. The document outlines some of the key challenges in applying traditional data mining techniques to spatio-temporal data due to its continuous and correlated nature.
The document describes an experiment comparing three big data analysis platforms: Apache Hive, Apache Spark, and R. Seven identical analyses of clickstream data were performed on each platform, and the time taken to complete each operation was recorded. The results showed that Spark was faster for queries involving transformations of big data, while R was faster for operations involving actions on big data. The document provides details on the hardware, software, data, and specific analytical tasks used in the experiment.
Big Data Mining - Classification, Techniques and IssuesKaran Deep Singh
The document discusses big data mining and provides an overview of related concepts and techniques. It describes how big data is characterized by large volume, variety, and velocity of data that is difficult to manage with traditional methods. Common techniques for big data mining discussed include NoSQL databases, MapReduce, and Hadoop. Some challenges of big data mining are also mentioned, such as dealing with high volumes of unstructured data and limitations of traditional databases in handling diverse and continuously growing data sources.
This document discusses dimensionality reduction techniques for data mining. It introduces principal component analysis (PCA) as an unsupervised linear algorithm that reduces the dimensionality of a dataset while retaining most of its information. PCA finds new variables, or principal components, that are smaller in number than the original variables. It provides a geometric and algebraic description of PCA. The document also describes a proposed data mining system architecture to examine a university course database. The architecture includes components for data warehousing, online analytical processing tools, and a graphical interface.
Mankind has stored more than 295 billion gigabytes (or 295 Exabyte) of data since 1986, as per a report by the University of Southern California. Storing and monitoring this data in widely distributed environments for 24/7 is a huge task for global service organizations. These datasets require high processing power which can’t be offered by traditional databases as they are stored in an unstructured format. Although one can use Map Reduce paradigm to solve this problem using java based Hadoop, it cannot provide us with maximum functionality. Drawbacks can be overcome using Hadoop-streaming techniques that allow users to define non-java executable for processing this datasets. This paper proposes a THESAURUS model which allows a faster and easier version of business analysis.
Authentic and Anonymous Data Sharing with Data Partitioning in Big Datarahulmonikasharma
A Hadoop is a framework for the transformation analysis of very huge data. This paper presents an distributed approach for data storage with the help of Hadoop distributed file system (HDFS). This scheme overcomes the drawbacks of other data storage scheme, as it stores the data in distributed format. So, there is no chance of any loss of data. HDFS stores the data in the form of replica’s, it is advantageous in case of failure of any node; user is able to easily recover from data loss unlike other storage system, if you loss then you cannot. We have proposed ID-Based Ring Signature Scheme to provide secure data sharing among the network, so that only authorized person have access to the data. System is became more attack prone with the help of Advanced Encryption Standard (AES). Even if attacker is successful in getting source data but it’s unable to decode it.
Data mining involves using algorithms to automatically find patterns in large datasets. It is used to make predictions about future trends and behaviors to help companies make proactive decisions. The document discusses the history and evolution of data mining, from early data collection and storage to today's powerful algorithms and massive databases. Common data mining techniques are also outlined.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology
An introduction to Data Mining by Kurt ThearlingPim Piepers
An Introduction to Data Mining Discovering hidden value in your data warehouse By Kurt Thearling Overview Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Most companies already collect and refine massive quantities of data. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel processing computers, data mining tools can analyze massive databases to deliver answers to questions such as, "Which clients are most likely to respond to my next promotional mailing, and why?" This white paper provides an introduction to the basic technologies of data mining. Examples of profitable applications illustrate its relevance to today’s business environment as well as a basic description of how data warehouse architectures can evolve to deliver the value of data mining to end users.
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
This document discusses big data summarization, including the challenges and potential solutions. It presents a framework for big data summarization with four main stages: 1) data clustering to group similar documents, 2) data generalization to abstract data to a higher conceptual level, 3) semantic term identification to identify metadata for more efficient data representation, and 4) evaluation of the summaries. Key challenges addressed include initializing clustering methods, selecting attributes to control generalization, and ensuring semantic associations in representations. Solutions proposed are detailed assessments of clustering initialization methods and statistical approaches for clustering, generalization and term identification.
A Review on Classification of Data Imbalance using BigDataIJMIT JOURNAL
Classification is one among the data mining function that assigns items in a collection to target categories or collection of data to provide more accurate predictions and analysis. Classification using supervised learning method aims to identify the category of the class to which a new data will fall under. With the advancement of technology and increase in the generation of real-time data from various sources like Internet, IoT and Social media it needs more processing and challenging. One such challenge in processing is data imbalance. In the imbalanced dataset, majority classes dominate over minority classes causing the machine learning classifiers to be more biased towards majority classes and also most classification algorithm predicts all the test data with majority classes. In this paper, the author analysis the data imbalance models using big data and classification algorithm.
An Comprehensive Study of Big Data Environment and its Challenges.ijceronline
Big Data is a data analysis methodology enabled by recent advances in technologies and Architecture. Big data is a massive volume of both structured and unstructured data, which is so large that it's difficult to process with traditional database and software techniques. This paper provides insight to Big data and discusses its nature, definition that include such features as Volume, Velocity, and Variety .This paper also provides insight to source of big data generation, tools available for processing large volume of variety of data, applications of big data and challenges involved in handling big data
This document provides an introduction and overview of the INF2190 - Data Analytics course. It discusses the instructor, Attila Barta, details on where and when the course will take place. It then provides definitions and history of data analytics, discusses how the field has evolved with big data, and references enterprise data analytics architectures. It contrasts traditional vs. big data era data analytics approaches and tools. The objective of the course is described as providing students with the foundation to become data scientists.
IRJET-Unraveling the Data Structures of Big data, the HDFS Architecture and I...IRJET Journal
This document discusses big data, its data structures, the Hadoop Distributed File System (HDFS) architecture, and data replication in HDFS. It begins by defining big data and describing its four main data structures: structured, semi-structured, quasi-structured, and unstructured. It then provides details on the master/slave architecture of HDFS and the roles of the NameNode and DataNodes. It explains how data is replicated across multiple DataNodes for reliability and how the replicas are placed on different racks to prevent total data loss if a rack fails. The importance of data replication in HDFS for availability, reliability, and fault tolerance is also highlighted.
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET Journal
This document provides a systematic review of research articles on big data analysis. It analyzed 64 articles published between 2014-2018 from IEEE Explorer and Google Scholar databases. Key findings include: the number of published articles has increased each year, reflecting the growing importance of big data; experimental and case study articles accounted for 25 of the analyzed papers; 19 articles were ultimately selected for review, with 11 from Google Scholar and 8 from IEEE Explorer. The review aims to provide an overview of current research progress on big data analysis techniques.
This document discusses big data tools and management at large scales. It introduces Hadoop, an open-source software framework for distributed storage and processing of large datasets using MapReduce. Hadoop allows parallel processing of data across thousands of nodes and has been adopted by large companies like Yahoo!, Facebook, and Baidu to manage petabytes of data and perform tasks like sorting terabytes of data in hours.
An elastic , effective, activety or intelligent ,graceful networking architecture layout be desired to make processing massive data. next to that ,existent network architectures be considerably incapable for
cleatting the huge data. massive data thrusts network exchequers into border it consequence with in network overcrowding ,needy achievement, then permicious employer exprtises. this offered the current state-of-the-art research affronts ,potential solutions into huge data networking notion. More specifically, present the state of networking problems into massive data connected intrequirements,capacity,running ,
data manipulating also will introduce the architectures of MapReduce , Hadoop paradigm within research
requirements, fabric networks and software defined networks which utilizized into making today’s idly growing digital world and compare and contrast into identify relevant drawbacks and solutions.
El documento describe el módulo de rúbricas en Moodle, que permite calificar trabajos de estudiantes usando rúbricas. Explica cómo configurar una actividad para usar una rúbrica, con opciones para crear, editar y eliminar rúbricas. Indica que una rúbrica contiene una descripción de la tarea, dimensiones a evaluar, escala de niveles de logro y una tabla para estructurarla. El módulo convierte la calificación a la escala de la rúbrica seleccionada.
Los estudiantes Kike, Gabriel, Alfredo y Martha discuten las próximas elecciones en su escuela. Planean votar por el candidato Rodrigo Ramírez porque promete bajar los precios de los libros requeridos y mantener los baños más limpios. Cuando Rodrigo llega a pedir su apoyo, los estudiantes se ofrecen a convencer a otros de votar por él para asegurar su candidatura.
This document outlines Aisling Meadow Homestead's goals and site features. The homestead aims to be self-sustainable through permaculture and organic methods, providing for its family and the community. Located on 15 acres in Hudson Valley, NY, the 5-acre homestead site features a pasture, beehives, chickens, a garden, and forest access. It faces challenges like predator risks and lack of water access and storage, but sees opportunities such as creating a pond, rainwater catchment, honey and egg sales, composting, and installing solar power. Photos further depict the homestead site.
Wireless Sensor Network (WSN) is a promising field for research. As the use of this field increases, it is
required to give proper security to this field. So to ensure the security of communication of data or messages and to
control the use of data in WSN is of great importance. As sensor networks interact with responsive data and operate
in unfriendly unattended area, from the time of system design these security concerns should be addressed. The paper,
presents a modified Motesec security protocol which is a security mechanism for Wireless sensor network. In this
protocol a hash function based approach is used to detect replay attacks. For data access control key lock matching
method i.e. memory data access control policy is used to prevent unauthorized data access. Encoding and
reconstruction scheme is used to find out attacker. Flooding attack detection by comparing data rate. There is currently
massive research is present in the area of wireless sensor network security..Keywords: GPS,GCM,LBS Android.
Keywords: secure communication architecture, wireless Sensor network security.
2- صورة الطبيب و الممرض image of doctors & nurses- 2May Haddad MD.MPH
صورة الطبيب و الممرض
في صبحية صحوية –الجزء 2
مقتطفات عن "صبحية صحوية"- إعداد د. مي حداد
may_haddad@hotmail.com
يونيو/حزيران 2012 إلى يونيو/حزيران 2013
هذا المقال المصور يستكمل المقال الأول "صورة الطبيب
و الممرض في صبحية صحوية "-1 و الذي غطى مقتطفات من الفترة الزمنية نوفمبر- تشرين الثاني 2011 إلى مايو- أيار 2012
Ética e Relações Humanas na conduta do professor universitárioFlavio Farah
O artigo contém um conjunto de recomendações sobre o relacionamento do(a) professor(a) universitário(a) com os alunos. Essas recomendações decorrem de uma postura não autoritária e são baseadas nos conhecimentos, na prática e nas reflexões do autor sobre Ética e Relações Humanas. A experiência mostra que as práticas recomendadas abaixo não fazem o(a) professor(a) perder a autoridade, como poderia parecer à primeira vista; pelo contrário, fazem com que seja respeitado(a) pelos alunos.
Rhalf C. Llamera is seeking a position as a mechanical technician, machinist, or operator. He has over 13 years of experience in these roles for companies in the Philippines and Saudi Arabia, including as a mechanical technician for Arcelormittal Tubular Products Jubail from 2013 to present. His duties have included maintenance, repairs, inspections, and preventative maintenance on various machines and equipment. He is proficient in both mechanical and electrical troubleshooting.
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
Big Data has gained much interest from the academia and the IT industry. In the digital and computing
world, information is generated and collected at a rate that quickly exceeds the boundary range. As
information is transferred and shared at light speed on optic fiber and wireless networks, the volume of
data and the speed of market growth increase. Conversely, the fast growth rate of such large data
generates copious challenges, such as the rapid growth of data, transfer speed, diverse data, and security.
Even so, Big Data is still in its early stage, and the domain has not been reviewed in general. Hence, this
study expansively surveys and classifies an assortment of attributes of Big Data, including its nature,
definitions, rapid growth rate, volume, management, analysis, and security. This study also proposes a
data life cycle that uses the technologies and terminologies of Big Data. Map/Reduce is a programming
model for efficient distributed computing. It works well with semi-structured and unstructured data. A
simple model but good for a lot of applications like Log processing and Web index building.
This document discusses big data and tools for analyzing big data. It defines big data as very large data sets that are difficult to analyze using traditional data management systems due to the volume, variety and velocity of the data. It describes characteristics of big data including the 5 V's: volume, variety, velocity, veracity and value. It also discusses some common tools for analyzing big data, including MapReduce, Apache Hadoop, Apache Mahout and Apache Spark. Finally, it briefly mentions the potential of quantum computing for big data analysis by being able to process exponentially large amounts of data simultaneously.
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
This document summarizes a survey paper on MapReduce processing using Hadoop. It discusses how big data is growing rapidly due to factors like the internet and social media. Traditional databases cannot handle big data. Hadoop uses MapReduce and HDFS to store and process extremely large datasets across commodity servers in a distributed manner. HDFS stores data in a distributed file system, while MapReduce allows parallel processing of that data. The paper describes the MapReduce process and its core functions like map, shuffle, reduce. It explains how Hadoop provides advantages like scalability, cost effectiveness, flexibility and parallel processing for big data.
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
This document describes a method for processing large amounts of data stored in cloud storage using Hadoop clusters. Data is uploaded to cloud storage by users and then processed using MapReduce on Hadoop clusters. The method involves storing data in the cloud for processing and then running MapReduce algorithms on Hadoop clusters to analyze the data in parallel. The results are then stored back in the cloud for users to download. An architecture is proposed involving a controller that directs requests to Hadoop masters which coordinate nodes to perform mapping and reducing of data according to the algorithm implemented.
The document discusses big data testing using the Hadoop platform. It describes how Hadoop, along with technologies like HDFS, MapReduce, YARN, Pig, and Spark, provides tools for efficiently storing, processing, and analyzing large volumes of structured and unstructured data distributed across clusters of machines. These technologies allow organizations to leverage big data to gain valuable insights by enabling parallel computation of massive datasets.
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...IRJET Journal
This document proposes using Apache Hadoop and a data-aware cache framework called Dache to analyze large amounts of social media data from Twitter in real-time. The goals are to overcome limitations of existing analytics tools by leveraging Hadoop's ability to handle big data, improve processing speed through Dache caching, and provide visualizations of trends. Data would be grabbed from Twitter using Flume, stored in HDFS, converted to CSV format using MapReduce, analyzed using Dache to optimize Hadoop jobs, and visualized using tools like Tableau. The system aims to efficiently analyze social media trends at low cost using open source tools.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable and distributed processing of large datasets. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. HDFS stores data reliably across machines in a Hadoop cluster and MapReduce processes data in parallel by breaking the job into smaller fragments of work executed across cluster nodes.
This document summarizes a talk on using big data driven solutions to combat COVID-19. It discusses how big data preparation involves ingesting, cleansing, and enriching data from various sources. It also describes common big data technologies used for storage, mining, analytics and visualization including Hadoop, Presto, Kafka and Tableau. Finally, it provides examples of research projects applying big data and AI to track COVID-19 cases, model disease spread, and optimize health resource utilization.
Big data - what, why, where, when and howbobosenthil
The document discusses big data, including what it is, its characteristics, and architectural frameworks for managing it. Big data is defined as data that exceeds the processing capacity of conventional database systems due to its large size, speed of creation, and unstructured nature. The architecture for managing big data is demonstrated through Hadoop technology, which uses a MapReduce framework and open source ecosystem to process data across multiple nodes in parallel.
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond HillClaraZara1
Big data is used for structured, unstructured and semi-structured large volume of data which is difficult to manage and costly to store. Using explanatory analysis techniques to understand such raw data, carefully balance the benefits in terms of storage and retrieval techniques is an essential part of the Big Data. The research discusses the MapReduce issues, framework for MapReduce programming model and implementation. The paper includes the analysis of Big Data using MapReduce techniques and identifying a required document from a stream of documents. Identifying a required document is part of the security in a stream of documents in the cyber world. The document may be significant in business, medical, social, or terrorism.
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
Big data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, technqiues and frameworks. Hadoop is an open source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Mr. Ketan Bagade | Mrs. Anjali Gharat | Mrs. Helina Tandel "A Review Paper on Big Data and Hadoop for Data Science" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-1 , December 2019, URL: https://www.ijtsrd.com/papers/ijtsrd29816.pdf Paper URL: https://www.ijtsrd.com/computer-science/data-miining/29816/a-review-paper-on-big-data-and-hadoop-for-data-science/mr-ketan-bagade
We are in the age of big data which involves collection of large datasets.Managing and processing large data sets is difficult with existing traditional database systems.Hadoop and Map Reduce has become one of the most powerful and popular tools for big data processing . Hadoop Map Reduce a powerful programming model is used for analyzing large set of data with parallelization, fault tolerance and load balancing and other features are it is elastic,scalable,efficient.MapReduce with cloud is combined to form a framework for storage, processing and analysis of massive machine maintenance data in a cloud computing environment.
Memory Management in BigData: A Perpective Viewijtsrd
The requirement to perform complicated statistic analysis of big data by institutions of engineering, scientific research, health care, commerce, banking and computer research is immense. However, the limitations of the widely used current desktop software like R, excel, minitab and spss gives a researcher limitation to deal with big data and big data analytic tools like IBM BigInsight, HP Vertica, SAP HANA & Pentaho come at an overpriced license. Apache Hadoop is an open source distributed computing framework that uses commodity hardware. With this project, I intend to collaborate Apache Hadoop and R software to develop an analytic platform that stores big data (using open source Apache Hadoop) and perform statistical analysis (using open source R software).Due to the limitations of vertical scaling of computer unit, data storage is handled by several machines and so analysis becomes distributed over all these machines. Apache Hadoop is what comes handy in this environment. To store massive quantities of data as required by researchers, we could use commodity hardware and perform analysis in distributed environment. Bhavna Bharti | Prof. Avinash Sharma"Memory Management in BigData: A Perpective View" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-2 | Issue-4 , June 2018, URL: http://www.ijtsrd.com/papers/ijtsrd14436.pdf http://www.ijtsrd.com/engineering/computer-engineering/14436/memory-management-in-bigdata-a-perpective-view/bhavna-bharti
How do data analysts work with big data and distributed computing frameworks.pdfSoumodeep Nanee Kundu
This document discusses how data analysts work with big data and distributed computing frameworks. It begins by defining big data and explaining the challenges it poses in terms of volume, velocity, variety and veracity. It then explores how distributed computing frameworks like Hadoop, Spark and Storm enable data analysts to process large and complex datasets in parallel across clusters of machines. The role of data analysts is described as collecting, analyzing and interpreting data to generate insights. Analysts collaborate with data engineers and scientists, using tools like Python, SQL and Tableau. Examples are given of how big data is applied in domains like healthcare, IoT and fraud detection. Case studies on Google's PageRank and Twitter analytics are also summarized.
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent processing models with a more efficient and intuitive solution to rapidly process large amount of data in parallel is called MapReduce. It is a framework defining a template approach of programming to perform large-scale data computation on clusters of machines in a cloud computing environment. MapReduce provides automatic parallelization and distribution of computation based on several processors. It hides the complexity of writing parallel and distributed programming code. This paper provides a comprehensive systematic review and analysis of large-scale dataset processing and dataset handling challenges and
requirements in a cloud computing environment by using the MapReduce framework and its open-source implementation Hadoop. We defined requirements for MapReduce systems to perform large-scale data processing. We also proposed the MapReduce framework and one implementation of this framework on Amazon Web Services. At the end of the paper, we presented an experimentation of running MapReduce
system in a cloud environment. This paper outlines one of the best techniques to process large datasets is MapReduce; it also can help developers to do parallel and distributed computation in a cloud environment.
Big Data Analytics: Recent Achievements and New ChallengesEditor IJCATR
The era of Big data is being generated by everything around us at all times. Every digital process and social media
exchange produces it. Systems, sensors and mobile devices transmit it. Big data is arriving from multiple sources at an alarming
velocity, volume and variety. To extract meaningful value from big data, you need optimal processing power, analytics
capabilities and skills. Big data has become an important issue for a large number of research areas such as data mining,
machine learning, computational intelligence, information fusion, the semantic Web, and social networks. The combination of
big data technologies and traditional machine learning algorithms has generated new and interesting challenges in other areas
as social media and social networks. These new challenges are focused mainly on problems such as data processing, data
storage, data representation, and how data can be used for pattern mining, analysing user behaviours, and visualizing and
tracking data, among others. In this paper, discussion about the new concept big data and data analytic their concept, tools
and methodologies that is designed to allow for efficient data mining and information sharing fusion from social media and of
the new applications and frameworks that are currently appearing under the “umbrella” of the social networks, social media
and big data paradigms.
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...IRJET Journal
This document compares and analyzes various tools for data mining and big data mining. It discusses traditional open source data mining tools like Orange, R, Weka, Shogun, Rapid Miner and KNIME. Each tool has different capabilities for data preprocessing, machine learning algorithms, visualization, platforms and programming languages. The document aims to help researchers select the most appropriate data mining tool for their needs and research.
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET Journal
This document provides an overview of big data analytics approaches and tools. It begins with an abstract discussing the need to evaluate different methodologies and technologies based on organizational needs to identify the optimal solution. The document then reviews literature on big data analytics tools and techniques, and evaluates challenges faced by small vs large organizations. Several big data application examples across industries are presented. The document also introduces concepts of big data including the 3Vs (volume, velocity, variety), describes tools like Hadoop, Cloudera and Cassandra, and discusses scaling big data technologies based on an organization's requirements.
This document discusses analyzing fire department call data from San Francisco using HiveQL and MapReduce. The authors cleaned the data, loaded it into HDFS, and performed queries and analysis. They found that Hive queries took less time than custom MapReduce programs for the same queries on this dataset. Visualizations of query results were created using JFreeCharts. The goal was to help improve fire department resource allocation and response based on patterns in call volume, location, and time.
The Big Data Importance – Tools and their UsageIRJET Journal
This document discusses big data, tools for analyzing big data, and opportunities that big data analytics provides. It begins by defining big data and its key characteristics of volume, variety and velocity. It then discusses tools for storing, managing and processing big data like Hadoop, MapReduce and HDFS. Finally, it outlines how big data analytics can be applied across different domains to enable new insights and informed decision making through analyzing large datasets.
Similar to [IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh (20)
These days we have an increased number of heart diseases including increased risk of heart attacks. Our proposed system users sensors that allow to detect heart rate of a person using heartbeat sensing even if the person is at home. The sensor is then interfaced to a microcontroller that allows checking heart rate readings and transmitting them over internet. The user may set the high as well as low levels of heart beat limit. After setting these limits, the system starts monitoring and as soon as patient heart beat goes above a certain limit, the system sends an alert to the controller which then transmits this over the internet and alerts the doctors as well as concerned users. Also the system alerts for lower heartbeats. Whenever the user logs on for monitoring, the system also displays the live heart rate of the patient. Thus concerned ones may monitor heart rate as well get an alert of heart attack to the patient immediately from anywhere and the person can be saved on time.This value will continue to grow if no proper solution is found. Internet of Things (IoT) technology developments allows humans to control a variety of high-tech equipment in our daily lives. One of these is the ease of checking health using gadgets, either a phone, tablet or laptop. we mainly focused on the safety measures for both driver and vehicle by using three types of sensors: Heartbeat sensor, Traffic light sensor and Level sensor. Heartbeat sensor is used to monitor heartbeat rate of the driver constantly and prevents from the accidents by controlling through IOT.
ABSTRACT The success of the cloud computing paradigm is due to its on-demand, self-service, and pay-by-use nature. Public key encryption with keyword search applies only to the certain circumstances that keyword cipher text can only be retrieved by a specific user and only supports single-keyword matching. In the existing searchable encryption schemes, either the communication mode is one-to-one, or only single-keyword search is supported. This paper proposes a searchable encryption that is based on attributes and supports multi-keyword search. Searchable encryption is a primitive, which not only protects data privacy of data owners but also enables data users to search over the encrypted data. Most existing searchable encryption schemes are in the single-user setting. There are only few schemes in the multiple data users setting, i.e., encrypted data sharing. Among these schemes, most of the early techniques depend on a trusted third party with interactive search protocols or need cumbersome key management. To remedy the defects, the most recent approaches borrow ideas from attribute-based encryption to enable attribute-based keyword search (ABKS
This document reviews the behavior of reinforced concrete deep beams. Deep beams are defined as having a shear span to depth ratio of less than 5. The response of deep beams differs from regular beams due to the influence of shear deformations and stresses. Failure modes include flexure, flexural-shear, and diagonal cracking. Previous studies investigated factors affecting shear strength such as concrete strength, reinforcement, and loading conditions. Equations have been proposed to predict shear strength based on test results.
Subcutaneous administration of toluene to rabbits for 6 weeks resulted in significant increases in liver enzyme levels and histopathological changes in the liver tissue. Liver sections from toluene-treated rabbits showed congested central veins, flattening and vacuolation of hepatocytes, and disarrangement of hepatic architecture. In contrast, liver sections from control rabbits appeared normal. Toluene exposure is known to cause oxidative stress and damage cell membranes in the liver through its metabolism.
This document summarizes a research paper that proposes a system to analyze crop phenology (growth stages) using IoT to support parallel agriculture management. The system would use sensors to collect data on soil moisture, temperature, humidity and other parameters. This data would be input to a database. Then, a multiple linear regression model trained on past data would predict the optimal crop and expected yield based on the tested sensor data and parameters. This system aims to help farmers select crops and fertilization practices tailored to their specific fields' conditions.
This document summarizes a study that determined the liberation size of gold ore from the Iperindo-Ilesha deposit in Nigeria and assessed its amenability to froth flotation. Samples of the ore were collected and subjected to sieve analysis to determine particle size fractions. Chemical analysis found that the actual and economic liberation sizes were 45μm and 250μm, respectively. Froth flotation experiments at 45μm particle size and varying collector dosages achieved a maximum gold recovery of 78.93% at 0.3 mol/dm3 collector dosage, with concentrate grade of 115 ppm Au. These parameters will be used for further processing to extract gold from this deposit.
This document presents a proposal for an IOT-based intelligent baby care system with a web application for remote baby monitoring. The system uses sensors to automatically swing a cradle when a baby cries, sound alarms if the baby cries for too long or the mattress is wet, and sends alerts to a web page for parents to monitor the baby's status from anywhere via internet connection. The proposed system aims to help working parents manage childcare remotely using sensors, a Raspberry Pi, web camera, and cloud server to detect the baby's activities and notify parents through a web application on their phone.
This document discusses various sources of water pollution and new techniques being developed for water purification. It begins by outlining how water pollution occurs from industrial wastes like mining and manufacturing, agricultural runoff containing pesticides, and domestic waste. It then examines some specific pollutants in more depth from these sources. New techniques under research for water purification are also mentioned, with the goal of developing more affordable methods. The document aims to analyze the impact of pollutants on water and introduce promising new purification techniques.
This document summarizes a research paper on using big data methodologies with IoT and its applications. It discusses how big data analytics is being used across various fields like engineering, data management, and more. It also discusses how IoT enables the collection of massive amounts of data from sensors and devices. Machine learning techniques are used to analyze this big data from IoT and enable communication between devices. The document provides examples of domains where big data and IoT are being applied, such as healthcare, energy, transportation, and others. It analyzes the similarities and differences in how big data techniques are used across these IoT domains.
The document describes a proposed smart library automation and monitoring system using RFID technology. The system uses RFID tags attached to books and student ID cards. An RFID scanner reads the tags to automate processes like tracking student entry and exit, book check-in/check-out, and inventory management. This allows transactions to occur without manual intervention. The system also includes an Android app for students to search books and check availability. The goals are to streamline library operations, prevent unauthorized access, and help locate misplaced books. Raspberry Pi hardware and a MySQL database are part of the proposed implementation.
This document discusses congestion control techniques for vehicular ad hoc networks (VANETs). It first provides background on VANETs, noting their use of vehicle-to-vehicle communication to share information. Congestion can occur when there is a sudden increase in data from nodes in the network. The document then reviews different existing congestion control schemes, which vary in how they adjust source sending rates and handle transient congestion. It proposes a priority-based congestion control technique using dual queues, one for transit packets and one for locally generated packets. This approach aims to route packets along less congested paths when congestion is detected based on buffer occupancy.
This document summarizes a research paper that proposes applying principles of Vedic mathematics to optimize the design of multipliers, squarers, and cubers. It begins by providing background on multipliers and their importance in electronic systems. It then reviews related work applying Vedic mathematics to multiplier design. The document outlines the methodology for performing multiplication, squaring, and cubing according to Vedic mathematics principles. It presents simulation and synthesis results comparing the proposed Vedic designs to traditional array-based designs, finding improvements in speed, power, and area. The document concludes that Vedic mathematics provides an effective approach for optimizing the design of these fundamental arithmetic components.
Cloud computing is the one of the emerging techniques to process the big data. Large collection of set or large
volume of data is known as big data. Processing of big data (MRI images and DICOM images) normally takes
more time compare with other data. The main tasks such as handling big data can be solved by using the concepts
of hadoop. Enhancing the hadoop concept it will help the user to process the large set of images or data. The
Advanced Hadoop Distributed File System (AHDF) and MapReduce are the two default main functions which
are used to enhance hadoop. HDF method is a hadoop file storing system, which is used for storing and retrieving
the data. MapReduce is the combinations of two functions namely maps and reduce. Map is the process of
splitting the inputs and reduce is the process of integrating the output of map’s input. Recently, in medical fields
the experienced problems like machine failure and fault tolerance while processing the result for the scanned
data. A unique optimized time scheduling algorithm, called Advanced Dynamic Handover Reduce Function
(ADHRF) algorithm is introduced in the reduce function. Enhancement of hadoop and cloud introduction of
ADHRF helps to overcome the processing risks, to get optimized result with less waiting time and reduction in
error percentage of the output image
Text mining has turned out to be one of the in vogue handle that has been joined in a few research
fields, for example, computational etymology, Information Retrieval (IR) and data mining. Natural
Language Processing (NLP) methods were utilized to extricate learning from the textual text that is
composed by people. Text mining peruses an unstructured form of data to give important
information designs in a most brief day and age. Long range interpersonal communication locales
are an awesome wellspring of correspondence as the vast majority of the general population in this
day and age utilize these destinations in their everyday lives to keep associated with each other. It
turns into a typical practice to not compose a sentence with remedy punctuation and spelling. This
training may prompt various types of ambiguities like lexical, syntactic, and semantic and because of
this kind of indistinct data; it is elusive out the genuine data arrange. As needs be, we are directing
an examination with the point of searching for various text mining techniques to get different
textual requests via web-based networking media sites. This review expects to depict how
contemplates in online networking have utilized text investigation and text mining methods to
identify the key topics in the data. This study concentrated on examining the text mining
contemplates identified with Facebook and Twitter; the two prevailing web-based social networking
on the planet. Aftereffects of this overview can fill in as the baselines for future text mining research.
Colorectal cancer (CRC) has potential to spread within the peritoneal cavity, and this transcoelomic
dissemination is termed “peritoneal metastases” (PM).The aim of this article was to summarise the current
evidence regarding CRC patients at high risk of PM. Colorectal cancer is the second most common cause of cancer
death in the UK. Prompt investigation of suspicious symptoms is important, but there is increasing evidence that
screening for the disease can produce significant reductions in mortality.High quality surgery is of paramount
importance in achieving good outcomes, particularly in rectal cancer, but adjuvant radiotherapy and chemotherapy
have important parts to play. The treatment of advanced disease is still essentially palliative, although surgery for
limited hepatic metastases may be curative in a small proportion of patients.
This document summarizes a research paper on the thermal performance of air conditioners using nanofluids compared to base fluids. Key points:
- Nanofluids, which are liquids containing nanoparticles, can improve heat transfer in heat pipes and cooling systems due to their higher thermal conductivity compared to base fluids.
- The document reviews how factors like nanofluid type, nanoparticle size and concentration affect thermal efficiency and heat transfer limits. It also examines using nanofluids to enhance heat exchange in transmission fluids.
- An experimental setup is described to study heat transfer and friction factors of water-based Al2O3 nanofluids in a horizontal tube under constant heat flux. Temperature, pressure and flow rate are measured
Now-a-day’s pedal powered grinding machine is used only for grinding purpose. Also, it requires lots of efforts
and limited for single application use. Another problem in existing model is that it consumed more time and also has
lower efficiency. Our aim is to design a human powered grinding machine which can also be used for many purposes
like pumping, grinding, washing, cutting, etc. it can carry water to a height 8 meter and produces 4 ampere of electricity
in most effective way. The system is also useful for the health conscious work out purpose. The purpose of this technical
study is to increase the performance and output capacity of pedal powered grinding machine.
This document summarizes a research paper that proposes using distributed control of multiple energy storage units (ESUs) to manage voltage and loading in electric distribution networks with renewable energy sources like solar and wind. The distributed control approach coordinates the ESUs to store excess power generated during peak periods and discharge it during peak load periods. Each ESU can provide both active and reactive power to support voltage and manage power flows. The distributed control strategy uses a consensus algorithm to divide the required active power reduction equally among ESUs based on their available capacity. Simulation results are presented to analyze the coordinated control of ESU active and reactive power outputs over time.
The steady increase in non-linear loads on the power supply network such as, AC variable speed drives,
DC variable Speed drives, UPS, Inverter and SMPS raises issues about power quality and reliability. In this
subject, attention has been focused on harmonics . Harmonics overload the power system network and cause
reliability problems on equipment and system and also waste energy. Passive and active harmonic filters are
used to mitigate harmonic problems. The use of both active and passive filter is justified to mitigate the
harmonics. The difficulty for practicing engineers is to select and deploy correct harmonic filters , This paper
explains which solutions are suitable when it comes to choosing active and passive harmonic filters and also
explains the mistakes need to be avoided.
This Paper is aimed at analyzing the few important Power System equipment failures generally
occurring in the Industrial Power Distribution system. Many such general problems if not resolved it may
lead to huge production stoppage and unforeseen equipment damages. We can improve the reliability of
Power system by simply applying the problem solving tool for every case study and finding out the root cause
of the problem, validation of root cause and elimination by corrective measures. This problem solving
approach to be practiced by every day to improve the power system reliability. This paper will throw the light
and will be a guide for the Practicing Electrical Engineers to find out the solution for every problem which
they come across in their day to day maintenance activity.
More from IJET - International Journal of Engineering and Techniques (20)
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
1. International Journal of Computer Techniques -– Volume 3 Issue 2, Mar-Apr 2016
ISSN: 2394-2231 http://www.ijctjournal.org Page 217
Analysis of Various Tools in Big Data Scenario
Amarbir Singh1
, Palwinder Singh2
1,2
(Department of Computer Science, Guru Nanak Dev University, Amritsar)
----------------------------------------************************----------------------------------
Abstract:
Big data is a term which refers to those data sets or combinations of data sets whose volume,
complexity, and rate of growth make them difficult to be captured, managed, processed or analysed by
traditional tools and technologies. Big data is relatively a new concept that includes huge quantities of data,
social media analytics and real time data. Over the past era, there have been a lot of efforts and studies are
carried out in growing proficient tools for performing various tasks in big data. Due to the large and
complex collection of datasets it is difficult to process on traditional data processing applications. This
concern turns to be further mandatory for producing various tools in big data. In this survey, a various
collection of big data tools are illustrated and also analysed with the salient features.
Keywords — Big data, Big data analytics, MapReduce, Data analysis, Hadoop.
----------------------------------------************************----------------------------------
I. INTRODUCTION
Big data is a vast quantity of data which extracts
values by the process of capturing and analysis, this
can be possible by innovative architectures and
technologies. Nowadays from the platform of traffic
management and tracking personal devices such as
Mobile phones are useful for position specific data
which emerges as novel bases of big data. Mainly
the Big data have developed to increase the use of
data demanding technologies for it. By using
prevailing traditional techniques it is very
challenging to achieve effective analysis of the
huge size of data as Advances in information
technology and its widespread growth in several
areas of business, engineering, medical and
scientific studies are resulting in information/data
explosion. Meanwhile, on the market, big data
have become the latest imminent technology, which
can serve vast profits to the business organizations.
This becomes essential because it contains several
issues and challenges related in bringing and
adapting, which need to be understood in this
technology. The concept of big data deals with the
datasets which continues to develop rapidly
whereas that becomes tough to handle them by
using the current concepts and tools in database
management. Data capture, sharing, analytics,
search, storage, visualization, etc., is the related
difficulties in big data. Many challenges can be
forwarded due to the several properties of big data
like volume, variety, velocity, variability, value and
complexity. Volume represent the size of the data
how the data is large. Variety makes the data too
big. Data comes from the various sources that can
be of structured, unstructured and semi-structured
type. Velocity defines the motion of the data
streaming and speed of data processing. Variety
represents the unreliability inherent in some sources
of data [1]. Variability refers to the variation in the
data flow rates. Big data are often characterized by
relatively “low value density”, that is the data
received in the original form usually has a low
value relative to its volume. However, a high value
can be obtained by analyzing large volumes of such
data. Complexity refers to the fact that big data are
generated through a myriad of sources. This
imposes a critical challenge: the need to connect,
match, cleanse and transform data received from
different sources.
On the other hand scalability, real-time
analytics, unstructured data, fault tolerance, etc., is
the several challenges included in huge data
management. Obviously the amount of data stored
RESEARCH ARTICLE OPEN ACCESS
2. International Journal of Computer Techniques -– Volume 3 Issue 2, Mar-Apr 2016
ISSN :2394-2231 http://www.ijctjournal.org Page 218
in various sectors can vary in the data stored and
created, i.e., images, audio, text information etc.,
from one industry to another. The main aim of big
data analytics is to utilize the advanced analytic
techniques besides very huge, different datasets
which contain diverse sizes from terabytes to
zettabytes and diverse types such as structured or
unstructured and batch or streaming. Big data is
useful for data sets where their size or type is away
from the capability of traditional relational
databases for capturing, managing and processing
the data with low-latency. Thus the out coming
challenges tend to the occurrence of powerful big
data tools. From the practical perspective, the
graphical interface used in the big data analytics
tools leads to be more efficient, faster and better
decisions which are massively preferred by
analysts, business users and researchers [2].
II. IMPORTANCE OF BIG DATA
Big data enables organizations to accomplish
several objectives [3] such as it helps to support
real-time decisions and it includes various types of
information that can be used in decision making.
Apart from that big data helps to explore and
analyze information and improve business
outcomes and manage risk, now and in the future.
III. MAP-REDUCE
MapReduce is an algorithm design pattern that
originated in the functional programming world. It
consists of three steps. First, you write a mapper
function or script that goes through your input data
and outputs a series of keys and values to use in
calculating the results. The keys are used to cluster
together bits of data that will be needed to calculate
a single output result. The unordered list of keys
and values is then put through a sort step that
ensures that all the fragments that have the same
key are next to one another in the file. The reducer
stage then goes through the sorted output and
receives all of the values that have the same key in
a contiguous block [4]. That may sound like a very
roundabout way of building your algorithms, but its
prime virtue is that it removes unplanned random
accesses, with all scattering and gathering handled
in the sorting phase. Even on single machines, this
boosts performance, thanks to the increased locality
of memory accesses, but it also allows the process
to be split across a large number of machines easily,
by dealing with the input in many independent
chunks and partitioning the data based on the key.
Hadoop is the best-known public system for
running MapReduce algorithms, but many modern
databases, such as MongoDB, also support it as an
option. It’s worthwhile even in a fairly traditional
system, since if you can write your query in a
MapReduce form, you’ll be able to run it efficiently
on as many machines as you have available.In the
traditional relational database world, all processing
happens after the information has been loaded into
the store, using a specialized query language on
highly structured and optimized data structures. The
approach pioneered by Google, and adopted by
many other web companies, is to instead create a
pipeline that reads and writes to arbitrary file
formats, with intermediate results being passed
between stages as files, with the computation
spread across many machines. Typically based
around the Map-Reduce approach to distributing
work, this approach requires a whole new set of
tools, which are describe below.
A. Hadoop
Originally developed by Yahoo! as a clone of
Google’s MapReduce infrastructure, but
subsequently open sourced, Hadoop takes care of
running your code across a cluster of machines. Its
responsibilities include chunking up the input data,
sending it to each machine, running your code on
each chunk, checking that the code ran, passing any
results either on to further processing stages or to
the final output location, performing the sort that
occurs between the map and reduce stages and
sending each chunk of that sorted data to the right
machine, and writing debugging information on
each job’s progress, among other things. As you
might guess from that list of requirements, it’s quite
a complex system, but thankfully it has been battle-
tested by a lot of users [5]. There’s a lot going on
under the hood, but most of the time, as a developer,
you only have to supply the code and data, and it
just works. Its popularity also means that there’s a
large ecosystem of related tools, some that making
writing individual processing steps easier, and
others that orchestrate more complex jobs that
3. International Journal of Computer Techniques -– Volume 3 Issue 2, Mar-Apr 2016
ISSN :2394-2231 http://www.ijctjournal.org Page 219
require many inputs and steps. As a novice user, the
best place to get started is by learning to write a
streaming job in your favorite scripting language,
since that lets you ignore the gory details of what’s
going on behind the scenes. As a mature project,
one of Hadoop’s biggest strengths is the collection
of debugging and reporting tools it has built in.
Most of these are accessible through a web
interface that holds details of all running and
completed jobs and lets you drill down to the error
and warning log files.
B. Hive
With Hive, you can program Hadoop jobs using
SQL. It’s a great interface for anyone coming from
the relational database world, though the details of
the underlying implementation aren’t completely
hidden. You do still have to worry about some
differences in things like the most optimal way to
specify joins for best performance and some
missing language features. Hive does offer the
ability to plug in custom code for situations that
don’t fit into SQL, as well as a lot of tools for
handling input and output. To use it, you set up
structured tables that describe your input and output,
issue load commands to ingest your files, and then
write your queries as you would in any other
relational database [6]. Do be aware, though, that
because of Hadoop’s focus on largescale processing,
the latency may mean that even simple jobs take
minutes to complete, so it’s not a substitute for a
real-time transactional database.
C. Pig
The Apache Pig project is a procedural data
processing language designed for Hadoop. In
contrast to Hive’s approach of writing logic-driven
queries, with Pig you specify a series of steps to
perform on the data. It’s closer to an everyday
scripting language, but with a specialized set of
functions that help with common data processing
problems. It’s easy to break text up into component
ngrams, for example, and then count up how often
each occurs. Other frequently used operations, such
as filters and joins, are also supported. Pig is
typically used when your problem (or your
inclination) fits with a procedural approach, but you
need to do typical data processing operations, rather
than general purpose calculations [7]. Pig has been
described as “the duct tape of Big Data” for its
usefulness there, and it is often combined with
custom streaming code written in a scripting
language for more general operations.
D. Cascading
Most real-world Hadoop applications are built of
a series of processing steps, and Cascading
lets you define that sort of complex workflow as a
program. You lay out the logical flow of the data
pipeline you need, rather than building it explicitly
out of Map-Reduce steps feeding into one another.
To use it, you call a Java API, connecting objects
that represent the operations you want to perform
into a graph. The system takes that definition, does
some checking and planning, and executes it on
your Hadoop cluster [8]. There are a lot of built-in
objects for common operations like sorting,
grouping, and joining, and you can write your own
objects to run custom processing code.
E. Cascalog
Cascalog is a functional data processing
interface written in Clojure. Influenced by the old
Datalog language and built on top of the Cascading
framework, it lets you write your processing code at
a high level of abstraction while the system takes
care of assembling it into a Hadoop job. It makes it
easy to switch between local execution on small
amounts of data to test your code and production
jobs on your real Hadoop cluster [9]. Cascalog
inherits the same approach of input and output taps
and processing operations from Cascading, and the
functional paradigm seems like a natural way of
specifying data flows. It’s a distant descendant of
the original Clojure wrapper for Cascading,
cascading-clojure.
F. Mrjob
Mrjob is a framework that lets you write the
code for your data processing, and then
transparently run it either locally, on Elastic
MapReduce, or on your own Hadoop cluster.
Written in Python, it doesn’t offer the same level of
abstraction or built-in operations as the Java-based
Cascading. The job specifications are defined as a
series of map and reduce steps, each implemented
4. International Journal of Computer Techniques -– Volume 3 Issue 2, Mar-Apr 2016
ISSN :2394-2231 http://www.ijctjournal.org Page 220
as a Python function [10]. It is great as a framework
for executing jobs, even allowing you to attach a
debugger to local runs to really understand what’s
happening in your code.
G. Caffeine
Even though no significant technical information
has been published on it, I’m including Google’s
Caffeine project, as there’s a lot of speculation that
it’s a replacement for the MapReduce paradigm.
From reports and company comments, it appears
that Google is using a new version of the Google
File System that supports smaller files and
distributed masters. It also sounds like the company
has moved away from the batch processing
approach to building its search index, instead using
a dynamic database approach to make updating
faster. There’s no indication that Google’s come up
with a new algorithmic approach that’s as widely
applicable as MapReduce, though I am looking
forward to hearing more about the new architecture.
H. S4
Yahoo! initially created the S4 system to make
decisions about choosing and positioning ads, but
the company open sourced it after finding it useful
for processing arbitrary streams of events. S4 lets
you write code to handle unbounded streams of
events, and runs it distributed across a cluster of
machines, using the ZooKeeper framework to
handle the housekeeping details. You write data
sources and handlers in Java, and S4 handles
broadcasting the data as events across the system,
load-balancing the work across the available
machines. It’s focused on returning results fast,
with low latency, for applications like building near
real-time search engines on rapidly changing
content [11]. This sets it apart from Hadoop and the
general MapReduce approach, which involves
synchronization steps within the pipeline, and so
some degree of latency. One thing to be aware of is
that S4 uses UDP and generally offers no delivery
guarantees for the data that is passing through the
pipeline. It usually seems possible to adjust queue
sizes to avoid data loss, but it does put the burden
of tuning to reach the required level of reliability on
the application developer.
I. MapR
MapR is a commercial distribution of Hadoop
aimed at enterprises. It includes its own file systems
that are a replacement for HDFS, along with other
tweaks to the framework, like distributed name
nodes for improved reliability. The new file system
aims to offer increased performance, as well as
easier backups and compatibility with NFS to make
it simpler to transfer data in and out. The
programming model is still the standard Hadoop
one; the focus is on improving the infrastructure
surrounding the core framework to make it more
appealing to corporate customers[12].
J. Acunu
Like MapR, Acunu is a new low-level data
storage layer that replaces the traditional file system,
though its initial target is Cassandra rather than
Hadoop. By writing a kernel-level key/value store
called Castle, which has been open-sourced, the
creators are able to offer impressive speed boosts in
many cases. The data structures behind the
performance gains are also impressive. Acunu also
offers some of the traditional benefits of a
commercially supported distribution, such as
automatic configuration and other administration
tools.
K.Flume
One very common use of Hadoop is taking web
server or other logs from a large number of
machines, and periodically processing them to pull
out analytics information. The Flume project is
designed to make the data gathering process easy
and scalable, by running agents on the source
machines that pass the data updates to collectors,
which then aggregate them into large chunks that
can be efficiently written as HDFS files. It’s usually
set up using a command-line tool that supports
common operations, like tailing a file or listening
on a network socket, and has tunable reliability
guarantees that let you trade off performance and
the potential for data loss.
L.Kafka
Kafka is a comparatively new project for
sending large numbers of events from producers to
consumers. Originally built to connect LinkedIn’s
5. International Journal of Computer Techniques -– Volume 3 Issue 2, Mar-Apr 2016
ISSN :2394-2231 http://www.ijctjournal.org Page 221
website with its backend systems, it’s somewhere
between S4 and Flume in its functionality. Unlike
S4, it’s persistent and offers more safeguards for
delivery than Yahoo!’s UDP-based system, but it
tries to retain its distributed nature and low latency
[13]. It can be used in a very similar way to Flume,
keeping its high throughput, but with a more
flexible system for creating multiple clients and an
underlying architecture that’s more focused on
parallelization. Kafka relies on ZooKeeper to keep
track of its distributed processing.
M. Azkaban
The trickiest part of building a working system
using these new data tools is the integration.
The individual services need to be tied together into
sequences of operations that are triggered by your
business logic, and building that plumbing is
surprisingly time consuming. Azkaban is an open
source project from LinkedIn that lets you define
what you want to happen as a job flow, possibly
made up of many dependent steps, and then handles
a lot of the messy housekeeping details. It keeps
track of the log outputs, spots errors and emails
about errors as they happen, and provides a friendly
web interface so you can see how your jobs are
getting on. Jobs are created as text files, using a
very minimal set of commands, with any
complexity expected to reside in the
Unix commands or Java programs that the step calls.
N. Oozie
Oozie is a job control system that’s similar to
Azkaban, but exclusively focused on Hadoop. This
isn’t as big a difference as you might think, since
most Azkaban uses I’ve run across have also been
for Hadoop, but it does mean that Oozie’s
integration is a bit tighter, especially if you’re using
the Yahoo! distribution of both. Oozie also supports
a more complex language for describing job flows,
allowing you to make runtime decisions about
exactly which steps to perform, all described in
XML files [14]. There’s also an API that you can
use to build your own extensions to the system’s
functionality. Compared to Azkaban, Oozie’s
interface is more powerful but also more complex,
so which you choose should depend on how much
you need Oozie’s advanced features.
O. Greenplum
Though not strictly a NoSQL database, the
Greenplum system offers an interesting way of
combining a flexible query language with
distributed performance. Built on top of the
Postgres open source database, it adds in a
distributed architecture to run on a cluster of
multiple machines, while retaining the standard
SQL interface. It automatically shards rows across
machines, by default based on a hash of a table’s
primary key, and works to avoid data loss both by
using RAID drive setups on individual servers and
by replicating data across machines. It’s normally
deployed on clusters of machines with
comparatively fast processors and large amounts of
RAM, in contrast to the pattern of using commodity
hardware that’s more common in the web world.
IV. CONCLUSION
Early interest in big data analytics focused
primarily on business and social data sources, such
as e-mail, videos, tweets, Facebook posts, reviews,
and Web behavior. The scope of interest in big data
analytics is growing to include data from intelligent
systems, such as in-vehicle infotainment, kiosks,
smart meters, and many others, and device sensors
at the edge of networks—some of the largest-
volume, fastest-streaming, and most complex big
data. So in order to handle the growing complexity
this paper provides an in-depth analysis of various
tools which are used in current big data scenario.
ACKNOWLEDGMENT
Authors wish to thank Guru Nanak Dev
University, Amritsar for the guidance and Support.
REFERENCES
[1] http://www01.ibm.com/software/in/data/bigdata/Mark Troester, “Big Data
Meets Big Data Analytics”, retrieved 10/02/14, 2013.
[2] http://www.slideshare.net/HarshMishra3/harsh-big-data-seminar-report
Garlasu, D.; Sandulescu, V. ; Halcu, I. ; Neculoiu, G.,”A Big Data
implementation based on Grid Computing”, Grid Computing, 2013.
[3] http://searchcloudcomputing.techtarget.com/definition/big-data-Big-Data.
[4] Sagiroglu, S.; Sinanc, D.,”Big Data: A Review” , 20-24 May 2013.
[5] Mukherjee, A.; Datta, J.; Jorapur, R.; Singhvi, R.; Haloi, S.; Akram, W.,
“Shared disk big data analytics with Apache Hadoop”, 18-22 Dec.,2012.
[6] http://dashburst.com/infographic/big-data-volume.variety-velocity/
[7] http://www01.ibm.com/software/in/data/bigdata/Mark Troester (2013),
“Big Data Meets Big Data Analytics”,
www.sas.com/resources/.../WR46345.pdf, retrieved 10/02/14.
6. International Journal of Computer Techniques -– Volume 3 Issue 2, Mar-Apr 2016
ISSN :2394-2231 http://www.ijctjournal.org Page 222
[8] Chris Deptula, “With all of the Big Data Tools, what is the right one for
me”,www.openbi.com/blogs/chris%20Deptula,retrieved 08/02/14.
[9] Brian Runciman(2013), “IT NOW Big Data Focus, AUTUMN 2013”,
www.bcs.org, retrieved 03/02/14.
[10] Neil Raden (2012), “ Big Data Analytics
Architecture”,www.teradata.com/ Big-Data Analytics”, retrieved
15/03/14.
[11] K. Bakshi, "Considerations for Big Data: Architecture and Approach",
Aerospace Conference IEEE, Big Sky Montana, March 2012
[12] Sagiroglu, S.; Sinanc, D. ,(20-24 May 2013),”Big Data: A Review”
[13] Grosso, P.; de Laat, C.; Membrey, P., ” Addressing big data issues in
Scientific Data Infrastructure” , 20-24 May 2013.
[14] Divyakant Agrawal, UC Santa Barba, Philip Bernstein, Microsoft Elisa
Bertino,…(2012), “ Challenges and Opportunities with Big Data”,
“www.cra.org/ccc/../BigdataWhitepaper.pd..., retrieved 15/03/14.