This document reviews the use of cloud computing for big data analytics in bioinformatics. It discusses how high-throughput sequencing technologies have led to an exponential increase in biological data that is challenging for small labs and institutions to store and analyze. The document then provides an overview of cloud computing and its applications in bioinformatics, including for data storage, software tools, and platforms. A key application discussed is CrossBow, which can map gene sequences and identify SNPs from genome data in hours using Amazon EC2 cloud computing resources at a relatively low cost. In general, the review finds that cloud computing improves the processing and storage of large and growing bioinformatics data.
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET Journal
This document provides a systematic review of research articles on big data analysis. It analyzed 64 articles published between 2014-2018 from IEEE Explorer and Google Scholar databases. Key findings include: the number of published articles has increased each year, reflecting the growing importance of big data; experimental and case study articles accounted for 25 of the analyzed papers; 19 articles were ultimately selected for review, with 11 from Google Scholar and 8 from IEEE Explorer. The review aims to provide an overview of current research progress on big data analysis techniques.
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...ijcsit
In supporting its large scale, multidisciplinary scientific research efforts across all the university campuses and by the research personnel spread over literally every corner of the state, the state of Nevada needs to build and leverage its own Cyber infrastructure. Following the well-established as-a-service model, this state-wide Cyber infrastructure that consists of data acquisition, data storage, advanced instruments, visualization, computing and information processing systems, and people, all seamlessly linked together through a high-speed network, is designed and operated to deliver the benefits of Cyber infrastructure-as-aService (CaaS).There are three major service groups in this CaaS, namely (i) supporting infrastructural
services that comprise sensors, computing/storage/networking hardware, operating system, management tools, virtualization and message passing interface (MPI); (ii) data transmission and storage services that provide connectivity to various big data sources, as well as cached and stored datasets in a distributed
storage backend; and (iii) processing and visualization services that provide user access to rich processing and visualization tools and packages essential to various scientific research workflows. Built on commodity hardware and open source software packages, the Southern Nevada Research Cloud(SNRC)and a data repository in a separate location constitute a low cost solution to deliver all these services around CaaS. The service-oriented architecture and implementation of the SNRC are geared to encapsulate as much detail of big data processing and cloud computing as possible away from end users; rather scientists only need to learn and access an interactive web-based interface to conduct their collaborative, multidisciplinary, dataintensive research. The capability and easy-to-use features of the SNRC are demonstrated through a use case that attempts to derive a solar radiation model from a large data set by regression analysis.
COMPLETE END-TO-END LOW COST SOLUTION TO A 3D SCANNING SYSTEM WITH INTEGRATED...ijcsit
3D reconstruction is a technique used in computer vision which has a wide range of applications in
areas like object recognition, city modelling, virtual reality, physical simulations, video games and
special effects. Previously, to perform a 3D reconstruction, specialized hardwares were required.
Such systems were often very expensive and was only available for industrial or research purpose.
With the rise of the availability of high-quality low cost 3D sensors, it is now possible to design
inexpensive complete 3D scanning systems. The objective of this work was to design an acquisition and
processing system that can perform 3D scanning and reconstruction of objects seamlessly. In addition,
the goal of this work also included making the 3D scanning process fully automated by building and
integrating a turntable alongside the software. This means the user can perform a full 3D scan only by
a press of a few buttons from our dedicated graphical user interface. Three main steps were followed
to go from acquisition of point clouds to the finished reconstructed 3D model. First, our system
acquires point cloud data of a person/object using inexpensive camera sensor. Second, align and
convert the acquired point cloud data into a watertight mesh of good quality. Third, export the
reconstructed model to a 3D printer to obtain a proper 3D print of the model.
Privacy Preserving Aggregate Statistics for Mobile CrowdsensingIJSRED
This document summarizes a research paper on preserving privacy in mobile crowdsensing applications. It discusses how crowdsourced data can be aggregated and mined for valuable information but also risks disclosing sensitive user information. The paper proposes a new framework that introduces multiple agents between users and an untrusted server to help preserve location privacy of both workers and tasks in spatial crowdsourcing applications. Users upload sensed data to random agents, who then aggregate and perturb the statistics before further aggregation to publish overall statistics to third parties while protecting individual privacy through differential privacy techniques. The framework aims to enable privacy-preserving participation in crowdsourcing without relying on any single trusted entity.
The document describes a student project titled "Error Detection in Big Data on Cloud". The project aims to develop a time-efficient approach for detecting and correcting errors in large sensor data stored on the cloud. If errors are found, the approach also involves error recovery and storing the corrected data in its original format. The proposed method uses algorithms like cyclic redundancy check, Hamming code, and secure hash algorithm to detect and locate errors in big data sets efficiently. Design documents like data flow diagrams, use case diagrams and class diagrams were created to plan the system architecture and implementation of the project.
A time efficient approach for detecting errors in big sensor data on cloudNexgen Technology
TO GET THIS PROJECT COMPLETE SOURCE CODE PLEASE CALL BEOLOW CONTACT DETAILS
MOBILE: 9791938249, 0413-2211159, WEB: WWW.NEXGENPROJECT.COM ,EMAIL:Praveen@nexgenproject.com
NEXGEN TECHNOLOGY provides total software solutions to its customers. Apsys works closely with the customers to identify their business processes for computerization and help them implement state-of-the-art solutions. By identifying and enhancing their processes through information technology solutions. NEXGEN TECHNOLOGY help it customers optimally use their resources.
IRJET- Improving Data Availability by using VPC Strategy in Cloud Environ...IRJET Journal
This document discusses improving data availability in cloud environments using virtual private cloud (VPC) strategies and data replication strategies (DRS). It proposes using VPC to define private networks in public clouds and deploying cloud resources into those private networks for improved security and control. It also proposes using DRS to store multiple copies of data across different nodes to increase data availability, reduce bandwidth usage, and provide fault tolerance. The proposed approach identifies popular data files for replication, selects the best storage sites based on factors like request frequency, failure probability, and storage usage, and decides when to replace replicas to optimize resource usage. A simulation showed this hybrid VPC and DRS approach improved performance metrics like response time, network usage, and load balancing compared to
This document proposes a novel real-time application to identify bursty local areas related to emergency topics using geotagged tweets. It uses three techniques: Naive Bayes classification to extract tweets related to emergency topics, density-based spatiotemporal clustering to group tweets by location and time into clusters, and Kleinberg's burst detection algorithm to identify clusters with higher tweet volumes indicating emergencies. The application was implemented as a web and mobile interface. It was tested in a real-time weather monitoring system using Twitter data and successfully detected bursty local areas for weather emergencies.
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET Journal
This document provides a systematic review of research articles on big data analysis. It analyzed 64 articles published between 2014-2018 from IEEE Explorer and Google Scholar databases. Key findings include: the number of published articles has increased each year, reflecting the growing importance of big data; experimental and case study articles accounted for 25 of the analyzed papers; 19 articles were ultimately selected for review, with 11 from Google Scholar and 8 from IEEE Explorer. The review aims to provide an overview of current research progress on big data analysis techniques.
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...ijcsit
In supporting its large scale, multidisciplinary scientific research efforts across all the university campuses and by the research personnel spread over literally every corner of the state, the state of Nevada needs to build and leverage its own Cyber infrastructure. Following the well-established as-a-service model, this state-wide Cyber infrastructure that consists of data acquisition, data storage, advanced instruments, visualization, computing and information processing systems, and people, all seamlessly linked together through a high-speed network, is designed and operated to deliver the benefits of Cyber infrastructure-as-aService (CaaS).There are three major service groups in this CaaS, namely (i) supporting infrastructural
services that comprise sensors, computing/storage/networking hardware, operating system, management tools, virtualization and message passing interface (MPI); (ii) data transmission and storage services that provide connectivity to various big data sources, as well as cached and stored datasets in a distributed
storage backend; and (iii) processing and visualization services that provide user access to rich processing and visualization tools and packages essential to various scientific research workflows. Built on commodity hardware and open source software packages, the Southern Nevada Research Cloud(SNRC)and a data repository in a separate location constitute a low cost solution to deliver all these services around CaaS. The service-oriented architecture and implementation of the SNRC are geared to encapsulate as much detail of big data processing and cloud computing as possible away from end users; rather scientists only need to learn and access an interactive web-based interface to conduct their collaborative, multidisciplinary, dataintensive research. The capability and easy-to-use features of the SNRC are demonstrated through a use case that attempts to derive a solar radiation model from a large data set by regression analysis.
COMPLETE END-TO-END LOW COST SOLUTION TO A 3D SCANNING SYSTEM WITH INTEGRATED...ijcsit
3D reconstruction is a technique used in computer vision which has a wide range of applications in
areas like object recognition, city modelling, virtual reality, physical simulations, video games and
special effects. Previously, to perform a 3D reconstruction, specialized hardwares were required.
Such systems were often very expensive and was only available for industrial or research purpose.
With the rise of the availability of high-quality low cost 3D sensors, it is now possible to design
inexpensive complete 3D scanning systems. The objective of this work was to design an acquisition and
processing system that can perform 3D scanning and reconstruction of objects seamlessly. In addition,
the goal of this work also included making the 3D scanning process fully automated by building and
integrating a turntable alongside the software. This means the user can perform a full 3D scan only by
a press of a few buttons from our dedicated graphical user interface. Three main steps were followed
to go from acquisition of point clouds to the finished reconstructed 3D model. First, our system
acquires point cloud data of a person/object using inexpensive camera sensor. Second, align and
convert the acquired point cloud data into a watertight mesh of good quality. Third, export the
reconstructed model to a 3D printer to obtain a proper 3D print of the model.
Privacy Preserving Aggregate Statistics for Mobile CrowdsensingIJSRED
This document summarizes a research paper on preserving privacy in mobile crowdsensing applications. It discusses how crowdsourced data can be aggregated and mined for valuable information but also risks disclosing sensitive user information. The paper proposes a new framework that introduces multiple agents between users and an untrusted server to help preserve location privacy of both workers and tasks in spatial crowdsourcing applications. Users upload sensed data to random agents, who then aggregate and perturb the statistics before further aggregation to publish overall statistics to third parties while protecting individual privacy through differential privacy techniques. The framework aims to enable privacy-preserving participation in crowdsourcing without relying on any single trusted entity.
The document describes a student project titled "Error Detection in Big Data on Cloud". The project aims to develop a time-efficient approach for detecting and correcting errors in large sensor data stored on the cloud. If errors are found, the approach also involves error recovery and storing the corrected data in its original format. The proposed method uses algorithms like cyclic redundancy check, Hamming code, and secure hash algorithm to detect and locate errors in big data sets efficiently. Design documents like data flow diagrams, use case diagrams and class diagrams were created to plan the system architecture and implementation of the project.
A time efficient approach for detecting errors in big sensor data on cloudNexgen Technology
TO GET THIS PROJECT COMPLETE SOURCE CODE PLEASE CALL BEOLOW CONTACT DETAILS
MOBILE: 9791938249, 0413-2211159, WEB: WWW.NEXGENPROJECT.COM ,EMAIL:Praveen@nexgenproject.com
NEXGEN TECHNOLOGY provides total software solutions to its customers. Apsys works closely with the customers to identify their business processes for computerization and help them implement state-of-the-art solutions. By identifying and enhancing their processes through information technology solutions. NEXGEN TECHNOLOGY help it customers optimally use their resources.
IRJET- Improving Data Availability by using VPC Strategy in Cloud Environ...IRJET Journal
This document discusses improving data availability in cloud environments using virtual private cloud (VPC) strategies and data replication strategies (DRS). It proposes using VPC to define private networks in public clouds and deploying cloud resources into those private networks for improved security and control. It also proposes using DRS to store multiple copies of data across different nodes to increase data availability, reduce bandwidth usage, and provide fault tolerance. The proposed approach identifies popular data files for replication, selects the best storage sites based on factors like request frequency, failure probability, and storage usage, and decides when to replace replicas to optimize resource usage. A simulation showed this hybrid VPC and DRS approach improved performance metrics like response time, network usage, and load balancing compared to
This document proposes a novel real-time application to identify bursty local areas related to emergency topics using geotagged tweets. It uses three techniques: Naive Bayes classification to extract tweets related to emergency topics, density-based spatiotemporal clustering to group tweets by location and time into clusters, and Kleinberg's burst detection algorithm to identify clusters with higher tweet volumes indicating emergencies. The application was implemented as a web and mobile interface. It was tested in a real-time weather monitoring system using Twitter data and successfully detected bursty local areas for weather emergencies.
IRJET- Cost Effective Workflow Scheduling in BigdataIRJET Journal
This document summarizes research on cost-effective workflow scheduling in big data environments. It discusses a Pointer Gossip Content Addressable Network Montage Framework that allows automatic selection of target clouds, uniform access to clouds, and workflow data management while meeting service level agreements at lowest cost. The framework is evaluated using a real scientific workflow application in different deployment scenarios. Results show it can execute workflows with expected performance and quality at lowest cost.
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
These are the slides from a 5 minute Lightning Talk that I gave at XLDB 2015 on May 19, 2015 at Stanford. It is based in part on our experiences developing the NCI Genomic Data Commons (GDC).
The advent of Big Data has seen the emergence of new processing and storage challenges. These challenges are often solved by distributed processing. Distributed systems are inherently dynamic and unstable, so it is realistic to expect that some resources will fail during use. Load balancing and task scheduling is an important step in determining the performance of parallel applications. Hence the need to design load balancing algorithms adapted to grid computing. In this paper, we propose a dynamic and hierarchical load balancing strategy at two levels: Intrascheduler load balancing, in order to avoid the use of the large-scale communication network, and interscheduler load balancing, for a load regulation of our whole system. The strategy allows improving the average response time of CLOAK-Reduce application tasks with minimal communication. We first focus on the three performance indicators, namely response time, process latency and running time of MapReduce tasks.
International Refereed Journal of Engineering and Science (IRJES) is a peer reviewed online journal for professionals and researchers in the field of computer science. The main aim is to resolve emerging and outstanding problems revealed by recent social and technological change. IJRES provides the platform for the researchers to present and evaluate their work from both theoretical and technical aspects and to share their views.
Big data and open access: a collision course for scienceBeth Plale
This document summarizes key points from a keynote talk on big data and open access given at the 2nd International LSDMA Symposium. It discusses how open access to data, along with open cleaning and analysis of data, can help address major societal challenges related to climate change, food security, and new economies. However, rapid growth in data and expectations is creating tensions between enabling new discoveries and addressing privacy concerns. Technical solutions exist but cultural and policy changes may also be needed to fully realize the benefits of open data while protecting privacy. The document provides several examples of open data initiatives related to text analysis, urban science, and atmospheric observations.
Efficient Tree-based Aggregation and Processing Time for Wireless Sensor Netw...CSCJournals
Tree-based data aggregation suffers from increased data delivery time because the parents must wait for the data from their leaves. In this paper, we propose an Efficient Tree-based Aggregation and Processing Time (ETAPT) algorithm using Appropriate Data Aggregation and Processing Time (ADAPT) metric. A tree structure is built out from the sink, electing sensors having the highest degree of connectivity as parents; others are considered as leaves. Given the maximum acceptable latency, ETAPT's algorithm takes into account the position of parents, their number of leaves and the depth of the tree, in order to compute an optimal ADAPT time to parents with more leaves, so increasing data aggregation gain and ensuring enough time to process data from leaves. Simulations were performed in order to validate our ETAPT. The results obtained show that our ETAPT provides a higher data aggregation gain, with lower energy consumed and end-to-end delay compared to Aggregation Time Control (ATC) and Data Aggregation Supported by Dynamic Routing (DASDR).
This document summarizes a seminar presentation on big data analytics. It reviews 25 research papers published between 2011-2014 on issues related to big data analysis, real-time big data analysis using Hadoop in cloud computing, and classification of big data using tools and frameworks. The review process involved a 5-stage analysis of the papers. Key issues identified include big data analysis, real-time analysis using Hadoop in clouds, and classification using tools like Hadoop, MapReduce, HDFS. Promising solutions discussed are MapReduce Agent Mobility framework, PuntStore with pLSM index, IOT-StatisticDB statistical database mechanism, and visual clustering analysis.
This document summarizes a proposed enhancement to the OpenStack Nova scheduler to incorporate network factors into virtual machine scheduling decisions. The current Nova scheduler only considers CPU, memory, and storage utilization when placing VMs, but not network utilization or connectivity. The proposed enhancement adds a network filter and weighting to Nova's filtering scheduler. It would check network interface status and bandwidth when initially placing VMs to ensure connectivity. It would also enable dynamic VM migration if a host's network card fails. This aims to optimize VM placement and improve performance by considering network factors.
Building the Pacific Research Platform: Supernetworks for Big Data ScienceLarry Smarr
The document summarizes Dr. Larry Smarr's presentation on building the Pacific Research Platform (PRP) to enable big data science across research universities on the West Coast. The PRP provides 100-1000 times more bandwidth than today's internet to support research fields from particle physics to climate change. In under 2 years, the prototype PRP has connected researchers and datasets across California through optical networks and is now expanding nationally and globally. The next steps involve adding machine learning capabilities to the PRP through GPU clusters to enable new discoveries from massive datasets.
Frequency and similarity aware partitioning for cloud storage based on space ...redpel dot com
Frequency and similarity aware partitioning for cloud storage based on space time utility maximization model.
for more ieee paper / full abstract / implementation , just visit www.redpel.com
1. Nebula is NASA's open source cloud computing platform built using OpenStack that provides on-demand access to computing resources and storage for large datasets.
2. It allows NASA researchers to run computationally intensive tasks in virtual machines and store huge datasets over 100 terabytes in size.
3. The document discusses Nebula's architecture, services, and case studies of its use at various NASA research centers to support activities like processing images from Mars missions.
- The Pacific Research Platform (PRP) interconnects campus DMZs across multiple institutions to provide high-speed connectivity for data-intensive research.
- The PRP utilizes specialized data transfer nodes called FIONAs that provide disk-to-disk transfer speeds of 10-100Gbps.
- Early applications of the PRP include distributing telescope data between UC campuses, connecting particle physics experiments to computing resources, and enabling real-time wildfire sensor data analysis.
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a pre-
processing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
Grid computing is the sharing of computer resources from multiple administrative domains to achieve common goals. It allows for independent, inexpensive access to high-end computational capabilities. Grid computing federates resources like computers, data, software and other devices. It provides a single login for users to access distributed resources for tasks like drug discovery, climate modeling and other data-intensive applications. Current grids are used for distributed supercomputing, high-throughput computing, on-demand computing and other methods. Grids benefit scientists, engineers and other users who need to solve large problems or collaborate globally.
A Survey on Neural Network Based Minimization of Data Center in Power Consump...IJSTA
This document summarizes a survey on using neural networks to minimize data center power consumption. It discusses how load prediction using neural networks can balance server loads and optimize energy usage by selectively powering down unused servers. The document reviews several related works applying neural network prediction to schedule virtual machines and transition servers to low-power sleep states. The survey concludes neural network models trained on historical load data can accurately predict future loads to enable reliable and optimized load balancing across servers.
Folkebibliotek som velfærdsaktør? #VI2016 velfærdens innovationsdagMichel Steen-Hansen
Sådan spørger vi på Velfærdens Innovationsdag 2016 som arrangeres af Mandag Morgen 28. januar.
Jeg skal lave indlæg og har undtagelsesvis begrænset mit til at lave en PowerPoint slide. Min baggrund som historiker fornægter sig ikke, men netop her er der den pointe, at folkebibliotekerne løbende har tilpasset sig samfundsudviklingen og borgernes behov. Du kan se den her, men den kræver nok et par ord med på vejen, så kig forbi.
http://biblioteksdebat.blogspot.dk/2016/01/folkebibliotek-som-velfrdsaktr.html
"Lecture on Social Media Presence", a two-days workshop for 200 students of Product and Fashion Design at Domus Academy, Milan.
Under the guide of Researcher and Lecturer Paolo Ferrarini, Italian Correspondent of Cool Hunting - in collaboration with Anna Chiara Maggiolini.
IRJET- Cost Effective Workflow Scheduling in BigdataIRJET Journal
This document summarizes research on cost-effective workflow scheduling in big data environments. It discusses a Pointer Gossip Content Addressable Network Montage Framework that allows automatic selection of target clouds, uniform access to clouds, and workflow data management while meeting service level agreements at lowest cost. The framework is evaluated using a real scientific workflow application in different deployment scenarios. Results show it can execute workflows with expected performance and quality at lowest cost.
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
These are the slides from a 5 minute Lightning Talk that I gave at XLDB 2015 on May 19, 2015 at Stanford. It is based in part on our experiences developing the NCI Genomic Data Commons (GDC).
The advent of Big Data has seen the emergence of new processing and storage challenges. These challenges are often solved by distributed processing. Distributed systems are inherently dynamic and unstable, so it is realistic to expect that some resources will fail during use. Load balancing and task scheduling is an important step in determining the performance of parallel applications. Hence the need to design load balancing algorithms adapted to grid computing. In this paper, we propose a dynamic and hierarchical load balancing strategy at two levels: Intrascheduler load balancing, in order to avoid the use of the large-scale communication network, and interscheduler load balancing, for a load regulation of our whole system. The strategy allows improving the average response time of CLOAK-Reduce application tasks with minimal communication. We first focus on the three performance indicators, namely response time, process latency and running time of MapReduce tasks.
International Refereed Journal of Engineering and Science (IRJES) is a peer reviewed online journal for professionals and researchers in the field of computer science. The main aim is to resolve emerging and outstanding problems revealed by recent social and technological change. IJRES provides the platform for the researchers to present and evaluate their work from both theoretical and technical aspects and to share their views.
Big data and open access: a collision course for scienceBeth Plale
This document summarizes key points from a keynote talk on big data and open access given at the 2nd International LSDMA Symposium. It discusses how open access to data, along with open cleaning and analysis of data, can help address major societal challenges related to climate change, food security, and new economies. However, rapid growth in data and expectations is creating tensions between enabling new discoveries and addressing privacy concerns. Technical solutions exist but cultural and policy changes may also be needed to fully realize the benefits of open data while protecting privacy. The document provides several examples of open data initiatives related to text analysis, urban science, and atmospheric observations.
Efficient Tree-based Aggregation and Processing Time for Wireless Sensor Netw...CSCJournals
Tree-based data aggregation suffers from increased data delivery time because the parents must wait for the data from their leaves. In this paper, we propose an Efficient Tree-based Aggregation and Processing Time (ETAPT) algorithm using Appropriate Data Aggregation and Processing Time (ADAPT) metric. A tree structure is built out from the sink, electing sensors having the highest degree of connectivity as parents; others are considered as leaves. Given the maximum acceptable latency, ETAPT's algorithm takes into account the position of parents, their number of leaves and the depth of the tree, in order to compute an optimal ADAPT time to parents with more leaves, so increasing data aggregation gain and ensuring enough time to process data from leaves. Simulations were performed in order to validate our ETAPT. The results obtained show that our ETAPT provides a higher data aggregation gain, with lower energy consumed and end-to-end delay compared to Aggregation Time Control (ATC) and Data Aggregation Supported by Dynamic Routing (DASDR).
This document summarizes a seminar presentation on big data analytics. It reviews 25 research papers published between 2011-2014 on issues related to big data analysis, real-time big data analysis using Hadoop in cloud computing, and classification of big data using tools and frameworks. The review process involved a 5-stage analysis of the papers. Key issues identified include big data analysis, real-time analysis using Hadoop in clouds, and classification using tools like Hadoop, MapReduce, HDFS. Promising solutions discussed are MapReduce Agent Mobility framework, PuntStore with pLSM index, IOT-StatisticDB statistical database mechanism, and visual clustering analysis.
This document summarizes a proposed enhancement to the OpenStack Nova scheduler to incorporate network factors into virtual machine scheduling decisions. The current Nova scheduler only considers CPU, memory, and storage utilization when placing VMs, but not network utilization or connectivity. The proposed enhancement adds a network filter and weighting to Nova's filtering scheduler. It would check network interface status and bandwidth when initially placing VMs to ensure connectivity. It would also enable dynamic VM migration if a host's network card fails. This aims to optimize VM placement and improve performance by considering network factors.
Building the Pacific Research Platform: Supernetworks for Big Data ScienceLarry Smarr
The document summarizes Dr. Larry Smarr's presentation on building the Pacific Research Platform (PRP) to enable big data science across research universities on the West Coast. The PRP provides 100-1000 times more bandwidth than today's internet to support research fields from particle physics to climate change. In under 2 years, the prototype PRP has connected researchers and datasets across California through optical networks and is now expanding nationally and globally. The next steps involve adding machine learning capabilities to the PRP through GPU clusters to enable new discoveries from massive datasets.
Frequency and similarity aware partitioning for cloud storage based on space ...redpel dot com
Frequency and similarity aware partitioning for cloud storage based on space time utility maximization model.
for more ieee paper / full abstract / implementation , just visit www.redpel.com
1. Nebula is NASA's open source cloud computing platform built using OpenStack that provides on-demand access to computing resources and storage for large datasets.
2. It allows NASA researchers to run computationally intensive tasks in virtual machines and store huge datasets over 100 terabytes in size.
3. The document discusses Nebula's architecture, services, and case studies of its use at various NASA research centers to support activities like processing images from Mars missions.
- The Pacific Research Platform (PRP) interconnects campus DMZs across multiple institutions to provide high-speed connectivity for data-intensive research.
- The PRP utilizes specialized data transfer nodes called FIONAs that provide disk-to-disk transfer speeds of 10-100Gbps.
- Early applications of the PRP include distributing telescope data between UC campuses, connecting particle physics experiments to computing resources, and enabling real-time wildfire sensor data analysis.
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a pre-
processing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
Grid computing is the sharing of computer resources from multiple administrative domains to achieve common goals. It allows for independent, inexpensive access to high-end computational capabilities. Grid computing federates resources like computers, data, software and other devices. It provides a single login for users to access distributed resources for tasks like drug discovery, climate modeling and other data-intensive applications. Current grids are used for distributed supercomputing, high-throughput computing, on-demand computing and other methods. Grids benefit scientists, engineers and other users who need to solve large problems or collaborate globally.
A Survey on Neural Network Based Minimization of Data Center in Power Consump...IJSTA
This document summarizes a survey on using neural networks to minimize data center power consumption. It discusses how load prediction using neural networks can balance server loads and optimize energy usage by selectively powering down unused servers. The document reviews several related works applying neural network prediction to schedule virtual machines and transition servers to low-power sleep states. The survey concludes neural network models trained on historical load data can accurately predict future loads to enable reliable and optimized load balancing across servers.
Folkebibliotek som velfærdsaktør? #VI2016 velfærdens innovationsdagMichel Steen-Hansen
Sådan spørger vi på Velfærdens Innovationsdag 2016 som arrangeres af Mandag Morgen 28. januar.
Jeg skal lave indlæg og har undtagelsesvis begrænset mit til at lave en PowerPoint slide. Min baggrund som historiker fornægter sig ikke, men netop her er der den pointe, at folkebibliotekerne løbende har tilpasset sig samfundsudviklingen og borgernes behov. Du kan se den her, men den kræver nok et par ord med på vejen, så kig forbi.
http://biblioteksdebat.blogspot.dk/2016/01/folkebibliotek-som-velfrdsaktr.html
"Lecture on Social Media Presence", a two-days workshop for 200 students of Product and Fashion Design at Domus Academy, Milan.
Under the guide of Researcher and Lecturer Paolo Ferrarini, Italian Correspondent of Cool Hunting - in collaboration with Anna Chiara Maggiolini.
This short document promotes the creation of Haiku Deck presentations on SlideShare by providing an image of someone taking a photo with the caption "Inspired?". It encourages the reader to get started making their own Haiku Deck presentation by clicking a button labeled "GET STARTED".
Three little pigs by CEPR CRUZ DEL CAMPOmissmariatxu
A piñata is a decorated container that holds candy or small toys. It is broken at parties, usually as part of a game where children try to break it open to retrieve the treats inside. Piñatas originated in Mexico and are traditionally part of celebrations for holidays like Christmas or birthdays.
The document discusses information and communication technology (ICT) systems and how they are used. An ICT system includes hardware, software, data, and users. ICT systems are used in offices, shops, factories, aircraft, ships, and fields like communications, medicine, and farming. The document also discusses word processing software, desktop publishing software, graphics software, and how different types of software are suited for different tasks.
[Etude] Quand les mères de famille se mettent au zéro déchetDynvibe
Le "zéro déchet" est une tendance en forte croissance depuis le début de l’année 2016 en France. Le « bio », la « slow life », le végétarisme, mais aussi certains ouvrages best-sellers sur le sujet, semblent être à l’origine d’une prise de conscience tournée vers la lutte contre le gâchis et la rationalisation de la consommation.
Sur les réseaux sociaux, ces nouveaux adeptes forment une communauté de plus en plus active. On y échange des conseils et des idées d’actions au quotidien.
L’analyse des profils de cette population montre que cette démarche n'est plus réservée à des personnes initiée à l'écologie, mais concerne désormais un public beaucoup plus large. Les mères de famille notamment, sont souvent à l'initiative de la mise en pratique du « 0 déchet » dans leur foyer.
Bien sûr, cette tendance ne concerne pour le moment qu’une population restreinte, mais sa vitesse de propagation est très rapide, alors que les marques et les enseignes semblent encore assez peu mobilisées pour y répondre.
Dynvibe a analysé plus de 1000 conversations sur les réseaux sociaux afin de décrypter les motivations de cette population et ses applications concrètes dans les foyers français.
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
Big Data has gained much interest from the academia and the IT industry. In the digital and computing
world, information is generated and collected at a rate that quickly exceeds the boundary range. As
information is transferred and shared at light speed on optic fiber and wireless networks, the volume of
data and the speed of market growth increase. Conversely, the fast growth rate of such large data
generates copious challenges, such as the rapid growth of data, transfer speed, diverse data, and security.
Even so, Big Data is still in its early stage, and the domain has not been reviewed in general. Hence, this
study expansively surveys and classifies an assortment of attributes of Big Data, including its nature,
definitions, rapid growth rate, volume, management, analysis, and security. This study also proposes a
data life cycle that uses the technologies and terminologies of Big Data. Map/Reduce is a programming
model for efficient distributed computing. It works well with semi-structured and unstructured data. A
simple model but good for a lot of applications like Log processing and Web index building.
SYSTEMS USING WIRELESS SENSOR NETWORKS FOR BIG DATACSEIJJournal
Wireless sensor networks are continually developing in the big data world and are widely employed in
many aspects of life. In the monitoring region, the WSN gathers, analyses, and sends information about the
detected item. In recent years, WSN has also made important strides in the management of critical data
protection, traffic monitoring, and climate - change detection. The rich big data contributors known as
wireless sensor networks provide a significant amount of data from numerous sensor nodes in large-scale
networks (WSNs), which are among the numerous potential datasets. However, unlike traditional wireless
networks, suffer from significant constraints in communication and data dependability due to the cluster’s
constraints. This paper gives a detailed assessment of cutting-edge research on using WSN into large data
systems Potential network and effective deployment and scientific problems are presented and discussed in
the context of the study topics and aim. Finally, unresolved issues are addressed in order to discuss
interesting future research possibilities.
Systems using Wireless Sensor Networks for Big Datacivejjour
(1) Wireless sensor networks are being increasingly used in big data systems to gather and transmit large amounts of data from numerous sensor nodes. However, WSNs face constraints like limited communication bandwidth and power that make them unsuitable for traditional big data architectures.
(2) The document discusses challenges of integrating WSNs into big data systems and outlines several unresolved issues including lack of compatibility between different IoT devices, mobility of sensors, supporting real-time communications, and addressing security and privacy concerns with increased connectivity and data volumes.
(3) It analyzes recent research that has evaluated WSN properties and protocols for big data. However, issues around individuality, cooperation, mobility, fog computing, delays, security
Systems using Wireless Sensor Networks for Big DataCSEIJJournal
Wireless sensor networks are continually developing in the big data world and are widely employed in
many aspects of life. In the monitoring region, the WSN gathers, analyses, and sends information about the
detected item. In recent years, WSN has also made important strides in the management of critical data
protection, traffic monitoring, and climate - change detection. The rich big data contributors known as
wireless sensor networks provide a significant amount of data from numerous sensor nodes in large-scale
networks (WSNs), which are among the numerous potential datasets. However, unlike traditional wireless
networks, suffer from significant constraints in communication and data dependability due to the cluster’s
constraints. This paper gives a detailed assessment of cutting-edge research on using WSN into large data
systems Potential network and effective deployment and scientific problems are presented and discussed in
the context of the study topics and aim. Finally, unresolved issues are addressed in order to discuss
interesting future research possibilities.
Big Data in Bioinformatics & the Era of Cloud ComputingIOSR Journals
This document discusses the challenges of big data in bioinformatics and how cloud computing can address them. It notes that high-throughput experiments are generating huge amounts of biological data from fields like genomics and proteomics. Storing and analyzing this "big data" requires massive computational resources that are costly for individual organizations. However, cloud computing provides elastic, on-demand access to storage and processing power at an affordable cost. This allows bioinformatics data to be securely stored and shared on the cloud to enable collaborative analysis and overcome issues of data transfer, storage limitations, and infrastructure maintenance.
This document proposes a middleware called MSOAH-IoT to address heterogeneity issues in IoT applications. The middleware is based on a service-oriented architecture and uses REST APIs to collect data from heterogeneous sensors. It introduces heterogeneous networking interfaces and has been tested on gateways running different operating systems. The middleware aims to support various smart objects using different networking interfaces and OS systems while unifying various data formats. It is implemented on a Raspberry Pi gateway to manage communications at the network edge and handle heterogeneity issues.
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...IJECEIAES
With advent of Big Data Analytics, the healthcare system is increasingly adopting the analytical services that is ultimately found to generate massive load of highly unstructured data. We reviewed the existing system to find that there are lesser number of solutions towards addressing the problems of data variety, data uncertainty, and data speed. It is important that an errorfree data should arrive in analytics. Existing system offers single-hand solution towards single platform. Therefore, we introduced an integrated framework that has the capability to address all these three problems in one execution time. Considering the synthetic big data of healthcare, we carried out the investigation to find that our proposed system using deep learning architecture offers better optimization of computational resources. The study outcome is found to offer comparatively better response time and higher accuracy rate as compared to existing optimization technqiues that is found and practiced widely in literature.
This document discusses mobile data analytics and big data analysis using cloud computing. It introduces how massive amounts of data are generated from mobile devices and networks. Big data analysis tools like Hadoop, Spark and Storm are described for processing large datasets. Challenges of mobile big data analytics include limited device resources and connectivity. The document discusses how cloud computing addresses these issues by providing scalable infrastructure and handling computation and storage remotely. It also covers security considerations for cloud-based mobile analytics.
Resource Consideration in Internet of Things A Perspective Viewijtsrd
The ubiquitous computing and its applications at different levels of abstraction are possible mainly by virtualization. Most of its applications are becoming pervasive with each passing day and with the growing trend of embedding computational and networking capabilities in everyday objects of use by a common man. Virtualization provides many opportunities for research in IoT since most of the IoT applications are resource constrained. Therefore, there is a need for an approach that shall manage the resources of the IoT ecosystem. Virtualization is one such approach that can play an important role in maximizing resource utilization and managing the resources of IoT applications. This paper presents a survey of Virtualization and the Internet of Things. The paper also discusses the role of virtualization in IoT resource management. Rishikesh Sahani | Prof. Avinash Sharma ""Resource Consideration in Internet of Things: A Perspective View"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23694.pdf
Paper URL: https://www.ijtsrd.com/computer-science/world-wide-web/23694/resource-consideration-in-internet-of-things-a-perspective-view/rishikesh-sahani
SPECIAL SECTION ON HEALTHCARE BIG DATAReceived August 16, ChereCheek752
SPECIAL SECTION ON HEALTHCARE BIG DATA
Received August 16, 2016, accepted September 11, 2016, date of publication September 26, 2016,
date of current version October 15, 2016.
Digital Object Identifier 10.1109/ACCESS.2016.2613278
Mobile Cloud Computing Model and Big Data
Analysis for Healthcare Applications
LO’AI A. TAWALBEH1,2, (Senior Member, IEEE), RASHID MEHMOOD3, (Senior Member, IEEE),
ELHADJ BENKHLIFA4, AND HOUBING SONG5, (Member, IEEE)
1Department of Computer Engineering, Jordan University of science and Technology, Irbid 22110, Jordan
2Department of Computer Engineering, Umm Al-Qura University, Mecca 21955, Saudi Arabia
3High Performance Computing Centre, King Abdulaziz University, Jeddah 21589, Saudi Arabia
4Staffordshire University, Stoke-on-Trent ST4 2DE, U.K.
5Department of Electrical and Computer Engineering, West Virginia University, Morgantown, WV 26506, USA
Corresponding author: L. A. Tawalbeh ([email protected])
This work was supported by the Long-Term National Science Technology and Innovation Plan (LT-NSTIP) under Grant 13-ELE2527-10
and in part by the King Abdulaziz City for Science and Technology (KACST), Saudi Arabia.
ABSTRACT Mobile devices are increasingly becoming an indispensable part of people’s daily life,
facilitating to perform a variety of useful tasks. Mobile cloud computing integrates mobile and cloud
computing to expand their capabilities and benefits and overcomes their limitations, such as limited memory,
CPU power, and battery life. Big data analytics technologies enable extracting value from data having four
Vs: volume, variety, velocity, and veracity. This paper discusses networked healthcare and the role of mobile
cloud computing and big data analytics in its enablement. The motivation and development of networked
healthcare applications and systems is presented along with the adoption of cloud computing in healthcare.
A cloudlet-based mobile cloud-computing infrastructure to be used for healthcare big data applications
is described. The techniques, tools, and applications of big data analytics are reviewed. Conclusions are
drawn concerning the design of networked healthcare systems using big data and mobile cloud-computing
technologies. An outlook on networked healthcare is given.
INDEX TERMS Healthcare systems, big data analytics, mobile cloud computing, cloudlet infrastructure,
health applications.
I. INTRODUCTION
Recently, there have been many advances in information and
communication technologies that have been transforming the
world; the world is increasingly becoming a small neighbor-
hood. Among these technologies are the cloud computing, the
wireless communications (3G/4G/5G), and the competitive
mobile devices industry. The mobile devices can provide
variety of services to facilitate our living style [1]. They are
integrated in our daily routine to help performing variety
of tasks such as location determination, time management,
image processing, booking hotels, selling and buying onli ...
SPECIAL SECTION ON HEALTHCARE BIG DATAReceived August 16, .docxrosiecabaniss
SPECIAL SECTION ON HEALTHCARE BIG DATA
Received August 16, 2016, accepted September 11, 2016, date of publication September 26, 2016,
date of current version October 15, 2016.
Digital Object Identifier 10.1109/ACCESS.2016.2613278
Mobile Cloud Computing Model and Big Data
Analysis for Healthcare Applications
LO’AI A. TAWALBEH1,2, (Senior Member, IEEE), RASHID MEHMOOD3, (Senior Member, IEEE),
ELHADJ BENKHLIFA4, AND HOUBING SONG5, (Member, IEEE)
1Department of Computer Engineering, Jordan University of science and Technology, Irbid 22110, Jordan
2Department of Computer Engineering, Umm Al-Qura University, Mecca 21955, Saudi Arabia
3High Performance Computing Centre, King Abdulaziz University, Jeddah 21589, Saudi Arabia
4Staffordshire University, Stoke-on-Trent ST4 2DE, U.K.
5Department of Electrical and Computer Engineering, West Virginia University, Morgantown, WV 26506, USA
Corresponding author: L. A. Tawalbeh ([email protected])
This work was supported by the Long-Term National Science Technology and Innovation Plan (LT-NSTIP) under Grant 13-ELE2527-10
and in part by the King Abdulaziz City for Science and Technology (KACST), Saudi Arabia.
ABSTRACT Mobile devices are increasingly becoming an indispensable part of people’s daily life,
facilitating to perform a variety of useful tasks. Mobile cloud computing integrates mobile and cloud
computing to expand their capabilities and benefits and overcomes their limitations, such as limited memory,
CPU power, and battery life. Big data analytics technologies enable extracting value from data having four
Vs: volume, variety, velocity, and veracity. This paper discusses networked healthcare and the role of mobile
cloud computing and big data analytics in its enablement. The motivation and development of networked
healthcare applications and systems is presented along with the adoption of cloud computing in healthcare.
A cloudlet-based mobile cloud-computing infrastructure to be used for healthcare big data applications
is described. The techniques, tools, and applications of big data analytics are reviewed. Conclusions are
drawn concerning the design of networked healthcare systems using big data and mobile cloud-computing
technologies. An outlook on networked healthcare is given.
INDEX TERMS Healthcare systems, big data analytics, mobile cloud computing, cloudlet infrastructure,
health applications.
I. INTRODUCTION
Recently, there have been many advances in information and
communication technologies that have been transforming the
world; the world is increasingly becoming a small neighbor-
hood. Among these technologies are the cloud computing, the
wireless communications (3G/4G/5G), and the competitive
mobile devices industry. The mobile devices can provide
variety of services to facilitate our living style [1]. They are
integrated in our daily routine to help performing variety
of tasks such as location determination, time management,
image processing, booking hotels, selling and buying onli ...
BIG DATA IN CLOUD COMPUTING REVIEW AND OPPORTUNITIESijcsit
Big Data is used in decision making process to gain useful insights hidden in the data for business and engineering. At the same time it presents challenges in processing, cloud computing has helped in advancement of big data by providing computational, networking and storage capacity. This paper presents the review, opportunities and challenges of transforming big data using cloud computing resources.
Big Data is used in decision making process to gain useful insights hidden in the data for business and engineering. At the same time it presents challenges in processing, cloud computing has helped in advancement of big data by providing computational, networking and storage capacity. This paper presents the review, opportunities and challenges of transforming big data using cloud computing resources.
Cloud middleware and services-a systematic mapping reviewjournalBEEI
Cloud computing currently plays a crucial role in the delivery of vital information technology services. A unique aspect of cloud computing is the cloud middleware and other related entities that support applications and networks. A specific field of research may be considered, particularly as regards cloud middleware and services at all levels, and thus needs analysis and paper surveys to elucidate possible study limitations. The purpose of this paper is to perform a systematic mapping for studies that capture cloud computing middleware, stacks, tools and services. The methodology adopted for this study is a systematic mapping review. The results showed that more papers on the contribution facet were published with tool, model, method and process having 18.10%, 13.79%, 6.03% and 8.62% respectively. In addition, in terms of tool, evaluation and solution research had the largest number of articles with 14.17% and 26.77% respectively. A striking feature of the systemic map is the high number of articles in solution research with respect to all aspects of the features applied in the studies. This study showed clearly that there are gaps in cloud computing middleware and delivery services that would interest researchers and industry professionals desirous of research in this area.
2005-03-17 Air Quality Cluster TechTrackRudolf Husar
The document discusses a federated information system called Dvoy that aims to integrate heterogeneous air quality data from different sources and provide uniform access. It does this through the use of wrappers that encapsulate data sources and mediators implemented as web services that resolve logical heterogeneity and allow for standardized querying of multidimensional data cubes. The system uses mediators and wrappers based on previous research to overcome issues of data access, translation and merging across different source schemas and formats.
The document discusses using a mediator-based architecture and web services to provide uniform access to distributed heterogeneous air quality data sources. Key points include:
- A mediator server can homogenize data coding/formatting from various sources, allowing users to access data through a simple universal interface while minimizing changes to data providers.
- Services like Dvoy use mediators and wrappers to resolve technical and logical heterogeneity across sources and provide multidimensional querying of spatial-temporal data cubes.
- This approach facilitates data sharing and integration for improved analysis to address challenges from secondary pollutants and more participatory management of air quality.
TOP CITED UBICOMPUTING ARTICLES IN 2013 - International Journal of Ubiquitous...ijujournal
International Journal of Ubiquitous Computing (IJU) is a quarterly open access peer-reviewed journal that provides excellent international forum for sharing knowledge and results in theory, methodology and applications of ubiquitous computing. Current information age is witnessing a dramatic use of digital and electronic devices in the workplace and beyond. Ubiquitous Computing presents a rather arduous requirement of robustness, reliability and availability to the end user. Ubiquitous computing has received a significant and sustained research interest in terms of designing and deploying large scale and high performance computational applications in real life. The aim of the journal is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field
1) The document discusses challenges researchers face in accessing and using Earth science data from different sources and the potential for web services to help address these challenges.
2) It proposes a service-oriented architecture where data and processing are distributed across a peer-to-peer network and users can compose chains of services to transform raw data into useful knowledge.
3) Key assertions are that web services technologies are promising for building applications to access and analyze distributed Earth science resources, but issues around service discovery, semantics, and dynamic behavior need to be addressed for their full potential to be realized.
IRJET- Swift Retrieval of DNA Databases by Aggregating QueriesIRJET Journal
This document summarizes a research paper that proposes a new method for securely sharing and querying genomic DNA sequences stored in the cloud without violating privacy. The method builds on existing frameworks by offering deterministic results with zero error probability, and a scheme that is twice as fast but uses twice the storage space, which is preferable given cloud storage pricing. The encoding of the data supports a richer set of query types beyond exact matching, including counting matches, logical OR matches, handling ambiguities, threshold queries, and concealing results from the decrypting server. Linear and logistic regression algorithms are used to analyze the data. The literature review discusses previous work on securely sharing genomic data and transforming protocols to ensure accountability without compromising privacy.
An Algorithm to synchronize the local database with cloud DatabaseAM Publications
Since the cloud computing [1] platform is widely accepted by the industry, variety of applications are designed targeting to a cloud platform. Database as a Service (DaaS) is one of the powerful platform of cloud computing. There are many research issues in DaaS platform and one among them is the data synchronization issue. There are many approaches suggested in the literature to synchronise a local database by being in cloud environment. Unfortunately, very few work only available in the literature to synchronise a cloud database by being in the local database. The aim of this paper is to provide an algorithm to solve the problem of data synchronization from local database to cloud database.
Similar to MawereC- Ubuntunet paper publication 2015 (20)
An Algorithm to synchronize the local database with cloud Database
MawereC- Ubuntunet paper publication 2015
1. Page | 1 ISSN 2223-7062 Proceedings and report of the 7th UbuntuNet Alliance annual conference, 2014,
pp 217-230
Cloud-Based Big Data Analytics in
Bioinformatics: A Review
Cephas MAWERE1
, Kudakwashe ZVAREVASHE2
, Thamari SENGUDZWA3
, Tendai
PADENGA4
1
Harare Institute of Technology, School of Industrial Sciences & Technology Biotech. Dept.,
P.O. Box BE277 Belvedere, Harare, Zimbabwe
Tel: +263 772 384985, +263 4 741 422-58, Fax: +263 4 741 406
Email:cmawere@gmail.com
2
Harare Institute of Technology, School of Information Science IT Dept., P.O. Box BE277
Belvedere, Harare, Zimbabwe
Tel: +263 733 784352, +263 4 741 422-36, Fax: +263 4 741 406
Email:kzvare@gmail.com
3
Harare Institute of Technology, School of Industrial Sciences & Technology Biotech. Dept.,
P.O. Box BE277 Belvedere, Harare, Zimbabwe
Tel: +263 774 088979, +263 4 741 422-57, Fax: +263 4 741 406
Email: thamaryc@gmail.com
4
Harare Institute of Technology, Dean of School of Information Science, P.O. Box BE277
Belvedere, Harare, Zimbabwe
Tel: +263 775 958368, +263 4 741 422-36, Fax: +263 4 741 406
Email: tepadenga@gmail.com
Abstract
The significant advances in high- throughput sequencing technologies over the last decade
have led to an exponential explosion of biological data. Consequently, bioinformatics is
encountering unprecedented challenges in data storage and analysis, memory allocation,
computational power and time complexity. It is therefore becoming increasingly disturbing
for small laboratories and some large institutions to establish and maintain infrastructures for
data processing. This has forced bioinformatics to take a leap-forward from in- house
computing to cloud computing to address these issues. In this paper, the authors thus explore
the current applications of big data analytics in bioinformatics on the cloud. The findings in
each in each use case reveal that cloud computing remarkably improve processing and
storage of the continuously growing data in bioinformatics. Though there are some
bottlenecks to solve, cloud computing promises to provide a lightweight environment to
develop pipelines for prognosis, diagnosis, drug discovery and personalized medicine. The
knowledge generated by this review will help scientist who are facing challenges with big
data to resort to use cloud computing as an alternative solution and get results faster and
without the need to install and maintain or update expensive software. This paper also serves
to bring awareness to scientists on the current technologies that are used to manage data.
Keywords: High-throughput Sequencing Technologies; Bioinformatics; Data Processing;
In- House Computing; Cloud Computing; Big Data Analytics.
2. Page | 2 ISSN 2223-7062 Proceedings and report of the 7th UbuntuNet Alliance annual conference, 2014,
pp 217-230
1. Introduction
In the last seven years, more scientific data has been generated than in the entire human
history. Hide, W. (2012). This has been triggered by the recent advances in high throughput
technologies such as Next Generation Sequencing (NGS), genotyping, Single Nucleotide
Polymorphism (SNP) discovery, gene expression, proteomics as well as virtual screening
Blankenburg, L, et al (2009). For example, GenBank has seen DNA sequencing data
doubling every 14 months in the last decade as attested by Figures 1-3. As on 21 October
2014, sequence data has been recorded to have reached 108
bytes while the number of bases
has catapulted to 1011
and 1012
bytes for GenBank WGS data respectively GeneBank. (2014).
Figure 1: Sequence data from GenBank statistics [3]
3. Page | 3 ISSN 2223-7062 Proceedings and report of the 7th UbuntuNet Alliance annual conference, 2014,
pp 217-230
Figure 2: GenBank statistics on number of bases [3]
Figure 3: Whole Genome Shotgun (WGS) base data statistics [3]
Inevitably, biological data has exploded beyond the capacity of computers to capture, store,
analyse, search, share, visualize and transfer it for processing in the required time (Monica,
N. & Kumar, R. (2013), Such big data has daunted small laboratories, hospitals and even
some large institutions to establish and maintain infrastructures for data processing Dai, L., et
al (2012). This has forced bioinformatics to leap forward from in-house computing to cloud
computing which promises to address these issues.
But what is big data? Well, the term big data is defined as “a collection of data sets so large,
varied, frequently updated and complex that it becomes difficult to process using on-hand
data management tools” Movirtu. (2014a) Movirtu. (2014b). The process of research into such data
to reveal hidden patterns and secret correlations is thus called ‘big data analytics’
Implementa. (2014). This research therefore seeks to explore the current applications of big
data analytics in bioinformatics on the cloud. In each of the applications the authors will
highlight how cloud computing has improved the processing of such exponentially growing
biological data.
The paper will thus be composed of 5 categories. Section 2 will look at the overview of cloud
computing and its application in bioinformatics. Consequently section 3 elaborates the
applications of bioinformatics on the cloud for each area of high throughput technology and
section 4 brings out the advantages and challenges of big data analytics on the cloud. Finally
the conclusion and recommendation sums it up in section 5 and 6 respectively.
2. Cloud Computing In Bioinformatics
2.1 Overview of cloud computing
The buzzword ‘cloud computing’ was inspired by the cloud symbol that is often used in
reference to the Internet in flowcharts. However, cloud computing is a form of computation
4. Page | 4 ISSN 2223-7062 Proceedings and report of the 7th UbuntuNet Alliance annual conference, 2014,
pp 217-230
that exploits the full potential of multiple computers and delivers dynamically allocated
virtual resources via the internet Telco Review (2014). These hosted resources include storage,
computation, applications, servers and network. Such services can be accessed on-demand (in
a pay-as-you-go model) by any user via Web Application Programming Interfaces (API),
without the need of the user knowing where the services are hosted or how they are delivered.
Such virtualization of data in the cloud enables ‘snapshots’ of large data sets to be moved
from point to point within the cloud at high transfer rates without associating particular
machines or storage devices at either source or destination of the data transfer Butte, A. J. &
Dudley, J. T. (2010). This has necessitated Internet-based companies like Google, Windows and
Amazon, whose cloud architectures can harness petabyte scales of data, to offer on-demand
services to tens of thousands of users simultaneously Lin, Y et al (2013).
2.2 Bioinformatics resources on the cloud
Bioinformatics clouds thus involve a large variety of services for big data analytics.
Generally, the clouds fall into four categories (Figure 4) which are Data as a Service (DaaS),
Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service
(IaaS) Stanoevska-Slabeva, K. & Wozniak, T. (2010). DaaS enables dynamic data access on
demand and provides up-to-date that are accessible by a wide range of devices that are
connected over the Web. For instance, Amazon Web Services (AWS) which provides a
centralized repository of public data sets like 1000 Genomes.
Figure 4: Illustration of bioinformatics resources on the cloud Dai, L (2012), Stanoevska-Slabeva, K. &
Wozniak, T. (2010)
IaaS offers full computer infrastructure in form of virtualized resources like CPUs and
operating systems via the Internet. On the other hand, PaaS provides an environment for users
to develop, test and deploy cloud applications where computer resources automatically scale
5. Page | 5 ISSN 2223-7062 Proceedings and report of the 7th UbuntuNet Alliance annual conference, 2014,
pp 217-230
to match application demand. The platforms commonly used thus include the Amazon Elastic
Compute Cloud (EC2), Windows Azure and the Google app engine. This research however
focuses on the use of SaaS which delivers software services online and facilitates remote
access to available bioinformatics software tools via the Internet.
3. Applications of bioinformatics on the cloud
In particular, this paper reviews the use of cloud computing to address big data generated in
bioinformatics areas such as (i) sequence mapping, (ii) sequence alignment, (iii) sequence
analysis, (iv) peak calling for ChIP-seq data, (v) identification of epistatic interactions of
SNPs as well as (vi) drug discovery. These applications are run on Amazon and Windows
Azure platforms, with a parallelization implementation of Hadoop’s MapReduce as
illustrated by the example of CrossBow in Figure 5.
3.1 CrossBow- for SNP calling and read mapping
Figure 5: Calling SNPs with CrossBow Russell, J. (2012)
6. Page | 6 ISSN 2223-7062 Proceedings and report of the 7th UbuntuNet Alliance annual conference, 2014,
pp 217-230
The MapReduce on Crossbow thus includes the Map step, the Sorting step and the Reduce
step. The Map step uses the Bowtie algorithm (named after the Burrows Wheeler Transform,
BWT) to map sequence reads (which are short DNA sequences about 25-250 base pairs (bp)
long) in parallel to the reference genome. The sequence alignments generated are then
aggregated so that all alignments on the same chromosome or locus are grouped together and
sorted by position. The sorted alignments are then scanned to identify SNPs within each
region using the SOAPsnp (Short Oligonucleotide Analysis Package for SNP) algorithm. The
results are stored via the Hadoop Distributed File System (HDFS), and then archived in
SOAPsnp format Russell, J. (2012)
In the benchmark set on Amazon EC2 cloud, 2.7 billion reads were genotyped in less than
3hrs using a 320 CPU cluster for a cost of $85. Below is a snapshot of Crossbow on the cloud
showing how one can genotype or call SNPs from a genome.
Figure 6: A snapshot of CrossBow on the Amazon EC2 (Gurtowski, J., et al, 2013). On this graphical user
interface (GUI) any user can have his job processed on a pay-as-you go service by specifying the processing
details in the textboxes.
7. Page | 7 ISSN 2223-7062 Proceedings and report of the 7th UbuntuNet Alliance annual conference, 2014,
pp 217-230
3.2 CloudBurst
Cloudburst is a new highly sensitive parallel read- mapping algorithm optimized for mapping
next-generation sequence data to reference genomes. Also, it uses the open-source Hadoop
implementation of MapReduce to parallelize execution using multiple compute nodes (Schatz,
M. C., 2009). In a case study done by Schatz in 2008, it was noted that CloudBurst scales
linearly as the number of reads increases, and with near linear parallel speedup as the size of
the number of processors increases (Schatz, M. C., 2009). In a 24- processor core configuration,
CloudBurst is up to 30 times faster than RMAP executing on a single core, while computing
an identical set of alignments.
Figure 7: Running time on EC2 High-Medium Instance Cluster. Comparison of CloudBurst running time
while scaling the size of the cluster for mapping 7million reads to human chromosome 22 with at most four
mismatches on the EC2 Cluster. The 96- core cluster is 3.5X faster than the 24- core cluster Schatz, M. C.
(2009)
Using a larger remote EC2 with 96 cores, CloudBurst improved performance by more than
100 fold. The running time was also reduced from hours to mere minutes for typical jobs
involving mapping of millions of short reads to the human genome as shown by Figure 7.
3.2.1 Genome-Wide Epistatic Interactions
Following high-throughput SNP genotyping, the complex relationship between genotype and
phenotype can now be unravelled on the cloud. As the cardinality of SNPs combinations gets
larger, a dynamic clustering approach is now used for detecting high- order genome- wide
epistatic interactions. In one case study, the speed-up evaluation of the clustering method
(DCHE) was performed on Windows Azure platform on datasets with different sizes ranging
from 10 000 to 100 000. The number of nodes varied from 1 to 40 with 5 minor units (Figure
8).
8. Page | 8 ISSN 2223-7062 Proceedings and report of the 7th UbuntuNet Alliance annual conference, 2014,
pp 217-230
Figure 8: Genome-Wide Epistatic Interactions. Computing nodes are sampled from 1-40 with 5 as interval.
The red, blue, cyan and grey curve show functions of speed-up of M=10 000, M= 50 000, and M= 100 000, with
size fixed to 800 Guo, X 2014).
The results showed that when there are a limited number of SNPs in the dataset, for instance,
M= 10 000, it has a lower speed-up curve. This is because dataset with such size along SNP
dimension can be easily processed by a stand-alone version program alone in a couple of
minutes (Guo, X, 2014). However, as the size of datasets increases, speed-up performs better
(green and blue curves). Therefore, the cloud implementation of DCHE has a very good
performance with respect to the speed-up.
3.3 Virtual Screening
Virtual Screening is an established time and cost- effective alternative to in vitro
experiments to enable initial hit finding in the early stage of drug discovery (Certara. 2013). A
virtual screening experiment using Surflex- Dock has been performed in the Amazon
Compute Cloud to find lead structures that bind to the target under investigation. It consisted
of 375 000 ligand-receptor dockings which were then run on 160 cores. Accordingly, the
experiment was completed in < 9hrs for a total of less than $20.
9. Page | 9 ISSN 2223-7062 Proceedings and report of the 7th UbuntuNet Alliance annual conference, 2014,
pp 217-230
Figure 9: Surflex-Dock on Amazon EC2.One C1.xlarge EC2= modern computer with 4 cores (-
90hrs).
An EC2 cluster of 40 c1.xlarge instances will thus perform in 2 hours Certara. (2013)
Prior to this, a benchmark study for docking 100 000 structures into a single receptor
conformation was done (Figure 9). It was realised that a single c1.xlarge EC2 instance with
its eight cores delivers nearly the same performance as a reasonably modern desktop
computer with four cores (-90 hours), even though equipped with less CPU power per core.
An EC2 cluster utilizing 20 and 40 c1.xlarge instances is able to perform the same task in 4
and 2 hours, respectively (Certara, 2013).
3.4 Peak calling
Chromatin immunoprecipitation (ChIP), coupled with massively parallel short-read
sequencing (seq) is used to probe chromatin dynamics (Feng, X.et al ,2011.PeakRanger is an
algorithm that is used to resolve closely-spaced peaks on both punctuate sites, such as
transcription factor binding sites, and broad regions like histone modification marks.
In a use case carried out by Feng et. al., (2011) the MapReduce version of PeakRanger
demonstrated that on a fixed number of nodes with increasing data set sizes, the execution
time was shorter and increased more slowly, than the regular single- processor version. For
example, the cloud version processed 14 Gb dataset of 192 million reads in less than 5
minutes, more than 10 times faster than the original PeakRanger (Figure 10a).
10. Page | 10 ISSN 2223-7062 Proceedings and report of the 7th UbuntuNet Alliance annual conference, 2014,
pp 217-230
Figure 10: Performance of PeakRanger in cloud parallel computing. A) test with fixed number of
nodes and data sets of increasing size; B)test with increasing numbers of nodes and data sets with fixed
sizes (Feng, X., et al 2011)
In a second test, the scalability of time against the number of nodes was performed. As
expected, the runtime decreased rapidly until the number of nodes equals the number of
chromosomes. Further addition of nodes did not provide further benefit (Feng, X., et al
2011). Thus, PeakRanger on the cloud is provides marked improvements in run time and is
therefore suitable for high-throughput environments.
4. Advantages and Challenges of Cloud Computing
4.1 Benefits of using bioinformatics on the cloud
From the use cases just explored on cloud implementation of bioinformatics tools, cloud
computing has the following advantages:
It provides a unified, location- independent platform for data & computation. Thus, it
is available even to small labs which can harness its power for big data analytics in
bioinformatics.
It uses the ‘Pay-per-use’ model which gives access to any user.
It is flexible; virtual resources can be shared.
The costs incurred are affordable and relatively cheap for processing of big bio data
It provides linear scaling of software tools with increasing number of nodes
It harnesses the power of parallel execution provided by MapReduce
It is time effective; execution occurs faster than on a local in- house cluster
It eliminates the need for local installation and eases software maintenances and
updates, providing up-to-date cloud based services.
4.2 Bottlenecks of bioinformatics on the cloud
Nevertheless, the following are challenges commonly faced on the cloud for big data
analytics:
11. Page | 11 ISSN 2223-7062 Proceedings and report of the 7th UbuntuNet Alliance annual conference, 2014,
pp 217-230
Security: - trustability issues. How confidential is the data? There is also limited
control over remote storage. Security updates are installed by vendors and the data is
stored in a redundant manner. Data can be encrypted to secure it as this ensures no
eavesdropping during data transmission (Forer, L et al, 2012).
Data transfer: Large data sets need to be moved to the cloud. However there is slow
upload speed because of low bandwidth. Some vendors thus have offered to ship data
on hard drives (Forer, L et al, 2012). A promising solution is the use of high speed
transfer technologies such as data compression, Peer-to-Peer data distribution , as well
as Aspera’s fasp high speed file transfer technology (which speeds file transfers over
the Web and outperforms traditional technologies such as FTP and HTTP)
[http://www.aspeerasoft.com/en/technology/fasp_overview_1/fasp_technology_overv
iew_1].
Scalability: - need for parallelization, and depends on the algorithm itself if
MapReduce suits the use case & desired effects.
Usability: - bioinformatics involves a pipeline of scripts and a command line
interface. Setting up nodes in the cloud involves a lot of command line and some
deeper understanding. Also adapting new applications to the cloud still requires some
technical expertise.
5. Conclusion
The significant advances in high throughput technologies like NGS have led to an explosion
of biological data. This has given rise to bottlenecks in storage and processing of this data.
This has forced bioinformatics to shift its focus from in- house computing to cloud
computing to address these big data challenges. Here, SaaS was the primary focus of our
review with emphasis on software that can be used on the cloud to address big data
challenges in bioinformatics. Cloud computing of big data in bioinformatics can be done on
several platforms such as Amazon EC2, Windows Azure and Google app engine. Most of
these platforms implement the open source MapReduce software for parallel execution which
makes the cloud preferred over in- house computing in addition to its benefits. Based on
finding obtained from the explored use cases, the cloud- based resources make the big data
meaningful and usable upon processing.
6. Recommendations and Future Works
High speed transfer technologies can be used in future to aid big data transfer. Despite the
challenges existing in cloud computing, future efforts must also be devoted to open and
publicly accessible bioinformatics clouds to the whole scientific community to accelerate
diagnosis, prognosis, drug discovery and personalized medicine. Bioinformatics clouds
should integrate both data and software tools and a lightweight programming environment
should be provided to help in development of pipelines for data analysis.
References
Blankenburg, L., Haberland, L., Elvers, H-D., Tannert, C. & Jandrig, B. (2009) ‘High-
Throughput Omics Technologies: Potential Tools for the Investigation of Influences of EMF
on Biological Systems’ Curr Genomics. 10 (2) pp. 86-92.
Butte, A. J. & Dudley, J. T. (2010) ‘Ín silico Research in the Era of Cloud Computing’
Nature Biotechnology. 28 (11) pp. 1181-1185.
12. Page | 12 ISSN 2223-7062 Proceedings and report of the 7th UbuntuNet Alliance annual conference, 2014,
pp 217-230
Certara. (2013) ‘Cloud Computing Enables Cost Effective Virtual Screening’ pp. 1-4.
http://www.certara.com/ [accessed 29 August 2014].
Dai, L., Xiao, J., Zhang, Z., Gao, X. & Guo, Y. (2012) ‘Bioinformatics Clouds for Big Data
Manipulation’ Biology Direct. 7 (43) pp. 1-7.
Feng, X., Stein, L. & Grossman, R. (2011) ‘PeakRanger: A Cloud- Enabled Peak Caller for
ChIP-seq Data’ BMC Bioinformatics. 12 (139) pp. 1- 11.
Forer, L., Schonherr, S., WeinBensteiner, H., et. al. (2012) ‘Cloud Computing’
Computational Medicine. pp. 27-36.
GeneBank. (2014) ‘GeneBank Statistics’http://www.ncbi.nlm.nih.gov/genbank/statistics.
[accessed on Oct 21, 2014]
Guo, X., Meng, Y., Yu, N., & Pan, Y. (2014) ‘Cloud computing for Detecting High-Order
Genome-Wide Epistatic Interaction via Dynamic Clustering’ BMC Bioinformatics. 15 (102)
pp. 1471-2105.
Gurtowski, J., Schatz, M., & Langmead, B. (2013) http://www.bio-cloud-1449786154.us-
east-l.elb_amazonaws.com/cgi-bin/crossbow.pl/ [accessed 8 November 2014]
Hide, W. (2012) ‘The Promise of Big Data’. HSPH News.
http://www.hsph.harvard.edu/news/magazine/spr12-big-data-tb-health-costs. [accessed 31
August 2014].
Implementa. (2014) http://www.implementa.com/solutions [accessed 10 June 2014]
Lin, Y., Yu, C., & Lin, Y. (2013) ‘Enabling Large-Scale Biomedical Analysis in the Cloud’
BioMed Research International. Volume 2013 pp. 1-6.
Monica, N. & Kumar, R. (2013) ‘Survey on Big Data by Coordinating MapReduce to
Integrate Variety of Data’ International Journal of Science & Research . 2 (11). pp. 236-242.
Movirtu. (2014a) Movirtu Virtual SIM Solutions.
http://www.movirtu.com/#!manyme/c1aq3 [accessed 10 September 2014]
Movirtu. (2014b) Press Release. http://www.movirtu.com/#!090614-airtel-
movirtuagreement/c140h [accessed 10 September 2014]
Russell, J. (2012) ‘Hadoop’s Rise in Life Sciences’ Bio-IT World. pp. 1-7.
Schatz, M. C. (2009) ÇloudBurst: Highly Sensitive Read Mapping with MapReduce’
Bioinformatics. 25 (11) pp. 1363-1369.
Stanoevska-Slabeva, K. & Wozniak, T. (2010) ‘Çloud Basics- An Introduction to Cloud
Computing’ In: Dai, L., Xiao, J., Zhang, Z., Gao, X. & Guo, Y. (2012) ‘Bioinformatics
Clouds for Big Data Manipulation’ Biology Direct. 7 (43) pp. 1-7.
Telco Review. (2014) Will the virtual sim break the telco strangle
hold?http://techday.com/telco-review/news/will-the-virtual-sim-break-the-telco-strangle-
hold/195914/ [accessed 28 October 2014]
13. Page | 13 ISSN 2223-7062 Proceedings and report of the 7th UbuntuNet Alliance annual conference, 2014,
pp 217-230
Biographies
Cephas Mawere: Attained his BTech degree in Biotechnology at CUT, Zimbabwe in 2010.
Completed his MTech in Bioinformatics at JNTUH, India in September 2014. He is a HIT
Lecturer and Researcher. His research interests encompass the fields of Big Data, Cloud
Computing, Information Security and Computer-Aided Drug Design
Kudakwashe Zvarevashe: Attained his BSc degree in Information Systems at MSU,
Zimbabwe in 2010. Completed his MTech in Information Technology at JNTUH, India in
September 2014. He is a HIT Lecturer and Researcher. His research interests are in the areas
of Big Data, Information security, Cloud Computing and Web Services.
14. Page | 14 ISSN 2223-7062 Proceedings and report of the 7th UbuntuNet Alliance annual conference, 2014,
pp 217-230