This document outlines Project A which involves analyzing a Twitter dataset using tools on a virtual machine. Students are assigned to take a subset of Twitter user profile data, clean it, import it into a MongoDB database on their Jetstream VM, geolocate the user profiles, and visually display the results. The project aims to teach students how to set up a cloud VM, use MongoDB, and manipulate analysis tools. Students must complete the work individually and submit a report by October 31, 2016, discussing what they learned about setting up VMs, using MongoDB, and producing visualized results from complex data.
A self destruction system for dynamic group data sharing in cloudeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
1) The document proposes an architecture to secure cloud data from data mining attacks.
2) The key components are a Cloud Data Overlay Network Distributor which splits user files into chunks and distributes them across multiple Location Based Cloud Providers.
3) The distributor uses encryption and authentication to securely connect users to the cloud providers through a VPN network. This is intended to prevent unauthorized access and data tampering.
Public integrity auditing for shared dynamic cloud data with group user revoc...Pvrtechnologies Nellore
This document describes a public integrity auditing scheme for shared dynamic cloud data that supports secure group user revocation. It identifies limitations in existing schemes, such as lack of consideration for data secrecy during updates and potential collusion attacks during revocation. The proposed scheme uses vector commitment, asymmetric group key agreement, and group signatures to enable encrypted data updates among group users and efficient yet secure user revocation. It aims to provide public auditing of data integrity, as well as properties like traceability and accountability. The security and performance of the scheme are analyzed and shown to improve upon relevant existing works.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Integrity Privacy to Public Auditing for Shared Data in Cloud ComputingIJERA Editor
In cloud computing, many mechanisms have been proposed to allow not only a data owner itself but also a public verifier to efficiently perform integrity checking without downloading the entire data from the cloud, which is referred to as public auditing . In these mechanisms, data is divided into many small blocks, where each block is independently signed by the owner; and a random combination of all the blocks instead of the whole data is retrieved during integrity checking .However, public auditing for such shared data— while preserving identity privacy — remains to be an open challenge. Here, we only consider how to audit the integrity of shared data in the cloud with static groups. It means the group is pre-defined before shared data is created in the cloud and the membership of users in the group is not changed during data sharing. The original user is responsible for deciding who is able to share her data before outsourcing data to the cloud. Another interesting problem is how to audit the integrity of shared data in the cloud with dynamic groups — a new user can be added into the group and an existing group member can be revoked during data sharing.
With cloud storage services, it is commonplace for data to be not only stored in the cloud, but also shared across multiple users. However, public auditing for such shared data — while preserving identity privacy — remains to be an open challenge. In this paper, we propose the first privacy-preserving mechanism that allows public auditing on shared data stored in the cloud. In particular, we exploit ring signatures to compute the verification information needed to audit the integrity of shared data. With our mechanism, the identity of the signer on each block in shared data is kept private from a third party auditor (TPA), who is still able to publicly verify the integrity of shared data without retrieving the entire file. Our experimental results demonstrate the effectiveness and efficiency of our proposed mechanism when auditing shared data.
PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO...Nexgen Technology
bulk ieee projects in pondicherry,ieee projects in pondicherry,final year ieee projects in pondicherry
Nexgen Technology Address:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com.
www.nexgenproject.com
Mobile: 9751442511,9791938249
Telephone: 0413-2211159.
NEXGEN TECHNOLOGY as an efficient Software Training Center located at Pondicherry with IT Training on IEEE Projects in Android,IEEE IT B.Tech Student Projects, Android Projects Training with Placements Pondicherry, IEEE projects in pondicherry, final IEEE Projects in Pondicherry , MCA, BTech, BCA Projects in Pondicherry, Bulk IEEE PROJECTS IN Pondicherry.So far we have reached almost all engineering colleges located in Pondicherry and around 90km
A self destruction system for dynamic group data sharing in cloudeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
1) The document proposes an architecture to secure cloud data from data mining attacks.
2) The key components are a Cloud Data Overlay Network Distributor which splits user files into chunks and distributes them across multiple Location Based Cloud Providers.
3) The distributor uses encryption and authentication to securely connect users to the cloud providers through a VPN network. This is intended to prevent unauthorized access and data tampering.
Public integrity auditing for shared dynamic cloud data with group user revoc...Pvrtechnologies Nellore
This document describes a public integrity auditing scheme for shared dynamic cloud data that supports secure group user revocation. It identifies limitations in existing schemes, such as lack of consideration for data secrecy during updates and potential collusion attacks during revocation. The proposed scheme uses vector commitment, asymmetric group key agreement, and group signatures to enable encrypted data updates among group users and efficient yet secure user revocation. It aims to provide public auditing of data integrity, as well as properties like traceability and accountability. The security and performance of the scheme are analyzed and shown to improve upon relevant existing works.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Integrity Privacy to Public Auditing for Shared Data in Cloud ComputingIJERA Editor
In cloud computing, many mechanisms have been proposed to allow not only a data owner itself but also a public verifier to efficiently perform integrity checking without downloading the entire data from the cloud, which is referred to as public auditing . In these mechanisms, data is divided into many small blocks, where each block is independently signed by the owner; and a random combination of all the blocks instead of the whole data is retrieved during integrity checking .However, public auditing for such shared data— while preserving identity privacy — remains to be an open challenge. Here, we only consider how to audit the integrity of shared data in the cloud with static groups. It means the group is pre-defined before shared data is created in the cloud and the membership of users in the group is not changed during data sharing. The original user is responsible for deciding who is able to share her data before outsourcing data to the cloud. Another interesting problem is how to audit the integrity of shared data in the cloud with dynamic groups — a new user can be added into the group and an existing group member can be revoked during data sharing.
With cloud storage services, it is commonplace for data to be not only stored in the cloud, but also shared across multiple users. However, public auditing for such shared data — while preserving identity privacy — remains to be an open challenge. In this paper, we propose the first privacy-preserving mechanism that allows public auditing on shared data stored in the cloud. In particular, we exploit ring signatures to compute the verification information needed to audit the integrity of shared data. With our mechanism, the identity of the signer on each block in shared data is kept private from a third party auditor (TPA), who is still able to publicly verify the integrity of shared data without retrieving the entire file. Our experimental results demonstrate the effectiveness and efficiency of our proposed mechanism when auditing shared data.
PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO...Nexgen Technology
bulk ieee projects in pondicherry,ieee projects in pondicherry,final year ieee projects in pondicherry
Nexgen Technology Address:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com.
www.nexgenproject.com
Mobile: 9751442511,9791938249
Telephone: 0413-2211159.
NEXGEN TECHNOLOGY as an efficient Software Training Center located at Pondicherry with IT Training on IEEE Projects in Android,IEEE IT B.Tech Student Projects, Android Projects Training with Placements Pondicherry, IEEE projects in pondicherry, final IEEE Projects in Pondicherry , MCA, BTech, BCA Projects in Pondicherry, Bulk IEEE PROJECTS IN Pondicherry.So far we have reached almost all engineering colleges located in Pondicherry and around 90km
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESneirew J
ABSTRACT
Data in the cloud is increasing rapidly. This huge amount of data is stored in various data centers around the world. Data deduplication allows lossless compression by removing the duplicate data. So, these data centers are able to utilize the storage efficiently by removing the redundant data. Attacks in the cloud computing infrastructure are not new, but attacks based on the deduplication feature in the cloud computing is relatively new and has made its urge nowadays. Attacks on deduplication features in the cloud environment can happen in several ways and can give away sensitive information. Though, deduplication feature facilitates efficient storage usage and bandwidth utilization, there are some drawbacks of this feature. In this paper, data deduplication features are closely examined. The behavior of data deduplication depending on its various parameters are explained and analyzed in this paper.
IRJET- Multiple Keyword Search over Encrypted Cloud DataIRJET Journal
This document proposes a system for multi-keyword ranked search over encrypted cloud data. The system consists of four main components: data owners who upload encrypted data to the cloud server, data users who search for and access encrypted files, an administrator server that handles authentication and generates trapdoors for searches, and the cloud server that stores the encrypted data and indexes. When a data user performs a search, the administrator generates an encrypted trapdoor for the keywords to allow searching without revealing the plaintext to the cloud server. Search results are ranked based on factors like download frequency. The proposed system aims to provide secure, authenticated searches over outsourced encrypted data while preserving data privacy.
A new approach for user identification in web usage mining preprocessingIOSR Journals
This document presents a new approach for user identification in web usage mining preprocessing. It proposes a three-phase method: 1) Select websites and access them from different locations to find the IP address, session usage time, and navigations. 2) Apply Java tools and methods to identify the IP address, session usage, and visited web links. 3) Combine the web link navigation, IP address, and session usage to efficiently investigate web user behavior. The key steps in preprocessing include data cleaning, IP address identification, session identification, data integration, transformation, reduction, and usage mining. The proposed approach aims to improve performance and data quality for identifying unique users and sessions.
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...IRJET Journal
This document summarizes a research paper that proposes a methodology for optimizing storage on the cloud using authorized de-duplication. It discusses how de-duplication works to eliminate duplicate data and optimize storage. The key steps are chunking files into blocks, applying secure hash algorithms like SHA-512 to generate unique hashes for each block, and comparing hashes to reference duplicate blocks instead of storing multiple copies. It also discusses using cryptographic techniques like ciphertext-policy attribute-based encryption for authentication and security on public clouds. The proposed approach aims to optimize storage while providing authorized de-duplication functionality.
IRJET- Multiple Keyword Search for Encrypted Cloud StorageIRJET Journal
This document proposes a phrase search technique for encrypted cloud storage using bloom filters that is faster than existing techniques with better storage and communication costs. The technique uses n-gram filters and conjunctive keyword searching to support functionality. It describes a system framework where the data owner encrypts and uploads documents with attached bloom filters to the cloud. Data users can search for keywords and request files, with the data owner approving or rejecting requests. The technique aims to improve search performance over encrypted documents stored in the cloud while maintaining security and privacy.
Parallel and Distributed Algorithms for Large Text Datasets AnalysisIllia Ovchynnikov
This document is a thesis submitted for the degree of Bachelor of Computer Science at Opole University of Technology. It explores using distributed systems for processing large text datasets in the context of near duplicate text detection. The study reviews big data concepts, popular analytics frameworks like Hadoop and Spark, and algorithms for determining document duplication levels. The results were applied to develop a prototype distributed anti-plagiarism system that showed improved performance over existing solutions for analyzing large collections of text data.
International Refereed Journal of Engineering and Science (IRJES) is a peer reviewed online journal for professionals and researchers in the field of computer science. The main aim is to resolve emerging and outstanding problems revealed by recent social and technological change. IJRES provides the platform for the researchers to present and evaluate their work from both theoretical and technical aspects and to share their views.
E FFICIENT D ATA R ETRIEVAL F ROM C LOUD S TORAGE U SING D ATA M ININ...IJCI JOURNAL
Cloud computing is an emanating technology allowing
users to perform data processing, use as storage
and data admission services from around the world t
hrough internet. The Cloud service providers charge
depending on the user’s usage. Imposing confidentia
lity and scalability on cloud data increases the
complexity of cloud computing. As sensitive informa
tion is centralized into the cloud, this informatio
n must
be encrypted and uploaded to cloud for the data pri
vacy and efficient data utilization. As the data be
comes
complex and number of users are increasing searchin
g of the files must be allowed through multiple
keyword of the end users interest. The traditional
searchable encryption schemes allows users to searc
h in
the encrypted cloud data through keywords, which su
pport only Boolean search, i.e., whether a keyword
exists in a file or not, without any relevance of d
ata files and the queried keyword. Searching of dat
a in the
cloud using Single keyword ranked search results to
o coarse output and the data privacy is opposed usi
ng
server side ranking based on order-preserving encry
ption (OPE)
Secret keys and the packets transportation for privacy data forwarding method...eSAT Journals
Abstract The Cloud computing is the process of storing the data in the Remote server. This process doesn‘t speak much about confidentiality and robustness of the data. To improve the security and confidentiality the uploaded file from a data owner is splitted into multiple packets and stored in multiple cloud servers. These packets are encrypted using the primary key. These different keys are also distributed in multiple key servers. User id is appended for verification. If the data owner forwards the file then the keys are verified for the data access. In this we are proposing sending the secret key as SMS to the shared or forwarded nodes for the process of proper Security. This technique integrates the concepts of encryption, encoding and forwarding. Keywords-cloud computing, encryption, storage system
Secret keys and the packets transportation for privacy data forwarding method...eSAT Publishing House
This document proposes a method for improving data security and privacy in cloud data forwarding. The method involves splitting a data owner's encrypted file into multiple packets, encrypting each packet, and storing the packets and encryption keys across multiple cloud servers. If the data owner wants to forward the file, they send the encrypted packets and verify the recipient's identity. To further enhance security, the decryption key is sent as an SMS rather than over the cloud servers. This integrates concepts of encryption, encoding, and key distribution to improve data confidentiality when files are forwarded in the cloud.
This document outlines a final year project proposal on security in cloud computing. The aim is to propose a new trust model between cloud providers and users based on the user's past experience, knowledge of cloud concepts, and security measures at different levels. The document reviews existing literature on reputation-based trust models and encryption techniques. It then discusses research problems around data security, privacy, and trust. The proposed methodology involves surveys to understand user requirements, and experiments using data coloring and watermarking techniques with encryption to securely store fragmented data in the cloud. Potential outcomes include improved service level agreements and fine-grained access control, with limitations around specific data types and formats supported.
Secure Data Sharing For Dynamic Groups in Multi-Attorney Manner Using Cloudpaperpublications3
Abstract: Cloud computing provides an economical and efficient solution for sharing data among the cloud users in the group , users sharing data in a multi-attorney manner preserving data and identity privacy from an untrusted cloud, it is still a challenging issue, due to frequent change of the membership in the group. In this paper, we propose a multi-attorney data sharing scheme for the dynamic groups in the cloud. By combing group signature and Tripple DES encryption techniques, any cloud user anonymously share the data with others. In addition, we analyze the security of our scheme with rigorous proofs, and demonstrate the efficiency of our scheme in experiments.Keywords: cloud computing, data sharing, privacy-preserving, access control, and dynamic groups.
Title: Secure Data Sharing For Dynamic Groups in Multi-Attorney Manner Using Cloud
Author: Vijaya Kumar Patil C, Manjunath H
International Journal of Recent Research in Mathematics Computer Science and Information Technology
ISSN 2350-1022
Paper Publications
This document proposes a distributed framework for community detection in social networks. Each community acts as a selfish agent that uses only local information to maximize its own modularity utility function. Communities have predefined actions like merging with neighboring communities. Communities choose actions that maximize their own modularity without a centralized decision maker. Experimental results on real and synthetic networks show the proposed distributed methods achieve modularity comparable to existing centralized approaches but with better runtime.
This document proposes methods for discovering organizational structure in static and dynamic social networks. For static networks, it introduces an m-Score to represent member importance and builds a community tree to represent the organizational hierarchy. For dynamic networks, it develops a tree learning algorithm to reconstruct the evolving community tree based on scoring past and current community structures. Experiments on karate club and Enron email networks demonstrate the approach.
Enhancing Our Capacity for Large Health Dataset AnalysisCTSI at UCSF
Overview of UCSF-CTSI Comparative Effectiveness Large Dataset Analysis Core, which offers resources for the analysis of large, public data sets on health and health care.
My slides from my 3-hour tutorial on mesoscale structures in networks from the 2016 Lake Como School on Complex Networks (http://ntmb.lakecomoschool.org/).
After my talk, Tiago Peixoto gave a talk on statistical inference of large-scale mesoscale structures in networks. His presentation, which takes a complementary perspective from mine, is available at the following website: https://speakerdeck.com/count0/statisical-inference-of-generative-network-models
This document provides an overview of key network analysis measures that can be calculated using the Gephi software tool, including global measures like density, degree, diameter, and average path length. It also discusses positional measures like degree centrality, eigenvector centrality, closeness centrality, and betweenness centrality. Finally, it briefly mentions local measures like community structure, cliques, and community detection using modularity. Examples analyzing the karate club network are provided for several of the measures.
This document discusses adding semantic structure to real-time social data from Twitter through Twitter Annotations. It describes how Annotations can be mapped to existing Semantic Web vocabularies and linked to datasets to enable real-time semantic search over social and linked data. A system called TwitLogic is presented that captures Twitter data, converts it to RDF, and publishes it as linked streams to allow for continuous querying and integration with the live Semantic Web.
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESneirew J
ABSTRACT
Data in the cloud is increasing rapidly. This huge amount of data is stored in various data centers around the world. Data deduplication allows lossless compression by removing the duplicate data. So, these data centers are able to utilize the storage efficiently by removing the redundant data. Attacks in the cloud computing infrastructure are not new, but attacks based on the deduplication feature in the cloud computing is relatively new and has made its urge nowadays. Attacks on deduplication features in the cloud environment can happen in several ways and can give away sensitive information. Though, deduplication feature facilitates efficient storage usage and bandwidth utilization, there are some drawbacks of this feature. In this paper, data deduplication features are closely examined. The behavior of data deduplication depending on its various parameters are explained and analyzed in this paper.
IRJET- Multiple Keyword Search over Encrypted Cloud DataIRJET Journal
This document proposes a system for multi-keyword ranked search over encrypted cloud data. The system consists of four main components: data owners who upload encrypted data to the cloud server, data users who search for and access encrypted files, an administrator server that handles authentication and generates trapdoors for searches, and the cloud server that stores the encrypted data and indexes. When a data user performs a search, the administrator generates an encrypted trapdoor for the keywords to allow searching without revealing the plaintext to the cloud server. Search results are ranked based on factors like download frequency. The proposed system aims to provide secure, authenticated searches over outsourced encrypted data while preserving data privacy.
A new approach for user identification in web usage mining preprocessingIOSR Journals
This document presents a new approach for user identification in web usage mining preprocessing. It proposes a three-phase method: 1) Select websites and access them from different locations to find the IP address, session usage time, and navigations. 2) Apply Java tools and methods to identify the IP address, session usage, and visited web links. 3) Combine the web link navigation, IP address, and session usage to efficiently investigate web user behavior. The key steps in preprocessing include data cleaning, IP address identification, session identification, data integration, transformation, reduction, and usage mining. The proposed approach aims to improve performance and data quality for identifying unique users and sessions.
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...IRJET Journal
This document summarizes a research paper that proposes a methodology for optimizing storage on the cloud using authorized de-duplication. It discusses how de-duplication works to eliminate duplicate data and optimize storage. The key steps are chunking files into blocks, applying secure hash algorithms like SHA-512 to generate unique hashes for each block, and comparing hashes to reference duplicate blocks instead of storing multiple copies. It also discusses using cryptographic techniques like ciphertext-policy attribute-based encryption for authentication and security on public clouds. The proposed approach aims to optimize storage while providing authorized de-duplication functionality.
IRJET- Multiple Keyword Search for Encrypted Cloud StorageIRJET Journal
This document proposes a phrase search technique for encrypted cloud storage using bloom filters that is faster than existing techniques with better storage and communication costs. The technique uses n-gram filters and conjunctive keyword searching to support functionality. It describes a system framework where the data owner encrypts and uploads documents with attached bloom filters to the cloud. Data users can search for keywords and request files, with the data owner approving or rejecting requests. The technique aims to improve search performance over encrypted documents stored in the cloud while maintaining security and privacy.
Parallel and Distributed Algorithms for Large Text Datasets AnalysisIllia Ovchynnikov
This document is a thesis submitted for the degree of Bachelor of Computer Science at Opole University of Technology. It explores using distributed systems for processing large text datasets in the context of near duplicate text detection. The study reviews big data concepts, popular analytics frameworks like Hadoop and Spark, and algorithms for determining document duplication levels. The results were applied to develop a prototype distributed anti-plagiarism system that showed improved performance over existing solutions for analyzing large collections of text data.
International Refereed Journal of Engineering and Science (IRJES) is a peer reviewed online journal for professionals and researchers in the field of computer science. The main aim is to resolve emerging and outstanding problems revealed by recent social and technological change. IJRES provides the platform for the researchers to present and evaluate their work from both theoretical and technical aspects and to share their views.
E FFICIENT D ATA R ETRIEVAL F ROM C LOUD S TORAGE U SING D ATA M ININ...IJCI JOURNAL
Cloud computing is an emanating technology allowing
users to perform data processing, use as storage
and data admission services from around the world t
hrough internet. The Cloud service providers charge
depending on the user’s usage. Imposing confidentia
lity and scalability on cloud data increases the
complexity of cloud computing. As sensitive informa
tion is centralized into the cloud, this informatio
n must
be encrypted and uploaded to cloud for the data pri
vacy and efficient data utilization. As the data be
comes
complex and number of users are increasing searchin
g of the files must be allowed through multiple
keyword of the end users interest. The traditional
searchable encryption schemes allows users to searc
h in
the encrypted cloud data through keywords, which su
pport only Boolean search, i.e., whether a keyword
exists in a file or not, without any relevance of d
ata files and the queried keyword. Searching of dat
a in the
cloud using Single keyword ranked search results to
o coarse output and the data privacy is opposed usi
ng
server side ranking based on order-preserving encry
ption (OPE)
Secret keys and the packets transportation for privacy data forwarding method...eSAT Journals
Abstract The Cloud computing is the process of storing the data in the Remote server. This process doesn‘t speak much about confidentiality and robustness of the data. To improve the security and confidentiality the uploaded file from a data owner is splitted into multiple packets and stored in multiple cloud servers. These packets are encrypted using the primary key. These different keys are also distributed in multiple key servers. User id is appended for verification. If the data owner forwards the file then the keys are verified for the data access. In this we are proposing sending the secret key as SMS to the shared or forwarded nodes for the process of proper Security. This technique integrates the concepts of encryption, encoding and forwarding. Keywords-cloud computing, encryption, storage system
Secret keys and the packets transportation for privacy data forwarding method...eSAT Publishing House
This document proposes a method for improving data security and privacy in cloud data forwarding. The method involves splitting a data owner's encrypted file into multiple packets, encrypting each packet, and storing the packets and encryption keys across multiple cloud servers. If the data owner wants to forward the file, they send the encrypted packets and verify the recipient's identity. To further enhance security, the decryption key is sent as an SMS rather than over the cloud servers. This integrates concepts of encryption, encoding, and key distribution to improve data confidentiality when files are forwarded in the cloud.
This document outlines a final year project proposal on security in cloud computing. The aim is to propose a new trust model between cloud providers and users based on the user's past experience, knowledge of cloud concepts, and security measures at different levels. The document reviews existing literature on reputation-based trust models and encryption techniques. It then discusses research problems around data security, privacy, and trust. The proposed methodology involves surveys to understand user requirements, and experiments using data coloring and watermarking techniques with encryption to securely store fragmented data in the cloud. Potential outcomes include improved service level agreements and fine-grained access control, with limitations around specific data types and formats supported.
Secure Data Sharing For Dynamic Groups in Multi-Attorney Manner Using Cloudpaperpublications3
Abstract: Cloud computing provides an economical and efficient solution for sharing data among the cloud users in the group , users sharing data in a multi-attorney manner preserving data and identity privacy from an untrusted cloud, it is still a challenging issue, due to frequent change of the membership in the group. In this paper, we propose a multi-attorney data sharing scheme for the dynamic groups in the cloud. By combing group signature and Tripple DES encryption techniques, any cloud user anonymously share the data with others. In addition, we analyze the security of our scheme with rigorous proofs, and demonstrate the efficiency of our scheme in experiments.Keywords: cloud computing, data sharing, privacy-preserving, access control, and dynamic groups.
Title: Secure Data Sharing For Dynamic Groups in Multi-Attorney Manner Using Cloud
Author: Vijaya Kumar Patil C, Manjunath H
International Journal of Recent Research in Mathematics Computer Science and Information Technology
ISSN 2350-1022
Paper Publications
This document proposes a distributed framework for community detection in social networks. Each community acts as a selfish agent that uses only local information to maximize its own modularity utility function. Communities have predefined actions like merging with neighboring communities. Communities choose actions that maximize their own modularity without a centralized decision maker. Experimental results on real and synthetic networks show the proposed distributed methods achieve modularity comparable to existing centralized approaches but with better runtime.
This document proposes methods for discovering organizational structure in static and dynamic social networks. For static networks, it introduces an m-Score to represent member importance and builds a community tree to represent the organizational hierarchy. For dynamic networks, it develops a tree learning algorithm to reconstruct the evolving community tree based on scoring past and current community structures. Experiments on karate club and Enron email networks demonstrate the approach.
Enhancing Our Capacity for Large Health Dataset AnalysisCTSI at UCSF
Overview of UCSF-CTSI Comparative Effectiveness Large Dataset Analysis Core, which offers resources for the analysis of large, public data sets on health and health care.
My slides from my 3-hour tutorial on mesoscale structures in networks from the 2016 Lake Como School on Complex Networks (http://ntmb.lakecomoschool.org/).
After my talk, Tiago Peixoto gave a talk on statistical inference of large-scale mesoscale structures in networks. His presentation, which takes a complementary perspective from mine, is available at the following website: https://speakerdeck.com/count0/statisical-inference-of-generative-network-models
This document provides an overview of key network analysis measures that can be calculated using the Gephi software tool, including global measures like density, degree, diameter, and average path length. It also discusses positional measures like degree centrality, eigenvector centrality, closeness centrality, and betweenness centrality. Finally, it briefly mentions local measures like community structure, cliques, and community detection using modularity. Examples analyzing the karate club network are provided for several of the measures.
This document discusses adding semantic structure to real-time social data from Twitter through Twitter Annotations. It describes how Annotations can be mapped to existing Semantic Web vocabularies and linked to datasets to enable real-time semantic search over social and linked data. A system called TwitLogic is presented that captures Twitter data, converts it to RDF, and publishes it as linked streams to allow for continuous querying and integration with the live Semantic Web.
The document provides a history and overview of circles. It discusses:
1) Euclid defined circles in 300 BC in his work "The Elements", establishing circles as fundamental objects in geometry. He defined a point and line and established early theorems about circles.
2) Key parts of a circle include the radius, diameter, circumference, chord, secant, tangent, arc, sector, and segment. The circumference is related to the diameter by pi.
3) Circles can be defined analytically using equations in the standard "center-radius" form of (x-h)2 + (y-k)2 = r2, where (h,k) are the coordinates of the
Quick introduction to community detection.
Structural properties of real world networks, definition of "communities", fundamental techniques and evaluation measures.
Active Directory (AD) is a database management system created by Microsoft that stores information about network components. It allows administrators to manage policies, accounts, programs and updates across a network. AD replaced Windows NT's domain model and provided greater scalability than previous Windows versions. The LDAP directory service protocol was developed to provide a standard way to access directories over a network. It led to the development of directory server software like Microsoft Active Directory that could be implemented by various clients and vendors. Windows NT was a powerful and secure operating system that supported multiple file systems and architectures. It provided user accounts and identity-based security where users could only access resources with the proper authenticated password for their account. Database security involves restricting access to tables and authenticating
This document is a project report for a chat server application with file and desktop sharing capabilities. It was created by three students - Khagendra Kumar Chapre, Akhil Pal, and Manauwar Alam - as a partial fulfillment of their Bachelor of Technology degree at the Government Engineering College in Bikaner, Rajasthan, India. The report documents the design, development and testing of a chat server that allows users to chat, share files, and share their desktop screens. It includes sections on introduction, literature review, requirements, architectural design, testing and results.
Docker allows creating isolated environments called containers from images. Containers provide a standard way to develop, ship, and run applications. The document discusses how Docker can be used for scientific computing including running different versions of software, automating computations, sharing research environments and results, and providing isolated development environments for users through Docker IaaS tools. K-scope is a code analysis tool that previously required complex installation of its Omni XMP dependency, but could now be run as a containerized application to simplify deployment.
This document contains personal and academic details for Martin Craughwell. It outlines his educational background including a Higher Degree in Computer Network Management and Cloud Infrastructure from Athlone Institute of Technology. It also lists his skills in programming languages, databases, servers, and networks. Several group and individual projects are described relating to networking, scripting, and software design using languages like Java, Python, and databases. Hobbies include computers, programming, walking, reading, and playing guitar.
This document summarizes a research paper on developing a cloud-based storage system similar to Dropbox called "Unbox". It describes the proposed system's architecture which includes users uploading files via a web browser that are stored on servers using the OwnCloud storage system. The storage is implemented using a Raspberry Pi server with Samba file sharing enabled to allow access from multiple devices via CIFS/SMB protocol. The system aims to provide a cost-efficient cloud storage option for users with features like file encryption, backup plans and access from multiple devices.
VRE Cancer Imaging BL RIC Workshop 22032011djmichael156
The document discusses the Virtual Research Environment for Cancer Imaging (VRE-CI) project which aims to provide a framework for researchers and clinicians to share cancer imaging information, images, and algorithms. It describes using Business Connectivity Services and managed metadata to organize and search image metadata, and building a reusable SharePoint site definition to manage DICOM files and extract metadata for search. Key aspects covered include mapping folders, issues with document library names, including external code, and adapting the DICOM field model.
This tutorial demonstrates how to create a Windows Azure application with a web role that uses the Windows Azure Table service for data storage. It involves creating a Visual Studio project with a web role, defining an entity class to represent data stored in the Table service, and creating a data access class to add and retrieve messages from the Table service. The tutorial will then add functionality to the web role to display and process messages and deploy the application to both the local emulator and Windows Azure.
These Windows Azure and SQL Database (formerly SQL Azure) tutorials are designed for beginners who have some .NET development experience. Using a common
scenario, each tutorial introduces one or two Windows Azure features or components. Even though each tutorial builds upon the previous ones, the tutorials are self-contained and can be used without completing the previous tutorials.
This document provides an overview and instructions for a series of Windows Azure and SQL Database tutorials. The tutorials are designed for beginners and teach how to use various Azure services like the Web Role, Table Storage, SQL Database, Blob Storage, Worker Role and Queues. Each tutorial builds on the previous one by adding additional features and functionality to an example messaging board application. The tutorials can be completed independently without doing the previous tutorials.
This document provides an overview and instructions for a series of Windows Azure and SQL Database tutorials. The tutorials are designed for beginners and teach how to use various Azure services like the Web Role, Table Storage, SQL Database, Blob Storage, Worker Role and Queues. Each tutorial builds on the previous one by adding additional features and functionality to an example messaging board application. The tutorials can be completed independently without doing the previous tutorials.
Windows Azure and SQL Database Tutorials; Jonathan Gao. These Windows Azure and SQL Database (formerly SQL Azure) tutorials are
designed for beginners who have some .NET development experience. Using a common
scenario, each tutorial introduces one or two Windows Azure features or components.
Even though each tutorial builds upon the previous ones, the tutorials are self-contained
and can be used without completing the previous tutorials.
Twitter Dataset Analysis and Geocoding James Nelson
The aim of the project was to validate user-defined location data in a Twitter dataset of 10,000 tweets using MongoDB and the Google Maps Geocoding API.
Running Head: System Proposal 1
2
System Proposal
System proposal
Your name
University
System Proposal
Client
Riordan manufacturing is virtual company and in this case is used for purposes of demonstrations.Riordan is a manufacturing business.
The brief
The brief is to design a database system which will be used to coordinate and manage the main operations of Riordan manufacturer. In addition to managing all customer relationships, management reports will also be allowed to be reported .The main data will be managed by Microsoftspreads sheets as well as paper-based work.
Main issues of the current system
· Aspread sheet can be accessed by only one person at a time for editing. A system is needed where multiple users can edit and access theinformation at once.
· The current system does not have a security of data.
· There is a duplication of data in multiple systems and this leads to a lot of mistakes being made as well as giving a lot of work.
· The reports which are produced by the current system are not meaningful.
Solution
My system is a window based system of information management which uses.NET and SQL. For multi-user applications, these technologies are the most applicable (Dyanna Gregory, 2014); the system will provide the needs, security and an application which is easy to use
The key issues will be addresses as follows:
The database which is relational willensure access of the system by multiple users who can access the need records simultaneously. However same people will not be allowed to edit same document simultaneously.This will be enabled by a locking functionality which will ensure that only one person will edit the record while others can access other documents .
The system will integrate a windows authentication system to ensure security by means of private log-ins. This will ensure security. All paper based and the spread sheet systems will be integrated into a single system to ensure secure storage of the data. The accuracy of the data will be ensured through forms data entry validation.
Investment and cost
The cost and total investment required in this report will be a fixed cost of $10000+VAT based on the following development stages.
Approval……………………………………………………….…..$1500+VAT
ALPHA-1 version delivery………………………………………..…$3000+VAT
ALPHA-2 version delivery…………………………………………..$3000+VAT
BETA-version delivery………………………………………………$1500+VAT
Project completion……………………………………………..…….$1000+VAT
Software
The windows 2007 will be adequate and sufficient to provide a platform which will run the database. Since SQL server is usually installed on this, there will not be any additional investment required. A dedicated server will also be required so as to provide an optimum performance although this may be considered sometimes later as the project develops.
Hardware
The hardware requirement will constitute of severs, computers and the.
Fundamental question and answer in cloud computing quiz by animesh chaturvediAnimesh Chaturvedi
The document contains questions and answers related to a cloud computing exam. It includes 5 questions worth 5 marks each on topics like the 2013 ACM Turing Award winner and their contributions to distributed systems and cloud computing, different cloud computing models, data transfer methods, descriptions of Google File System and Hadoop Distributed File System, and architectures for Hadoop on Google Cloud Platform and web applications on Google App Engine. The answers to the questions are provided in slides within the linked website.
This document provides an overview of a project to build a website blocker using Python. It discusses the project idea, literature survey on existing website blocking tools, technologies used including Python and Tkinter, the workflow involving importing libraries and creating GUI elements and block/unblock functions, functions used in the project, pros and cons, and references. The objective is to create a tool that can block given websites from any device to help users avoid distractions.
1) The document outlines the tasks, tools, and topics explored by Vipul Divyanshu during a summer internship at India Innovation Labs, including data analytics on a medium-sized database and building a recommender engine.
2) Key tools explored include Mahout for machine learning algorithms, Hadoop for distributed processing, and Rush Analyzer (with KNIME) for data visualization and analytics.
3) Vipul implemented recommendation engines including user-based, item-based, and SlopeOne recommenders and evaluated performance using recommender evaluators.
Report : Dpilot - A Cloud Based File Transfer Web ApplicationNidhi Chauhan
A Web Application to share your data over Cloud , making it secure and simple to transfer/ share your files, images or other documents over remote locations using MERN stack Development. Dpilot is a cloud based file transfer application that allows its user to upload data on cloud server and the receiver on the other hand can downlaod the data from the server. The Downlaod information is send to the receiver via mail service.
Other Features include:-
Secure Login system
Easy data Access
Lightening Fast Uploads and Downloads
Connect with your Facebook Or Gmail Account for easy access
This document provides an overview of a virtual network project for students. The objectives are to demonstrate installing and configuring virtual network/server operating systems, virtual workstations, and ensuring security and manageability of the virtual network. Students will perform tasks over two days, such as installing client and server operating systems, configuring security and file sharing on the server, and setting up a basic intranet with user webpages. The project aims to simulate a small office network using virtualization software.
Syncitall is a program that allows users to sync files across multiple cloud storage services like Google Drive, OneDrive, and Dropbox. It provides a common interface to access files from different cloud storages simultaneously. The program uses APIs to connect to cloud services and Selenium to automate browser authorization. It splits large files into parts for uploading across storages. The graphical user interface is built using PyQt and allows viewing, moving, deleting, and downloading files from connected cloud accounts in one place.
Similar to Project a twitter dataset analysis (20)
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
Project a twitter dataset analysis
1. Management, Access, and Use of Big and Complex Data
Fall 2016 Indiana University
1
Project
A:
Twitter
dataset
Analysis
Assigned:
11
Oct
2016
Due
Date:
31
Oct
2016
Project
group
size:
1
Deliverables:
project
report,
and
additional
evidence
of
work
done
Submit
where:
Canvas
Discuss
where:
project
specific
thread
on
Piazza
In
this
project
you
will
be
taking
a
Twitter
data
set
through
a
set
of
steps
that
could
be
seen
as
a
manually
executed
data
pipeline
–
the
data
pipeline
concept
taught
in
the
first
couple
lessons
of
the
course.
You
will
use
a
subset
of
the
Twitter
data,
just
the
user
profile
portion,
10,000
of
them.
You
will
invoke
tools
to
clean
the
data
and
import
it
into
the
MongoDB
database
that
is
running
on
your
Jetstream
VM.
From
there
you
will
invoke
additional
tools
to
geolocate
the
user
profiles
and
display
the
geolocated
user
profiles
visually.
The
tutorials
that
the
AIs
have
delivered
over
the
first
four
weeks
of
the
semester
should
have
you
well
prepared
to
begin
this
project
with
ease.
The
tutorial
material
is
available
to
you
in
the
Files
directory
on
Canvas
under
the
directory.
Tutorial
material:
Canvas>Files>Project
Material
Generate
SSH
Key.pdf
Canvas>Files>Project
Material
Jetstream
Tutorial.pdf
Canvas>Files>Project
Material
Transfer
files
into
Instance.pdf
From
this
project
you
will
have
learned
how
to:
• Set
up
and
use
a
cloud
hosted
virtual
machine
• Use
a
MongoDB
database
• Manipulate
software
tools
that
are
given
to
you
or
that
you
find
at
cloud
hosted
sites
to
produce
a
specific
visualized
result
Students
will
work
alone
on
this
project.
That
is,
project
group
size
=
1.
The
point
of
contact
for
the
projects
is
your
Associate
Instructor
who
can
be
reached
through
Edx/Piazza.
Discussion
about
the
project
will
take
place
in
a
Pizza
Discussion
group
set
up
for
the
project.
You
will
turn
in
a
report,
and
a
couple
other
pieces
of
your
work.
This
is
described
at
the
end
of
the
project.
You
will
have
until
Oct
31
2016
to
complete
the
work.
1.
Dataset
The
dataset
we
will
use
in
this
project
is
a
portion
of
a
Twitter
dataset
that
was
created
by
researchers
at
the
University
of
Illinois
[1][2]
from
Twitter
data
that
they
collected
May
2011
and
cleaned
for
instance
so
that
tweets
have
user
profiles
and
following
relationships
don’t
have
dangling
followers
(followers
with
no
profiles).
We
obtained
the
dataset
August
2014.
It
is
free
and
open
for
use
but
uses
of
the
dataset
beyond
this
classroom
project
must
include
this
citation:
2. Management, Access, and Use of Big and Complex Data
Fall 2016 Indiana University
2
Rui
Li,
Shengjie
Wang,
Hongbo
Deng,
Rui
Wang,
Kevin
Chen-‐Chuan
Chang:
Towards
social
user
profiling:
unified
and
discriminative
influence
model
for
inferring
home
locations.
KDD
2012:1023-‐1031
The
full
dataset
contains
Twitter
data
of
three
types:
“Following”
relationships
followers,
followees
relationships,
284
million
of
these
relationships
amongst
20
million
users
User
profiles
profiles
for
3
million
users
selected
from
20
million
users
who
have
at
least
10
relationships
in
the
following
network.
Tweets
at
most
500
tweets
for
each
user
in
140
thousand
users,
who
have
their
locations
in
their
profiles
among
the
3
million
users.
For
Project
A,
you
will
use
just
a
subset
of
the
user
profiles
dataset,
containing
10,000
user
profiles.
Data
File
File
Size
(approx.)
Location
Subject
Description
users_10000.txt
730
KB
Canvas
>>
Files
>>
Project
Materials
User
Profiles
10,000
user
profiles
2.
Setting
up
your
environment
Students
are
required
to
carry
out
your
project
in
a
virtual
machine
that
you
set
up
and
use
on
Jetstream.
If
you
prefer
to
work
on
another
platform,
you
must
demonstrate
that
your
project
runs
on
the
Jetstream
virtual
machine.
2.1
Create
and
configure
Virtual
Machine
(VM)
Through
the
tutorials,
we
have
taken
you
through
setting
up
your
own
VM
for
your
use,
and
creating
and
copying
your
public
key
to
your
VM.
If
you
have
not
yet
set
up
your
Jetstream
virtual
machine,
see
the
following
tutorials
on
Canvas.
Giving
your
VM
a
copy
of
your
public
key
is
needed
so
that
you
can
copy
files
to
your
VM.
It
is
also
needed
if
you
plan
to
use
a
command
line
shell
that
is
not
the
Atmosphere
shell
that
Jetstream
gives
you
through
its
web
interface.
Canvas>>Files
>>Jetstream
Tutorial
Canvas>>Files>>Generate
SSH
Key.pdf
The
VM
you
configure
should
be
a
small
one.
Do
not
use
a
larger
VM
for
this
project
as
it
will
eat
up
compute
resources
unnecessarily.
Be
sure
to
shut
down
your
VM
when
you
are
done
for
the
day.
If
you
fail
to,
it
will
consume
compute
resources
unnecessarily
and
reduce
your
quota.
Name
of
image
to
use
‘I535-‐I435-‐B669
Project
A’
Image
style
m1.tiny
(1
CPU,
2
GB
memory)
Inside
your
VM,
you
will
find
MongoDB
already
installed.
You
will
need
to
disable
the
Access
Control
and
Authentication
in
configuration
file
of
MongoDB.
2.2
Set
up
VM
with
Pipeline
Tools
and
Dataset
3. Management, Access, and Use of Big and Complex Data
Fall 2016 Indiana University
3
You
will
need
on
your
virtual
machine
cleaning
and
import
tools
that
you
will
be
needing
for
this
project.
To
get
started
you
will
need
a
project
directory
(another
name
for
folder)
in
which
to
work.
From
within
your
VM
type
the
below
command
to
create
Project
Directory.
This
project
directory
will
be
used
for
your
tools
and
dataset.
mkdir Project
When
you
log
into
your
VM,
you
will
be
in
the
parent
folder
/home/your_username.
To
move
to
the
folder,
project,
type
cd Project
To
move
back
up
a
level
in
the
directory
hierarchy,
type
cd ..
Once
you
have
a
project
directory
set
up
in
your
VM,
you’ll
need
to
get
the
dataset
and
tools
into
that
directory.
Tools
(the
software)
I590-‐TwitterProjectCode.tar.gz
Dataset
users_10000.txt
Location
of
both
Canvas>Files>Project
Material
Tutorial
on
getting
tools
and
dataset
into
VM
Canvas>Files>Project
Material>Transfer
files
into
Instance.pdf
2.3
Build
Pipeline
Tools
in
VM
so
they
can
run
You
will
need
to
build
the
pipeline
tools
before
you
can
use
the.
That’s
because
the
tools
are
scripts
that
invoke
Java
code,
and
you’ll
need
to
build
the
Java
code
(which
means
compiling
the
Java
code
into
executable
form.)
The
two
files
for
both
tools
and
data
(‘I590-‐TwitterProjectCode.tar.gz’
and
‘users_10000.txt’)
should
be
in
your
Project
directory
at
this
point.
Your
command
prompt
should
be
in
that
directory
too.
To
verify
that
your
command
prompt
is
there,
type
‘pwd’
and
you
will
see
the
current
directory:
/home/your_username/Project
Extract
the
tools
from
their
zipped
and
tarred
package
tar –zxf I590-TwitterProjectCode.tar.gz
cd I590-TwitterProjectCode
4. Management, Access, and Use of Big and Complex Data
Fall 2016 Indiana University
4
Notice
that
the
second
command
had
you
change
to
a
deeper
directory.
Verify
that
you
are
in
the
directory
/home/your_username/Project/I590-‐TwitterProjectCode
by
typing
at
the
command
line
pwd
From
this
command
you
should
see:
/home/username/Project/I590-‐TwitterProjectCode.
Before
building
and
deploy
the
code,
take
a
look
at
the
configuration
file
‘build.properties’
by
typing
cat build.properties
Inspect
the
file
to
ensure
it
looks
as
it
does
below.
We
suggest
you
use
the
VI
editor
to
open
the
file.
A
quick
cheat
sheet
for
VI
is
below.
You
will
need
to
edit
both
lines
(for
location
of
project.base.dir
and
java.home)
as
is
appropriate
for
your
VM.
Hint:
you
may
just
need
to
replace
“username”
with
your
login
name.
# $Id: build.properties
# @author: Yuan Luo
# Configuration properties for building I590-TwitterProjectCode
project.base.dir=/home/username/Project/I590-TwitterProjectCode
java.home=/usr/bin
VI
editor
cheat
sheet
type vi filename open file
type i Insert or add text
hit <Esc>, type :wq save and quit
hit <Esc>, type :q! Force quit without save
hit <Esc>, type :w save
Now
you
are
ready
to
build
the
Java
software.
Your
cursor
should
still
be
in
the
directory
/home/username/Project/I590-‐TwitterProjectCode
ant
To
check
to
see
if
the
build
was
successful
by
typing
at
the
command
line
ls build
If
you
see
a
new
directory
called
“class”,
then
you
know
the
build
was
successful.
Great
work!
If
not,
please
ask
the
AI’s
for
help.
2.4
Using
the
Tools:
Running
the
Data
Pipeline
You
will
be
the
master
orchestrator
of
your
data
pipeline.
In
other
words,
you
will
be
manually
invoking
each
tool
instead
of
having
the
pipeline
invoke
each
in
turn
automatically.
5. Management, Access, and Use of Big and Complex Data
Fall 2016 Indiana University
5
Your
pipeline
has
three
tools.
While
they
look
like
simple
scripts,
when
you
open
one
of
them
up,
you
will
see
that
they
make
calls
to
Java
code.
That’s
why
you
had
to
build
the
software.
All
your
tools
are
in
the
directory
called
“bin”.
Reformat.sh
Converts
the
data
encoding
to
another
encoding
Import_mongodb.sh
Imports
twitter
user
profiles
into
mongoDB
QueryAndUpdate.sh
Query
Google
Geo
API
Below
is
the
directory
structure
of
the
software
package.
Comments
are
delimited
by
parenthesis.
I590-‐TwitterProjectCode
├──
bin
(Contains
scripts
(executables);
generated
after
code
deployment)
├──
build
(build
directory,
generated
during
the
code
compile
time)
│
├──
classes
(class
files
that
generated
by
java
compiler)
│
│
├──
google
│
│
├──
mongodb
│
│
└──
util
│
└──
lib
(core
jar
file
for
scripts
in
bin)
├──
config
(configuration
file:
config.properties)
├──
data
(empty
directory,
put
your
data
here)
├──
input
(query
criteria
file,
query.json,
needed
for
finding
docs
in
MongoDB)
├──
lib
(third
party
dependency
library
jars)
├──
log
(empty
directory,
log
files
go
here)
├──
src
(source
code,
broken
down
in
next
diagram)
│
├──
google
│
├──
mongodb
│
└──
util
└──
templates
(template
files
and
deploy
script.
Deploy
script
generates
platform-‐dependent
scripts
and
outputs
them
to
bin
during
code
deployment)
And
the
“src”
directory,
broken
down
in
more
detail
is
organized
as
follows:
src
├──
google
│
└──
GeoCodingClient.java
(return
geocoding
results
from
Google)
├──
mongodb
│
├──
Config.java
(extract
parameters
from
configuration
file)
│
└──
MongoDBOperations.java
(select
documents
that
satisfy
given
query
criteria;
│
update
docs
by
adding
geocode)
└──
util
(utility
classes)
├──
Base64.java
(encoder/decoder)
├──
PropertyReader.java
(help
class
for
reading
.properties
files)
└──
UrlSigner.java
(OAuth
2.0
help
class)
The
tools
you
have
are
simple
but
powerful.
As
said
earlier,
most
are
in
the
form
of
scripts
that
invoke
Java
code.
You
will
execute
the
analysis
as
a
series
of
four
steps,
described
below.
Data
pipeline
task
1:
reformat
the
data
6. Management, Access, and Use of Big and Complex Data
Fall 2016 Indiana University
6
The
raw
txt
file
of
user
profiles
is
encoded
in
ISO-‐8859-‐1
format.
This
is
a
format
that
the
MongoDB
NoSQL
store
does
not
accept,
a
common
problem.
So
you
will
need
to
convert
the
txt
file
into
the
UTF-‐8
format
that
MongoDB
accepts.
You
need
to
do
this
before
you
can
store
the
Twitter
user
profiles
into
the
MongoDB
database.
Reformat
the
user
profile
twitter
dataset
from
ISO-‐8859-‐1
to
UTF-‐8
format
by
running
the
following
reformatting
script
that
is
in
your
bin
directory.
Name
the
output
file
./bin/reformat.sh <input file> <output file>
(User
should
move
users_10000.txt
data
file
from
Project
directory
to
the
/I590-‐
TwitterPorjectCode
directory:
mv users_10000.txt I590-TwitterPorjectCode/
And
then
the
sample
command
could
be:
./bin/reformat.sh
users_10000.txt
user_10000.tsv)
Use
vi
editor
to
open
the
file
you
created.
Add
the
following
line
as
the
first
line
to
the
newly
reformatted
Twitter
data
file
(it
becomes
the
“headline”,
something
MongoDB
understands).
Be
sure
that
you
use
tabs
to
split
the
fields.
user_id user_name friend_count follower_count status_count
favorite_count account_age user_location
Data
pipeline
task
2:
Import
the
data
into
MongoDB
The
tab-‐separated
values
(tsv)
file
could
be
imported
directly
into
MongoDB,
however,
it
has
no
structure.
Adding
a
header
line
(i.e.,
field
names
for
each
field)
allows
MongoDB
to
give
structure
to
each
record.
The
internal
format
for
MongoDB
object
is
a
binary
form
of
JSON
(www.json.org).
To
import
the
converted
user
profile
data
into
MongoDB,
run
the
script
./bin/import_mongodb.sh.
The
script
accepts
four
parameters:
<db name> name
of
database
into
which
data
should
go
<collection name> name
of
collection
in
database
into
which
data
should
go
<import file type> import
file
type
<file name> name
of
tsv
file
that
has
data
for
import
For
example:
./bin/import_mongodb.sh projectA profile tsv user_10000.tsv
Data
Pipeline
Task
3:
Query
and
Update
the
User
Profile
Collection
The
Twitter
user
profile
permits
a
Twitter
user
to
input
arbitrary
text
as
their
location
meaning
user
locations
can
be
anything.
Through
the
QueryAndUpdate.sh
tool
you
will
access
the
Google
geocoding
API
to
validate
user
locations
and
extract
valid
Latitude/Longitude
of
the
user
locations.
If
you
are
interested
in
what
the
Google
geocoding
API
does,
take
a
look
here
https://developers.google.com/maps/documentation/geocoding
7. Management, Access, and Use of Big and Complex Data
Fall 2016 Indiana University
7
You
are
now
ready
to
run
geolocation
on
the
user
profiles
in
MongoDB
to
add
lat/lon
geolocation
information
to
the
user
profiles.
You
will
need
the
configuration
file
and
query
criteria
file
which
can
be
found
on
your
VM
by
following
the
code
package
tree
structure
that
we
gave
you
above.
The
<db
name>
and
<collection
name>
are
the
same
as
how
you
named
these
files
in
the
previous
step.
Note
that
the
tool
will
exit
when
the
Google
geocoding
query
number
reaches
the
daily
limit.
Simple
but
workable
software
for
doing
the
geocoding
is
QueryAndUpdate.sh.
It’s
a
script,
but
you
should
peek
at
the
Java
code
that
the
script
invokes
to
see
how
it
works.
The
Java
code
is
at
src/google/GeoCodingClient.java
(see
tree
structure
above).
QueryAndUpdate.sh
allows
you
to
specify
an
authentication
option
in
the
configuration
file
that
you
can
find
in
the
config
directory
(see
tree
structure
above).
While
Google
provides
three
authentication
options,
you
will
use
the
anonymous
user
option:
• Anonymous
user:
limited
to
making
2500
(or
little
more)
geocoding
queries
per
day.
To
use
this
option,
leave
all
authentication
configuration
parameters
blank.
This
means
you
will
need
to
run
your
tool
4
times
over
4
days
to
finish
geocoding
all
10,000
user
profiles.
This
workaround
is
simpler
than
the
other
authentication
options.
./bin/QueryAndUpdate.sh <configuration file> <db name>
<collection name> <query criteria file> <log file>
For
example
./bin/QueryAndUpdate.sh config/config.properties projectA profile
input/query.json test1.log
A
sample
of
the
geocode
information
that
is
added
by
the
Google
geocoding
service
is
given
below
{
"geocode" : {
"formatted_address" : "Noel N5, Kitimat-Stikine D,
BC V0J, Canada",
"location" : { "lat" : 57.4755555,
"lng" : -132.3597222 }
}
}
The
below
example
shows
how
to
query
for
a
record
from
within
MongoDB.
The
example
queries
for
user
profile
user_id:100008949.
From
there
it
updates
the
record
to
add
the
geolocation
information.
Finally,
another
query
is
issued
showing
that
the
update
was
successful.
$ mongo
> use projectA
switched to db projectA
> db.profile.find({user_id:100008949})
8. Management, Access, and Use of Big and Complex Data
Fall 2016 Indiana University
8
{ "_id" : ObjectId("5415fc01d77bc408f1397df5"), "user_id" :
NumberLong(100008949), "user_name" : "esttrellitta",
"friend_count" : 264, "follower_count" : 44, "status_count" :
6853, "favorite_count" : 0, "account_age" : "28 Dec 2009 18:01:42
GMT", "user_location" : "El Paso,Tx." }
>
>db.profile.update({user_id:100008949},{$set: {geolocation
:{formatted_address: "El Paso, TX, USA", location:{lat:
31.7775757, lng:-106.6359219}}}})
>
> db.profile.find({user_id:100008949})
{ "_id" : ObjectId("5415fc01d77bc408f1397df5"), "user_id" :
NumberLong(100008949), "user_name" : "esttrellitta",
"friend_count" : 264, "follower_count" : 44, "status_count" :
6853, "favorite_count" : 0, "account_age" : "28 Dec 2009 18:01:42
GMT", "user_location" : "El Paso,Tx.", "geolocation" : {
"formatted_address" : "El Paso, TX, USA", "location" : { "lat" :
31.7775757, "lng" : -106.6359219 } } }
QueryAndUpdate.sh
uses
the
find
method
to
query
the
MongoDB.
A
sample
query
criteria
used
in
the
find
method
is
this:
{
"geocode": {"$exists": false}
}
Additional
reference
for
the
query
criteria
is
here:
http://docs.MongoDB.org/manual/core/crud-‐introduction/#query
To
check
for
results,
use
database
commands
to
query
MongoDB
directly.
http://docs.mongodb.org/manual/reference/command
Data
Pipeline
Step
4:
Visualization
The
final
step
in
the
data
pipeline
is
to
visualize
selected
user
profiles
and
their
geo-‐
locations
using
Google
Maps.
You
will
select
a
subset
of
user
profiles
(no
less
than
50)
and
plot
them.
To
visualize
the
geo-‐location
corrected
user
profile
dataset,
you
will
need
to
export
the
user
names
and
long/lat
coordinates
to
a
csv
file
and
reformat
(again!!)
it
to
conform
with
the
format
that
Google
chart
JavaScript
lib
can
use.
An
example
of
such
a
format
is
here
below.
Note
that
the
first
row
in
the
format
“arrayToDataTable”
gives
the
field
names.
This
should
help
you
get
the
lat/long
in
right
places.
You
will
hand
in
a
screen
shot
of
the
visualization
of
your
data
var data = google.visualization.arrayToDataTable([
['Lat', 'Long', 'Name'],
[37.4232, -122.0853, 'Work'],
[37.4289, -122.1697, 'University'],
9. Management, Access, and Use of Big and Complex Data
Fall 2016 Indiana University
9
[37.6153, -122.3900, 'Airport'],
[37.4422, -122.1731, 'Shopping']
]);
The
original
TSV
dataset
contains
10,000
profiles.
After
inserting
and
updating
these
profiles,
most
now
have
real
world
geo-‐locations.
Your
final
task
is
to
query
MongoDB
to
extract
50
user
profiles
with
their
geo-‐location
information,
and
plot
the
geolocations
a
map.
You
will
create
an
html
file,
which
has
50
user
profiles
with
their
geo-‐locations
in
the
‘arrayToDataTable’
data
structure
as
above.
You
then
open
this
html
file
on
your
browser
to
visualize
the
result.
For
more
information,
see:
https://docs.mongodb.com/manual/reference/program/mongoexport/
https://developers.google.com/chart/interactive/docs/gallery/map
https://developers.google.com/maps/documentation/javascript/tutorial
Visualization
API
https://developers.google.com/chart/interactive/docs/reference
3
Deliverables
Submit
the
following
through
Canvas:
1. The
exported
portion
of
your
MongoDB
dataset
in
tab
separated
value
form.
The
dataset
will
include
only
those
profiles
that
you
chose
to
plot.
2. The
html
file
that
underlies
the
Google
map
picture
of
your
selected
region.
3. A
written
report
that:
a. Lists
all
sources
of
help
that
you
consulted.
b. Answers
the
following
questions:
i. How
many
locations
were
you
able
to
validate
(i.e.,
geolocate)?
What
is
the
remaining
number?
Give
suggestions
for
resolving
those
that
you
were
not
able
to
resolve.
ii. List
ways
in
which
you
think
this
pipeline
could
be
improved,
including
other
tools
that
could
be
used.
References
[1]
Rui
Li,
Shengjie
Wang,
Hongbo
Deng,
Rui
Wang,
Kevin
Chen-‐Chuan
Chang:
Towards
social
user
profiling:
unified
and
discriminative
influence
model
for
inferring
home
locations.
KDD
2012:1023-‐1031
[2]
https://wiki.cites.illinois.edu/wiki/display/forward/Dataset-‐UDI-‐TwitterCrawl-‐Aug2012
[3]
MongoDB
Reference
http://docs.MongoDB.org/manual/reference
[4]
Instructions
to
dump
the
MongoDB
db
http://docs.MongoDB.org/manual/reference/program/mongodump