Submit Search
Upload
App Engine Application for Detecting Similar Files in Google Drive
•
0 likes
•
4 views
IRJET Journal
Follow
https://www.irjet.net/archives/V10/i1/IRJET-V10I120.pdf
Read less
Read more
Engineering
Report
Share
Report
Share
1 of 5
Download now
Download to read offline
Recommended
Design and Monitoring Performance of Digital Properties
Design and Monitoring Performance of Digital Properties
IRJET Journal
D017372538
D017372538
IOSR Journals
A Generic Open Source Framework for Auto Generation of Data Manipulation Comm...
A Generic Open Source Framework for Auto Generation of Data Manipulation Comm...
iosrjce
Analyzing Optimal Practises for Web Frameworks
Analyzing Optimal Practises for Web Frameworks
IRJET Journal
Enhancing Password Manager Chrome Extension through Multi Authentication and ...
Enhancing Password Manager Chrome Extension through Multi Authentication and ...
ijtsrd
Company Visitor Management System Report.docx
Company Visitor Management System Report.docx
fantabulous2024
File Repository on GAE
File Repository on GAE
lynneblue
Developing apps with techstack wp-dm
Developing apps with techstack wp-dm
Actian Corporation
Recommended
Design and Monitoring Performance of Digital Properties
Design and Monitoring Performance of Digital Properties
IRJET Journal
D017372538
D017372538
IOSR Journals
A Generic Open Source Framework for Auto Generation of Data Manipulation Comm...
A Generic Open Source Framework for Auto Generation of Data Manipulation Comm...
iosrjce
Analyzing Optimal Practises for Web Frameworks
Analyzing Optimal Practises for Web Frameworks
IRJET Journal
Enhancing Password Manager Chrome Extension through Multi Authentication and ...
Enhancing Password Manager Chrome Extension through Multi Authentication and ...
ijtsrd
Company Visitor Management System Report.docx
Company Visitor Management System Report.docx
fantabulous2024
File Repository on GAE
File Repository on GAE
lynneblue
Developing apps with techstack wp-dm
Developing apps with techstack wp-dm
Actian Corporation
IRJET- Training and Placement Database Management System
IRJET- Training and Placement Database Management System
IRJET Journal
IRJET-Towards a Methodology for the Development of Plug-In
IRJET-Towards a Methodology for the Development of Plug-In
IRJET Journal
MACHINE LEARNING AUTOMATIONS PIPELINE WITH CI/CD
MACHINE LEARNING AUTOMATIONS PIPELINE WITH CI/CD
IRJET Journal
apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...
apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...
apidays
IRJET - Multitenancy using Cloud Computing Features
IRJET - Multitenancy using Cloud Computing Features
IRJET Journal
Nyc mule soft_meetup_13_march_2021
Nyc mule soft_meetup_13_march_2021
NeerajKumar1965
IRJET- Cross User Bigdata Deduplication
IRJET- Cross User Bigdata Deduplication
IRJET Journal
vinay-mittal-new
vinay-mittal-new
Vinay Mittal
Full Stack Web Development: Vision, Challenges and Future Scope
Full Stack Web Development: Vision, Challenges and Future Scope
IRJET Journal
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET Journal
D033017020
D033017020
ijceronline
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET Journal
Why Google Stores Billions of Lines of Code in a Single Repository
Why Google Stores Billions of Lines of Code in a Single Repository
Kapil Mohan
why google stores billions of lines of code in a single repository
why google stores billions of lines of code in a single repository
mustafa sarac
Which Application Modernization Pattern Is Right For You?
Which Application Modernization Pattern Is Right For You?
Apigee | Google Cloud
Top 10 python frameworks for web development in 2020
Top 10 python frameworks for web development in 2020
Alaina Carter
Cc unit 5
Cc unit 5
Dr. Radhey Shyam
Yii Framework in the RAD context + Mashup demo built on YII
Yii Framework in the RAD context + Mashup demo built on YII
George-Leonard Chetreanu
Effective Information Flow Control as a Service: EIFCaaS
Effective Information Flow Control as a Service: EIFCaaS
IRJET Journal
Flaw less coding and authentication of user data using multiple clouds
Flaw less coding and authentication of user data using multiple clouds
IRJET Journal
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
IRJET Journal
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
IRJET Journal
More Related Content
Similar to App Engine Application for Detecting Similar Files in Google Drive
IRJET- Training and Placement Database Management System
IRJET- Training and Placement Database Management System
IRJET Journal
IRJET-Towards a Methodology for the Development of Plug-In
IRJET-Towards a Methodology for the Development of Plug-In
IRJET Journal
MACHINE LEARNING AUTOMATIONS PIPELINE WITH CI/CD
MACHINE LEARNING AUTOMATIONS PIPELINE WITH CI/CD
IRJET Journal
apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...
apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...
apidays
IRJET - Multitenancy using Cloud Computing Features
IRJET - Multitenancy using Cloud Computing Features
IRJET Journal
Nyc mule soft_meetup_13_march_2021
Nyc mule soft_meetup_13_march_2021
NeerajKumar1965
IRJET- Cross User Bigdata Deduplication
IRJET- Cross User Bigdata Deduplication
IRJET Journal
vinay-mittal-new
vinay-mittal-new
Vinay Mittal
Full Stack Web Development: Vision, Challenges and Future Scope
Full Stack Web Development: Vision, Challenges and Future Scope
IRJET Journal
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET Journal
D033017020
D033017020
ijceronline
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET Journal
Why Google Stores Billions of Lines of Code in a Single Repository
Why Google Stores Billions of Lines of Code in a Single Repository
Kapil Mohan
why google stores billions of lines of code in a single repository
why google stores billions of lines of code in a single repository
mustafa sarac
Which Application Modernization Pattern Is Right For You?
Which Application Modernization Pattern Is Right For You?
Apigee | Google Cloud
Top 10 python frameworks for web development in 2020
Top 10 python frameworks for web development in 2020
Alaina Carter
Cc unit 5
Cc unit 5
Dr. Radhey Shyam
Yii Framework in the RAD context + Mashup demo built on YII
Yii Framework in the RAD context + Mashup demo built on YII
George-Leonard Chetreanu
Effective Information Flow Control as a Service: EIFCaaS
Effective Information Flow Control as a Service: EIFCaaS
IRJET Journal
Flaw less coding and authentication of user data using multiple clouds
Flaw less coding and authentication of user data using multiple clouds
IRJET Journal
Similar to App Engine Application for Detecting Similar Files in Google Drive
(20)
IRJET- Training and Placement Database Management System
IRJET- Training and Placement Database Management System
IRJET-Towards a Methodology for the Development of Plug-In
IRJET-Towards a Methodology for the Development of Plug-In
MACHINE LEARNING AUTOMATIONS PIPELINE WITH CI/CD
MACHINE LEARNING AUTOMATIONS PIPELINE WITH CI/CD
apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...
apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...
IRJET - Multitenancy using Cloud Computing Features
IRJET - Multitenancy using Cloud Computing Features
Nyc mule soft_meetup_13_march_2021
Nyc mule soft_meetup_13_march_2021
IRJET- Cross User Bigdata Deduplication
IRJET- Cross User Bigdata Deduplication
vinay-mittal-new
vinay-mittal-new
Full Stack Web Development: Vision, Challenges and Future Scope
Full Stack Web Development: Vision, Challenges and Future Scope
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
D033017020
D033017020
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache Pig
Why Google Stores Billions of Lines of Code in a Single Repository
Why Google Stores Billions of Lines of Code in a Single Repository
why google stores billions of lines of code in a single repository
why google stores billions of lines of code in a single repository
Which Application Modernization Pattern Is Right For You?
Which Application Modernization Pattern Is Right For You?
Top 10 python frameworks for web development in 2020
Top 10 python frameworks for web development in 2020
Cc unit 5
Cc unit 5
Yii Framework in the RAD context + Mashup demo built on YII
Yii Framework in the RAD context + Mashup demo built on YII
Effective Information Flow Control as a Service: EIFCaaS
Effective Information Flow Control as a Service: EIFCaaS
Flaw less coding and authentication of user data using multiple clouds
Flaw less coding and authentication of user data using multiple clouds
More from IRJET Journal
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
IRJET Journal
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
IRJET Journal
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
IRJET Journal
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil Characteristics
IRJET Journal
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
IRJET Journal
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
IRJET Journal
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
IRJET Journal
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
IRJET Journal
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADAS
IRJET Journal
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
IRJET Journal
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
IRJET Journal
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
IRJET Journal
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
IRJET Journal
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridges
IRJET Journal
React based fullstack edtech web application
React based fullstack edtech web application
IRJET Journal
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
IRJET Journal
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
IRJET Journal
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
IRJET Journal
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
IRJET Journal
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
IRJET Journal
More from IRJET Journal
(20)
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil Characteristics
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADAS
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridges
React based fullstack edtech web application
React based fullstack edtech web application
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Recently uploaded
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
João Esperancinha
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
M Maged Hegazy, LLM, MBA, CCP, P3O
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
soniya singh
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
rehmti665
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
purnimasatapathy1234
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
slot gacor bisa pakai pulsa
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
SIVASHANKAR N
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
hassan khalil
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
humanexperienceaaa
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
ranjana rawat
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Low Rate Call Girls In Saket, Delhi NCR
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
rakeshbaidya232001
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
SIVASHANKAR N
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
Tsuyoshi Horigome
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
ranjana rawat
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
srsj9000
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur High Profile
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
pranjaldaimarysona
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
ranjana rawat
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
Call Girls in Nagpur High Profile
Recently uploaded
(20)
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
App Engine Application for Detecting Similar Files in Google Drive
1.
International Research Journal
of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 95 App Engine Application for Detecting Similar Files in Google Drive Vijay Raj1, Namita Bilagi 2, G S Nagaraja3 1 Student at Department of Computer Science and Engineering, RV College of Engineering, Bengaluru, Karnataka 2 Student at Department of Computer Science and Engineering, RV College of Engineering, Bengaluru, Karnataka 3 Professor and Associate Dean at Department of Computer Science and Engineering, RV College of Engineering, Bengaluru, Karnataka ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Duplicate files are rulingoverstoragefordecades. Especially when the storage is meant to backup for multiple users belonging to similar group like family or friends. In this paper we identify and list similar files present in one of the most popular clouds back up storage ‘google drive’. The drive contains large amounts of different files that would ideally be stored by a family in their cloud backup storage. In this application we develop a python code and deploy it in Google App Engine, to analyze similar files in the drive. MD5 hashing approach is used to identify these similar files in the large quantity of data. Key Words: App Engine, Google Drive, MD5, Proximity measure, Replicate file, , Security, Unique files I INTRODUCTION A Similar file has many meaning, such as date modified is different, same file but copied so the date created for that file is different but content is the same or content itself is different [1][3]. These files occupymanygigatoterrabytesof storage, identifying these files especially in personal storage can benefit users saving money. Let us get to know some modules and API used: 1.1 MD5 The MD5 (message-digest algorithm) is widely used hash function that is cryptographically broken. Input: Arbitrary length message. Output: Hash code of size 128-bit. The input message is divided into 512-bit blocks (eighteen 32-bit words) for each chunk. The message must be an multiple of 512-bit blocks else padding is used. One MD5 operation is made up of 64 of them, arranged in four rounds of 16 operations each. Each iteration of the nonlinear function F employs a different function. Ki is a 32- bit constant that is unique for each operation, and Mi is for a 32-bit block of the message input. s stands for a left bit rotation by s. MD5 algorithm operates on a 128-bit state, then divided into four 32-bit words, denoted as A, B, C, and D. 1) Fig.1 MD5 Algorithm[4] Below are initialized to certain fixed constants. A = x67452301 B = 0xEFCDAB89 C = 0x98BADCFE D = 0x10325376 The main algorithm then uses each 512-bitmessage Ablock each time to change the state. Single message block is processed in 4 steps known as rounds, each of which consists of 16 operations based on the non-linearfunction F, modular addition, and left rotation. The fig.1 above shows single round's of operationsthatmessagegoesthrough.Each round uses a different one of the four potential functions F. 1.2 Google App Engine Google App Engine is PaaS model. Web application development and hosting are both handled bythe extremely scalable Google App Engine. The programs are built to accommodate numerous users at once without affecting
2.
International Research Journal
of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 96 overall performance. GAE can be used to construct cloud applications for service providers to offer their products[2]. App Engine characteristics [11]: • Popular language: build application using various language such as Java, Python, Ruby, PHP etc. • Open and flexible: Programmers can install custom libraries and framework on app enginebysupplying docker container. • Powerful application diagnostics: use Google’s cloud monitoring and cloud logging tool to check the health and performance of GAE application. Fix bugs quickly using debugger and error reporting tools. • Application versioning: host and manage different versions of same a. You can choose between two Python language environmentsonAppEngine[11].Bothenvironmentsallow you to utilize serving technology to build your web,mobile, and IoT applications rapidly and with no operational cost. They both scale quickly and effectively to manage increasing demand. Despite the similarities between the two environments, thereare somesignificantdifferences as well. Table 1. Standard vs flexible environment in App Engine 1.3 App engine admin API Google App Engine Admin API is a RESTful API formanaging your App Engine applications. Many of the App Engine administrative tasks can be found in the Google Cloud dashboard and can be accessed to the Admin API. Admin API allow us manage App Engine applications in a way that is best suited for your environment or process. 1.4 Drive API Google Drive is not a brand-new product by Google. It is a rebranded version of Google Docs that users may still use in their familiar ways while adding a new, optional storage capability. A marketingploywasusedtochangetheproduct's name in order to draw customers to the expanding cloud storage sector. In contrast toGoogleDocs'previousemphasis on applications, Google Drive's nameconveystheimpression that its major role is the cloud file storage service[9]. Fig 2. Drive API flowchart 1.5 Google OAuth 2.0 The OAuth 2.0 protocol is used by Google APIs for authentication and authorization. Web server, client-side, installation, and limited-input device applications are some examples of popular OAuth 2.0 events that Google supports. Users are only authenticated when they accept the terms that are provided to them on a user consent screen when using OAuth 2.0 for authentication. Applications that use OAuth 2.0 and satisfy one or more of the verification criteria are verified by Google. II Literature Review Sif[1] is a tool used and all similar files in a large file system are located by this tool . Even though two files may otherwise be completely different, they are deemed comparable if they share a significant number of similar pieces. For instance, a file might be a realignment of another file or it might contain another file, possibly with some alterations. Findingallgroupsofcomparablefilestakesabout 500MB to 1GB of processing time per hour, even for only a 25% similarity. At a quick post-processingstage,theusercan choose the degree of similarityandanumberofotherspecific factors. In file management, information collecting ,data compression, file synchronization, program reuse and plagiarism detection, the application of sif tool can be found. At Hewlett-Packard[3], millions of technical related support documents are housed in the various repositories. Such collectionsareperiodicallymergedandgroomedaspart
3.
International Research Journal
of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 97 of content management. During the process, it is crucial to spot and eliminatesupport papers thatare essentiallycopies of newer versions. By doing this, the collection's quality is enhanced, chaff is removed from search results, and consumer satisfaction is enhanced.TheyPresentamethodforsearching across hugedocumentrepositoriestolocatecomparablefiles. It works when the byte stream is chuncked toidentifyunique signatures that might be present in several files. Clusters of connected files can be found after an investigation of the file- chunk graph. Scalability can be considerably improved by using an optional bipartite graph splitting technique[3]. Anand Bhalerao and Ambika Pawar's [5] work discusses the data deduplication technology background study, including different forms of deduplication and its performance metrics. Various chunking strategies are also discussed with algorithms used in the data deduplication process. The comparisionand taxonomy of several chunking algorithms are described in the paper. The research on chunking algorithms' benefits and drawbacks for potential future research topics is summarised in the paper's conclusion. Mr. Pushpendra Singh Tomar, Dr. Maneesh Shreevastava[6] their work on detecting duplicates using MD5 hash on various web files like pdf, html, etc. In their paper removing duplicate documents isessentialtolowering runtimeand increasing search accuracy. There arebillionsof unique URLs that are currently being retrieved by search engine crawlers, of which hundreds of millions are copies in some way. As a result, speedy replicate detection speeds up indexing andsearches.400millionexactreplicasof1.2billion URLs were discovered using an MD5 hash after investigation by one vendor. To support the system some amount of hardware needed is decreassed and indexing time is much reduced when collection volumes are reduced by tens of percentage points. Researchers at Google[7] publishedapaperonimproving google drive experience, suchas UX changes whereuserscan access recent files, most used files using machine learning techniques. They also discuss about scaling of this feature to business, But the paper majorly focuses on UXoptimizations. Most of papers focus upon finding similar files in large database from different companies. Example: HP uses chunking and matching for detecting similar files[3]. But there very few focused upon personal storagededuplication. Although the methods can be applied on user level storage but have not targeted on popular cloud storages. websites. Table 2. Data set overview All these files are uploaded to google drive, some files are duplicated to get accurate results.Thedatascrappedarevery similar to that of a back up of personal files of a family. The files varied in various parameters such as name of the files, date created, modified, author etc. but if the content of the files are the same their md5 check sum would match. 3.2 Implementation The application is implemented using mainly 4 modules and API: I. OAuth II. Drive API III. MD5 IV. App.yaml for App engine III METHODOLOGY 3.1 DATA SET The data is obtained by applyingwebscrappingonnumerous
4.
International Research Journal
of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 98 Fig 3. Workflow Flowchart The application is implemented using following steps: 1. Google OAuth consent is needed to access google drive as prerequisite of google drive API. OAuth generates a “secret.json” file which we need for our code. 2. Google Drive API takes secret.jsonasinputandallowsuser to authenticate the developer to access their drives. After a successful authentication session, drive API generates “token.json”. 3. Now we are ready to access the files in the drive. For finding similar files we need to get the md5 hash of each file. In google drive when we upload the files, google automatically generates md5 hash for each file and attaches it to the file as metadata. 4. After we access md5 hash, we use brute force approach to string match md5 hash of other files to find similar files. 5. Now to turn the code into an app engine code, we need to create two files: App.yaml : which setstheenvironmentofapp engineserver. For example: if we use python then we have to set runtime: python3. We can also set flexible environment or standard environment, number of vm scaling instances. Requirement.txt: this file contains all the necessary packages that are used in our code. 6. Now the code is ready to be deployed on app engine, but before that we need to enable few things: billing account, App Engine API and Cloud build API. Then using gcloudshell command “ gcloud app deploy” to deploy. 7. To view the results of similar files in google drive visit the URL provided in the cloud shell after deploying. IV RESULT AND ANALYSIS It is important to note that MD5 hash is generated by google drive itself , when user uploads the files drive automatically generated a MD5 hash for each file, this is used for various google drive user experience features. Table 3. P value for similar and dissimilar file types FileType Numberof dissimilar files Numberof Similar files P(similarity) P(dissimilarity) PDF 565 147 0.206 0.793 Images 3233 823 0.203 0.797 Documents 1074 157 0.121 0.872 MP3 168 32 0.16 0.84 MP4 132 30 0.185 0.814 Otherfiles 775 89 0.103 0.896 Total: 5947 1278 0.177 0.823 Measures for different types of similar and dissimilar files can be represented as: P(file similarity in a storage) = Total number of similar / Total number of files in storage P(file dissimilarity in a storage) = (Total number of files in storage - total number of similar) / Total number of files in storage. Below table and graph gives the comparison for number of similar files and dissimilar files. Fig 4. comparison of number of similar and dissimilar files
5.
International Research Journal
of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 99 We can notice that images have most number of replicate data, this is normal to expect from a familybackupsincethey share lot pictures from each family member.Documentsand pdf files have similar number of replicate files, this maybe due to syncing activity betweenvariousapplication,example google cloud and MS office syncing the same files. MP3 has one of the least similar files count. MP4 has 40 similar files, MP4 files are one of the large files compared to others, making MP4 duplicate data take up lot of storage in google drive. Example single movie in MP4 format may be around 1.2 GB (depending on resolution and codecs). Other filescan be classified as miscellaneous, such as the data from an mobile application back up data (.bin, .dat, etc). V CONCLUSION AND FUTURE SCOPE User data have comparably many similar files within the storage. So we built an application code to deploy on Google App Engine to find these similar files. Using App engine we can build scalable web applications and along with Google drive access integration. Detecting similar files in very large amount of data files is not a barrier anymore thanks to App engine scalable computation power. MD5 is a renowned crypto authentication technique, which gives hash of file to match with other files hash to know if it’s the same. Moreover, this application can be used for deduplication purpose, applied on google drive to free large amount ofstorage occupied by duplicate files. There arevery few application that can perform deduplication on google drive, moreover detecting similar files in popular cloud storage like google drive will save money for many users by not buying extra storage. This applicationcanalsobeusedon large data back upcenters, where large files are chunkedand hashes are generated to compare. V REFERENCES [1] Manber, U., 1994, January. Finding Similar Files in a Large File System. In Usenix winter (Vol. 94, pp. 1-10). [2] Severance, C., 2009. Using Google App Engine: Building Web applications. " O'Reilly Media, Inc.". [3] Forman, G., Eshghi, K. and Chiocchetti, S., 2005,August. Finding similar files in large document repositories. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (pp. 394-400). [4] Gupta, S., Goyal, N. and Aggarwal, K., 2014. A review of comparative study of md5 and ssh security algorithm. International Journal of Computer Applications, 104(14t) [5] Bhalerao, A. and Pawar, A., 2017, May. A survey: On data deduplication for efficiently utilizing cloud storage for big data backups. In 2017 international conference on trends in electronics and informatics (ICEI)(pp.933- 938). IEEE. [6] Tomar, P.S. and Shreevastava, M., 2011. The study of detecting replicate documents usingMD5hashfunction. International Journal of Advanced Computer Research, 1(2), p.14.William Stalling7, Fourth Edition, Cryptography and Network Security (Various Hash Algorithms). [7] Qinlu He, Zhanhuai Li and Xiao Zhang, "Data deduplication techniques," 2010 International Conference on Future Information Technology and Management Engineering, 2010, pp. 430-433, doi: 10.1109/FITME.2010.5656539. [8] Chen, S.J., Qin, Z., Wilson, Z., Calaci, B., Rose, M., Evans, R., Abraham, S., Metzler, D., Tata, S. and Colagrosso, M., 2020, August. Improving recommendation quality in google drive. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2900-2908). [9] Gallaway, T.O. and Starkey, J., 2013. Google Drive. The Charleston Advisor, 14(3), pp.16-19. [10] 10. Fan, J. and Xie, W., 1999. Some notes on similarity measure and proximity measure. Fuzzy sets and systems, 101(3), pp.403-412. [11] Google.com. Developer’s Guide. http://code.google.com/appengine/docs/
Download now