Your SlideShare is downloading. ×
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

156
views

Published on

With the emergence of high-performance computing instances in the cloud, massive scale computations have become available to technically every organization. Digital libraries typically employ a …

With the emergence of high-performance computing instances in the cloud, massive scale computations have become available to technically every organization. Digital libraries typically employ a data-intensive infrastructure, but given the resources, advanced services based on data and text mining could be developed. A fundamental issue is the ease of development and integration of such services. We demonstrate the feasibility by providing a case study on a visual machine learning algorithm with MapReduce running in the cloud in a small cluster.


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
156
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study Peter Wittek Swedish School of Library and Information Science University of Boras˚ 29/11/11
  • 2. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case StudyOutline 1 Workflows and Digital Libraries 2 Computational Requirements of Digital Libraries 3 A Workflow in Cloud HPC 4 Experimental Results 5 Open Issues 6 Conclusions
  • 3. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study Workflows and Digital LibrariesFundamental Issues in Digital Libraries Digital objects remain authentic and accessible for future re-use Component and management failures Natural disasters Attacks Materials resulting from digital reformatting Information that is born-digital and has no analog counterpart How do you facilitate future re-use?
  • 4. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study Workflows and Digital LibrariesDigital libraries and HPC? Advantages: No need for upfront investment Go beyond full-text search Machine learning Pattern matching Social media and graph mining
  • 5. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study Computational Requirements of Digital LibrariesDigital Preservation Future-proofing document collections Emulation Migration Workflows are often tremendously compute-intensive
  • 6. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study Computational Requirements of Digital LibrariesMachine Learning and Advanced Services A step towards digital curation SaaS approach to digital curation Indexing by Lucene/Nutch Collection-level metadata extraction by Mahout Topic modelling Visualization
  • 7. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study A Workflow in Cloud HPCA Middleware Architecture Support MapReduce Engine Policy Services: Enforcement -Document processes -Context Archival search Storage -Data Interface mining Middleware Grid or Cloud Storage Grid or Cloud Computing A middleware to make adoption by DL practitioners easier Moving towards computational science
  • 8. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study A Workflow in Cloud HPCMapReduce Frameworks Language Distributed MapReduce Hadoop Java Yes/Custom Pure MR-MPI C/C++ Yes/MPI Mixed Phoenix++ C++ No Pure Table: Comparison of some open source MapReduce frameworks
  • 9. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study A Workflow in Cloud HPCAdapting SOM to Digital Libraries Batch formulation of update formula MapReduce version in MR-MPI Needs to work on sparse structures Sparse topic model by random indexing
  • 10. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study Experimental ResultsVisual clusters in a 17th-century Corpus
  • 11. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study Experimental ResultsHardware and software AWS Cluster Instances 8 cores per instance 24Gbyte of RAM Linux environment OpenMPI
  • 12. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study Experimental ResultsRunning time 12000 10000 8000 Time (s) 6000 4000 2000 SOM (RI) SOM (LSI) 0
  • 13. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study Experimental ResultsSpeed-up 8 7 6 5 Speed-up 4 3 2 SOM (RI) SOM (LSI) Linear 1
  • 14. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study Open IssuesObstacles to Adoption Persistence and high-reliability MapReduce Not just a technological issue Service-level agreement Particularly problematic Another EU FP7 project working on it: SLA@SOI Niche for alternative cloud providers Difficulty of integration
  • 15. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study ConclusionsOngoing work Use GPU-aware MapReduce Language Distributed MapReduce Mars C/C++ No Pure GPMR C/C++ Yes/MPI Mixed Table: Comparison of GPU-aware open source MapReduce frameworks Adapting MR-MPI Speed-up: 10x
  • 16. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study ConclusionsAcknowledgment Work has been funded by Sustaining Heritage Access through Multivalent ArchiviNg (SHAMAN), an EU FP7 large integrated project. http://shaman-ip.eu/shaman/ Additional funding has been received from Amazon Web Services. http://aws.amazon.com/
  • 17. Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study ConclusionsSummary Cloud and HPC: a solution looking for a problem Digital libraries Computational requirements Expertise Complexity and integration Contact: peterwittek@acm.org

×