Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
storing and
distributing data

             Phil Cryer
            open source systems architect
                    and a...
data problem?
data problem!
data storage
  •   more data is constantly being created and saved

  •   storage has not kept up with Moore’s Law

  •   ...
bhl data storage
   •   approximately 24 TB of data, cannot host locally

   •   currently stored at Internet Archive (IA)...
bhl data delivery
    •   approximately 24 TiB of data, cannot host locally

    •   currently stored at Internet Archive ...
proposal
Use Linux and open source software running on commodity
   hardware to create a scalable, distributed filesystem.
distributed
filesystems
 •   write once, read anywhere

 •   replication, fault tolerance and redundancy

 •   error correc...
distributed
filesystems
 •   many open source options available to try

 •   narrowed down to three to evaluate
GlusterFS
GlusterFS
•   a clustered file-system capable of scaling to several
    petabytes

•   open source software that runs on co...
proof of concept
proof of concept
   •   a six box cluster with servers running Debian/
       GNU Linux

   •   using GlusterFS as the dis...
approved
•   who could support the cluster?

•   what hardware would we build the cluster with?

•   where would the first ...
^ Anthony Goddard
how to populate
   •   we could request a server populated at IA shipped

   •   they could send us raw disks filled with o...
cluster status
  •   currently 20,114 books (few more days to go)

  •   22 TiB of disk space used

  •   shared across 2 ...
cluster status
  •   (6) 5U sized servers hosted at MBL

  •   each with 8 Gig RAM

  •   24 /1.5 TB drives in each server...
$246,000




Graph from Backblaze (http://www.backblaze.com)
complex, or not
   •   while our example uses new, faster commodity
       hardware...

   •   it could run on any hardwar...
architecture
data syncing
 •   how would data be sync’d via the clusters

 •   working on “a Dropbox on
     steroids” (@chrisfreeland)...
fedora-commons
fedora-commons
   •   Fedora-commons is open source digital repository
       software (not a Linux distribution)

   •   ...
other avenues
Duraspace
Duraspace
•   BHL is participating in a pilot for Duraspace with the
    New York Public Library

•   Duraspace would prov...
distributed
computing
distributed
computing
 •   map/reduce frameworks (Hadoop, Disco, others)

 •   Disco has a GlusterFS plugin

 •   make exi...
seed box(es)
seed box(es)

  •   reliable, easy to maintain hardware we can distribute

  •   would be a seed box, once on the network,...
sharing the code
   •   our code and configurations are all open source
       and hosted on Google Code - bhl-bits http://...
code http://code.google.com/p/bhl-bits



email phil.cryer@mobot.org

slides http://bit.ly/pc-slides

twitter @fak3r
Storing and distributing data
Storing and distributing data
Storing and distributing data
Storing and distributing data
Storing and distributing data
Storing and distributing data
Storing and distributing data
Storing and distributing data
Storing and distributing data
Storing and distributing data
Upcoming SlideShare
Loading in …5
×

Storing and distributing data

2,056 views

Published on

(see NOTES tab under presentation for more detail) An overview talk about decisions I've made so far in architecting the BHL clustered, distributed storage filesystem. Covering background on the proposal, proof of concept, to a current status report and thoughts of future implementations and uses.

Published in: Technology
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (Unlimited) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/qURD } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/qURD } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/qURD } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/qURD } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/qURD } ......................................................................................................................... Download doc Ebook here { https://soo.gd/qURD } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Storing and distributing data

  1. 1. storing and distributing data Phil Cryer open source systems architect and accidental tourist BHL/Mobot, Saint Louis, MO
  2. 2. data problem?
  3. 3. data problem!
  4. 4. data storage • more data is constantly being created and saved • storage has not kept up with Moore’s Law • expanding SANs is expensive (not sustainable) • backups are for disaster recovery, not redundancy or failover
  5. 5. bhl data storage • approximately 24 TB of data, cannot host locally • currently stored at Internet Archive (IA) • lack of control for editing and updating • single point of failure
  6. 6. bhl data delivery • approximately 24 TiB of data, cannot host locally • currently stored at Internet Archive (IA) • lack of control for editing and updating • single point of failure • save metadata locally at Mobot to remix and serve • the rest is sourced and served from IA for delivery • no control over delivery (CDN, delivery network) • far from ideal
  7. 7. proposal
  8. 8. Use Linux and open source software running on commodity hardware to create a scalable, distributed filesystem.
  9. 9. distributed filesystems • write once, read anywhere • replication, fault tolerance and redundancy • error correction • scalable horizontally
  10. 10. distributed filesystems • many open source options available to try • narrowed down to three to evaluate
  11. 11. GlusterFS
  12. 12. GlusterFS • a clustered file-system capable of scaling to several petabytes • open source software that runs on commodity hardware • easy to install and manage (runs in users-pace) • very flexible and customizable • offers seamless expansion and updating
  13. 13. proof of concept
  14. 14. proof of concept • a six box cluster with servers running Debian/ GNU Linux • using GlusterFS as the distributed filesystem • populated, and synced data with a remote cluster run by Anthony (data 616 <==> MBL) • we simulated hardware failures • ran map/reduce jobs (distributed computing) • defined procedures, configurations and build scripts
  15. 15. approved • who could support the cluster? • what hardware would we build the cluster with? • where would the first cluster live? • how would data come into/out of the cluster, how about syncing, serving, etc?
  16. 16. ^ Anthony Goddard
  17. 17. how to populate • we could request a server populated at IA shipped • they could send us raw disks filled with our data • we could ‘download’ the data • transfer data through “the cloud”
  18. 18. cluster status • currently 20,114 books (few more days to go) • 22 TiB of disk space used • shared across 2 GlusterFS nodes • but it hasn’t all been so simple...
  19. 19. cluster status • (6) 5U sized servers hosted at MBL • each with 8 Gig RAM • 24 /1.5 TB drives in each server (RAID 5) • over 100 TB of usable storage • faster connection than ever before (delivery) • ultimately the cluster will be split into 2 sets of 3, each in a different building, for further redundancy
  20. 20. $246,000 Graph from Backblaze (http://www.backblaze.com)
  21. 21. complex, or not • while our example uses new, faster commodity hardware... • it could run on any hardware that can run Linux • chain old, outdated computers together • build your own cluster for next to nothing (host it in your basement) • can solve infrastructure funding issues, provides a working proof of concept before diving in
  22. 22. architecture
  23. 23. data syncing • how would data be sync’d via the clusters • working on “a Dropbox on steroids” (@chrisfreeland) • I had an idea on how this would work, setup a test • then started a thread on my blog about this - see http://bit.ly/pc-dropbox • from this, more options mentioned, existing open source solutions like Lsyncd, rsync, Openduckbill • next, testing between Mobot and the cluster
  24. 24. fedora-commons
  25. 25. fedora-commons • Fedora-commons is open source digital repository software (not a Linux distribution) • accounts for all additions and changes so it provides built-in version control • provides disaster recover • open standards to mesh with future file format • provides open sharing services such as OAI-PMH
  26. 26. other avenues
  27. 27. Duraspace
  28. 28. Duraspace • BHL is participating in a pilot for Duraspace with the New York Public Library • Duraspace would provide a link for publishers and cloud providers • pilot to show the feasibility of hosting in “the cloud” • testing the use of application servers (image server, taxonfinder, etc) running in “the cloud” • pilot to show the feasibility of distributing data globally through “the cloud”
  29. 29. distributed computing
  30. 30. distributed computing • map/reduce frameworks (Hadoop, Disco, others) • Disco has a GlusterFS plugin • make existing data more useful • image and OCR re-processing, taxonfinder • distributed web servers, geo-load balancing • identifier resolution pools...?
  31. 31. seed box(es)
  32. 32. seed box(es) • reliable, easy to maintain hardware we can distribute • would be a seed box, once on the network, adding other nodes are a matter of booting off the cd • GlusterSP (storage platform) makes this easier (GUI)
  33. 33. sharing the code • our code and configurations are all open source and hosted on Google Code - bhl-bits http:// code.google.com/p/bhl-bits/ • our projects server shares detailed instructions on how we’ve built our cluster for others to use http://projects.biodiversitylibrary.org/ • we have a mailing list, bhl-tech, bhl-bits (announce) • ask questions, get involved
  34. 34. code http://code.google.com/p/bhl-bits email phil.cryer@mobot.org slides http://bit.ly/pc-slides twitter @fak3r

×