Learning Lessons: Building a CMS on top of NoSQL technologies
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Learning Lessons: Building a CMS on top of NoSQL technologies

on

  • 16,870 views

Presentation for Berlin Buzzwords conference on Lily, our open source content repository based on HBase and SOLR

Presentation for Berlin Buzzwords conference on Lily, our open source content repository based on HBase and SOLR

Statistics

Views

Total Views
16,870
Views on SlideShare
14,190
Embed Views
2,680

Actions

Likes
17
Downloads
258
Comments
0

18 Embeds 2,680

http://nosql.mypopescu.com 1628
http://www.nosqldatabases.com 779
http://www.slideshare.net 179
http://www.lilyproject.org 51
url_unknown 10
http://adiefatlady.posterous.com 8
http://lilycms.org 6
http://www.lilycms.org 4
http://webcache.googleusercontent.com 2
http://twittertim.es 2
http://translate.googleusercontent.com 2
http://www.taaza.com 2
http://devvideos.com 2
http://www.hanrss.com 1
http://facebook.slideshare.com 1
http://web1.conversationminer.com 1
https://www.google.com 1
http://docs.ngdata.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Learning Lessons: Building a CMS on top of NoSQL technologies Presentation Transcript

  • 1. Learning Lessons Building a content repository on top of NoSQL Technologies IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 2. hello, I’m @stevenn from @outerthought IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 2
  • 3. This story is about IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 3
  • 4. Complexity complexity software architecture 3.0 2.0 1.0 age IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 4
  • 5. Complexity complexity user interest 3.0 2.0 1.0 age IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 5
  • 6. We Prefer Sophistication » the challenge for us was to scale ... without dropping features IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 6
  • 7. The typical CMS ‘architecture’ database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 7
  • 8. The typical CMS ‘architecture’ application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 8
  • 9. The typical CMS ‘architecture’ more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 9
  • 10. The typical CMS ‘architecture’ even more cache more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10
  • 11. The typical CMS ‘architecture’ client even more cache more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11
  • 12. The typical CMS ‘architecture’ client (+cache) even more cache more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12
  • 13. What we found hard to scale » access control » facet browsing » all the nifty stuff people were using our software for » ... anything that required random access to in-memory-cache data for computations IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 13
  • 14. Beyond the ‘scaling’ problem » three-prong data layer fs » result set merging (between MySQL & Lucene) » happened in appcode/memory » ‘transactions’, set operations = hard IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 14
  • 15. Beyond the three-prong problem » errrr..... “Failover” ..... ? IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 15
  • 16. If we would be able to add more nodes ... scalability » True Distribution availability performance ... in the line of fire IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 16
  • 17. Solution 1 » do MORE inside the database IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17
  • 18. Functional IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 18
  • 19. Functional IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 19
  • 20. Infrastructural e ! a s ta b d a o re m IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 20
  • 21. e ! a s tab d a o re n m eve IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 21
  • 22. s ! u s se e b sa g m es d d ’s a l et IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 22
  • 23. f f! ! s tu B C J D r e ! ov 0t S 0 J M w M I! R IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 23
  • 24. http://bigdatamatters.com/bigdatamatters/2010/04/high-availability-with-oracle.html IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 24
  • 25. Business Development 101 user interest budget IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 25
  • 26. Solution II sophistication ability to cope 3.0 mysql nosql? 2.0 1.0 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 26
  • 27. Enter The Cambrian Explosion Cassandra NoSQL neo4j IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 27
  • 28. Requirements, phase I » automatic scaling to large data sets » fault-tolerance: replication, automatic handling of failing nodes » a flexible data model supporting sparse data » runs on commodity hardware » efficient random access to data » open source, ability to participate in the development thus drive the direction of the project » some preference for a Java-based solution IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 28
  • 29. Requirements, phase II » After careful consideration, we realized the important choices were also: » consistency: no chance of having two conflicting versions of a row » atomic updates of a single row, single-row transactions » bonus points for MapReduce integration » e.g. full-text index rebuilding IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 29
  • 30. That brought us to HBase, which bought us: » a datamodel where you can have column families which keep all versions and others which do not, which fits very well on our CMS document model » ordered tables with the ability to do range scans on them, which allows to build scalable indexes on top of it » HDFS, a convenient place to store large blobs » Apache license and community, a familiar environment for us IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 30
  • 31. » OK, so now we had a data store ! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 31
  • 32. » However, content repository = store + search ! u ch o IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32
  • 33. a s w at ! h sy T a .. .) e ver we h o ( IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33
  • 34. Search ponderings » CMS = two types of search » structured search » numbers, strings » based on logic (SQL, anyone?) » information retrieval (or: full-text search) » text » based on statistics IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34
  • 35. Search ponderings » All of that, at scale IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 35
  • 36. Structured Search » HBase Indexing Library » idea from Google App Engine datastore indexes » http://code.google.com/appengine/articles/ index_building.html rowkey col col rowkey col order A val3 foo6 val2-B B val2 foo7 val3-A content table index table A IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 36
  • 37. Full-text / IR search » Lucene? » no sharding (for scale) » no replication (for availability) » batched index updates (not real-time) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 37
  • 38. Beyond Lucene » Katta » scalable architecture, however only search, no indexing » Elastic Search » very young (sorry) » hbasene et al. » stores inverted index in HBase, might not scale all features » SOLR » widely used, schema, facets, query syntax, cloud branch More info: http://lilycms.org/lily/prerelease/technology.html IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 38
  • 39. ? + = r ? ! O a sy E IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 39
  • 40. Remember distribution ? Remember secondary indexes ? ➙ Need for reliable queuing IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 40
  • 41. Connecting things » we needed a reliable bridge between our main storage (HBase) and our index/search server(s) (SOLR) » indexing, reindexing, mass reindexing (M/R) » we need a reliable method of updating HBase secondary indexes » all of that eventually to run distributed » distribution means coping with failure IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 41
  • 42. Solution » ACMEMessageQueue ? Bzzzzzt. We wanted fault-safe HBase persistence for the queues. Also for ease of administration. » ➙ WAL & Queue implemented on top of HBase tables IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 42
  • 43. WAL / Queue » WAL » Queue » guaranteed execution » triggering of async of synchronous actions actions » call doesn’t return before » e.g. (re)index (updated) secondary action finishes record with SOLR back-end » e.g. update secondary actions » size depends on speed of » if all goes well, back-end process size = #concurrent ops » will be useful/made available outside of Lily context as well! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 43
  • 44. The Sum » Lily model (records & fields) » mapped onto HBase (=storage) » indexed and searchable through SOLR » using a WAL/Queue mechanism implemented in HBase » runtime based on Kauri » with client/server comms via Avro IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 44
  • 45. Architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 45
  • 46. Architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 46
  • 47. Roadmap » Today = release of learning material (architecture, model, API, Javadoc) ➥ www.lilycms.org ➥ bit.ly/lilyprerelease » Mid July = ‘proof of architecture’ release e re! th early N » from there on, ca. 3-monthly releases leading up to Lily 1.0 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 47
  • 48. bit.ly/lilyprerelease IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 48
  • 49. License » Apache IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 49
  • 50. Business model » Consulting, mentoring, turn-key projects » Strong focus on partner relations » targeting vertical markets » geographic coverage » SaaS offerings » Markets: media, finance, insurance, govt, heritage ... LOTS of semi-structured data » Not: OLAP IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 50
  • 51. More ? » @outerthought » www.lilycms.org/lily/prerelease.html IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 51