Learning Lessons: Building a CMS on top of NoSQL technologies

17,610 views
17,327 views

Published on

Presentation for Berlin Buzzwords conference on Lily, our open source content repository based on HBase and SOLR

Published in: Technology

Learning Lessons: Building a CMS on top of NoSQL technologies

  1. 1. Learning Lessons Building a content repository on top of NoSQL Technologies IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  2. 2. hello, I’m @stevenn from @outerthought IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 2
  3. 3. This story is about IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 3
  4. 4. Complexity complexity software architecture 3.0 2.0 1.0 age IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 4
  5. 5. Complexity complexity user interest 3.0 2.0 1.0 age IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 5
  6. 6. We Prefer Sophistication » the challenge for us was to scale ... without dropping features IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 6
  7. 7. The typical CMS ‘architecture’ database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 7
  8. 8. The typical CMS ‘architecture’ application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 8
  9. 9. The typical CMS ‘architecture’ more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 9
  10. 10. The typical CMS ‘architecture’ even more cache more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10
  11. 11. The typical CMS ‘architecture’ client even more cache more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11
  12. 12. The typical CMS ‘architecture’ client (+cache) even more cache more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12
  13. 13. What we found hard to scale » access control » facet browsing » all the nifty stuff people were using our software for » ... anything that required random access to in-memory-cache data for computations IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 13
  14. 14. Beyond the ‘scaling’ problem » three-prong data layer fs » result set merging (between MySQL & Lucene) » happened in appcode/memory » ‘transactions’, set operations = hard IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 14
  15. 15. Beyond the three-prong problem » errrr..... “Failover” ..... ? IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 15
  16. 16. If we would be able to add more nodes ... scalability » True Distribution availability performance ... in the line of fire IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 16
  17. 17. Solution 1 » do MORE inside the database IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17
  18. 18. Functional IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 18
  19. 19. Functional IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 19
  20. 20. Infrastructural e ! a s ta b d a o re m IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 20
  21. 21. e ! a s tab d a o re n m eve IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 21
  22. 22. s ! u s se e b sa g m es d d ’s a l et IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 22
  23. 23. f f! ! s tu B C J D r e ! ov 0t S 0 J M w M I! R IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 23
  24. 24. http://bigdatamatters.com/bigdatamatters/2010/04/high-availability-with-oracle.html IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 24
  25. 25. Business Development 101 user interest budget IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 25
  26. 26. Solution II sophistication ability to cope 3.0 mysql nosql? 2.0 1.0 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 26
  27. 27. Enter The Cambrian Explosion Cassandra NoSQL neo4j IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 27
  28. 28. Requirements, phase I » automatic scaling to large data sets » fault-tolerance: replication, automatic handling of failing nodes » a flexible data model supporting sparse data » runs on commodity hardware » efficient random access to data » open source, ability to participate in the development thus drive the direction of the project » some preference for a Java-based solution IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 28
  29. 29. Requirements, phase II » After careful consideration, we realized the important choices were also: » consistency: no chance of having two conflicting versions of a row » atomic updates of a single row, single-row transactions » bonus points for MapReduce integration » e.g. full-text index rebuilding IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 29
  30. 30. That brought us to HBase, which bought us: » a datamodel where you can have column families which keep all versions and others which do not, which fits very well on our CMS document model » ordered tables with the ability to do range scans on them, which allows to build scalable indexes on top of it » HDFS, a convenient place to store large blobs » Apache license and community, a familiar environment for us IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 30
  31. 31. » OK, so now we had a data store ! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 31
  32. 32. » However, content repository = store + search ! u ch o IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32
  33. 33. a s w at ! h sy T a .. .) e ver we h o ( IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33
  34. 34. Search ponderings » CMS = two types of search » structured search » numbers, strings » based on logic (SQL, anyone?) » information retrieval (or: full-text search) » text » based on statistics IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34
  35. 35. Search ponderings » All of that, at scale IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 35
  36. 36. Structured Search » HBase Indexing Library » idea from Google App Engine datastore indexes » http://code.google.com/appengine/articles/ index_building.html rowkey col col rowkey col order A val3 foo6 val2-B B val2 foo7 val3-A content table index table A IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 36
  37. 37. Full-text / IR search » Lucene? » no sharding (for scale) » no replication (for availability) » batched index updates (not real-time) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 37
  38. 38. Beyond Lucene » Katta » scalable architecture, however only search, no indexing » Elastic Search » very young (sorry) » hbasene et al. » stores inverted index in HBase, might not scale all features » SOLR » widely used, schema, facets, query syntax, cloud branch More info: http://lilycms.org/lily/prerelease/technology.html IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 38
  39. 39. ? + = r ? ! O a sy E IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 39
  40. 40. Remember distribution ? Remember secondary indexes ? ➙ Need for reliable queuing IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 40
  41. 41. Connecting things » we needed a reliable bridge between our main storage (HBase) and our index/search server(s) (SOLR) » indexing, reindexing, mass reindexing (M/R) » we need a reliable method of updating HBase secondary indexes » all of that eventually to run distributed » distribution means coping with failure IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 41
  42. 42. Solution » ACMEMessageQueue ? Bzzzzzt. We wanted fault-safe HBase persistence for the queues. Also for ease of administration. » ➙ WAL & Queue implemented on top of HBase tables IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 42
  43. 43. WAL / Queue » WAL » Queue » guaranteed execution » triggering of async of synchronous actions actions » call doesn’t return before » e.g. (re)index (updated) secondary action finishes record with SOLR back-end » e.g. update secondary actions » size depends on speed of » if all goes well, back-end process size = #concurrent ops » will be useful/made available outside of Lily context as well! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 43
  44. 44. The Sum » Lily model (records & fields) » mapped onto HBase (=storage) » indexed and searchable through SOLR » using a WAL/Queue mechanism implemented in HBase » runtime based on Kauri » with client/server comms via Avro IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 44
  45. 45. Architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 45
  46. 46. Architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 46
  47. 47. Roadmap » Today = release of learning material (architecture, model, API, Javadoc) ➥ www.lilycms.org ➥ bit.ly/lilyprerelease » Mid July = ‘proof of architecture’ release e re! th early N » from there on, ca. 3-monthly releases leading up to Lily 1.0 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 47
  48. 48. bit.ly/lilyprerelease IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 48
  49. 49. License » Apache IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 49
  50. 50. Business model » Consulting, mentoring, turn-key projects » Strong focus on partner relations » targeting vertical markets » geographic coverage » SaaS offerings » Markets: media, finance, insurance, govt, heritage ... LOTS of semi-structured data » Not: OLAP IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 50
  51. 51. More ? » @outerthought » www.lilycms.org/lily/prerelease.html IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 51

×