Apereo OAE - Bootcamp


Published on

The Apereo Open Academic Environment is a platform that focusses on group collaboration between researchers, students and lecturers, and strongly embraces openness, creation, re-use, re-mixing and discovery of content, people and groups.

How does Apereo OAE work? OAE targets a large scale and a multi-tenant cloud-compatible deployment model, where a single installation can host multiple institutions at the same time.

This presentation provides a very detailed overview of the overall architecture and the different components and technologies. We will take a closer look into all of the following components and how they are being used:

- Node.js
- OAE Widgets
- Apache Cassandra
- ElasticSearch
- Redis
- Nginx

We also talk about the approach used for continuous nightly performance testing and how we are validating the desired (horizontal) scalability. Details around back-end and UI unit testing, code coverage and security testing are shared as well.

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Apereo OAE - Bootcamp

  1. 1. Apereo OAEBootcamp, San Diego 2013Wednesday, 12 June 13
  2. 2. Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleOr something else?Wednesday, 12 June 13
  3. 3. “Supporting academic collaboration”Wednesday, 12 June 13
  4. 4. Wednesday, 12 June 13
  5. 5. Project goals• Multi-tenant platform• Cloud-ready• SaaS• Used at large scaleWednesday, 12 June 13
  6. 6. Project goals• Maintainable• Extendable• Integrate-ableWednesday, 12 June 13
  7. 7. Solid foundationModern, not exoticWednesday, 12 June 13
  8. 8. July 1, 20131st production releaseWednesday, 12 June 13
  9. 9. Multi-tenancyWednesday, 12 June 13
  10. 10. Multi-tenancyWednesday, 12 June 13
  11. 11. Multi-tenancyWednesday, 12 June 13
  12. 12. Multi-tenancyWednesday, 12 June 13
  13. 13. Multi-tenancy• Market is heading• Support multiple institutions at same time• Multi-tenancy+• Easily created, maintained and configuredWednesday, 12 June 13
  14. 14. Performance!• Ability to scale horizontally• Evidence based• ContinuousWednesday, 12 June 13
  15. 15. Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleWednesday, 12 June 13
  16. 16. OAE ArchitectureThe Apereo OAE project is made up of 2 distinct source codeplatforms:• “Hilary”• Server-side RESTful web platform that exposes theOAE services• Written entirely using server-side JavaScript in Node.js• “3akai-ux”• A client-side / browser platform that provides theHTML, JavaScript and CSS that make up the browserUI of the applicationWednesday, 12 June 13
  17. 17. OAE ArchitectureWednesday, 12 June 13
  18. 18. Hilary System ArchitectureWednesday, 12 June 13
  19. 19. Application Servers• Written in server-side JavaScript, run in Node.js• Node.js used by: eBay, LinkedIn, Storify, Trello• Light-weight single-threaded event-driven platform that process IOasynchronously / non-blocking• Uses callbacks and an event queue to stash work to be done after IO orother heavy processes complete• App servers can be configured into functional specialization:• User Request Processor• Activity Processor• Search Indexer• Preview ProcessorWednesday, 12 June 13
  20. 20. Apache Cassandra• Authoritative data source• Provides high-availability and fault-tolerance without trading away performance• Regarding CAP theorem, Cassandra favours Availability and Partition Tolerance overConsistency, however consistency is tunable on the query-level (we almost alwaysuse “quorum”)• Uses a ring topology to shard data across nodes, with configurable replication levels• No RDBMS?• Cassandra gives more flexibility with incremental scalability in a cloudenvironment• Flexible scaling helps to overcome unpredictable growth of multi-tenant systems• Medium-to-long term options for replicating data to multiple data-centers forlocalizing both reads and writes• Used by: Netflix, eBay,TwitterWednesday, 12 June 13
  21. 21. ElasticSearch• Lucene-backed search platform• Built for masterless incremental scaling and high-availability• Powers Hilary search, including library, related content,group members and memberships• Exposes HTTP RESTful APIs for indexing and queryingdocuments• RESTful query interface uses JSON-based Query DSL• Used by: GitHub, FourSquare, StackOverflow,WordPressWednesday, 12 June 13
  22. 22. RabbitMQ• Message queue platform written in Erlang• Used for distributing tasks to specializedapplication server instances• Supports active-active queue mirroring forhigh availability• Used by: JoyentWednesday, 12 June 13
  23. 23. Redis• Fills a variety of functionality:• Broadcast messaging (can move to RabbitMQ)• Locking• Caching of basic user profiles• Holds volatile activity aggregation data• Comes with no managed clustering solution (yet), but has slavereplication for active fail-over• Some clients manage master-slave switching, and distributedreads for you• Used by:Twitter, Instagram, StackOverflow, FlickrWednesday, 12 June 13
  24. 24. Etherpad• Open Source collaborative editingapplication written in Node.js• Originally developed by Google and Mozilla• Licensed under Apache License v2• Powers collaborative document editing inOAEWednesday, 12 June 13
  25. 25. Nginx• HTTP and reverse-proxy server• Used to distribute load to applicationservers, etherpad servers and stream filedownloads• Useful rate-limiting features based onsource IP• Used by: Netflix,WordPress.comWednesday, 12 June 13
  26. 26. Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleWednesday, 12 June 13
  27. 27. Clustering Cassandra• Cassandra uses a partitioned ring topology to distribute data• Nodes are given a numeric token which determines their“location” in the ring• When data rows are read / written to Cassandra, the rowkey is hashed using a “Partitioner”, which determines whatnode holds the row’s data• “Replication Strategy” is used to determine which nodes willhold replicas of which rows• E.g.“Simple Strategy” will use the N - 1 nodes clockwisearound the ring from the primary nodeWednesday, 12 June 13
  28. 28. Clustering Cassandra(cont’d)Wednesday, 12 June 13
  29. 29. Clustering Cassandra(cont’d)• Query consistency specified at request time• ALL - All nodes must respond successfully• LOCAL_QUORUM - (RF/2 + 1) nodes (in thedatacenter) must respond successfully• EACH_QUORUM - (RF/2 + 1) nodes (in alldatacenters) must respond successfully• ONE - Only one node must respond successfully• Therefore, if you write with QUORUM then read withQUORUM, then results should always be consistentWednesday, 12 June 13
  30. 30. Clustering ElasticSearch• ElasticSearch shards data into a configurablenumber of shards• “Number of Replicas” can be configured atruntime, which determines how many replicasof each shard should exist• Shard replicas are distributed among thenodes in the cluster• Shard is identified by hash of the document idWednesday, 12 June 13
  31. 31. Clustering ElasticSearch(cont’d)Wednesday, 12 June 13
  32. 32. Clustering Etherpad• Short and sweet: It doesn’t really cluster• Data is stored in Cassandra, but active sessions must all sharethe same etherpad server• Configure number of etherpad servers and their hosts inHilary, and configure Nginx to proxy to the appropriateserver• Server is selected based on a numeric hash of the contentitem id• No high availability. If an etherpad server goes down, thosesessions are disconnected :( But etherpad content is retainedWednesday, 12 June 13
  33. 33. Clustering Etherpad(cont’d)Wednesday, 12 June 13
  34. 34. Clustering RabbitMQ• Uses a Master-Slave Active/Active queue mirroringpolicy for redundancy• We set a policy of ha-mode=all to all OAE queues• Ensures messages are replicated to all queues• Ensures all subscribed consumers receivemessages• Since all nodes are active peers, when a node fails,consumer simply reconnectsWednesday, 12 June 13
  35. 35. Clustering SearchIndexers• Search Indexers are regular applicationnodes configured to consume search indextasks• INDEX_UPDATE, INDEX_DELETE• Offloads fetching and processing ofindexing tasks to nodes that don’t impactrequest latencyWednesday, 12 June 13
  36. 36. Clustering ActivityProcessors• Regular application nodes that• Receive and route activity tasks• Collect and aggregate routed activities• May be configured with dedicated Redis server for aggregation• Aggregation is the process of deeming 2 or more activities “similar” and grouping theminto a single activity• Maintains temporary information in Redis to keep track of what activities have occurredrecently and aggregate them on-the-fly• Due to concurrency issues when aggregating new activities into feeds, routed activities aresharded into “concurrency buckets”• Avoids duplicate activities in streams, while not completely serializing the aggregationprocess• Bucket is selected based on a hash of• Activity Stream ID (e.g., user or group id)• Activity Type (e.g., share content, add to group, etc...)Wednesday, 12 June 13
  37. 37. Clustering Activity Processors(cont’d)• Activity0 and Activity1 were serialized into bucket 0 toavoid concurrency collisions in “user A”s stream• A bucket is only collected by one activity processor ata time• An activity processor can concurrently collect multiplebuckets (max concurrent buckets is configurable)• Number of buckets is configurable (is 3 in thisexample)Wednesday, 12 June 13
  38. 38. Clustering PreviewProcessor• Regular application node that is specializedto handle preview processing tasks• GENERATE_PREVIEWS• Offloads the CPU and memory-intensiveprocess of generating previews to machinesthat don’t impact user request latencyWednesday, 12 June 13
  39. 39. Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleStill with us?Wednesday, 12 June 13
  40. 40. Hilary design andextension points• Common patterns• Search• Activities• File storage• Preview ProcessorWednesday, 12 June 13
  41. 41. Search producers /transformers• Search producers• Generates documents that need to go in theindex• Search transformers• Transforms query results coming back from ESinto something the UI can useWednesday, 12 June 13
  42. 42. Search Producers• Produces documents that can be indexed/storedby ElasticSearch• Simple JSON document• A search document contains the full profile. (ie no datais hidden)• Runs on separate Search Indexing serversWednesday, 12 June 13
  43. 43. Search Producers - workflowWednesday, 12 June 13
  44. 44. Search Transformers• Transforms an ElasticSearch result to somethingthe UI can use.• Hides sensitive user data (if necessary)• Adds thumbnail URLs• Runs on the application serversWednesday, 12 June 13
  45. 45. Wednesday, 12 June 13
  46. 46. Custom search queries• Exposed as a REST API• Ex:• /api/search/general• /api/search/content-library• /api/search/<search name>Wednesday, 12 June 13
  47. 47. Custom search queries• // GET http://cam.oae.com/api/search/custom-foo?q=barvar SearchAPI = require(oae-search);SearchAPI.registerSearch(custom-foo, function(ctx, opts, callback) {// The query you need to write// opts.q = “bar”var query = …// Access scope the resultsvar filter = ..callback(null, SearchUtil.createQuery(query, filter, opts));});Wednesday, 12 June 13
  48. 48. Activities• Follows the activitystrea.ms spec• Each activity has:• an object (content item “presentation.ppt”)• an actor (BrandenVisser)• a target (Bert Pareyn)• a verb (to share)• End up in an activity stream• Generated by separate activity serversWednesday, 12 June 13
  49. 49. Activities• Activity Seeds• Activity Producers• Activity Routers• Activity TransformersWednesday, 12 June 13
  50. 50. Activities and notifications• Notifications are “special” activities thatwere routed to a separate activity stream• E-mails can be sent out for notifications• Piggy-backing notifications on activitiesgives you free aggregationWednesday, 12 June 13
  51. 51. ActivityAPI.registerActivityType(‘content-comment’, {groupBy: [{target: true}],notifications: {email: true,emailTemplateModule: oae-content,emailTemplateId: notify-content-comment}});Wednesday, 12 June 13
  52. 52. Activity seeds• When something happens an activity seedis created and sent out to RabbitMQ• Contains the data for the Activity Serversto produce the persisted activities andgenerate the routesWednesday, 12 June 13
  53. 53. ContentAPI.on(‘content-comment, function(ctx, comment) {// Create the actor, object and target objects for the activity..// Construct the activity seed..// Submit to RabbitMQActivityAPI.postActivity(ctx, activitySeed);});Local events get offloaded to RabbitMQWednesday, 12 June 13
  54. 54. Activity Producers• Produces the persisted entity that should bestored for each activity.• Each entity should hold all the data necessary forproducing routes and transforming into UI friendlydata• Should try to be compact as activities will be de-normalized and an entity will be saved per stream(each user has at least 2 streams, so this is a lot of data)• Produced on separate activity serversWednesday, 12 June 13
  55. 55. Activities// Persisted activity entity{    "oae:activityType": "content-share",                                 // Required    "published": "2011-02-10T15:04:55Z",                                // Required    "verb": "share"                                                     // Required    "actor": { <ProducedEntity> },    "object": { <ProducedEntity> },    "target": { <ProducedEntity> }}Wednesday, 12 June 13
  56. 56. // Transformed activity entity{    "oae:activityType": "content-share",    "published": "2011-02-10T15:04:55Z",    "verb": "share"    "actor": {        "objectType": "user",        "id": "http://my.oae.org/api/user/u:oae:mrvisser",        "displayName": "Branden Visser",        "url": "http://some.oae.org/~u:oae:mrvisser",        "image": { .. }    },    "object": {        "objectType": "content",        "oae:contentType": "file",        "oae:mimeType": "image/png"        "id": "http://my.oae.org/content/contentId",        "url": "http://my.oae.org/content/contentId",        "displayName": "Super cool image",        "image": { .. }    },    "target": {        "objectType": "user",        "id": "http://my.oae.org/user/u:cam:bert",        "url": "http://my.oae.org/~u:cam:bert",        "image": { .. }    }}Wednesday, 12 June 13
  57. 57. Activity routes• Activities can be routed to “activity streams”• Each user and group has an activity stream, users also have a notificationstream• Routing is the process of taking an activity anddetermining who should receive the activity• A route is a simple string with the ID of theprincipal to which the activity should be delivered• Routed on separate activity serversWednesday, 12 June 13
  58. 58. Activity Routes - Propagation• Permissions/privacy in activities• Possible values:• ANY The produced entity data can be routed to any route. i.e., The entity is public or loggedin• ROUTES The entity can only be propagated to the activity routes specified by the router• SPECIFY Specify additional routes the entity should be routed to• ex: Branden adds Bert to the Private group “OAE-Team”• Actor: Branden, Object: “OAE-Team”-group, Target: Bert• The default routing would generate an activity for all the managers of the OAE-Team group, need toadd a route for Bert as wellWednesday, 12 June 13
  59. 59. ActivityAPI.registerActivityRouter(content-comment, function(...) {// Generate routes for all the content managers/viewers...// Generate routes for all the recent contributes...// return the generated routes});Wednesday, 12 June 13
  60. 60. Activity Transformers• Transforms persisted activities intoactivitystrea.ms compliant results• Adds OAE specific data that can beconsumed by the UI.• ex: More imagesWednesday, 12 June 13
  61. 61. ActivityAPI.registerActivityEntityTransformer(content-comment, function(ctx, entities) {// Add thumbnail URLs...// Add replies to comments// (This data is available in the entity itself, it just needs to be cleaned up.)..// return});Wednesday, 12 June 13
  62. 62. File storage• New file storage backends can be plugged in• Available storage backends:• Local disk storage / Mounted NFS• Amazon S3• Try not to serve actual file bodies with HilaryWednesday, 12 June 13
  63. 63. Preview Processor• Generates thumbnail and large readable previewimages of content in the system• Uses the REST API to interact with OAE• Isolation from the application server• Informed of new content to process by RabbitMQ• Allows for multiple processors, preventsdropped messagesWednesday, 12 June 13
  64. 64. Preview Processor• Existing processors for:• Collabdocs - Uses webshot to take a browser screenshot of the etherpad• Images - GraphicsMagick used to make various sized thumbnails• Office Docs and PDFS - Individual page snapshots using LibreOffice and GraphicsMagick,so the whole document may be viewed in the browser• Arbitrary links• Flickr, SlideShare,Vimeo,YouTube: Special Handling for REST APIs to fetch displayname, description and preview images directly• Other links: Uses webshot to take a screenshot of links for which it doesn’t have aspecific handler• Creating custom processors• New processors can be added in new NPM modules• Uses a registration pattern to hook in to the Preview Processor• Flexible meta-data to back custom widgetsWednesday, 12 June 13
  65. 65. Storage interface• get• store• remove• getDownloadLinkWednesday, 12 June 13
  66. 66. Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleWednesday, 12 June 13
  67. 67. Performance testing• Workflow• Model loader• Tsung tests• Analysis• CasesWednesday, 12 June 13
  68. 68. Workflow1. Generate data2. Load data into the system3. Tsung tests4. Circonus5. AnalysisWednesday, 12 June 13
  69. 69. Workflow• Establish a baseline by cycling throughtesting, analysis and improvement• Iterate-and-improve• Try and maintain an acceptable baselinewhen adding new featuresWednesday, 12 June 13
  70. 70. OAE model loader• NodeJS tool that generates and loads data• Tries to reflect real life scenarios• ex: 30% of the members of a groupshould be managersWednesday, 12 June 13
  71. 71. Model loader - Generation• Generates re-runnable JSON files that define data tobe loaded into the system• All data is based on predefined distributions and canbe tweaked• Supports:• Users• Groups• Content• Links / Collaborative documents / Files• DiscussionsWednesday, 12 June 13
  72. 72. Model loader - loading• Loads the data into the system• Writes the generated IDs to disk so theycan be re-used in the Tsung testsWednesday, 12 June 13
  73. 73. Tsung• Tsung is an Erlang distributed load-testingtool• Able to simulate thousands of concurrentusers• Used to stress test the applicationWednesday, 12 June 13
  74. 74. Tsung• Takes an XML file that defines the HTTP requests and fires them off at theserver• Sessions• Each session has a probability of execution• Contains:Transactions, dynamic variables• Transactions• Contains: Requests,Thinktime• RequestsWednesday, 12 June 13
  75. 75. <session name="general_interest_term_anon" probability="17" type="ts_http"><setdynvars sourcetype="file" fileid="users.csv" delimiter=";" order="random"><var name="users_id" /><var name="users_username" /><var name="users_password" /></setdynvars><transaction name="tx_login"><request subst="true"><dyn_variable name="loggedin_user_id" jsonpath="$.id"/><http url="/api/auth/login" method="POST" version="1.1" contents="username=%%_users_username%%&amp;password=%%_users_password%%"><http_header name="Referer" value="/" /></http></request></transaction><thinktime value="4" random="true"/><transaction name="tx_general_search_search"><request subst="true"><http url="/api/search/general/all?q=%%_search_term_30%%&amp;from=10&amp;size=10" method="GET" version="1.1" ></http></request></transaction>...</session>Wednesday, 12 June 13
  76. 76. node-oae-tsung• Tsung’s XML syntax is hard and boring + xml files can gethuge• => Automate it• NodeJS tool to generate XML file• Uses generated ID from the model loader data load• Contains 3 layers:• API• Tests• SuitesWednesday, 12 June 13
  77. 77. node-oae-tsung API• Each method in the Tsung API represents a high-leveluser action against your UI• Performs the REST API requests that would get calledwhen a user performs that “click” / action / page visit• Encapsulates those requests into a “transaction”• ex: Loading the group members page would do arequest to:• GET /api/me• GET /api/group/<group>/membersWednesday, 12 June 13
  78. 78. node-oae-tsung API// “Show the members of a group”-pagevar members = module.exports.members = function(session, group) {var tx = session.addTransaction(group_members);tx.addRequest(GET, /api/me);tx.addRequest(GET, /api/group/ + group + /members);};Wednesday, 12 June 13
  79. 79. node-oae-tsung testcase• Each test case describes a possible user session by executing anumber “transactions” / API methods:• User logs in• User searches for groups• User visits a group• Thinks a bit• Searches for users to add• Adds some users• Performs a general search• Visits a content item• Shares that content item with a group• Logs out• End result is a Tsung “session” object that contains manytransactions and “think times”Wednesday, 12 June 13
  80. 80. module.exports.test = function(runner, probability) {probability = probability || 100;// Create a new session.var session = runner.addSession(add_group_users, probability);var user = User.login(session, %%_group_add_users_manager_username%%, %%_group_add_users_manager_passwoGroup.profile(session, %%_group_add_users_group_id%%);session.think(2);// Go to the members listGroup.members(session, groupId);session.think(6);// Add 2 usersvar update = {%%_group_add_users_user_0%%: member,%%_group_add_users_user_1%%: member};Group.updateMembers(session, groupId, update);session.think(2);...}Wednesday, 12 June 13
  81. 81. node-oae-tsung suite• Contains a list of test cases / sessions you want to include in your test suite (i.e.,Tsung XML file)• Has an optional probability option to control the session distribution• Standard test suite file:general_interest_term_anon,15general_interest_term_auth,50general_interest_content_auth,5general_interest_group_auth,7general_interest_user_auth,10private_groups_interest,40study_group_content,40edit_content,40edit_group,15add_content_users,10add_content_groups,5add_group_users,10add_group_groups,5Wednesday, 12 June 13
  82. 82. Putting it together• Need to create a new mix of sessions and distributions? Justcreate a new suite file.• Need to create new sessions that do different things? Justcreate new test cases that use the API methods.• Did web requests in your application change on certain pages?Just update the requests executed in the API methods, all testcases / suites update with it.• New feature in your application? Create the API methods forthe new actions, incorporate into test cases.• Did your UI get completely overhauled? Oops... remodel yourAPI methods from scratch.Wednesday, 12 June 13
  83. 83. Tsung• Take the tsung.xml file and run it• Depending on the session lengths, can takea couple of hours• Generates graphsWednesday, 12 June 13
  84. 84. Setup• 1 nginx load balancer (0.5GB / 1CPU)• 2 app nodes (0.5GB / 1 CPU)• 3 db nodes (8GB / 2 CPU)• 1 redis node (0.5GB / 1 CPU)• 1 search node (0.5GB / 1 CPU)Wednesday, 12 June 13
  85. 85. TransactionsWednesday, 12 June 13
  86. 86. Request latencyWednesday, 12 June 13
  87. 87. Transactions / secWednesday, 12 June 13
  88. 88. Arrival rate of new usersWednesday, 12 June 13
  89. 89. Simultaneous usersWednesday, 12 June 13
  90. 90. HTTP Requests / secWednesday, 12 June 13
  91. 91. Circonus Telemetry• Circonus is a tool built by OmniTI togather, graph and analyze data.• Allows for push/pull data entryWednesday, 12 June 13
  92. 92. Circonus Telemetry• Latencies• HTTP Request• Cassandra queries• Search queries• Permission checks• Activity collection / routing / delivery• Counts• Each API Call (ex: POST./api/user/create)• Cassandra queries (READ - WRITE)• Error countsWednesday, 12 June 13
  93. 93. Circonus Telemetry• Has multiple graph types (line, bar,histogram)• Allows you to overlay multiple data pointsWednesday, 12 June 13
  94. 94. Average latency for POST /api/*Wednesday, 12 June 13
  95. 95. Histogram latency - POST /api/*Wednesday, 12 June 13
  96. 96. So, does it scale?• Yes. We can scale the applicationhorizontally by adding more nodes• Doubling the hardware, roughly doublesthe throughputWednesday, 12 June 13
  97. 97. Simultaneous usersWednesday, 12 June 13
  98. 98. Requests / secWednesday, 12 June 13
  99. 99. Case: Permissions• Identified as a key component that would impact performancethe most• Permissions are propagated indirectly through group membership• Steve has access to ContentAVIA GroupC• Steve has access to GroupAVIA GroupC and GroupBWednesday, 12 June 13
  100. 100. Case: Permissions(cont’d)Attempt #1:First do a direct association check against the target. If unsuccessful,“Explode” and storedenormalized group memberships for user when permission check is performed (ifnecessary). Group membership changes intelligently try and keep denormalized groups listup to date to avoid invalidation.When doing a permission check, fetch the exploded list of groups and a select against thetarget’s members with the indirect list of groups.Assumption #1: Most permission checks would be for direct access to resources, soexploded check would not happen oftenAssumption #2: When needed, fetching exploded group hierarchy would be quite fast:Fetch one row from Cassandra.Assumption #3: Selecting matches in the direct target members would be quite fast:Query finite number of columns from one row would be fastResult: Baseline test peaked at about 300 requests per second. Scaling up to larger sizeswould explode the resource costs unacceptably, try and do better.Wednesday, 12 June 13
  101. 101. Case: Permissions(cont’d)What went wrong?For starters, D-Trace analysis shows our appservers are spending over 90% of their timeserializing / deserializing Thrift bodies fromCassandraWednesday, 12 June 13
  102. 102. Case: Permissions(cont’d)How do we fix it?Maybe we’re querying exploded groups more than we assumed, and there is muchmore data in the exploded group memberships than we assumed.Attempt #2Query the target resource direct members. If there are no groups, simply comparethe direct members with the source user. If there are groups, explode the groups ofthe source user (if not already denormalized), and only query the group ids from theexploded memberships that are directly assocated to the target resource. If noresults, no access. If there are results, there is access.Assumption #1: Fetching full “exploded” groups is expensive, avoid itAssumption #2: Most permission checks are against resources who don’t havegroups assigned as members, and there are generally much fewer groups assigned asmembers than groups a user will indirectly be a member ofResult: Same baseline test peaked at 850 requests per second.Thats reasonable!Wednesday, 12 June 13
  103. 103. Case:ActivityActivity Aggregation is the process of collecting multiple“similar” activities into a single aggregated activity. In order toaccomplish this, lots of “recent activities” need to be pooledtogether, and consulted for each routed activity.Attempt #1: Store the activity history and activity bucketsin CassandraResult: 3 Cassandra nodes, breaks down on a routedactivity throughput of 135 per second during data-load’.Equivalent to roughly 1.3 activities per second in our tests.Not acceptable.Wednesday, 12 June 13
  104. 104. Case:Activity (cont’d)Problem #1: Cassandra latency reports in DataStax OpsCenter showing Cassandranodes are surpassing 5000 requests per second, and latency is climbing to over 10 secondsand load on the Cassandra servers pushing 7.Solution #1: Move aggregation into Redis.The data is volatile and gets evicted over timeany way.Avoids cluster co-ordination gossip and disk I/O for a large part of the activityload.Problem #2: Initial tests showed the memory footprint easily skyrocketed over 8Gb.Not Acceptable.Solution #2: Normalize the aggregated entities in Redis rather than duplicate them foreach activity aggregation entry.Result: With 3 Cassandra nodes and 3 activity servers, collection happens at ~1500routed activities per second (translates to 15 activities per second).With 3 Cassandranodes and 6 activity servers, collection was occurring at 2500 routed activities persecond, which is as fast as the mass data-load was creating them. Memory footprint in alltests remains less than 4Gb for the entire duration. Cassandra remains stable in all tests.Good to go!Wednesday, 12 June 13
  105. 105. Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleSomeone poke the guy sleeping in the back.Wednesday, 12 June 13
  106. 106. Deployment andAutomation• As you can imagine, many machines to manage. Current inventory:• 3x Cassandra• 2x Redis• 2x RabbitMQ• 4x Application + Indexer• 3x Preview Processor• 1x Activity Processor• 1x Nginx• 3x Etherpad• Performance testing with a cluster of 21 virtual machines• Additional scalability testing and verification with ~30 virtual machinesWednesday, 12 June 13
  107. 107. Puppet• Use puppet to centralize machine configuration and prevent configuration drift• Collection of “Manifests” that define the state that the machine should in based onits hostname / role:• What files should exist? What should their contents be?• What packages should be installed?• What services should be running, or stopped?• http://github.com/sakaiproject/puppet-hilary• All 20+ machines in cluster have Puppet installed, which ask for “catalog” info(expected configuration state) from a single puppet master machine• Puppet Master knows how to determine the machine state from the manifestsbased on its host (e.g., db0 is a cassandra node, it should have cassandra, java, etc...)• Use puppetdb with “External Resources” to share machine-specific informationwith each other node in the clusterWednesday, 12 June 13
  108. 108. Hiera• Serves configuration information for puppet• JSON-based data-format, which can be inherited in a flexible manner• Keeps large complex configuration data as clean as possible{"classes": ["::oaeservice::hosts","::oaeservice::firewall::open","::oaeservice::rsyslog"],"nodetype": "%{nodetype}","nodesuffix": "%{nodesuffix}","web_domain": "oae-performance.oaeproject.org"Wednesday, 12 June 13
  109. 109. MCollective• Provides parallel execution over a number of machines atone time• Start / Stop / Check status of services• Install / Remove / Check version of packages• Use puppet resource syntax to check adhoc machinefacts• Apply puppet manifests• Each cluster node subscribes to an ActiveMQ server toreceive commands. Central machine (the “client”) publishesthe command and waits for replyWednesday, 12 June 13
  110. 110. Slapchop• Missing piece:We need to create 21 machines of different specs in acloud service, and somehow get MCollective on them• A tool we lovingly call slapchop• Define a JSON manifest that holds machines configs and instances• Run slapchop to create the machines in Joyent cloud, start them, getmcollective installed• Well, kind of...• Now you can log in to the MCollective client and run mco puppetapply• Well, kind of...• Go from empty cloud to working 21 machine cluster in ~15 minutesWednesday, 12 June 13
  111. 111. Nagios• Provides monitoring and alerts• Cassandra health• Diskspace• ElasticSearch JVM stats (memory usage, garbage collection)• Application server health• OS memory usage• Nginx health• RabbitMQ queue sizes• RabbitMQ health• Redis health• Nagios NRPE scripts deployed with puppetWednesday, 12 June 13
  112. 112. Nagios (cont’d)Wednesday, 12 June 13
  113. 113. Munin• We are using Munin for time-series OSresource statistics• Disk, network, redis, load, memory• Deployed automatically through puppetmanifests all nodesWednesday, 12 June 13
  114. 114. Munin (cont’d)Wednesday, 12 June 13
  115. 115. Munin (cont’d)Wednesday, 12 June 13
  116. 116. Security• Machines deployed in privateVLAN -- private interfaces are isolated• Public interface firewall completely closed on all nodes• Single bastion node with public SSH enabled, key-only authentication• Single Nginx node with public Web ports enabled• TCP Syn cookies enabled for publically exposed machines to prevent syn-floods• Rate-limiting to 50req/s per source IP applied for Web API requests• Several XSS tests and all issues followed up• Using OWASP JQuery plugin for XSS filtering user-created data• All infrastructure security deployed automatically with puppet• Penetration testing performed by University of Mercia• UI vulnerability testing performed by SCIRT groupWednesday, 12 June 13
  117. 117. Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleWednesday, 12 June 13
  118. 118. UI ArchitectureHilary3akai-uxMobile UI3rd party integrationsWednesday, 12 June 13
  119. 119. Core UI Architecture• JS frameworks• CSS framework• 3rd party plugins• OAE UI API• OAE CSS ComponentsWednesday, 12 June 13
  120. 120. Core frameworks• RequireJS• jQuery• underscore.jsWednesday, 12 June 13
  121. 121. RequireJS• File and module loader• Necessity to keep things modular• Optimisation built-inWednesday, 12 June 13
  122. 122. RequireJS• Define modules• Load files and modules on flyWednesday, 12 June 13
  123. 123. • DOM manipulation• Cross-browser abstraction• Events• Pretty much everythingWednesday, 12 June 13
  124. 124. • Utility toolbelt• Manipulate objects, arrays, etc.Wednesday, 12 June 13
  125. 125. CSS frameworks• Twitter Bootstrap• Font AwesomeWednesday, 12 June 13
  126. 126. Twitter Bootstrap• Re-usable, consistent CSS is hard• Most popular CSS framework• Documentation already there• Basic components, styles, etc.• Override where necessaryWednesday, 12 June 13
  127. 127. Twitter BootstrapWednesday, 12 June 13
  128. 128. Font Awesome• Icon font• No more images• Style with CSS• Skinning• EasyWednesday, 12 June 13
  129. 129. Font AwesomeWednesday, 12 June 13
  130. 130. 3rd party plug-ins• jQuery plug-ins• Bootstrap plug-insWednesday, 12 June 13
  131. 131. 3rd party plug-ins• Autosuggest• History.js• Fileupload• Validate• Templates• etc.Wednesday, 12 June 13
  132. 132. OAE UI API• Wrapper for REST requests• Users• Profile• Groups• Content• Discussions• Search• ConfigWednesday, 12 June 13
  133. 133. OAE UI API• Utilities• i18n• l10n• Widget loading• Template rendering• Notifications• XSS escaping• etc.Wednesday, 12 June 13
  134. 134. OAE CSS Components• Re-usable HTML fragments• OAE specific elements• Consistency• Design guidelinesWednesday, 12 June 13
  135. 135. Visibility iconsIndicate visibility of groups, content,discussions, etc.Wednesday, 12 June 13
  136. 136. Large optionsWednesday, 12 June 13
  137. 137. ThumbnailsWednesday, 12 June 13
  138. 138. ClipsWednesday, 12 June 13
  139. 139. TilesWednesday, 12 June 13
  140. 140. List itemsWednesday, 12 June 13
  141. 141. ToolboxJS frameworksCSS framework3rd party pluginsOAE UI APIOAE CSS ComponentsWednesday, 12 June 13
  142. 142. ToolboxJS frameworksCSS framework3rd party pluginsOAE UI APIOAE CSS ComponentsWIDGET SDKWednesday, 12 June 13
  143. 143. Putting it togetherWednesday, 12 June 13
  144. 144. Widgets• Modular components• HTML Fragment• JavaScript• CSS• Config file• Loaded into DOMWednesday, 12 June 13
  145. 145. Namespacing• Widgets share same container• Avoid clashes• Namespace:• HTML IDs• CSS classes• jQuery selectorsWednesday, 12 June 13
  146. 146. Widget JS• Require required APIs• Return function to be executed as widgetWednesday, 12 June 13
  147. 147. i18n• UI available in multiple languages• Standard .properties files• 2 types of bundles• Core bundles• Widget bundlesWednesday, 12 June 13
  148. 148. i18nTranslation priority1. Widget user language file2. Widget default language file3. Container user language file4. Container user language fileWednesday, 12 June 13
  149. 149. i18n__MSG__TRANSLATION_KEY__Wednesday, 12 June 13
  150. 150. i18n• English• French• German• Italian• Spanish• Russian (Partial)• Chinese (Partial)Wednesday, 12 June 13
  151. 151. l10n• API methods for localizing:• Timezones• Date Formatting• CurrencyWednesday, 12 June 13
  152. 152. UI templating• TrimPath• Avoids lots of DOM manipulation• Pass in JSON data• Supports if statements, for loops, etc.Wednesday, 12 June 13
  153. 153. UI templating• Templates are defined in between <!-- -->• oae.api.util.template().render(...)Wednesday, 12 June 13
  154. 154. UI templating• Template<div id="example_template"><!--<h4>Welcome {firstName}.</h4>You are ${profile.age} years old--></div>• Inputoae.api.util.template().render($("#example_template"), {“firstName”: “John”,“profile”: {“placeofbirth”: “Los Angeles”,“age”: 45}});• Result<h4>Welcome John.</h4>You are 45 years old.Wednesday, 12 June 13
  155. 155. UI templating• Template<div id="example_template"><!--{if score >= 5}<h1>Congratulations, you have succeeded</h1>{elseif score >= 0}<h1>Sorry, you have failed}{else}<h1>You have cheated</h1>{/if}--></div>• Inputoae.api.util.template().render($("#example_template"), {“score”: 6});• Result<h1>Congratulations, you have succeeded!</h1>Wednesday, 12 June 13
  156. 156. UI templating• Template<div id="example_template"><!--{for conference in conferences}<div>${conference.name} (${conference.year})</div>{forelse}<div>No conferences have been organized</div>{/for}--></div>• Inputoae.api.util.template().renderTemplate($("#example_template"), {“conferences”: [{“name”: “Sakai San Diego”, “year”: 2013},{“name”: “Sakai Atlanta”, “year”: 2012}]});• Result<div>Sakai San Diego (2013)</div><div>Sakai Atlanta (2012)</div>Wednesday, 12 June 13
  157. 157. Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleOr something else?Wednesday, 12 June 13
  158. 158. Customization andconfiguration• Administration UI• Global administration• Tenant administration• Manage production environmentWednesday, 12 June 13
  159. 159. Tenant management• Start, stop, edit tenants• Create new tenantsWednesday, 12 June 13
  160. 160. Tenant managementWednesday, 12 June 13
  161. 161. Tenant configuration• Configure global tenant (overridden by tenants) orindividual tenant configurations• Configure on the fly• Single Sign On integration• Default UI language• Default visibility settings• Data storage settings• etc.Wednesday, 12 June 13
  162. 162. Tenant configurationWednesday, 12 June 13
  163. 163. Tenant skinning• Skinning UI• Skin entire application• Branding, colors, etc.• LESS• Component re-useWednesday, 12 June 13
  164. 164. Tenant skinningWednesday, 12 June 13
  165. 165. Tenant skinningWednesday, 12 June 13
  166. 166. Tenant skinningWednesday, 12 June 13
  167. 167. Extending with NPM• NPM - Node Package Manager• Dependency management, including remote fetching custom modules fromthe NPM repo or github• Stored inside of node_modules directory of your project• Usually a logical set of functionality (e.g., a back-end REST API, or a set ofrelated widgets)• NPM module in 3akai-ux is searched for custom widgets• NPM module in Hilary (that starts with oae-) is searched for init.js tointegrate to the application container• New dependencies can be added to package.json file• Changes to this file must be maintained with a patch, though :(Wednesday, 12 June 13
  168. 168. UI Release Processes• Grunt• Task-based build system implemented in JavaScript• Similar in theory of operation to Make, Rake• Rich ecosystem of plug-ins to do most tasks• Easy to implement new task when a plugin doesn’t existyet• Used for running test suites, production builds, lintingtoolsWednesday, 12 June 13
  169. 169. UI Release Processes• Production Build• Optimizes the static assets to reduce throughput, request frequency, and optimizecaching across versions• Require.js Optimization:• Concatenate JavaScript dependencies (reduces number of web requests significantly)• Minify / Uglify JavaScript files (reduces payload sizes significantly, even when gzipenabled on web server)• Hash optimization:• Hash the contents of static assets and append result to the filename, then cachethem indefinitely on the browsers• When the files change, the hash in the filename changes to force reloading of theupdated asset• If files never change across version, client never reloads file until their cache isclearedWednesday, 12 June 13
  170. 170. Developer Resources:Widget SDK• Contains help on creating widgets• Code best practices• Design style guide• UI and API documentation• Widget Builder• ExamplesWednesday, 12 June 13
  171. 171. Developer Resources:Docs UI• UI that has documentation automaticallygenerated from the docs in the Hilary and3akai-ux source code• Accessible from /docs path of any tenantWednesday, 12 June 13
  172. 172. Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleYou do have Hilary installed, right?Wednesday, 12 June 13