• Save
Apereo OAE - Bootcamp
Upcoming SlideShare
Loading in...5

Apereo OAE - Bootcamp



The Apereo Open Academic Environment is a platform that focusses on group collaboration between researchers, students and lecturers, and strongly embraces openness, creation, re-use, re-mixing and ...

The Apereo Open Academic Environment is a platform that focusses on group collaboration between researchers, students and lecturers, and strongly embraces openness, creation, re-use, re-mixing and discovery of content, people and groups.

How does Apereo OAE work? OAE targets a large scale and a multi-tenant cloud-compatible deployment model, where a single installation can host multiple institutions at the same time.

This presentation provides a very detailed overview of the overall architecture and the different components and technologies. We will take a closer look into all of the following components and how they are being used:

- Node.js
- OAE Widgets
- Apache Cassandra
- ElasticSearch
- Redis
- Nginx

We also talk about the approach used for continuous nightly performance testing and how we are validating the desired (horizontal) scalability. Details around back-end and UI unit testing, code coverage and security testing are shared as well.



Total Views
Views on SlideShare
Embed Views



7 Embeds 564

http://www.oaeproject.org 318
http://oae.sakaiproject.org 143
http://lanyrd.com 74
http://oaeproject.org 24
http://stupration50.vinoreka.com 3
http://translate.googleusercontent.com 1
http://springpad.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Apereo OAE - Bootcamp Apereo OAE - Bootcamp Presentation Transcript

    • Apereo OAEBootcamp, San Diego 2013Wednesday, 12 June 13
    • Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleOr something else?Wednesday, 12 June 13
    • “Supporting academic collaboration”Wednesday, 12 June 13
    • Wednesday, 12 June 13
    • Project goals• Multi-tenant platform• Cloud-ready• SaaS• Used at large scaleWednesday, 12 June 13
    • Project goals• Maintainable• Extendable• Integrate-ableWednesday, 12 June 13
    • Solid foundationModern, not exoticWednesday, 12 June 13
    • July 1, 20131st production releaseWednesday, 12 June 13
    • Multi-tenancyWednesday, 12 June 13
    • Multi-tenancyWednesday, 12 June 13
    • Multi-tenancyWednesday, 12 June 13
    • Multi-tenancyWednesday, 12 June 13
    • Multi-tenancy• Market is heading• Support multiple institutions at same time• Multi-tenancy+• Easily created, maintained and configuredWednesday, 12 June 13
    • Performance!• Ability to scale horizontally• Evidence based• ContinuousWednesday, 12 June 13
    • Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleWednesday, 12 June 13
    • OAE ArchitectureThe Apereo OAE project is made up of 2 distinct source codeplatforms:• “Hilary”• Server-side RESTful web platform that exposes theOAE services• Written entirely using server-side JavaScript in Node.js• “3akai-ux”• A client-side / browser platform that provides theHTML, JavaScript and CSS that make up the browserUI of the applicationWednesday, 12 June 13
    • OAE ArchitectureWednesday, 12 June 13
    • Hilary System ArchitectureWednesday, 12 June 13
    • Application Servers• Written in server-side JavaScript, run in Node.js• Node.js used by: eBay, LinkedIn, Storify, Trello• Light-weight single-threaded event-driven platform that process IOasynchronously / non-blocking• Uses callbacks and an event queue to stash work to be done after IO orother heavy processes complete• App servers can be configured into functional specialization:• User Request Processor• Activity Processor• Search Indexer• Preview ProcessorWednesday, 12 June 13
    • Apache Cassandra• Authoritative data source• Provides high-availability and fault-tolerance without trading away performance• Regarding CAP theorem, Cassandra favours Availability and Partition Tolerance overConsistency, however consistency is tunable on the query-level (we almost alwaysuse “quorum”)• Uses a ring topology to shard data across nodes, with configurable replication levels• No RDBMS?• Cassandra gives more flexibility with incremental scalability in a cloudenvironment• Flexible scaling helps to overcome unpredictable growth of multi-tenant systems• Medium-to-long term options for replicating data to multiple data-centers forlocalizing both reads and writes• Used by: Netflix, eBay,TwitterWednesday, 12 June 13
    • ElasticSearch• Lucene-backed search platform• Built for masterless incremental scaling and high-availability• Powers Hilary search, including library, related content,group members and memberships• Exposes HTTP RESTful APIs for indexing and queryingdocuments• RESTful query interface uses JSON-based Query DSL• Used by: GitHub, FourSquare, StackOverflow,WordPressWednesday, 12 June 13
    • RabbitMQ• Message queue platform written in Erlang• Used for distributing tasks to specializedapplication server instances• Supports active-active queue mirroring forhigh availability• Used by: JoyentWednesday, 12 June 13
    • Redis• Fills a variety of functionality:• Broadcast messaging (can move to RabbitMQ)• Locking• Caching of basic user profiles• Holds volatile activity aggregation data• Comes with no managed clustering solution (yet), but has slavereplication for active fail-over• Some clients manage master-slave switching, and distributedreads for you• Used by:Twitter, Instagram, StackOverflow, FlickrWednesday, 12 June 13
    • Etherpad• Open Source collaborative editingapplication written in Node.js• Originally developed by Google and Mozilla• Licensed under Apache License v2• Powers collaborative document editing inOAEWednesday, 12 June 13
    • Nginx• HTTP and reverse-proxy server• Used to distribute load to applicationservers, etherpad servers and stream filedownloads• Useful rate-limiting features based onsource IP• Used by: Netflix,WordPress.comWednesday, 12 June 13
    • Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleWednesday, 12 June 13
    • Clustering Cassandra• Cassandra uses a partitioned ring topology to distribute data• Nodes are given a numeric token which determines their“location” in the ring• When data rows are read / written to Cassandra, the rowkey is hashed using a “Partitioner”, which determines whatnode holds the row’s data• “Replication Strategy” is used to determine which nodes willhold replicas of which rows• E.g.“Simple Strategy” will use the N - 1 nodes clockwisearound the ring from the primary nodeWednesday, 12 June 13
    • Clustering Cassandra(cont’d)Wednesday, 12 June 13
    • Clustering Cassandra(cont’d)• Query consistency specified at request time• ALL - All nodes must respond successfully• LOCAL_QUORUM - (RF/2 + 1) nodes (in thedatacenter) must respond successfully• EACH_QUORUM - (RF/2 + 1) nodes (in alldatacenters) must respond successfully• ONE - Only one node must respond successfully• Therefore, if you write with QUORUM then read withQUORUM, then results should always be consistentWednesday, 12 June 13
    • Clustering ElasticSearch• ElasticSearch shards data into a configurablenumber of shards• “Number of Replicas” can be configured atruntime, which determines how many replicasof each shard should exist• Shard replicas are distributed among thenodes in the cluster• Shard is identified by hash of the document idWednesday, 12 June 13
    • Clustering ElasticSearch(cont’d)Wednesday, 12 June 13
    • Clustering Etherpad• Short and sweet: It doesn’t really cluster• Data is stored in Cassandra, but active sessions must all sharethe same etherpad server• Configure number of etherpad servers and their hosts inHilary, and configure Nginx to proxy to the appropriateserver• Server is selected based on a numeric hash of the contentitem id• No high availability. If an etherpad server goes down, thosesessions are disconnected :( But etherpad content is retainedWednesday, 12 June 13
    • Clustering Etherpad(cont’d)Wednesday, 12 June 13
    • Clustering RabbitMQ• Uses a Master-Slave Active/Active queue mirroringpolicy for redundancy• We set a policy of ha-mode=all to all OAE queues• Ensures messages are replicated to all queues• Ensures all subscribed consumers receivemessages• Since all nodes are active peers, when a node fails,consumer simply reconnectsWednesday, 12 June 13
    • Clustering SearchIndexers• Search Indexers are regular applicationnodes configured to consume search indextasks• INDEX_UPDATE, INDEX_DELETE• Offloads fetching and processing ofindexing tasks to nodes that don’t impactrequest latencyWednesday, 12 June 13
    • Clustering ActivityProcessors• Regular application nodes that• Receive and route activity tasks• Collect and aggregate routed activities• May be configured with dedicated Redis server for aggregation• Aggregation is the process of deeming 2 or more activities “similar” and grouping theminto a single activity• Maintains temporary information in Redis to keep track of what activities have occurredrecently and aggregate them on-the-fly• Due to concurrency issues when aggregating new activities into feeds, routed activities aresharded into “concurrency buckets”• Avoids duplicate activities in streams, while not completely serializing the aggregationprocess• Bucket is selected based on a hash of• Activity Stream ID (e.g., user or group id)• Activity Type (e.g., share content, add to group, etc...)Wednesday, 12 June 13
    • Clustering Activity Processors(cont’d)• Activity0 and Activity1 were serialized into bucket 0 toavoid concurrency collisions in “user A”s stream• A bucket is only collected by one activity processor ata time• An activity processor can concurrently collect multiplebuckets (max concurrent buckets is configurable)• Number of buckets is configurable (is 3 in thisexample)Wednesday, 12 June 13
    • Clustering PreviewProcessor• Regular application node that is specializedto handle preview processing tasks• GENERATE_PREVIEWS• Offloads the CPU and memory-intensiveprocess of generating previews to machinesthat don’t impact user request latencyWednesday, 12 June 13
    • Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleStill with us?Wednesday, 12 June 13
    • Hilary design andextension points• Common patterns• Search• Activities• File storage• Preview ProcessorWednesday, 12 June 13
    • Search producers /transformers• Search producers• Generates documents that need to go in theindex• Search transformers• Transforms query results coming back from ESinto something the UI can useWednesday, 12 June 13
    • Search Producers• Produces documents that can be indexed/storedby ElasticSearch• Simple JSON document• A search document contains the full profile. (ie no datais hidden)• Runs on separate Search Indexing serversWednesday, 12 June 13
    • Search Producers - workflowWednesday, 12 June 13
    • Search Transformers• Transforms an ElasticSearch result to somethingthe UI can use.• Hides sensitive user data (if necessary)• Adds thumbnail URLs• Runs on the application serversWednesday, 12 June 13
    • Wednesday, 12 June 13
    • Custom search queries• Exposed as a REST API• Ex:• /api/search/general• /api/search/content-library• /api/search/<search name>Wednesday, 12 June 13
    • Custom search queries• // GET http://cam.oae.com/api/search/custom-foo?q=barvar SearchAPI = require(oae-search);SearchAPI.registerSearch(custom-foo, function(ctx, opts, callback) {// The query you need to write// opts.q = “bar”var query = …// Access scope the resultsvar filter = ..callback(null, SearchUtil.createQuery(query, filter, opts));});Wednesday, 12 June 13
    • Activities• Follows the activitystrea.ms spec• Each activity has:• an object (content item “presentation.ppt”)• an actor (BrandenVisser)• a target (Bert Pareyn)• a verb (to share)• End up in an activity stream• Generated by separate activity serversWednesday, 12 June 13
    • Activities• Activity Seeds• Activity Producers• Activity Routers• Activity TransformersWednesday, 12 June 13
    • Activities and notifications• Notifications are “special” activities thatwere routed to a separate activity stream• E-mails can be sent out for notifications• Piggy-backing notifications on activitiesgives you free aggregationWednesday, 12 June 13
    • ActivityAPI.registerActivityType(‘content-comment’, {groupBy: [{target: true}],notifications: {email: true,emailTemplateModule: oae-content,emailTemplateId: notify-content-comment}});Wednesday, 12 June 13
    • Activity seeds• When something happens an activity seedis created and sent out to RabbitMQ• Contains the data for the Activity Serversto produce the persisted activities andgenerate the routesWednesday, 12 June 13
    • ContentAPI.on(‘content-comment, function(ctx, comment) {// Create the actor, object and target objects for the activity..// Construct the activity seed..// Submit to RabbitMQActivityAPI.postActivity(ctx, activitySeed);});Local events get offloaded to RabbitMQWednesday, 12 June 13
    • Activity Producers• Produces the persisted entity that should bestored for each activity.• Each entity should hold all the data necessary forproducing routes and transforming into UI friendlydata• Should try to be compact as activities will be de-normalized and an entity will be saved per stream(each user has at least 2 streams, so this is a lot of data)• Produced on separate activity serversWednesday, 12 June 13
    • Activities// Persisted activity entity{    "oae:activityType": "content-share",                                 // Required    "published": "2011-02-10T15:04:55Z",                                // Required    "verb": "share"                                                     // Required    "actor": { <ProducedEntity> },    "object": { <ProducedEntity> },    "target": { <ProducedEntity> }}Wednesday, 12 June 13
    • // Transformed activity entity{    "oae:activityType": "content-share",    "published": "2011-02-10T15:04:55Z",    "verb": "share"    "actor": {        "objectType": "user",        "id": "http://my.oae.org/api/user/u:oae:mrvisser",        "displayName": "Branden Visser",        "url": "http://some.oae.org/~u:oae:mrvisser",        "image": { .. }    },    "object": {        "objectType": "content",        "oae:contentType": "file",        "oae:mimeType": "image/png"        "id": "http://my.oae.org/content/contentId",        "url": "http://my.oae.org/content/contentId",        "displayName": "Super cool image",        "image": { .. }    },    "target": {        "objectType": "user",        "id": "http://my.oae.org/user/u:cam:bert",        "url": "http://my.oae.org/~u:cam:bert",        "image": { .. }    }}Wednesday, 12 June 13
    • Activity routes• Activities can be routed to “activity streams”• Each user and group has an activity stream, users also have a notificationstream• Routing is the process of taking an activity anddetermining who should receive the activity• A route is a simple string with the ID of theprincipal to which the activity should be delivered• Routed on separate activity serversWednesday, 12 June 13
    • Activity Routes - Propagation• Permissions/privacy in activities• Possible values:• ANY The produced entity data can be routed to any route. i.e., The entity is public or loggedin• ROUTES The entity can only be propagated to the activity routes specified by the router• SPECIFY Specify additional routes the entity should be routed to• ex: Branden adds Bert to the Private group “OAE-Team”• Actor: Branden, Object: “OAE-Team”-group, Target: Bert• The default routing would generate an activity for all the managers of the OAE-Team group, need toadd a route for Bert as wellWednesday, 12 June 13
    • ActivityAPI.registerActivityRouter(content-comment, function(...) {// Generate routes for all the content managers/viewers...// Generate routes for all the recent contributes...// return the generated routes});Wednesday, 12 June 13
    • Activity Transformers• Transforms persisted activities intoactivitystrea.ms compliant results• Adds OAE specific data that can beconsumed by the UI.• ex: More imagesWednesday, 12 June 13
    • ActivityAPI.registerActivityEntityTransformer(content-comment, function(ctx, entities) {// Add thumbnail URLs...// Add replies to comments// (This data is available in the entity itself, it just needs to be cleaned up.)..// return});Wednesday, 12 June 13
    • File storage• New file storage backends can be plugged in• Available storage backends:• Local disk storage / Mounted NFS• Amazon S3• Try not to serve actual file bodies with HilaryWednesday, 12 June 13
    • Preview Processor• Generates thumbnail and large readable previewimages of content in the system• Uses the REST API to interact with OAE• Isolation from the application server• Informed of new content to process by RabbitMQ• Allows for multiple processors, preventsdropped messagesWednesday, 12 June 13
    • Preview Processor• Existing processors for:• Collabdocs - Uses webshot to take a browser screenshot of the etherpad• Images - GraphicsMagick used to make various sized thumbnails• Office Docs and PDFS - Individual page snapshots using LibreOffice and GraphicsMagick,so the whole document may be viewed in the browser• Arbitrary links• Flickr, SlideShare,Vimeo,YouTube: Special Handling for REST APIs to fetch displayname, description and preview images directly• Other links: Uses webshot to take a screenshot of links for which it doesn’t have aspecific handler• Creating custom processors• New processors can be added in new NPM modules• Uses a registration pattern to hook in to the Preview Processor• Flexible meta-data to back custom widgetsWednesday, 12 June 13
    • Storage interface• get• store• remove• getDownloadLinkWednesday, 12 June 13
    • Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleWednesday, 12 June 13
    • Performance testing• Workflow• Model loader• Tsung tests• Analysis• CasesWednesday, 12 June 13
    • Workflow1. Generate data2. Load data into the system3. Tsung tests4. Circonus5. AnalysisWednesday, 12 June 13
    • Workflow• Establish a baseline by cycling throughtesting, analysis and improvement• Iterate-and-improve• Try and maintain an acceptable baselinewhen adding new featuresWednesday, 12 June 13
    • OAE model loader• NodeJS tool that generates and loads data• Tries to reflect real life scenarios• ex: 30% of the members of a groupshould be managersWednesday, 12 June 13
    • Model loader - Generation• Generates re-runnable JSON files that define data tobe loaded into the system• All data is based on predefined distributions and canbe tweaked• Supports:• Users• Groups• Content• Links / Collaborative documents / Files• DiscussionsWednesday, 12 June 13
    • Model loader - loading• Loads the data into the system• Writes the generated IDs to disk so theycan be re-used in the Tsung testsWednesday, 12 June 13
    • Tsung• Tsung is an Erlang distributed load-testingtool• Able to simulate thousands of concurrentusers• Used to stress test the applicationWednesday, 12 June 13
    • Tsung• Takes an XML file that defines the HTTP requests and fires them off at theserver• Sessions• Each session has a probability of execution• Contains:Transactions, dynamic variables• Transactions• Contains: Requests,Thinktime• RequestsWednesday, 12 June 13
    • <session name="general_interest_term_anon" probability="17" type="ts_http"><setdynvars sourcetype="file" fileid="users.csv" delimiter=";" order="random"><var name="users_id" /><var name="users_username" /><var name="users_password" /></setdynvars><transaction name="tx_login"><request subst="true"><dyn_variable name="loggedin_user_id" jsonpath="$.id"/><http url="/api/auth/login" method="POST" version="1.1" contents="username=%%_users_username%%&amp;password=%%_users_password%%"><http_header name="Referer" value="/" /></http></request></transaction><thinktime value="4" random="true"/><transaction name="tx_general_search_search"><request subst="true"><http url="/api/search/general/all?q=%%_search_term_30%%&amp;from=10&amp;size=10" method="GET" version="1.1" ></http></request></transaction>...</session>Wednesday, 12 June 13
    • node-oae-tsung• Tsung’s XML syntax is hard and boring + xml files can gethuge• => Automate it• NodeJS tool to generate XML file• Uses generated ID from the model loader data load• Contains 3 layers:• API• Tests• SuitesWednesday, 12 June 13
    • node-oae-tsung API• Each method in the Tsung API represents a high-leveluser action against your UI• Performs the REST API requests that would get calledwhen a user performs that “click” / action / page visit• Encapsulates those requests into a “transaction”• ex: Loading the group members page would do arequest to:• GET /api/me• GET /api/group/<group>/membersWednesday, 12 June 13
    • node-oae-tsung API// “Show the members of a group”-pagevar members = module.exports.members = function(session, group) {var tx = session.addTransaction(group_members);tx.addRequest(GET, /api/me);tx.addRequest(GET, /api/group/ + group + /members);};Wednesday, 12 June 13
    • node-oae-tsung testcase• Each test case describes a possible user session by executing anumber “transactions” / API methods:• User logs in• User searches for groups• User visits a group• Thinks a bit• Searches for users to add• Adds some users• Performs a general search• Visits a content item• Shares that content item with a group• Logs out• End result is a Tsung “session” object that contains manytransactions and “think times”Wednesday, 12 June 13
    • module.exports.test = function(runner, probability) {probability = probability || 100;// Create a new session.var session = runner.addSession(add_group_users, probability);var user = User.login(session, %%_group_add_users_manager_username%%, %%_group_add_users_manager_passwoGroup.profile(session, %%_group_add_users_group_id%%);session.think(2);// Go to the members listGroup.members(session, groupId);session.think(6);// Add 2 usersvar update = {%%_group_add_users_user_0%%: member,%%_group_add_users_user_1%%: member};Group.updateMembers(session, groupId, update);session.think(2);...}Wednesday, 12 June 13
    • node-oae-tsung suite• Contains a list of test cases / sessions you want to include in your test suite (i.e.,Tsung XML file)• Has an optional probability option to control the session distribution• Standard test suite file:general_interest_term_anon,15general_interest_term_auth,50general_interest_content_auth,5general_interest_group_auth,7general_interest_user_auth,10private_groups_interest,40study_group_content,40edit_content,40edit_group,15add_content_users,10add_content_groups,5add_group_users,10add_group_groups,5Wednesday, 12 June 13
    • Putting it together• Need to create a new mix of sessions and distributions? Justcreate a new suite file.• Need to create new sessions that do different things? Justcreate new test cases that use the API methods.• Did web requests in your application change on certain pages?Just update the requests executed in the API methods, all testcases / suites update with it.• New feature in your application? Create the API methods forthe new actions, incorporate into test cases.• Did your UI get completely overhauled? Oops... remodel yourAPI methods from scratch.Wednesday, 12 June 13
    • Tsung• Take the tsung.xml file and run it• Depending on the session lengths, can takea couple of hours• Generates graphsWednesday, 12 June 13
    • Setup• 1 nginx load balancer (0.5GB / 1CPU)• 2 app nodes (0.5GB / 1 CPU)• 3 db nodes (8GB / 2 CPU)• 1 redis node (0.5GB / 1 CPU)• 1 search node (0.5GB / 1 CPU)Wednesday, 12 June 13
    • TransactionsWednesday, 12 June 13
    • Request latencyWednesday, 12 June 13
    • Transactions / secWednesday, 12 June 13
    • Arrival rate of new usersWednesday, 12 June 13
    • Simultaneous usersWednesday, 12 June 13
    • HTTP Requests / secWednesday, 12 June 13
    • Circonus Telemetry• Circonus is a tool built by OmniTI togather, graph and analyze data.• Allows for push/pull data entryWednesday, 12 June 13
    • Circonus Telemetry• Latencies• HTTP Request• Cassandra queries• Search queries• Permission checks• Activity collection / routing / delivery• Counts• Each API Call (ex: POST./api/user/create)• Cassandra queries (READ - WRITE)• Error countsWednesday, 12 June 13
    • Circonus Telemetry• Has multiple graph types (line, bar,histogram)• Allows you to overlay multiple data pointsWednesday, 12 June 13
    • Average latency for POST /api/*Wednesday, 12 June 13
    • Histogram latency - POST /api/*Wednesday, 12 June 13
    • So, does it scale?• Yes. We can scale the applicationhorizontally by adding more nodes• Doubling the hardware, roughly doublesthe throughputWednesday, 12 June 13
    • Simultaneous usersWednesday, 12 June 13
    • Requests / secWednesday, 12 June 13
    • Case: Permissions• Identified as a key component that would impact performancethe most• Permissions are propagated indirectly through group membership• Steve has access to ContentAVIA GroupC• Steve has access to GroupAVIA GroupC and GroupBWednesday, 12 June 13
    • Case: Permissions(cont’d)Attempt #1:First do a direct association check against the target. If unsuccessful,“Explode” and storedenormalized group memberships for user when permission check is performed (ifnecessary). Group membership changes intelligently try and keep denormalized groups listup to date to avoid invalidation.When doing a permission check, fetch the exploded list of groups and a select against thetarget’s members with the indirect list of groups.Assumption #1: Most permission checks would be for direct access to resources, soexploded check would not happen oftenAssumption #2: When needed, fetching exploded group hierarchy would be quite fast:Fetch one row from Cassandra.Assumption #3: Selecting matches in the direct target members would be quite fast:Query finite number of columns from one row would be fastResult: Baseline test peaked at about 300 requests per second. Scaling up to larger sizeswould explode the resource costs unacceptably, try and do better.Wednesday, 12 June 13
    • Case: Permissions(cont’d)What went wrong?For starters, D-Trace analysis shows our appservers are spending over 90% of their timeserializing / deserializing Thrift bodies fromCassandraWednesday, 12 June 13
    • Case: Permissions(cont’d)How do we fix it?Maybe we’re querying exploded groups more than we assumed, and there is muchmore data in the exploded group memberships than we assumed.Attempt #2Query the target resource direct members. If there are no groups, simply comparethe direct members with the source user. If there are groups, explode the groups ofthe source user (if not already denormalized), and only query the group ids from theexploded memberships that are directly assocated to the target resource. If noresults, no access. If there are results, there is access.Assumption #1: Fetching full “exploded” groups is expensive, avoid itAssumption #2: Most permission checks are against resources who don’t havegroups assigned as members, and there are generally much fewer groups assigned asmembers than groups a user will indirectly be a member ofResult: Same baseline test peaked at 850 requests per second.Thats reasonable!Wednesday, 12 June 13
    • Case:ActivityActivity Aggregation is the process of collecting multiple“similar” activities into a single aggregated activity. In order toaccomplish this, lots of “recent activities” need to be pooledtogether, and consulted for each routed activity.Attempt #1: Store the activity history and activity bucketsin CassandraResult: 3 Cassandra nodes, breaks down on a routedactivity throughput of 135 per second during data-load’.Equivalent to roughly 1.3 activities per second in our tests.Not acceptable.Wednesday, 12 June 13
    • Case:Activity (cont’d)Problem #1: Cassandra latency reports in DataStax OpsCenter showing Cassandranodes are surpassing 5000 requests per second, and latency is climbing to over 10 secondsand load on the Cassandra servers pushing 7.Solution #1: Move aggregation into Redis.The data is volatile and gets evicted over timeany way.Avoids cluster co-ordination gossip and disk I/O for a large part of the activityload.Problem #2: Initial tests showed the memory footprint easily skyrocketed over 8Gb.Not Acceptable.Solution #2: Normalize the aggregated entities in Redis rather than duplicate them foreach activity aggregation entry.Result: With 3 Cassandra nodes and 3 activity servers, collection happens at ~1500routed activities per second (translates to 15 activities per second).With 3 Cassandranodes and 6 activity servers, collection was occurring at 2500 routed activities persecond, which is as fast as the mass data-load was creating them. Memory footprint in alltests remains less than 4Gb for the entire duration. Cassandra remains stable in all tests.Good to go!Wednesday, 12 June 13
    • Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleSomeone poke the guy sleeping in the back.Wednesday, 12 June 13
    • Deployment andAutomation• As you can imagine, many machines to manage. Current inventory:• 3x Cassandra• 2x Redis• 2x RabbitMQ• 4x Application + Indexer• 3x Preview Processor• 1x Activity Processor• 1x Nginx• 3x Etherpad• Performance testing with a cluster of 21 virtual machines• Additional scalability testing and verification with ~30 virtual machinesWednesday, 12 June 13
    • Puppet• Use puppet to centralize machine configuration and prevent configuration drift• Collection of “Manifests” that define the state that the machine should in based onits hostname / role:• What files should exist? What should their contents be?• What packages should be installed?• What services should be running, or stopped?• http://github.com/sakaiproject/puppet-hilary• All 20+ machines in cluster have Puppet installed, which ask for “catalog” info(expected configuration state) from a single puppet master machine• Puppet Master knows how to determine the machine state from the manifestsbased on its host (e.g., db0 is a cassandra node, it should have cassandra, java, etc...)• Use puppetdb with “External Resources” to share machine-specific informationwith each other node in the clusterWednesday, 12 June 13
    • Hiera• Serves configuration information for puppet• JSON-based data-format, which can be inherited in a flexible manner• Keeps large complex configuration data as clean as possible{"classes": ["::oaeservice::hosts","::oaeservice::firewall::open","::oaeservice::rsyslog"],"nodetype": "%{nodetype}","nodesuffix": "%{nodesuffix}","web_domain": "oae-performance.oaeproject.org"Wednesday, 12 June 13
    • MCollective• Provides parallel execution over a number of machines atone time• Start / Stop / Check status of services• Install / Remove / Check version of packages• Use puppet resource syntax to check adhoc machinefacts• Apply puppet manifests• Each cluster node subscribes to an ActiveMQ server toreceive commands. Central machine (the “client”) publishesthe command and waits for replyWednesday, 12 June 13
    • Slapchop• Missing piece:We need to create 21 machines of different specs in acloud service, and somehow get MCollective on them• A tool we lovingly call slapchop• Define a JSON manifest that holds machines configs and instances• Run slapchop to create the machines in Joyent cloud, start them, getmcollective installed• Well, kind of...• Now you can log in to the MCollective client and run mco puppetapply• Well, kind of...• Go from empty cloud to working 21 machine cluster in ~15 minutesWednesday, 12 June 13
    • Nagios• Provides monitoring and alerts• Cassandra health• Diskspace• ElasticSearch JVM stats (memory usage, garbage collection)• Application server health• OS memory usage• Nginx health• RabbitMQ queue sizes• RabbitMQ health• Redis health• Nagios NRPE scripts deployed with puppetWednesday, 12 June 13
    • Nagios (cont’d)Wednesday, 12 June 13
    • Munin• We are using Munin for time-series OSresource statistics• Disk, network, redis, load, memory• Deployed automatically through puppetmanifests all nodesWednesday, 12 June 13
    • Munin (cont’d)Wednesday, 12 June 13
    • Munin (cont’d)Wednesday, 12 June 13
    • Security• Machines deployed in privateVLAN -- private interfaces are isolated• Public interface firewall completely closed on all nodes• Single bastion node with public SSH enabled, key-only authentication• Single Nginx node with public Web ports enabled• TCP Syn cookies enabled for publically exposed machines to prevent syn-floods• Rate-limiting to 50req/s per source IP applied for Web API requests• Several XSS tests and all issues followed up• Using OWASP JQuery plugin for XSS filtering user-created data• All infrastructure security deployed automatically with puppet• Penetration testing performed by University of Mercia• UI vulnerability testing performed by SCIRT groupWednesday, 12 June 13
    • Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleWednesday, 12 June 13
    • UI ArchitectureHilary3akai-uxMobile UI3rd party integrationsWednesday, 12 June 13
    • Core UI Architecture• JS frameworks• CSS framework• 3rd party plugins• OAE UI API• OAE CSS ComponentsWednesday, 12 June 13
    • Core frameworks• RequireJS• jQuery• underscore.jsWednesday, 12 June 13
    • RequireJS• File and module loader• Necessity to keep things modular• Optimisation built-inWednesday, 12 June 13
    • RequireJS• Define modules• Load files and modules on flyWednesday, 12 June 13
    • • DOM manipulation• Cross-browser abstraction• Events• Pretty much everythingWednesday, 12 June 13
    • • Utility toolbelt• Manipulate objects, arrays, etc.Wednesday, 12 June 13
    • CSS frameworks• Twitter Bootstrap• Font AwesomeWednesday, 12 June 13
    • Twitter Bootstrap• Re-usable, consistent CSS is hard• Most popular CSS framework• Documentation already there• Basic components, styles, etc.• Override where necessaryWednesday, 12 June 13
    • Twitter BootstrapWednesday, 12 June 13
    • Font Awesome• Icon font• No more images• Style with CSS• Skinning• EasyWednesday, 12 June 13
    • Font AwesomeWednesday, 12 June 13
    • 3rd party plug-ins• jQuery plug-ins• Bootstrap plug-insWednesday, 12 June 13
    • 3rd party plug-ins• Autosuggest• History.js• Fileupload• Validate• Templates• etc.Wednesday, 12 June 13
    • OAE UI API• Wrapper for REST requests• Users• Profile• Groups• Content• Discussions• Search• ConfigWednesday, 12 June 13
    • OAE UI API• Utilities• i18n• l10n• Widget loading• Template rendering• Notifications• XSS escaping• etc.Wednesday, 12 June 13
    • OAE CSS Components• Re-usable HTML fragments• OAE specific elements• Consistency• Design guidelinesWednesday, 12 June 13
    • Visibility iconsIndicate visibility of groups, content,discussions, etc.Wednesday, 12 June 13
    • Large optionsWednesday, 12 June 13
    • ThumbnailsWednesday, 12 June 13
    • ClipsWednesday, 12 June 13
    • TilesWednesday, 12 June 13
    • List itemsWednesday, 12 June 13
    • ToolboxJS frameworksCSS framework3rd party pluginsOAE UI APIOAE CSS ComponentsWednesday, 12 June 13
    • ToolboxJS frameworksCSS framework3rd party pluginsOAE UI APIOAE CSS ComponentsWIDGET SDKWednesday, 12 June 13
    • Putting it togetherWednesday, 12 June 13
    • Widgets• Modular components• HTML Fragment• JavaScript• CSS• Config file• Loaded into DOMWednesday, 12 June 13
    • Namespacing• Widgets share same container• Avoid clashes• Namespace:• HTML IDs• CSS classes• jQuery selectorsWednesday, 12 June 13
    • Widget JS• Require required APIs• Return function to be executed as widgetWednesday, 12 June 13
    • i18n• UI available in multiple languages• Standard .properties files• 2 types of bundles• Core bundles• Widget bundlesWednesday, 12 June 13
    • i18nTranslation priority1. Widget user language file2. Widget default language file3. Container user language file4. Container user language fileWednesday, 12 June 13
    • i18n__MSG__TRANSLATION_KEY__Wednesday, 12 June 13
    • i18n• English• French• German• Italian• Spanish• Russian (Partial)• Chinese (Partial)Wednesday, 12 June 13
    • l10n• API methods for localizing:• Timezones• Date Formatting• CurrencyWednesday, 12 June 13
    • UI templating• TrimPath• Avoids lots of DOM manipulation• Pass in JSON data• Supports if statements, for loops, etc.Wednesday, 12 June 13
    • UI templating• Templates are defined in between <!-- -->• oae.api.util.template().render(...)Wednesday, 12 June 13
    • UI templating• Template<div id="example_template"><!--<h4>Welcome {firstName}.</h4>You are ${profile.age} years old--></div>• Inputoae.api.util.template().render($("#example_template"), {“firstName”: “John”,“profile”: {“placeofbirth”: “Los Angeles”,“age”: 45}});• Result<h4>Welcome John.</h4>You are 45 years old.Wednesday, 12 June 13
    • UI templating• Template<div id="example_template"><!--{if score >= 5}<h1>Congratulations, you have succeeded</h1>{elseif score >= 0}<h1>Sorry, you have failed}{else}<h1>You have cheated</h1>{/if}--></div>• Inputoae.api.util.template().render($("#example_template"), {“score”: 6});• Result<h1>Congratulations, you have succeeded!</h1>Wednesday, 12 June 13
    • UI templating• Template<div id="example_template"><!--{for conference in conferences}<div>${conference.name} (${conference.year})</div>{forelse}<div>No conferences have been organized</div>{/for}--></div>• Inputoae.api.util.template().renderTemplate($("#example_template"), {“conferences”: [{“name”: “Sakai San Diego”, “year”: 2013},{“name”: “Sakai Atlanta”, “year”: 2012}]});• Result<div>Sakai San Diego (2013)</div><div>Sakai Atlanta (2012)</div>Wednesday, 12 June 13
    • Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleOr something else?Wednesday, 12 June 13
    • Customization andconfiguration• Administration UI• Global administration• Tenant administration• Manage production environmentWednesday, 12 June 13
    • Tenant management• Start, stop, edit tenants• Create new tenantsWednesday, 12 June 13
    • Tenant managementWednesday, 12 June 13
    • Tenant configuration• Configure global tenant (overridden by tenants) orindividual tenant configurations• Configure on the fly• Single Sign On integration• Default UI language• Default visibility settings• Data storage settings• etc.Wednesday, 12 June 13
    • Tenant configurationWednesday, 12 June 13
    • Tenant skinning• Skinning UI• Skin entire application• Branding, colors, etc.• LESS• Component re-useWednesday, 12 June 13
    • Tenant skinningWednesday, 12 June 13
    • Tenant skinningWednesday, 12 June 13
    • Tenant skinningWednesday, 12 June 13
    • Extending with NPM• NPM - Node Package Manager• Dependency management, including remote fetching custom modules fromthe NPM repo or github• Stored inside of node_modules directory of your project• Usually a logical set of functionality (e.g., a back-end REST API, or a set ofrelated widgets)• NPM module in 3akai-ux is searched for custom widgets• NPM module in Hilary (that starts with oae-) is searched for init.js tointegrate to the application container• New dependencies can be added to package.json file• Changes to this file must be maintained with a patch, though :(Wednesday, 12 June 13
    • UI Release Processes• Grunt• Task-based build system implemented in JavaScript• Similar in theory of operation to Make, Rake• Rich ecosystem of plug-ins to do most tasks• Easy to implement new task when a plugin doesn’t existyet• Used for running test suites, production builds, lintingtoolsWednesday, 12 June 13
    • UI Release Processes• Production Build• Optimizes the static assets to reduce throughput, request frequency, and optimizecaching across versions• Require.js Optimization:• Concatenate JavaScript dependencies (reduces number of web requests significantly)• Minify / Uglify JavaScript files (reduces payload sizes significantly, even when gzipenabled on web server)• Hash optimization:• Hash the contents of static assets and append result to the filename, then cachethem indefinitely on the browsers• When the files change, the hash in the filename changes to force reloading of theupdated asset• If files never change across version, client never reloads file until their cache isclearedWednesday, 12 June 13
    • Developer Resources:Widget SDK• Contains help on creating widgets• Code best practices• Design style guide• UI and API documentation• Widget Builder• ExamplesWednesday, 12 June 13
    • Developer Resources:Docs UI• UI that has documentation automaticallygenerated from the docs in the Hilary and3akai-ux source code• Accessible from /docs path of any tenantWednesday, 12 June 13
    • Topics1. Project Goals2. Hilary System Architecture3. Clustering4. Hilary Design and Extension Patterns5. Performance Testing6. Deployment and Automation7. UI Architecture8. Customization and Configuration9. Part 2: Hands on exampleYou do have Hilary installed, right?Wednesday, 12 June 13