Mark will discuss building a storage and processing platform which takes in data as students perform homework, quizzes and tests and stores it in a MongoDB database. The output are visualizations with allow students and teachers to track their progress in a class.
3. 3
• Global Company with over 5,000 employees
• Now a Learning Science Company
• All content available digitally by Fall 2015
• Higher Ed system is Connect
• K-12 LMS is Engrade
• Adaptive systems LearnSmart, SmartBook
and ALEKS
McGraw-Hill Education
4. 4
• Global and Marine Seismologist
• Small College Physics Professor
• Oracle Database Administrator
• Head of IT Operations at MIT Sloan School
of Management
• Head of MHE Digital Platform Group’s
Analytics team’s Data Science group
• Systems Engineer on this project
My Background
5. 5
Motivation
MHE has several digital educational platforms
including Connect for Higher Ed and Engrade for K-
12
Instrument platforms to send student/educator
events in real time to a central system (LAP)
Ingest and store education events in data store
(MongoDB)
Analytics provides “insights” to students/educators
Introduction to LAP
7. 7
Standardized education events (Caliper)
Utilizes JSON-LD (linked data) format
Caliper uses Actor - Verb - Object tuple to
form learning events (ex: student – submit –
test)
Triggered from student/educator activity and
sent to LAP input API
IMS Caliper Format for
Education
14. 14
• JSON-LD input suggested a document store
• MongoDB accessible and well documented
• Provided needed performance and capacity
• Support from MongoDB Inc. (10Gen)
• Six Month Development Support contract
• Dedicated consultants
• Ongoing support contract
Why MongoDB?
15. 15
Standardized education events (Caliper)
Caliper (JSON-LD) produced by triggers in
the Connect Oracle database
Triggered from student/educator activity and
sent to LAP input API
LAP then verifies input, transforms into
MongoDB schema, calculates aggregates,
and sends to data visualizations
Data Flow Through the LAP
16. 16
Data Flow Through the LAP
Standardized education events (Caliper
examples)
1. Assessment Created
2. Assessment Attempt Started
3. Assessment Attempt Submitted
4. Assessment Attempt Graded
An assessment is an on-line homework assignment, quiz or test
associated with a McGraw-Hill digital textbook.
21. 21
Constraints on developing schema
Several learning activities require multiple
Caliper events
• Example: student starts, submits, and is
graded to complete a quiz
No guarantee that external applications will
send events in chronological order
May receive duplicate events
Data Flow Through the LAP
22. 22
MongoDB Schema – Version
0.1
V0.1
2 schema model (student and
class)
Class Collection describes the
class, section and assignments
Student Collection
• Assessment array updated when
attempt is complete
• All events for an activity
• Attempts for each activity in a sub-array
23. 23
V0.1
Problems
• Too embedded
• Difficult to update a student doc
• Query-logic-update
MongoDB Schema
25. V0.2
Problems
• Still have query-logic-update
• Difficult to do atomically and maintain
deterministic state
25
MongoDB Schema Version 2
26. 26
MongoDB Schema Version 3
{}
{}
• Remove arrays altogether
• Replace arrays with assessment and attempt docs,
each of which contains several sub-docs
27. V0.3
Atomic updates now much easier
Save raw Caliper event in event collection
Only update student collection if all required events
are in event collection
27
MongoDB Schema Version 0.3
28. 28
Query Utilization
• 3 basic queries to build visualization for CIS
• All student docs for current class
• All student docs for current student
• Class doc for current class
• All queries are on indexed parameters
• Student doc _id = class_id:student_id
• Class doc _id = class_id
29. 29
Infrastructure
• All servers and storage is in AWS
• Backups done using EBS snapshots
• DB size estimated to grow about 500 [GB/year]
• Data size estimate small enough for un-sharded
cluster
• 3 member replica sets
• Write to primary, read from primary and secondary's
30. 30
Performance
• Estimated peak load 100 [events/sec] = 100 [kB/sec]
• Average load of 1,500,000 events/day
• Max of 2,500,000 events per day
• Initially planned on sharded, replicated cluster but for
now do not need this
• Added SQS Queue to handle periods of very high
load
• Upgraded from MongoDB 2.6 to 3.0 (~ x10 faster)
31. 31
Conclusions
• We have a learning analytics platform in
production utilizing a MongoDB data store
• After several iterations we developed a
MongoDB schema which:
• Handles data coming in arbitrary order
with duplicates
• Performs one step, atomic inserts
• Has high performance during peak loads