Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Leveraging Flat Files from the Canvas LMS Data Portal at K-State


Published on

A lot of data are created in an LMS instance, and much of this can be analyzed for insight. In 2016, Instructure, the makers of Canvas, made their LMS data available to their customers through a data portal (updated monthly). This portal enables access to a number of flat files related to that particular instance. This presentation showcases how this big data was analyzed on a regular laptop with basic office software, to summarize Kansas State University’s use of the LMS. Methods for analysis include the following: basic descriptive statistics, survival analysis, computational linguistic analysis, and others.
The results are reported out with both numbers and data visualizations, including classic pie charts, line graphs, bar charts, mixed-charts, word clouds, and others. The findings provide some insights about how to approach the data, how to use a data dictionary, and other methods for extracting the data for awareness and practical decision-making. This work also is suggestive of next steps for more advanced analysis (using the flat files in a SQL database).
More information about this may be accessed at

Published in: Data & Analytics
  • Be the first to comment

Leveraging Flat Files from the Canvas LMS Data Portal at K-State

  1. 1. Leveraging “Flat Files” from the Canvas LMS Data Portal (at K-State) SIDLIT 2017 | LMS Preconference Colleague 2 Colleague August 2, 2017
  2. 2. Presentation  A lot of data are created in an LMS instance, and much of this can be analyzed for insight. In 2016, Instructure, the makers of Canvas, made their LMS data available to their customers through a data portal (updated monthly). This portal enables access to a number of flat files related to that particular instance. This presentation showcases how this big data was analyzed on a regular laptop with basic office software, to summarize Kansas State University’s use of the LMS. Methods for analysis include the following: basic descriptive statistics, survival analysis, computational linguistic analysis, and others. 2
  3. 3. Presentation (cont.)  The results are reported out with both numbers and data visualizations, including classic pie charts, line graphs, bar charts, mixed-charts, word clouds, and others. The findings provide some insights about how to approach the data, how to use a data dictionary, and other methods for extracting the data for awareness and practical decision-making. This work also is suggestive of next steps for more advanced analysis (using the flat files in a SQL database).  More information about this experience may be accessed on SlideShare through an article download titled “Wrangling Big Data in a Small Tech Ecosystem” at in-a-small-tech-ecosystem (orig. from Oct. 2016). The original article “Wrangling Big Data in a Small Tech Ecosystem” is from C2C Digital Magazine. 3
  4. 4. Presentation Order  Canvas LMS at Kansas State University (K-State)  Canvas LMS Data Portal and Flat Files  The Summary Data  Some Practical Applications  Moving Forward with the Data 4
  5. 5. General Approach Framework  Approaches  An instructional design approach  What can enhance teaching and learning?  A researcher approach  What can enhance accurate data collection, usage, researcher awareness, and decision-making?  Using all data (every part!)  Using all basic software tools available on a regular machine Data Clients on a Campus  Faculty  Staff  System Administrators  Leaders  Students  Analysts 5
  6. 6. Canvas LMS at Kansas State University (K-State)6
  7. 7. LMS History at K-State  Homegrown Learning Management System (LMS) (Axio Learning)  Informed by faculty, admin, and staff needs (IT Help Desk tickets, focus groups with faculty and staff)  Software updates rolled out annually with some patches in-between  Built mostly by K-State graduates and professional developers (often hired from student ranks)  Instructure’s Canvas LMS at K-State (2013 – present)  Availability of the data portal in 2016  Monthly updates of select data from the particular instance  Accessed at K-State in October 2016 7
  8. 8. An Early Brainstorm  Brainstorm beneficial questions (data queries) before exploring the data, so you’re not limited by the found data, and keep these in mind even after the initial data exploration. It is important to conceptualize what may be practically helpful through the informed imagination first.  It would be helpful to continue with the brainstorming as the data are explored. 8
  9. 9. Initial Brainstormed Questions  What can be reported out at various levels: university, college, department, course, and individual?  Is it possible to make observations about course design? Learner engagement (Discussions? Conversations?)? Advising? Technology usage (such as external tools)? Uses of the LMS site for non-course applications?  What sorts of manual-created courses exist, and how are these used? What percentage of the courses are these manual types of courses? 9
  10. 10. Initial Brainstormed Questions (cont.)  How closely is it possible to map the data of a learner’s trajectory? A group’s trajectory?  What are some attributes to use to identify various groups? Which attributes would be helpful? What sorts of group-specific questions may be asked?  For example, is it possible to identify high-performing groups vs. low-performing groups in order to run analytics to see what differences there may be between the two?  What may be understood about the learning going on in a particular course? A learning sequence?  Are there ways to understand effective support for learners and support for learning from this data? 10
  11. 11. Required Preliminary Understandings  Need to understand the front-end view of the LMS and its general uses on campus; otherwise, the back-end data view will be looking through a mirror darkly  Need to understand what terms are applied to the various types of data (because you want to be on the same page with the creators and users of the LMS)  Need to have experiences with the various analytical technologies applied to the particular data because various queries require different data processing and data structures  Will be applying the following: descriptive statistics, inferential statistics, direct data queries, linguistic analysis, survival analysis, sentiment analysis, topic modeling, and others  Will ultimately be applying more complex machine learning as well 11
  12. 12. Required Preliminary Understandings (cont.)  Need understandings of “states” of being for various objects in an LMS  Need ability to identify anomalies and the skills to interpret what these might mean  Need to know what data mean and where to dig deeper for more relevant information  Need to know where noise might enter a particular dataset or an analytical process…and to head off the introduction of or inclusion of noise 12
  13. 13. Canvas LMS Data Portal and “Flat Files”13
  14. 14. Canvas Data Portal  Data updated once a month (then, now, daily)  Live dynamic data may be accessed via a higher level of service  Flat files (in compressed .gz format for download with 7Zip) downloaded from SQL servers  Also known as table data (albeit without defined structural relationships between records and therefore “flat”)  May contain labeled data like numbers  May contain unstructured or semi-structured data like texts, names, messages, and others  Contain content data (messaging), trace data (interaction data), and some metadata (data about data, often riding on imagery and multimedia)  Data described in a formal data dictionary 14
  15. 15. “Flat Files” Strengths and Weaknesses Strengths  Manageable on a small-scale laptop  Can ask questions across several flat files Weaknesses  Lack relational data between the various flat files  Cannot query data effectively across the various data tables (because the relationships are not defined)  Lack access to identifier column  Lack access to the foreign key 15
  16. 16. Data Dictionary  A reference resource that describes particular data  Documentation of data captured in the Canvas Data warehouse  Helpful for understanding naming protocols of the various data types  The following is a verbatim example: 16 Name Type Description assignment_id bigint (big integer) Foreign key to the assignment the override is associated with. May be empty.
  17. 17. Data Dictionary 1.15.0 Facts  assignment_fact, assignment_group_fact, assignment_override_fact, assignment_override_user_fact, assignment_override_user_rollup_fa ct, communication_channel_fact, conversation_message_participant_ fact, course_ui_navigation_item_fact, discussion_entry_fact, discussion_topic_fact, enrollment_fact, external_tool_activation_fact,  file_fact, grading_period_fact, group_fact, group_membership_fact, module_completion_requirement _fact, module_fact, module_item_fact, module_prerequisite_fact, module_progression_completion_r equirement_fact, module_progression_fact, pseudonym_fact, quiz_fact, 17
  18. 18. Data Dictionary 1.15.0 Facts  quiz_question_answer_fact, quiz_question_fact, quiz_question_group_fact, quiz_submission_fact, quiz_submission_historical_fact, score_fact, submission_comment_fact, submission_comment_participant _fact, submission_fact, wiki_fact, wiki_page_fact 18
  19. 19. Data Dictionary 1.15.0 Dimension  account_dim, assignment_dim, assignment_group_dim, assignment_group_rule_dim, assignment_override_dim, assignment_override_user_dim, assignment_rule_dim, communication_channel_dim, conversation_dim, conversation_message_dim, course_dim, course_section_dim, course_ui_canvas_navigation_dim , course_ui_navigation_item_dim,  discussion_entry_dim, discussion_topic_dim, enrollment_dim, enrollment_rollup_dim, enrollment_term_dim, external_tool_activation_dim, file_dim, grading_period_dim, grading_period_group_dim, group_dim, group_membership_dim, module_completion_requirement _dim, module_dim, 19
  20. 20. Data Dictionary 1.15.0 Dimension  module_item_dim, module_prerequisite_dim, module_progression_completion_r equirement_dim, module_progression_dim, pseudonym_dim, quiz_dim, quiz_question_answer_dim, quiz_question_dim, quiz_question_group_dim, quiz_submission_dim, quiz_submission_historical_dim,  role_dim, score_dim, submission_comment_dim, submission_comment_participant _dim, submission_dim, user_dim, wiki_dim, wiki_page_dim 20
  21. 21. Data Dictionary 1.15.0 Both  requests 21
  22. 22. The Summary Data at instance level 22
  23. 23. Order: First Data Visualizations and Then Light Text Commentary  The data visualizations come first…so that the audience may analyze the data to see what it says  The summary analyses come directly after the visualization, so there is a kind of debriefing 23
  24. 24. 24
  25. 25. Purposeful Blur and Block  Need to know how to protect against data leakage  Never share the underlying dataset  Never share unique identifiers  Always double check screen grabs against accidental inclusion of personally identifiable data (PII); use effective redaction if PII is viewable  When redacting, make sure that the redaction cannot be reversed (backwards iterated or some other strategy) and a person re-identified  Check that no metadata is riding with multimedia being released  Any personally identifiable information (PII) is obfuscated here  No granular level of data was captured in the article 25
  26. 26. 26
  27. 27. Workflow 1. Conceptualizing questions and applications of the data 2. Review of the dataset information 3. Data download 4. Data extraction 5. Data processing (cleaning) and analytics 6. Validating / invalidating the findings 7. Additional data analytics 8. Write-up for presentation 9. Data and informational materials archival 27
  28. 28. 1. About Courses28
  29. 29. 29
  30. 30. Course Visibility  A majority of courses are not visible 30
  31. 31. 31
  32. 32. Course Workflow States  Claimed  Available  Deleted  Completed  Created 32
  33. 33. 2. About Course Sections33
  34. 34. 34
  35. 35. Life Cycle State for Course Section  A majority active  A minority deleted 35
  36. 36. 36
  37. 37. Date Restriction Accesses for Course Sections  Non-defined (default) as the majority  Restricted section access (by learner name) to defined dates  Non-restricted (all participants in the course welcome) section access to defined dates 37
  38. 38. 38
  39. 39. Ability to Self-Enroll in a Section or Not  Undefined (default)  Can manually self-enroll  Must be assigned to a section 39
  40. 40. 3a. About Assignments40
  41. 41. 41
  42. 42. Types of Assignments  None  Online_quiz  Online_upload  Assignment  On_paper, and others 42
  43. 43. 43
  44. 44. Time Features for Assignments  Half of assignments with no time allotment  Other half with time features  Due_at, no unlock_at, no look_at  Due_at, lock_at, unlock_at (all three) 44
  45. 45. 45
  46. 46. Main Themes Auto-Identified in Assignment Names  Assignment  Discussion  Work  Quiz  Participation  Final  Class  Presentation  Exam  Attendance  Homework  Chapter  Questions  Activity 46
  47. 47. 47
  48. 48. Some Linguistic Features of the Assignment Titles and Descriptions  Analytic: 91.69  “Formal, logical, and hierarchical thinking” vs. “more informal, personal, here-and- now, and narrative thinking”  Clout: 73.25  “perspective of high expertise” and confidence vs. “more tentative, humble, even anxious style”  Authentic: 11.83  “more honest, personal, and disclosing text” vs. “a more guarded, distanced form of discourse”  Tone: 64.98  “a more positive, upbeat style” vs. “greater anxiety, sadness, or hostility” (emotional tone) (“Linguistic Inquiry and Word Count: LIWC2015 Operator’s Manual,” 2015, p. 22) 48
  49. 49. 49
  50. 50. Delving into Topics of Interest  Identifying words (names, formulas, dates, symbols, etc.)-of-interest  Using NVivo 11 Plus to create word trees with the target term as the seeding topic  Ability to double-click on the respective branches to link back to the original source data files 50
  51. 51. 51
  52. 52. Unmuted or Muted Assignments  A majority unmuted assignments  A minority muted assignments 52
  53. 53. 53
  54. 54. Assignment Workflow States  A majority published (77%)  A smaller amount deleted (15%)  The smallest amount unpublished (8%) 54
  55. 55. 55
  56. 56. Survival Function of Assignments to Update  How long does it take before an assignment is updated?  At what point does an assignment seem to be “safe” against update?  What are some ways to understand assignments that are updated some 1,000 days after the date of creation?  Is it possible that some assignments were transferred over from a prior LMS through an LTI-enabled process that might have captured the very first moment of creation for that assignment? (“LTI” refers to the Learning Tools Interoperability standard created by the IMS Global Learning Consortium.) 56
  57. 57. 3b. About Submitted Assignments57
  58. 58. 58
  59. 59. Grades Submittal Counts for Completed Assignments  A slice-in-time view  Roughly two-thirds graded  Roughly one-third not graded 59
  60. 60. 4. About Quizzes60
  61. 61. 61
  62. 62. A Survey of Quiz Types  Assignment  Practice quiz  Graded survey  Survey  Affordances of the various quiz types change over time, so it is important to update on the various functions and capabilities even as one is looking at the data. 62
  63. 63. 63
  64. 64. Quiz Question Types in the LMS Instance  multiple_choice_questions  true_false_questions  essay_question  multiple_answers_question  short_answer_question, and others (in descending order) 64
  65. 65. 65
  66. 66. Quiz Question Workflow States  unpublished (default)  published  deleted  So a majority of quiz questions are created / drafted but held in reserve and not published.  What are some possible inferences that can be made from the instance- scale statistics and numbers? 66
  67. 67. 67
  68. 68. An Inclusive Scatterplot of Quiz Point Values  min-max range: 0 – 23,700 points per quiz  average quiz value: 33 points (w/o zeroes average in) and 28 points (with zeroes averaged in)  The 23,700 occurred twice, which suggests that it might be purposeful. That huge number, though, pulls the curve, and in a normal research context, such an outlier would likely be omitted to erase its pull on the curve, which would result in skew. A zoom-in would require going to the particular instructor and course. That might require a different approach to the data than described in this work…such as re-animating all the flat files in a SQL database and using unique identifiers to connect related data. 68
  69. 69. 69
  70. 70. Histogram of Quiz Point Values in LMS Instance (with a normal curve)  Frequency of point values for quizzes  Tendencies  Most at the lower number values 70
  71. 71. 71
  72. 72. Survival Curve of Deleted Quizzes in LMS Instance  Based on timestamp data, how long does it take for a deleted quiz to achieve “event” or be deleted (from its moment of creation)?  In this dataset, 22% of quizzes were deleted (14,769/66,366).  The min-max day range for the quiz deletions ranged from 0 - 813 days.  A survival analysis showed that the estimated survival time of quizzes that were deleted were 23.6 days, with a lower bound of 22.7 and an upper bound of 24.4 in the 95% confidence interval; the standard error was .419.  The median survival time--of the deleted quizzes--was a low 2 days, which means if a quiz is to be deleted, it usually happens fairly early.  The drop-off in the curve below is steep but tapers off after about several months. 72
  73. 73. 73
  74. 74. One Minus Survival Function Curve for Deleted Quizzes in the LMS Instance  Shows how long a quiz survives before it is deleted from a set of quizzes that were ultimately deleted 74
  75. 75. 75
  76. 76. Hazard Function for Deleted Quizzes in the LMS Instance  All quizzes in the set were ultimately deleted  This linegraph shows time-to-event of when quizzes were deleted from their respective creation-dates in the LMS instance.  All quizzes listed here ultimately were deleted.  The hazard function curve sometimes shows particular time-patterns of when a quiz is most at risk of deletion…but this curve only generally shows a steep rise initially and then a gradual achievement of time-to-event. 76
  77. 77. 5. About Discussion Boards77
  78. 78. 78
  79. 79. Types of Discussion Boards: Announcement vs. Default  default (66%)  announcement (34%) 79
  80. 80. 80
  81. 81. Workflow States of Discussion Boards  Undefined  Active  Deleted  Unpublished 81
  82. 82. 82
  83. 83. Active vs. Deleted Discussion Board Entries (Replies)  Active discussion board entries  Deleted discussion board entries 83
  84. 84. 6. About Learner Submitted Files84
  85. 85. 85
  86. 86. Handling of Learner Submissions in the LMS Instance  human_graded  not_graded  auto_graded 86
  87. 87. 87
  88. 88. Some Common Words from Comments Made on Submissions  Lots of encouraging words in comments made on submissions 88
  89. 89. 89
  90. 90. Submission Comment Participation Type  Admin  Submitter  Author  So administrators all comment on learner submissions, but not all authors or submitters comment. In other words, the creator of contents may submit the file without comment. 90
  91. 91. 7. About Uploaded Files91
  92. 92. 92
  93. 93. Uploads and Revisions of Files to the LMS Instance by Year  A sense of the university’s transition to the LMS, over multiple years (so caution) 93
  94. 94. 94
  95. 95. Observed Uploaded File Types  .docx  .pdf  .jpg  .png  .pptx  .xlsx  .ppt  .zip  .dat  .xl  .mp4  .html  .txt  .mp3  .sdl  .csv  .rtf  .css  .m4v, and others 95
  96. 96. 96
  97. 97. Word Cloud of File Contents (from the Descriptions of File Contents)  What do the words say about what people have uploaded to the LMS system? 97
  98. 98. 98
  99. 99. High Frequency Word Counts in the File Names Set (as onegrams)  Final  Paper  Lab  2015  Chapter  2016  Lesson  2014  Exam  Reflection  Project  Syllabus  Report  Review  Lecture  Profile  Study  Week  Analysis  Essay, and others 99
  100. 100. 8. About the Wikis and Wiki Pages100
  101. 101. Wikis and Wiki Pages  A “wiki” in Canvas is a page with its history captured and able to be reinstituted (enabled by wiki software)  Pages may be interconnected  A page may be set as the home page  A page may be embedded in a modular sequence  A page may contain the MediaSite video  A page may contain any number of contents: imagery, iframes, videos, and other contents 101
  102. 102. 102
  103. 103. Parent Types for Wiki Pages in the LMS Instance  Course  Group  In other words, the administrators (instructors) of courses are the ones who create a majority of the pages. The learners in groups create fewer of the wiki pages.  Note that the sense of a “wiki” page is different here. 103
  104. 104. 104
  105. 105. Wiki Page Workflow  Null (default)  Active  Unpublished  Deleted  This needs more insight, but the data dictionary does not explain the different states and what they mean. For example, is a “null” wiki page published? Is an “active” wiki page something that is included in a sequence? Is a “deleted” wiki page recoverable or not? 105
  106. 106. 106
  107. 107. Word Frequency Word Cloud from Wiki Page Titles  Focuses on introductions, projects, research, teams, and others 107
  108. 108. 9. About Enrollment Role Types108
  109. 109. About Enrollment Role Types Role Name Basic Role Type Librarian TAEnrollment StudentEnrollment StudentEnrollment TeacherEnrollment TeacherEnrollment TAEnrollment TAEnrollment DesignerEnrollment DesignerEnrollment ObserverEnrollment ObserverEnrollment Grader TAEnrollment GradeObserver TAEnrollment 109
  110. 110. University-Defined Roles and Capabilities  Some unique roles  Some shared roles 110
  111. 111. 111
  112. 112. Frequencies of Enrollment Roles  StudentEnrollment  TeacherEnrollment  StudentViewEnrollment  TAEnrollment  ObserverEnrollment  DesignerEnrollment (in descending order) 112
  113. 113. 113
  114. 114. Top Dozen Computer System Configurations for Accessing LMS Instance  …and others 114
  115. 115. 115
  116. 116. Request Types in the LMS Instance  GET (Read)  POST (Create)  PUT (Create)  HEAD (Retrieve Resource)  DELETE (Remove)  PATCH (Update, Modify) 116
  117. 117. 10. About Groups117
  118. 118. 118
  119. 119. Group Names Frequency Word Cloud  Clone  Teaching  Design  Clinical  Plan  Final  Class  Learning  Ventilation, and others 119
  120. 120. 120
  121. 121. Moderator Status of Learners in Groups  not_moderator  is_moderator 121
  122. 122. 122
  123. 123. Learner Membership Status in Groups  Accepted  Deleted  No invited  No requested 123
  124. 124. 11. About Users and Workflow States124
  125. 125. 125
  126. 126. User “Workflow” States in the LMS Instance  registered  pre_registered  deleted  creation_pending  The “creation_pending” may well refer to a process of approval for people to have access—for a level of security. 126
  127. 127. 127
  128. 128. Years of Origination of User Accounts  Initial exploration in 2013  Big push in 2014  New accounts in 2015 and 2016 indicating not only students but also employment churn and stragglers slow to change to a new LMS 128
  129. 129. 129
  130. 130. Retired Accounts = Registered False  2013 – early May 2017  Word frequency count from unigrams (so no full names represented as such)  First names more common and so better represented  One number removed in the “stopwords” list 130
  131. 131. 131
  132. 132. Created Pseudonyms  Pseudonyms = “logins associated with users”  Seems to be the connection between the LMS and various university information systems  Seems like partial data (extracted in May 2017) 132
  133. 133. 133
  134. 134. Current “States” of Pseudonyms  A majority of pseudonyms “active” vs. “deleted” 134
  135. 135. 12. About Course Level Grades (based on Enrollments)135
  136. 136. 136
  137. 137. Numbers of Attempts for Latest Submitted Assignments  Null (no scores)  One  Two  Three  Four, etc. (in descending order) 137
  138. 138. 13. About Conversations (In- System Emails)138
  139. 139. 139
  140. 140. Conversations with Media Objects Included  False  True  So when people use the email system inside Canvas, they do not generally attach media objects (like digital imagery, slideshows, audio, video, or other digital files). 140
  141. 141. 141
  142. 142. Conversations w/ or without Attachments  A majority of conversations are without attachments  A minority of conversations are with attachments 142
  143. 143. 143
  144. 144. Origins of Conversations / Messages  Human-generated conversations (the overwhelming majority)  System-generated messages 144
  145. 145. 145
  146. 146. Conversation Messages Word Frequency Count  482,339 conversation messages  Texts with 60,509,894 words  2/3 analyzed for textual contents (because of data size) 146
  147. 147. 147
  148. 148. Mass Conversation Message Contents  Analytic: 82.33  “Formal, logical, and hierarchical thinking” vs. “more informal, personal, here-and- now, and narrative thinking”  Clout: 80.21  “perspective of high expertise” and confidence vs. “more tentative, humble, even anxious style”  Authentic: 26.41  “more honest, personal, and disclosing text” vs. “a more guarded, distanced form of discourse”  Tone: 66.24  “a more positive, upbeat style” vs. “greater anxiety, sadness, or hostility” (emotional tone) (“Linguistic Inquiry and Word Count: LIWC2015 Operator’s Manual,” 2015, p. 22) 148
  149. 149. 149
  150. 150. Messaging about “Human Drives” in the Mass Conversation Messages  Affiliation (2.35)  Power (2.19)  Achievement (1.46)  Reward (1.3)  Risk (0.37)  “The focus on affiliation and social identity seems reasonable, given the typical college age of learners. The "power" language may come from faculty speaking from positions of authority. The low level of focus on risk is intriguing here (maybe young learners are not thought to have developed the efficacy and confidence to take on uncontrolled risks?). Clearly, there is a role for theorizing and interpretation, even with computation-based analytics.” 150
  151. 151. 151
  152. 152. Sentiment Analysis of Sample of Conversation Messaging  A smaller sample of the conversation messages were analyzed for sentiment. This set consisted of 72,377 messages.  The automated observations of sentiment showed that there were two tendencies...either very positive or moderately negative (in terms of text categories).  In this software tool, it is possible to explore which texts were categorized to which categories of sentiment (very negative, moderately negative, moderately positive, or very positive) in the comparisons between the target text and the built-in sentiment dictionary.  In other words, the actual exploration of the content is possible through both machine reading and human close reading. 152
  153. 153. 153
  154. 154. Auto-Extracted Theme Based Hierarchy Chart of Conversation Messaging Sample (as a Treemap)  Class  Assignment  Time  Paper  Questions  Exam  Online  Group, etc. 154
  155. 155. 155
  156. 156. Auto-extracted Themes from Conversation Messaging Sample  These are in alphabetical order  The themes are listed in a human-readable way going clockwise around the pie (in a pie chart) 156
  157. 157. 157
  158. 158. Auto-Coded Theme-Based Hierarchy Chart of Topics and Subtopics from Conversation Messaging Sample (as a Sunburst Diagram)  This sunburst diagram—in the software—is somewhat interactive  This enables digging down into a Topic by double-clicking on it and seeing the subtopic contents there  If the sliver is too thin, a mouse hovering will result in the actual subtopic and the statistics and quant data available for viewing 158
  159. 159. 159
  160. 160. Contexts of “Help” in a Word Tree  It is possible to analyze the various contexts in which “help” was used in the conversation messaging in the prior word tree  In the software (NVivo 11 Plus), the word tree is interactive and is linked to the original sources where the word appears, so it is possible to achieve close reading of every use of “help” from the underlying dataset  The challenge is engaging a full dataset of millions of words 160
  161. 161. 14. About Third-Party External Tool Activations on the LMS Instance161
  162. 162. 162
  163. 163. Numbers of External Tool Activations on the LMS Instance  External tool activations  Unique tool activations 163
  164. 164. 164
  165. 165. Named External Tool Activations in the K-State Canvas Instance  YouTube  Ted Ed  DropBox (with name variations)  Vimeo  Quizlet (with name variations)  MyOMLab  Khan Academy  Twitter  Flat World Knowledge  SlideShare  Yellowdig  SoftChalk Cloud (with name variations)  MyLab and Mastering  Educreations  Funbrain  Wikipedia, and others… 165
  166. 166. 166
  167. 167. External Tool Activations in 2013 (in alphabetical order)  Attendance Tool  Chat  CodeAcademy  Dropbox  Flat World Knowledge  Flickr Search  Graph Builder  Khan Academy  Learn LTI  McGraw-Hill Campus  Public Collections  SlideShare  SoftChalk Cloud  SoftChalk Cloud App  Ted Ed  Twitter  Vimeo  YouTube  Zoom 167
  168. 168. 168
  169. 169. External Tool Activations in 2014  There is an increase in both variety and number of external tool activations  No deeper analysis was applied, but it could be…as to the external tool types and the changing senses of needs 169
  170. 170. 170
  171. 171. External Tool Activations in 2015  There is an increase in both variety and number of external tool activations  No deeper analysis was applied, but it could be…as to the external tool types and the changing senses of needs 171
  172. 172. 172
  173. 173. External Tool Activations in 2016  There is an increase in both variety and number of external tool activations  No deeper analysis was applied, but it could be…as to the external tool types and the changing senses of needs 173
  174. 174. 15. About Course User Interface (UI) Navigation Item States174
  175. 175. 175
  176. 176. Course User Interface Navigation Item State  Visible  Hidden  This refers to user capabilities of enabling the pre-set functions in the left navigation of a course shell remain active or be placed in “hidden.”  There are “hidden” navigation element presets as well, which users may choose to activate. 176
  177. 177. Enablements and Limits re: the LMS Data Portal Data177
  178. 178. 178
  179. 179. Delimiting the Analytics from the LMS Data Portal Data  The concept behind delimiting is to make conclusions more accurate by representing how confident one may be about the results.  As noted, there may be challenges and noise in the data from any step in the workflow…but there are inherent limits also to the various data analytics types—as shown in the visualization in the prior slide. 179
  180. 180. Some Practical Applications180
  181. 181. Some Practical Applications  Self awareness (holding up a mirror to the campus for its use of its LMS)  Analytics  To improve usage of the LMS  To know what functions and features are desirable  To support learner usage  To support teaching and learning  To support non-teaching and learning approaches to the data  Decision-making  Instructional design  Administrative awareness, decision-making, funding, and others 181
  182. 182. Moving Forward with the Data182
  183. 183. What are Ways to Go Beyond? Other Analytical Methods  Reconnecting the flat files as relational files in SQL server  Design of specific cross-file queries for data analytics  Applying more and varied computational text analysis  Engaging machine learning for patterns (such as decision trees for predictivity of classifications based on available information) Bringing in More Data  Comparing macro-level data with other instances of the Canvas LMS (such as with comparable institutions of higher education)  Using additional data to enable close-in reads (but without compromising people’s privacy)  Keep confidential information confidential 183
  184. 184. Some Early Lessons Learned184
  185. 185. Assessing the Initial Haul of Biggish Data  Formulating askable questions  Analyzing the columnar data (and variables)  Understanding where the data comes from and how it is processed by Instructure  Analyzing the date data  Analyzing the textual data  Understanding ways to mix data in various datasets for enriched querying  Conceptualizing mixes of questions and potential findings based on the available data 185
  186. 186. Assessing the Initial Haul of Biggish Data (cont.)  Understanding the types of software that may be used to engage the data  Software enables cross-sectional base rate counts from flat files  Software enables cross-tabulation analysis and assessments of statistical significance (rarity of patterns)  Software enables finding patterns through machine learning (like applying decision trees to see what variables help determine classifications)  Software enables the identification of text-based patterns 186
  187. 187. Some Early Lessons Learned  Data visualizations are only summary data, and it’s important to get to the actual underlying data to understand some dynamics.  It helps to theorize or hypothesize broadly to understand what may be going on with the observed empirical data.  It is always wise to “sanity check” data extractions and data processing to see what is going on.  It is important to understand the LMS data portal’s default settings and the rationales behind those defaults to make sure that they make sense for the particular context. 187
  188. 188. Some Early Lessons Learned (cont.)  Avoid double-counting for complex data with similar lead-in terms.  Watch out to not type incorrectly.  Do not ignore error messages; figure out why they’re happening and deal with the issues.  Slow down the process, so you’re certain of what is happening at every step. Be careful not to lose data.  Be careful about going to Excel, which has 1.05 million rows of data limits. Be careful also of OS clipboards, which have 65,000 record limits. Do not let such limits stall the work and result in lost data. Go to MS Access first or SQL server. 188
  189. 189. Some Early Lessons Learned (cont.)  Use the LMS data portal “data dictionary” for the LMS data, but realize that it may be dated or incomplete or inaccurate. A particular instance of an LMS will be particular, so a general dictionary offers a general view, not a specific one. Use the data dictionary in an attentive way.  Realize that there are nuances in the data that may not be apparent initially.  With computational text analysis, oftentimes, foreign languages will get short shrift. There may be effective ways to address this.  With any sort of automation, there will be trade-offs. It is important to check findings against the data and conduct data queries on multiple software tools. 189
  190. 190. Some Early Lessons Learned (cont.)  Data is messy. It is totally possible (even probable) to have a process going smoothly when something has glitch-ed with a data download.  No matter what, it is not possible to import the data for processing into either Microsoft Access or SQL. In that case, there may need to be a data “substitution” by extracting the “same-ish” set from the LMS data portal (days later from when the first set was extracted).  The assumption is that new data is incremented on the end of the existing data, so if the file is the proper one, a “later” version still should be accurate. Depending on the data handling, though, that assumption may not be true. It will be important to check. 190
  191. 191. Some Early Lessons Learned (cont.)  Don’t just go with how software is designed. For example, with a word frequency count, don’t just go with the high counts, but analyze the “long tail” of the low counts.  The “power law” does often apply to word counts in language. The long tail shows something of outlier data in terms of single mentions (but you have to slog through misspellings, strange alphanumeric strings, and other noise first).  There are certain data visualizations that work better for certain types of data.  All data visualizations should be sufficiently labeled.  It helps to calculate not only raw numbers but percentages, where possible. 191
  192. 192. Some Early Lessons Learned (cont.)  Data portals contain personally identifiable information (PII), so extra care has to be taken to ensure that people’s private information is not misused nor leaked.  What is knowable depends on what other datasets one has access to and how one sets up the analyses…  It helps to know what is possible to know from the data (full universe)  It helps to know what is politically viable to ask and capture (subset) (people may ask for the moon)  It helps to use resources wisely to pursue asks that create constructive awareness and good decision-making (sub-subset)  Recording steps is important (in notes and in macros)…so everything can be repeated as needed. 192
  193. 193. To a Relational Database  So…Flat files are downloaded as compressed .gz files, opened with 7Zip as .csv files.  Microsoft offers SQL Server Express as a free tool but limits to one CPU (up to 4 cores), 1 GB RAM, and database size limits to 10 GB (“Limitations of SQL Server Express”).  Set this up on a dedicated machine, so the setup does not disrupt other work.  In shifting to SQL Server Express, the flat files have to be properly processed for the data to move without lossiness or other problems.  It may help to process the data first in MS Access (as long as the flat file data is not too large to handle in Access). Treat text columns as “Long Text,” not “Short Text.” Label Date fields not as text but “Date with Time.” The idea is to have the proper settings for appropriate receipt in SQL. 193
  194. 194. To a Relational Database (cont.)  Then, export the object from Access to Excel 2016 with the formatting and proper data structure.  If the records have > 65,000 records, then MS Access is unable to export the data table. 194
  195. 195. To a Relational Database (cont.)  One option is to split the dataset in Access (Highlight the table -> go to Database Tools tab -> click Access Database -> Split database.) The problem with this is that a dataset will have to be split quite a few times to get to the low 65,000 records, and then after ingestion into SQL, any repeat data will have to be deleted. This path is too onerous to be helpful, especially with LMS data portal data which can easily go into the millions and millions of rows.  A more direct option follows on the next slide. 195
  196. 196. To a Relational Database (cont.)  When files are too large (anything over the 65,000 records that will fit in a clipboard), then it makes better sense to just clean data on export in SQL. The sequence goes like this: .gz -> .csv (using 7Zip) -> open SQL Management Studio -> import data (change “DT_String” columns to “DT_Text” (for a “text stream”), so there is not a 50 character constraint on the columns), and the data import generally goes well. (This solution takes up more computer memory and is inelegant, but it solves the many issues that would crop up otherwise with a straight import without the data label adjustments.)  There is no import of column names in the first row.  In SQL Server Management Studio 17, go to Databases -> System Databases -> “master” database (right-click) -> Tasks -> Import Data … and specify that the original source is from Microsoft Excel. The flat files are now database objects (dbos) in the master database. Do keep the original file names, for ease-of-reference. 196
  197. 197. To a Relational Database (cont.)  Re-indexing needed?  If so, the foreign keys may have to be reconnected to the correct primary keys for the relating in a relational database to make sense and for SQL queries across the files to make sense.  Foreign keys point to primary keys in another table; they are unique identifiers that connect related data between tables.  Primary keys are unique identifiers (and “reserved” against reuse in that sense), and they indicate unique records in data tables (and databases).  If not, it may be possible to run SQL queries by loading the tables with primary keys first and those with referring foreign keys second…but I am not there yet. Working on it. 197
  198. 198. To a Relational Database (cont.)  Proceed with a good basic text on SQL server. Give it a good read-through before actually going too far into a project. (Experimentation is always good, but time wastage—not so much.)  If local support with a database administrator (DBA) is available, that would be optimal. 198
  199. 199. References  Pennebaker, J.W., Booth, R.J., Boyd, R.L., & Francis, M.E. (2015). Linguistic Inquiry and Word Count: LIWC2015. Operator’s Manual. Retrieved at https://s3-us-west- 199
  200. 200. Contact and Conclusion  Dr. Shalin Hai-Jew  iTAC  Kansas State University  212 Hale / Farrell Library   785-532-5262 200