Leveraging Flat Files from the Canvas LMS Data Portal at K-State

Leveraging “Flat Files”
from the Canvas LMS
Data Portal (at K-State)
SIDLIT 2017 | LMS Preconference
Colleague 2 Colleague
August 2, 2017

Presentation
 A lot of data are created in an LMS instance, and much of this can be
analyzed for insight. In 2016, Instructure, the makers of Canvas, made their
LMS data available to their customers through a data portal (updated
monthly). This portal enables access to a number of flat files related to that
particular instance. This presentation showcases how this big data was
analyzed on a regular laptop with basic office software, to summarize
Kansas State University’s use of the LMS. Methods for analysis include the
following: basic descriptive statistics, survival analysis, computational
linguistic analysis, and others.
2

Presentation (cont.)
 The results are reported out with both numbers and data visualizations,
including classic pie charts, line graphs, bar charts, mixed-charts, word
clouds, and others. The findings provide some insights about how to
approach the data, how to use a data dictionary, and other methods for
extracting the data for awareness and practical decision-making. This
work also is suggestive of next steps for more advanced analysis (using the
flat files in a SQL database).
 More information about this experience may be accessed on SlideShare
through an article download titled “Wrangling Big Data in a Small Tech
Ecosystem” at http://www.slideshare.net/ShalinHaiJew/wrangling-big-data-
in-a-small-tech-ecosystem (orig. from Oct. 2016). The original article
“Wrangling Big Data in a Small Tech Ecosystem” is from C2C Digital
Magazine.
3

Presentation Order
 Canvas LMS at Kansas State University (K-State)
 Canvas LMS Data Portal and Flat Files
 The Summary Data
 Some Practical Applications
 Moving Forward with the Data
4

General Approach
Framework
 Approaches
 An instructional design approach
 What can enhance teaching and
learning?
 A researcher approach
 What can enhance accurate
data collection, usage, researcher
awareness, and decision-making?
 Using all data (every part!)
 Using all basic software tools
available on a regular machine
Data Clients on a
Campus
 Faculty
 Staff
 System Administrators
 Leaders
 Students
 Analysts
5

Canvas LMS at
Kansas State University (K-State)6

LMS History at K-State
 Homegrown Learning Management System (LMS) (Axio Learning)
 Informed by faculty, admin, and staff needs (IT Help Desk tickets, focus groups
with faculty and staff)
 Software updates rolled out annually with some patches in-between
 Built mostly by K-State graduates and professional developers (often hired from
student ranks)
 Instructure’s Canvas LMS at K-State (2013 – present)
 Availability of the data portal in 2016
 Monthly updates of select data from the particular instance
 Accessed at K-State in October 2016
7

An Early Brainstorm
 Brainstorm beneficial questions (data queries) before exploring the data, so
you’re not limited by the found data, and keep these in mind even after
the initial data exploration. It is important to conceptualize what may be
practically helpful through the informed imagination first.
 It would be helpful to continue with the brainstorming as the data are
explored.
8

Initial Brainstormed Questions
 What can be reported out at various levels: university, college,
department, course, and individual?
 Is it possible to make observations about course design? Learner
engagement (Discussions? Conversations?)? Advising? Technology usage
(such as external tools)? Uses of the LMS site for non-course applications?
 What sorts of manual-created courses exist, and how are these used?
What percentage of the courses are these manual types of courses?
9

Initial Brainstormed Questions (cont.)
 How closely is it possible to map the data of a learner’s trajectory? A
group’s trajectory?
 What are some attributes to use to identify various groups? Which attributes
would be helpful? What sorts of group-specific questions may be asked?
 For example, is it possible to identify high-performing groups vs. low-performing
groups in order to run analytics to see what differences there may be between
the two?
 What may be understood about the learning going on in a particular
course? A learning sequence?
 Are there ways to understand effective support for learners and support for
learning from this data?
10

Required Preliminary Understandings
 Need to understand the front-end view of the LMS and its general uses on
campus; otherwise, the back-end data view will be looking through a mirror
darkly
 Need to understand what terms are applied to the various types of data
(because you want to be on the same page with the creators and users of the
LMS)
 Need to have experiences with the various analytical technologies applied to
the particular data because various queries require different data processing
and data structures
 Will be applying the following: descriptive statistics, inferential statistics, direct
data queries, linguistic analysis, survival analysis, sentiment analysis, topic
modeling, and others
 Will ultimately be applying more complex machine learning as well
11

Required Preliminary Understandings
(cont.)
 Need understandings of “states” of being for various objects in an LMS
 Need ability to identify anomalies and the skills to interpret what these
might mean
 Need to know what data mean and where to dig deeper for more relevant
information
 Need to know where noise might enter a particular dataset or an analytical
process…and to head off the introduction of or inclusion of noise
12

Canvas LMS Data Portal and “Flat
Files”13

Canvas Data Portal
 Data updated once a month (then, now, daily)
 Live dynamic data may be accessed via a higher level of service
 Flat files (in compressed .gz format for download with 7Zip) downloaded
from SQL servers
 Also known as table data (albeit without defined structural relationships between
records and therefore “flat”)
 May contain labeled data like numbers
 May contain unstructured or semi-structured data like texts, names, messages, and
others
 Contain content data (messaging), trace data (interaction data), and some
metadata (data about data, often riding on imagery and multimedia)
 Data described in a formal data dictionary
14

“Flat Files” Strengths and Weaknesses
Strengths
 Manageable on a small-scale
laptop
 Can ask questions across several
flat files
Weaknesses
 Lack relational data between the
various flat files
 Cannot query data effectively
across the various data tables
(because the relationships are not
defined)
 Lack access to identifier column
 Lack access to the foreign key
15

Data Dictionary
 A reference resource that describes particular data
 Documentation of data captured in the Canvas Data warehouse
 Helpful for understanding naming protocols of the various data types
 The following is a verbatim example:
16
Name Type Description
assignment_id bigint (big integer) Foreign key to the
assignment the
override is associated
with. May be empty.

Data Dictionary 1.15.0
Facts
 assignment_fact,
assignment_group_fact,
assignment_override_fact,
assignment_override_user_fact,
assignment_override_user_rollup_fa
ct, communication_channel_fact,
conversation_message_participant_
fact,
course_ui_navigation_item_fact,
discussion_entry_fact,
discussion_topic_fact,
enrollment_fact,
external_tool_activation_fact,
 file_fact, grading_period_fact,
group_fact,
group_membership_fact,
module_completion_requirement
_fact, module_fact,
module_item_fact,
module_prerequisite_fact,
module_progression_completion_r
equirement_fact,
module_progression_fact,
pseudonym_fact, quiz_fact,
17

Facts
 quiz_question_answer_fact,
quiz_question_fact,
quiz_question_group_fact,
quiz_submission_fact,
quiz_submission_historical_fact,
score_fact,
submission_comment_fact,
submission_comment_participant
_fact, submission_fact, wiki_fact,
wiki_page_fact
18

Dimension
 account_dim, assignment_dim,
assignment_group_dim,
assignment_group_rule_dim,
assignment_override_dim,
assignment_override_user_dim,
assignment_rule_dim,
communication_channel_dim,
conversation_dim,
conversation_message_dim,
course_dim, course_section_dim,
course_ui_canvas_navigation_dim
, course_ui_navigation_item_dim,
 discussion_entry_dim,
discussion_topic_dim,
enrollment_dim,
enrollment_rollup_dim,
enrollment_term_dim,
external_tool_activation_dim,
file_dim, grading_period_dim,
grading_period_group_dim,
group_dim,
group_membership_dim,
module_completion_requirement
_dim, module_dim,
19

Dimension
 module_item_dim,
module_prerequisite_dim,
module_progression_completion_r
equirement_dim,
module_progression_dim,
pseudonym_dim, quiz_dim,
quiz_question_answer_dim,
quiz_question_dim,
quiz_question_group_dim,
quiz_submission_dim,
quiz_submission_historical_dim,
 role_dim, score_dim,
submission_comment_dim,
submission_comment_participant
_dim, submission_dim, user_dim,
wiki_dim, wiki_page_dim
20

Both
 requests
21

The Summary Data
at instance level
22

Order: First Data Visualizations and
Then Light Text Commentary
 The data visualizations come first…so that the audience may analyze the
data to see what it says
 The summary analyses come directly after the visualization, so there is a
kind of debriefing
23

Purposeful Blur and Block
 Need to know how to protect against data leakage
 Never share the underlying dataset
 Never share unique identifiers
 Always double check screen grabs against accidental inclusion of personally
identifiable data (PII); use effective redaction if PII is viewable
 When redacting, make sure that the redaction cannot be reversed (backwards iterated
or some other strategy) and a person re-identified
 Check that no metadata is riding with multimedia being released
 Any personally identifiable information (PII) is obfuscated here
 No granular level of data was captured in the article
25

Workflow
1. Conceptualizing questions and applications of the data
2. Review of the dataset information
3. Data download
4. Data extraction
5. Data processing (cleaning) and analytics
6. Validating / invalidating the findings
7. Additional data analytics
8. Write-up for presentation
9. Data and informational materials archival
27

Course Visibility
 A majority of courses are not visible
30

Course Workflow States
 Claimed
 Available
 Deleted
 Completed
 Created
32

Life Cycle State for Course Section
 A majority active
 A minority deleted
35

Date Restriction Accesses for Course
Sections
 Non-defined (default) as the majority
 Restricted section access (by learner name) to defined dates
 Non-restricted (all participants in the course welcome) section access to
defined dates
37

Ability to Self-Enroll in a Section or Not
 Undefined (default)
 Can manually self-enroll
 Must be assigned to a section
39

Types of Assignments
 None
 Online_quiz
 Online_upload
 Assignment
 On_paper, and others
42

Time Features for Assignments
 Half of assignments with no time allotment
 Other half with time features
 Due_at, no unlock_at, no look_at
 Due_at, lock_at, unlock_at (all three)
44

Main Themes Auto-Identified in
Assignment Names
 Assignment
 Discussion
 Work
 Quiz
 Participation
 Final
 Class
 Presentation
 Exam
 Attendance
 Homework
 Chapter
 Questions
 Activity
46

Some Linguistic Features of the
Assignment Titles and Descriptions
 Analytic: 91.69
 “Formal, logical, and hierarchical thinking” vs. “more informal, personal, here-and-
now, and narrative thinking”
 Clout: 73.25
 “perspective of high expertise” and confidence vs. “more tentative, humble, even
anxious style”
 Authentic: 11.83
 “more honest, personal, and disclosing text” vs. “a more guarded, distanced form of
discourse”
 Tone: 64.98
 “a more positive, upbeat style” vs. “greater anxiety, sadness, or hostility” (emotional
tone) (“Linguistic Inquiry and Word Count: LIWC2015 Operator’s Manual,” 2015, p. 22)
48

Delving into Topics of Interest
 Identifying words (names, formulas, dates, symbols, etc.)-of-interest
 Using NVivo 11 Plus to create word trees with the target term as the seeding
topic
 Ability to double-click on the respective branches to link back to the
original source data files
50

Unmuted or Muted Assignments
 A majority unmuted assignments
 A minority muted assignments
52

Assignment Workflow States
 A majority published (77%)
 A smaller amount deleted (15%)
 The smallest amount unpublished (8%)
54

Survival Function of Assignments to
Update
 How long does it take before an assignment is updated?
 At what point does an assignment seem to be “safe” against update?
 What are some ways to understand assignments that are updated some
1,000 days after the date of creation?
 Is it possible that some assignments were transferred over from a prior LMS
through an LTI-enabled process that might have captured the very first moment
of creation for that assignment? (“LTI” refers to the Learning Tools Interoperability
standard created by the IMS Global Learning Consortium.)
56

3b. About Submitted Assignments57

Grades Submittal Counts for
Completed Assignments
 A slice-in-time view
 Roughly two-thirds graded
 Roughly one-third not graded
59

A Survey of Quiz Types
 Assignment
 Practice quiz
 Graded survey
 Survey
 Affordances of the various quiz types change over time, so it is important to
update on the various functions and capabilities even as one is looking at
the data.
62

Quiz Question Types in the LMS
Instance
 multiple_choice_questions
 true_false_questions
 essay_question
 multiple_answers_question
 short_answer_question, and others (in descending order)
64

Quiz Question Workflow States
 unpublished (default)
 published
 deleted
 So a majority of quiz questions are created / drafted but held in reserve
and not published.
 What are some possible inferences that can be made from the instance-
scale statistics and numbers?
66

An Inclusive Scatterplot of Quiz Point
Values
 min-max range: 0 – 23,700 points per quiz
 average quiz value: 33 points (w/o zeroes average in) and 28 points (with
zeroes averaged in)
 The 23,700 occurred twice, which suggests that it might be purposeful. That
huge number, though, pulls the curve, and in a normal research context,
such an outlier would likely be omitted to erase its pull on the curve, which
would result in skew. A zoom-in would require going to the particular
instructor and course. That might require a different approach to the data
than described in this work…such as re-animating all the flat files in a SQL
database and using unique identifiers to connect related data.
68

Histogram of Quiz Point Values in LMS
Instance (with a normal curve)
 Frequency of point values for quizzes
 Tendencies
 Most at the lower number values
70

Survival Curve of Deleted Quizzes in
LMS Instance
 Based on timestamp data, how long does it take for a deleted quiz to
achieve “event” or be deleted (from its moment of creation)?
 In this dataset, 22% of quizzes were deleted (14,769/66,366).
 The min-max day range for the quiz deletions ranged from 0 - 813 days.
 A survival analysis showed that the estimated survival time of quizzes that
were deleted were 23.6 days, with a lower bound of 22.7 and an upper
bound of 24.4 in the 95% confidence interval; the standard error was .419.
 The median survival time--of the deleted quizzes--was a low 2 days, which
means if a quiz is to be deleted, it usually happens fairly early.
 The drop-off in the curve below is steep but tapers off after about several
months.
72

One Minus Survival Function Curve for
Deleted Quizzes in the LMS Instance
 Shows how long a quiz survives before it is deleted from a set of quizzes that
were ultimately deleted
74

Hazard Function for Deleted Quizzes in
the LMS Instance
 All quizzes in the set were ultimately deleted
 This linegraph shows time-to-event of when quizzes were deleted from their
respective creation-dates in the LMS instance.
 All quizzes listed here ultimately were deleted.
 The hazard function curve sometimes shows particular time-patterns of
when a quiz is most at risk of deletion…but this curve only generally shows a
steep rise initially and then a gradual achievement of time-to-event.
76

Types of Discussion Boards:
Announcement vs. Default
 default (66%)
 announcement (34%)
79

Workflow States of Discussion Boards
 Undefined
 Active
 Deleted
 Unpublished
81

Active vs. Deleted Discussion Board
Entries (Replies)
 Active discussion board entries
 Deleted discussion board entries
83

6. About Learner Submitted Files84

Handling of Learner Submissions in the
LMS Instance
 human_graded
 not_graded
 auto_graded
86

Some Common Words from Comments
Made on Submissions
 Lots of encouraging words in comments made on submissions
88

Submission Comment Participation
Type
 Admin
 Submitter
 Author
 So administrators all comment on learner submissions, but not all authors or
submitters comment. In other words, the creator of contents may submit
the file without comment.
90

Uploads and Revisions of Files to the
LMS Instance by Year
 A sense of the university’s transition to the LMS, over multiple years (so
caution)
93

Observed Uploaded File Types
 .docx
 .pdf
 .jpg
 .png
 .pptx
 .xlsx
 .ppt
 .zip
 .dat
 .xl
 .mp4
 .html
 .txt
 .mp3
 .sdl
 .csv
 .rtf
 .css
 .m4v, and others
95

Word Cloud of File Contents (from the
Descriptions of File Contents)
 What do the words say about what people have uploaded to the LMS
system?
97

High Frequency Word Counts in the File
Names Set (as onegrams)
 Final
 Paper
 Lab
 2015
 Chapter
 2016
 Lesson
 2014
 Exam
 Reflection
 Project
 Syllabus
 Report
 Review
 Lecture
 Profile
 Study
 Week
 Analysis
 Essay, and others
99

8. About the Wikis and Wiki Pages100

Wikis and Wiki Pages
 A “wiki” in Canvas is a page with its history captured and able to be
reinstituted (enabled by wiki software)
 Pages may be interconnected
 A page may be set as the home page
 A page may be embedded in a modular sequence
 A page may contain the MediaSite video
 A page may contain any number of contents: imagery, iframes, videos,
and other contents
101

Parent Types for Wiki Pages in the LMS
Instance
 Course
 Group
 In other words, the administrators (instructors) of courses are the ones who
create a majority of the pages. The learners in groups create fewer of the
wiki pages.
 Note that the sense of a “wiki” page is different here.
103

Wiki Page Workflow
 Null (default)
 Active
 Unpublished
 Deleted
 This needs more insight, but the data dictionary does not explain the
different states and what they mean. For example, is a “null” wiki page
published? Is an “active” wiki page something that is included in a
sequence? Is a “deleted” wiki page recoverable or not?
105

Word Frequency Word Cloud from Wiki
Page Titles
 Focuses on introductions, projects, research, teams, and others
107

9. About Enrollment Role Types108

About Enrollment Role Types
Role Name Basic Role Type
Librarian TAEnrollment
StudentEnrollment StudentEnrollment
TeacherEnrollment TeacherEnrollment
TAEnrollment TAEnrollment
DesignerEnrollment DesignerEnrollment
ObserverEnrollment ObserverEnrollment
Grader TAEnrollment
GradeObserver TAEnrollment
109

University-Defined Roles and
Capabilities
 Some unique roles
 Some shared roles
110

Frequencies of Enrollment Roles
 StudentEnrollment
 TeacherEnrollment
 StudentViewEnrollment
 TAEnrollment
 ObserverEnrollment
 DesignerEnrollment (in descending order)
112

Top Dozen Computer System Configurations
for Accessing LMS Instance
 …and others
114

Request Types in the LMS Instance
 GET (Read)
 POST (Create)
 PUT (Create)
 HEAD (Retrieve Resource)
 DELETE (Remove)
 PATCH (Update, Modify)
116

Group Names Frequency Word Cloud
 Clone
 Teaching
 Design
 Clinical
 Plan
 Final
 Class
 Learning
 Ventilation, and others
119

Moderator Status of Learners in Groups
 not_moderator
 is_moderator
121

Learner Membership Status in Groups
 Accepted
 Deleted
 No invited
 No requested
123

11. About Users and Workflow
States124

User “Workflow” States in the LMS
Instance
 registered
 pre_registered
 deleted
 creation_pending
 The “creation_pending” may well refer to a process of approval for people
to have access—for a level of security.
126

Years of Origination of User Accounts
 Initial exploration in 2013
 Big push in 2014
 New accounts in 2015 and 2016 indicating not only students but also
employment churn and stragglers slow to change to a new LMS
128

Retired Accounts = Registered False
 2013 – early May 2017
 Word frequency count from unigrams (so no full names represented as
such)
 First names more common and so better represented
 One number removed in the “stopwords” list
130

Pseudonyms
 Pseudonyms = “logins associated with users”
 Seems to be the connection between the LMS and various university
information systems
 Seems like partial data (extracted in May 2017)
132

Current “States” of Pseudonyms
 A majority of pseudonyms “active” vs. “deleted”
134

12. About Course Level Grades
(based on Enrollments)135

Numbers of Attempts for Latest
Submitted Assignments
 Null (no scores)
 One
 Two
 Three
 Four, etc. (in descending order)
137

13. About Conversations (Emails)138

Conversations with Media Objects
Included
 False
 True
 So when people use the email system inside Canvas, they do not generally
attach media objects (like digital imagery, slideshows, audio, video, or
other digital files).
140

Conversations w/ or without
Attachments
 A majority of conversations are without attachments
 A minority of conversations are with attachments
142

Origins of Conversations / Messages
 Human-generated conversations (the overwhelming majority)
 System-generated messages
144

Conversation Messages Word
Frequency Count
 482,339 conversation messages
 Texts with 60,509,894 words
 2/3 analyzed for textual contents (because of data size)
146

Mass Conversation Message Contents
 Analytic: 82.33
 “Formal, logical, and hierarchical thinking” vs. “more informal, personal, here-and-
now, and narrative thinking”
 Clout: 80.21
 “perspective of high expertise” and confidence vs. “more tentative, humble, even
anxious style”
 Authentic: 26.41
 “more honest, personal, and disclosing text” vs. “a more guarded, distanced form of
discourse”
 Tone: 66.24
 “a more positive, upbeat style” vs. “greater anxiety, sadness, or hostility” (emotional
tone) (“Linguistic Inquiry and Word Count: LIWC2015 Operator’s Manual,” 2015, p. 22)
148

Messaging about “Human Drives” in
the Mass Conversation Messages
 Affiliation (2.35)
 Power (2.19)
 Achievement (1.46)
 Reward (1.3)
 Risk (0.37)
 “The focus on affiliation and social identity seems reasonable, given the
typical college age of learners. The "power" language may come from
faculty speaking from positions of authority. The low level of focus on risk is
intriguing here (maybe young learners are not thought to have developed
the efficacy and confidence to take on uncontrolled risks?). Clearly, there
is a role for theorizing and interpretation, even with computation-based
analytics.”
150

Sentiment Analysis of Sample of
Conversation Messaging
 A smaller sample of the conversation messages were analyzed for
sentiment. This set consisted of 72,377 messages.
 The automated observations of sentiment showed that there were two
tendencies...either very positive or moderately negative (in terms of text
categories).
 In this software tool, it is possible to explore which texts were categorized to
which categories of sentiment (very negative, moderately negative,
moderately positive, or very positive) in the comparisons between the
target text and the built-in sentiment dictionary.
 In other words, the actual exploration of the content is possible through both
machine reading and human close reading.
152

Auto-Extracted Theme Based Hierarchy
Chart of Conversation Messaging Sample
(as a Treemap)
 Class
 Assignment
 Time
 Paper
 Questions
 Exam
 Online
 Group, etc.
154

Auto-extracted Themes from
Conversation Messaging Sample
 These are in alphabetical order
 The themes are listed in a human-readable way going clockwise around
the pie (in a pie chart)
156

Auto-Coded Theme-Based Hierarchy Chart of
Topics and Subtopics from Conversation
Messaging Sample (as a Sunburst Diagram)
 This sunburst diagram—in the software—is somewhat interactive
 This enables digging down into a Topic by double-clicking on it and seeing
the subtopic contents there
 If the sliver is too thin, a mouse hovering will result in the actual subtopic
and the statistics and quant data available for viewing
158

Contexts of “Help” in a Word Tree
 It is possible to analyze the various contexts in which “help” was used in the
conversation messaging in the prior word tree
 In the software (NVivo 11 Plus), the word tree is interactive and is linked to
the original sources where the word appears, so it is possible to achieve
close reading of every use of “help” from the underlying dataset
 The challenge is engaging a full dataset of millions of words
160

14. About Third-Party External Tool
Activations on the LMS Instance161

Numbers of External Tool Activations on
the LMS Instance
 External tool activations
 Unique tool activations
163

Named External Tool Activations in the
K-State Canvas Instance
 YouTube
 Ted Ed
 DropBox (with name variations)
 Vimeo
 Quizlet (with name variations)
 MyOMLab
 Khan Academy
 Twitter
 Flat World Knowledge
 SlideShare
 Yellowdig
 SoftChalk Cloud (with name
variations)
 MyLab and Mastering
 Educreations
 Funbrain
 Wikipedia, and others…
165

External Tool Activations in 2013
(in alphabetical order)
 Attendance Tool
 Chat
 CodeAcademy
 Dropbox
 Flat World Knowledge
 Flickr Search
 Graph Builder
 Khan Academy
 Learn LTI
 McGraw-Hill Campus
 Public Collections
 SlideShare
 SoftChalk Cloud
 SoftChalk Cloud App
 Ted Ed
 Twitter
 Vimeo
 YouTube
 Zoom
167

 There is an increase in both variety and number of external tool activations
 No deeper analysis was applied, but it could be…as to the external tool types
and the changing senses of needs
169

171

173

15. About Course User Interface
(UI) Navigation Item States174

Course User Interface Navigation Item
State
 Visible
 Hidden
 This refers to user capabilities of enabling the pre-set functions in the left
navigation of a course shell remain active or be placed in “hidden.”
 There are “hidden” navigation element presets as well, which users may
choose to activate.
176

Enablements and Limits
re: the LMS Data Portal Data177

Delimiting the Analytics from the LMS
Data Portal Data
 The concept behind delimiting is to make conclusions more accurate by
representing how confident one may be about the results.
 As noted, there may be challenges and noise in the data from any step in
the workflow…but there are inherent limits also to the various data analytics
types—as shown in the visualization in the prior slide.
179

Some Practical Applications180

Some Practical Applications
 Self awareness (holding up a mirror to the campus for its use of its LMS)
 Analytics
 To improve usage of the LMS
 To know what functions and features are desirable
 To support learner usage
 To support teaching and learning
 To support non-teaching and learning approaches to the data
 Decision-making
 Instructional design
 Administrative awareness, decision-making, funding, and others
181

Moving Forward with the Data182

What are Ways to Go Beyond?
Other Analytical Methods
 Reconnecting the flat files as
relational files in SQL server
 Design of specific cross-file queries
for data analytics
 Applying more and varied
computational text analysis
 Engaging machine learning for
patterns (such as decision trees
for predictivity of classifications
based on available information)
Bringing in More Data
 Comparing macro-level data with
other instances of the Canvas LMS
(such as with comparable
institutions of higher education)
 Using additional data to enable
close-in reads (but without
compromising people’s privacy)
 Keep confidential information
confidential
183

Assessing the Initial Haul of Biggish Data
 Formulating askable questions
 Analyzing the columnar data (and variables)
 Understanding where the data comes from and how it is processed by Instructure
 Analyzing the date data
 Analyzing the textual data
 Understanding ways to mix data in various datasets for enriched querying
 Conceptualizing mixes of questions and potential findings based on the
available data
185

Assessing the Initial Haul of Biggish Data
(cont.)
 Understanding the types of software that may be used to engage the data
 Software enables cross-sectional base rate counts from flat files
 Software enables cross-tabulation analysis and assessments of statistical
significance (rarity of patterns)
 Software enables finding patterns through machine learning (like applying
decision trees to see what variables help determine classifications)
 Software enables the identification of text-based patterns
186

Some Early Lessons Learned
 Data visualizations are only summary data, and it’s important to get to the
actual underlying data to understand some dynamics.
 It helps to theorize or hypothesize broadly to understand what may be
going on with the observed empirical data.
 It is always wise to “sanity check” data extractions and data processing to
see what is going on.
 It is important to understand the LMS data portal’s default settings and the
rationales behind those defaults to make sure that they make sense for the
particular context.
187

Some Early Lessons Learned (cont.)
 Avoid double-counting for complex data with similar lead-in terms.
 Watch out to not type incorrectly.
 Do not ignore error messages; figure out why they’re happening and deal
with the issues.
 Slow down the process, so you’re certain of what is happening at every
step. Be careful not to lose data.
 Be careful about going to Excel, which has 1.05 million rows of data limits.
Be careful also of OS clipboards, which have 65,000 record limits. Do not let
such limits stall the work and result in lost data. Go to MS Access first or SQL
server.
188

 Use the LMS data portal “data dictionary” for the LMS data, but realize that
it may be dated or incomplete or inaccurate. A particular instance of an
LMS will be particular, so a general dictionary offers a general view, not a
specific one. Use the data dictionary in an attentive way.
 Realize that there are nuances in the data that may not be apparent initially.
 With computational text analysis, oftentimes, foreign languages will get
short shrift. There may be effective ways to address this.
 With any sort of automation, there will be trade-offs. It is important to check
findings against the data and conduct data queries on multiple software
tools.
189

 Data is messy. It is totally possible (even probable) to have a process going
smoothly when something has glitch-ed with a data download.
 No matter what, it is not possible to import the data for processing into either
Microsoft Access or SQL. In that case, there may need to be a data
“substitution” by extracting the “same-ish” set from the LMS data portal (days
later from when the first set was extracted).
 The assumption is that new data is incremented on the end of the existing data,
so if the file is the proper one, a “later” version still should be accurate.
Depending on the data handling, though, that assumption may not be true. It
will be important to check.
190

 Don’t just go with how software is designed. For example, with a word
frequency count, don’t just go with the high counts, but analyze the “long
tail” of the low counts.
 The “power law” does often apply to word counts in language. The long tail
shows something of outlier data in terms of single mentions (but you have to slog
through misspellings, strange alphanumeric strings, and other noise first).
 There are certain data visualizations that work better for certain types of
data.
 All data visualizations should be sufficiently labeled.
 It helps to calculate not only raw numbers but percentages, where possible.
191

 Data portals contain personally identifiable information (PII), so extra care
has to be taken to ensure that people’s private information is not misused
nor leaked.
 What is knowable depends on what other datasets one has access to and
how one sets up the analyses…
 It helps to know what is possible to know from the data (full universe)
 It helps to know what is politically viable to ask and capture (subset) (people
may ask for the moon)
 It helps to use resources wisely to pursue asks that create constructive awareness
and good decision-making (sub-subset)
 Recording steps is important (in notes and in macros)…so everything can
be repeated as needed.
192

To a Relational Database
 So…Flat files are downloaded as compressed .gz files, opened with 7Zip as
.csv files.
 Microsoft offers SQL Server Express as a free tool but limits to one CPU (up to
4 cores), 1 GB RAM, and database size limits to 10 GB (“Limitations of SQL
Server Express”).
 Set this up on a dedicated machine, so the setup does not disrupt other work.
 In shifting to SQL Server Express, the flat files have to be properly processed
for the data to move without lossiness or other problems.
 It may help to process the data first in MS Access (as long as the flat file data is
not too large to handle in Access). Treat text columns as “Long Text,” not “Short
Text.” Label Date fields not as text but “Date with Time.” The idea is to have the
proper settings for appropriate receipt in SQL.
193

To a Relational Database (cont.)
 Then, export the object from Access to Excel 2016 with the formatting and proper
data structure.
 If the records have > 65,000 records, then MS Access is unable to export the data
table.
194

 One option is to split the dataset in
Access (Highlight the table -> go to
Database Tools tab -> click Access
Database -> Split database.) The
problem with this is that a dataset will
have to be split quite a few times to
get to the low 65,000 records, and then
after ingestion into SQL, any repeat
data will have to be deleted. This path
is too onerous to be helpful, especially
with LMS data portal data which can
easily go into the millions and millions
of rows.
 A more direct option follows on the next
slide.
195

 When files are too large (anything over the 65,000 records that will fit in a
clipboard), then it makes better sense to just clean data on export in SQL. The
sequence goes like this: .gz -> .csv (using 7Zip) -> open SQL Management Studio
-> import data (change “DT_String” columns to “DT_Text” (for a “text stream”), so
there is not a 50 character constraint on the columns), and the data import
generally goes well. (This solution takes up more computer memory and is
inelegant, but it solves the many issues that would crop up otherwise with a
straight import without the data label adjustments.)
 There is no import of column names in the first row.
 In SQL Server Management Studio 17, go to Databases -> System
Databases -> “master” database (right-click) -> Tasks -> Import Data … and
specify that the original source is from Microsoft Excel. The flat files are now
database objects (dbos) in the master database. Do keep the original file
names, for ease-of-reference.
196

 Re-indexing needed?
 If so, the foreign keys may have to be reconnected to the correct primary
keys for the relating in a relational database to make sense and for SQL
queries across the files to make sense.
 Foreign keys point to primary keys in another table; they are unique identifiers
that connect related data between tables.
 Primary keys are unique identifiers (and “reserved” against reuse in that sense),
and they indicate unique records in data tables (and databases).
 If not, it may be possible to run SQL queries by loading the tables with
primary keys first and those with referring foreign keys second…but I am not
there yet. Working on it.
197

 Proceed with a good basic text on SQL server. Give it a good read-through
before actually going too far into a project. (Experimentation is always
good, but time wastage—not so much.)
 If local support with a database administrator (DBA) is available, that would
be optimal.
198

References
 Pennebaker, J.W., Booth, R.J., Boyd, R.L., & Francis, M.E. (2015). Linguistic
Inquiry and Word Count: LIWC2015. Operator’s Manual. Retrieved at
https://s3-us-west-
2.amazonaws.com/downloads.liwc.net/LIWC2015_OperatorManual.pdf.
199

Contact and Conclusion
 Dr. Shalin Hai-Jew
 iTAC
 Kansas State University
 212 Hale / Farrell Library
 shalin@k-state.edu
 785-532-5262
200

Leveraging Flat Files from the Canvas LMS Data Portal at K-State

Recommended

Recommended

More Related Content

Similar to Leveraging Flat Files from the Canvas LMS Data Portal at K-State

Similar to Leveraging Flat Files from the Canvas LMS Data Portal at K-State (20)

More from Shalin Hai-Jew

More from Shalin Hai-Jew (20)

Recently uploaded

Recently uploaded (20)

Leveraging Flat Files from the Canvas LMS Data Portal at K-State