Backup
Big Data in Learning Analytics –
Analytics for Everyday Learning
Stefan Dietze, L3S Research Center, Hannover
24.01.2017
LearnTec 2017, Karlsruhe
23/02/17 1Stefan Dietze
Research areas
 Web science, Information Retrieval, Semantic Web, Social Web
Analytics, Knowledge Discovery, Human Computation
 Interdisciplinary application areas: digital humanities,
TEL/education, Web archiving, mobility
Some projects
L3S Research Center
23/02/17 2Stefan Dietze
http://l3s.de/
http://stefandietze.net/
Technology-enhanced
Learning / Web-based
Learning
Big Data in Learning Analytics? A simplistic perspective
23/02/17 3Stefan Dietze
Learning
Analytics &
Educational
Data Mining
 Application of data mining techniques to understand
learning activities and performance
 Traditionally confined to dedicated learning environments
and platforms (e.g, Moodle)
 Examples: JLA special issue on LA Datasets, data ranging
between few MB and max. 15 GB
 Near complete research corpus: LAK Dataset
(http://lak.linkededucation.org)
Learning Analytics & Knowledge Dataset
 Cooperation of
 Near-complete Linked Data corpus of Learning Analytics
research publications (~ 800, seit 2009)
Dietze, S., Taibi, D., D’Aquin, M., Facilitating
Scientometrics in Learning Analytics and
Educational Data Mining - the LAK Dataset,
Semantic Web Journal, 2017.
23/02/17 4Stefan Dietze
http://lak.linkededucation.org/
Technology-enhanced
Learning / Web-based
Learning
Big Data in Learning Analytics? A simplistic Perspective
23/02/17 5Stefan Dietze
Learning
Analytics &
Educational
Data Mining
 Application of data mining techniques to understand
learning activities and performance
 Traditionally confined to dedicated learning environments
and platforms (e.g, Moodle)
 Examples: JLA special issue on LA Datasets, data ranging
between few MB and max. 15 GB
 Near complete research corpus: LAK Dataset
(http://lak.linkededucation.org)
 Broader understanding: informal learning, micro-learning
 Research often focused on resources: sharing, reusing,
recommendation
 Data examples:
 „LinkedUp Catalog“:
> 50 M resources, 300 M statements
 „LRMI/schema.org“:
> 45 M quads (Common Crawl 2015)
Big Data? –
Depends, but mostly not!
(Volume?)
LinkedUp Catalog of learning resources
Dataset
Catalog/Registry
http://data.linkededucation.org/linkedup/catalog/
 “LinkedUp” (FP7 project): L3S, OU, OKFN, Elsevier, Exact Learning Solutions
 Publishing and curation of educational/learning resources according to Linked Data principles
 Largest collection of Linked Data about learning resources
(approx. 50 datasets, 50 M resources)
23/02/17 6Stefan Dietze
1
10
100
1000
10000
100000
1000000
10000000
1 51 101 151 201
count(log)
PLD (ranked)
# entities # statements
Learning Resources annotations on the Web?
 “Learning Resources Metadata Intiative (LRMI)”:
schema.org vocabulary for annotation of learning
resources in Web documents (schema.org etc)
 Approx. 5000 PLDs in “Common Crawl” (2 bn Web
documents)
 LRMI-Adaptation on the Web (WDC) [LILE16]:
 2015: 44.108.511 quads, 6.243.721 resources
 2014: 30.599.024 quads, 4.182.541 resources
 2013: 10.636873 quads, 1.461.093 resources
23/02/17 7
Power law distribution across providers
4805 Providers / PLDs
Taibi, D., Dietze, S., Towards embedded markup of learning resources
on the Web: a quantitative Analysis of LRMI Terms Usage, in
Companion Publication of the IW3C2 WWW 2016 Conference, IW3C2
2016, Montreal, Canada, April 11, 2016
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
http://lrmi.itd.cnr.it/
Technology-enhanced
Learning / Web-based
Learning
Big Data in Learning Analytics? A simplistic Perspective
Learning
Analytics &
Educational
Data Mining
 Application of data mining techniques to understand
learning activities and performance
 Traditionally confined to dedicated learning environments
and platforms (e.g, Moodle)
 Complete research corpus: LAK Dataset
(http://lak.linkededucation.org)
 Data examples: JLA special issue on LA Datasets, data
ranging between few MB and max. 15 GB
 Broader understanding: informal learning, micro-learning
 Research focused on resources: sharing, reusing,
recommendation
 Data examples:
 „LinkedUp Catalog“:
> 50 M resources, 300 M statements
 „LRMI/schema.org“:
> 45 M quads (Common Crawl 2015)
Big Data? –
Depends, but mostly not!
(Volume?)
Big Data? –
Depends, but mostly not!
(Velocity?)
23/02/17 8Stefan Dietze
23/02/17 9
(Informal) Learning on the Web ?
Stefan Dietze
 Anything can be a learning resource
 The activity makes the difference (not the
resource): i.e. how a resource is being used
 Learning Analytics in online/non-learning
environments?
o Activity streams,
o Social graphs (and their evolution),
o Behavioural traces (mouse movements,
keystrokes)
o ...
 Research challenges:
o How to detect „learning“?
o How to detect learning-specific notions
such as „competences“, „learning
performance“ etc?
23/02/17 10
„AFEL – Analytics for Everyday (Online) Learning“
Stefan Dietze
Examples of AFEL data sources:
• Activity streams and behavioral traces
• L3S Twitter Crawl: 6 bn tweets
• Common Crawl (2015): 2 bn documents
• Web Data Commons (2015): 1 TB = 24 bn
quads
• „German Academic Web“: 6 TB Web crawl
(quarterly recrawled)
• Wikipedia edit history: 3 M edits/month
(engl.)
• ....
 H2020 project (since 12/2015) aimed at understanding/supporting learning in social Web environments
Big Data Challenges/Tasks in AFEL & beyond: some examples
23/02/17 11Stefan Dietze
I Efficient data capture
 Crawling & extracting activity data
 Crawling, extracting and indexing learning
resources (eg Common Crawl)
II Efficient data analysis
 Understanding learning resources: entity
extraction & clustering on large Web crawls of
resources
 “Search as learning”: detecting learning in
heterogeneous search query logs & click streams
 Detecting learning activities: detection of learning
pattern (eg competent behavior) in absence of
learning objectives & assessments (!)
o Obtaining performance indicators from
behavioral traces?
o Quasi experiments in crowdsourcing
platforms to obtain training data
Gadiraju, U., Demartini, G., Kawase, R., Dietze, S. Human beyond the
Machine: Challenges and Opportunities of Microtask
Crowdsourcing. In: IEEE Intelligent Systems, Volume 30 Issue 4 –
Jul/Aug 2015.
Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding
Malicious Behavior in Crowdsourcing Platforms: The Case of
Online Surveys. ACM CHI Conference on Human Factors in Computing
Systems (CHI2015), April 18-23, Seoul, Korea.
Gadiraju, U., Demartini, G., Kawase, R., Dietze, S. Human beyond the
Machine: Challenges and Opportunities of Microtask
Crowdsourcing. In: IEEE Intelligent Systems, Volume 30 Issue 4 –
Jul/Aug 2015.
Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding
Malicious Behavior in Crowdsourcing Platforms: The Case of
Online Surveys. ACM CHI Conference on Human Factors in Computing
Systems (CHI2015), April 18-23, Seoul, Korea.
23/02/17 12Stefan Dietze
Detecting competence in online users?
Capturing assessment data: microtasks in Crowdflower
 “Content Creation (CC)”: transcription of captchas
 “Information Finding (IF)”: middle name of famous persons
 1800 assessments: 2 tasks * 3 durations * 3 difficulty levels
* 100 users (per assessment)
Level 1
„Daniel Craig“
Level 2
„George Lucas“
(profession: Archbishop)
Level 3
„Brian Smith“
(profession: Ice Hockey, born: 1972)
Behavioral Traces: keystrokes- and mouse movements
 timeBeforeInput, timeBeforeClick
 tabSwitchFreq
 windowToggleFreq
 openNewTabFreq
 WindowFocusFrequency
 totalMouseMovements
 scrollUpFreq, scrollDownFreq
 ….
 Total amount of events: 893.285 (CC Tasks), 736.664 (IF Tasks)
Find the middle name of:
23/02/17 13Stefan Dietze
Predicting competence from behavioural traces?
Training data
 Manual annotation of 1800 assessments
 Performance types [CHI15]:
o “Competent Worker” ,
o “Diligent Worker”
o “Fast Deceiver”
o “Incompetent Worker”
o “Rule Breaker”
o “Smart Deceiver”
o “Sloppy Worker”
 Prediction of performance types from
behavioral traces?
Predicting learner types from behavioral traces
 “Random Forest Classifier” (per task)
 10-fold cross validation
 Prediction performance: Accuracy, F-Measure
Results
 Longer assessments  more signals
 Simpler assessments  more conclusive signals
 “Competent Workers” (CW, DW): accuracy of 91% respectively 87%
 Most significant features: “TotalTime”, “TippingPoint”,
“MouseMovementFrequency”, “WindowFocusFrequency”
23/02/17 14Stefan Dietze
Other features to predict competence in learning/assessments?
“Dunning-Kruger Effect”
 Incompetence in task/domain reduces capacity to
recognice/assess own incompetence
Research question
 Self-assessment as indicator for competence?
Results
 Self-assessment as reliable indicator of competence
(94% accuracy), superior to mere performance
measurement
 Tendency to over-estimated own competence
increases with increasing difficulty level
David Dunning. 2011. The Dunning-Kruger Effect: On Being Ignorant of
One’s Own Ignorance. Advances in experimental social psychology 44
(2011), 247.
Performance („Accuracy“) of users classified as „competent“
23/02/17 15Stefan Dietze
Summary & outlook
 Learning analytics in online & Web-based settings
o Detection of learning & learning-related notions in
absence of assessment/performance indicators?
o Analysis of range of data, including behavioral
traces, activity streams, self assessment etc
o Actual big data
 Positive results from initial models and classifiers
 Application of developed models and classifiers in
online (learning) environments (e.g. AFEL Projekt)
o GNOSS/Didactalia (200.000 users)
o LearnWeb
o Deutsche Welle online
o …
Acknowledgements: Team
23/02/17 16Stefan Dietze
 Pavlos Fafalios (L3S)
 Besnik Fetahu (L3S)
 Ujwal Gadiraju (L3S)
 Eelco Herder (L3S)
 Ivana Marenzi (L3S)
 Ran Yu (L3S)
 Pracheta Sahoo (L3S, IIT India)
 Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)
 Mathieu d‘Aquin (The Open University, UK)
 Davide Taibi (CNR, Italy)
 ...
Acknowledgements: Team
23/02/17 17Stefan Dietze
 Pavlos Fafalios (L3S)
 Besnik Fetahu (L3S)
 Ujwal Gadiraju (L3S)
 Eelco Herder (L3S)
 Ivana Marenzi (L3S)
 Ran Yu (L3S)
 Pracheta Sahoo (L3S, IIT India)
 Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)
 Mathieu d‘Aquin (The Open University, UK)
 Davide Taibi (CNR, Italy)
 ...
?http://stefandietze.net

Big Data in Learning Analytics - Analytics for Everyday Learning

  • 1.
    Backup Big Data inLearning Analytics – Analytics for Everyday Learning Stefan Dietze, L3S Research Center, Hannover 24.01.2017 LearnTec 2017, Karlsruhe 23/02/17 1Stefan Dietze
  • 2.
    Research areas  Webscience, Information Retrieval, Semantic Web, Social Web Analytics, Knowledge Discovery, Human Computation  Interdisciplinary application areas: digital humanities, TEL/education, Web archiving, mobility Some projects L3S Research Center 23/02/17 2Stefan Dietze http://l3s.de/ http://stefandietze.net/
  • 3.
    Technology-enhanced Learning / Web-based Learning BigData in Learning Analytics? A simplistic perspective 23/02/17 3Stefan Dietze Learning Analytics & Educational Data Mining  Application of data mining techniques to understand learning activities and performance  Traditionally confined to dedicated learning environments and platforms (e.g, Moodle)  Examples: JLA special issue on LA Datasets, data ranging between few MB and max. 15 GB  Near complete research corpus: LAK Dataset (http://lak.linkededucation.org)
  • 4.
    Learning Analytics &Knowledge Dataset  Cooperation of  Near-complete Linked Data corpus of Learning Analytics research publications (~ 800, seit 2009) Dietze, S., Taibi, D., D’Aquin, M., Facilitating Scientometrics in Learning Analytics and Educational Data Mining - the LAK Dataset, Semantic Web Journal, 2017. 23/02/17 4Stefan Dietze http://lak.linkededucation.org/
  • 5.
    Technology-enhanced Learning / Web-based Learning BigData in Learning Analytics? A simplistic Perspective 23/02/17 5Stefan Dietze Learning Analytics & Educational Data Mining  Application of data mining techniques to understand learning activities and performance  Traditionally confined to dedicated learning environments and platforms (e.g, Moodle)  Examples: JLA special issue on LA Datasets, data ranging between few MB and max. 15 GB  Near complete research corpus: LAK Dataset (http://lak.linkededucation.org)  Broader understanding: informal learning, micro-learning  Research often focused on resources: sharing, reusing, recommendation  Data examples:  „LinkedUp Catalog“: > 50 M resources, 300 M statements  „LRMI/schema.org“: > 45 M quads (Common Crawl 2015) Big Data? – Depends, but mostly not! (Volume?)
  • 6.
    LinkedUp Catalog oflearning resources Dataset Catalog/Registry http://data.linkededucation.org/linkedup/catalog/  “LinkedUp” (FP7 project): L3S, OU, OKFN, Elsevier, Exact Learning Solutions  Publishing and curation of educational/learning resources according to Linked Data principles  Largest collection of Linked Data about learning resources (approx. 50 datasets, 50 M resources) 23/02/17 6Stefan Dietze
  • 7.
    1 10 100 1000 10000 100000 1000000 10000000 1 51 101151 201 count(log) PLD (ranked) # entities # statements Learning Resources annotations on the Web?  “Learning Resources Metadata Intiative (LRMI)”: schema.org vocabulary for annotation of learning resources in Web documents (schema.org etc)  Approx. 5000 PLDs in “Common Crawl” (2 bn Web documents)  LRMI-Adaptation on the Web (WDC) [LILE16]:  2015: 44.108.511 quads, 6.243.721 resources  2014: 30.599.024 quads, 4.182.541 resources  2013: 10.636873 quads, 1.461.093 resources 23/02/17 7 Power law distribution across providers 4805 Providers / PLDs Taibi, D., Dietze, S., Towards embedded markup of learning resources on the Web: a quantitative Analysis of LRMI Terms Usage, in Companion Publication of the IW3C2 WWW 2016 Conference, IW3C2 2016, Montreal, Canada, April 11, 2016 Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju http://lrmi.itd.cnr.it/
  • 8.
    Technology-enhanced Learning / Web-based Learning BigData in Learning Analytics? A simplistic Perspective Learning Analytics & Educational Data Mining  Application of data mining techniques to understand learning activities and performance  Traditionally confined to dedicated learning environments and platforms (e.g, Moodle)  Complete research corpus: LAK Dataset (http://lak.linkededucation.org)  Data examples: JLA special issue on LA Datasets, data ranging between few MB and max. 15 GB  Broader understanding: informal learning, micro-learning  Research focused on resources: sharing, reusing, recommendation  Data examples:  „LinkedUp Catalog“: > 50 M resources, 300 M statements  „LRMI/schema.org“: > 45 M quads (Common Crawl 2015) Big Data? – Depends, but mostly not! (Volume?) Big Data? – Depends, but mostly not! (Velocity?) 23/02/17 8Stefan Dietze
  • 9.
    23/02/17 9 (Informal) Learningon the Web ? Stefan Dietze  Anything can be a learning resource  The activity makes the difference (not the resource): i.e. how a resource is being used  Learning Analytics in online/non-learning environments? o Activity streams, o Social graphs (and their evolution), o Behavioural traces (mouse movements, keystrokes) o ...  Research challenges: o How to detect „learning“? o How to detect learning-specific notions such as „competences“, „learning performance“ etc?
  • 10.
    23/02/17 10 „AFEL –Analytics for Everyday (Online) Learning“ Stefan Dietze Examples of AFEL data sources: • Activity streams and behavioral traces • L3S Twitter Crawl: 6 bn tweets • Common Crawl (2015): 2 bn documents • Web Data Commons (2015): 1 TB = 24 bn quads • „German Academic Web“: 6 TB Web crawl (quarterly recrawled) • Wikipedia edit history: 3 M edits/month (engl.) • ....  H2020 project (since 12/2015) aimed at understanding/supporting learning in social Web environments
  • 11.
    Big Data Challenges/Tasksin AFEL & beyond: some examples 23/02/17 11Stefan Dietze I Efficient data capture  Crawling & extracting activity data  Crawling, extracting and indexing learning resources (eg Common Crawl) II Efficient data analysis  Understanding learning resources: entity extraction & clustering on large Web crawls of resources  “Search as learning”: detecting learning in heterogeneous search query logs & click streams  Detecting learning activities: detection of learning pattern (eg competent behavior) in absence of learning objectives & assessments (!) o Obtaining performance indicators from behavioral traces? o Quasi experiments in crowdsourcing platforms to obtain training data Gadiraju, U., Demartini, G., Kawase, R., Dietze, S. Human beyond the Machine: Challenges and Opportunities of Microtask Crowdsourcing. In: IEEE Intelligent Systems, Volume 30 Issue 4 – Jul/Aug 2015. Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys. ACM CHI Conference on Human Factors in Computing Systems (CHI2015), April 18-23, Seoul, Korea.
  • 12.
    Gadiraju, U., Demartini,G., Kawase, R., Dietze, S. Human beyond the Machine: Challenges and Opportunities of Microtask Crowdsourcing. In: IEEE Intelligent Systems, Volume 30 Issue 4 – Jul/Aug 2015. Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys. ACM CHI Conference on Human Factors in Computing Systems (CHI2015), April 18-23, Seoul, Korea. 23/02/17 12Stefan Dietze Detecting competence in online users? Capturing assessment data: microtasks in Crowdflower  “Content Creation (CC)”: transcription of captchas  “Information Finding (IF)”: middle name of famous persons  1800 assessments: 2 tasks * 3 durations * 3 difficulty levels * 100 users (per assessment) Level 1 „Daniel Craig“ Level 2 „George Lucas“ (profession: Archbishop) Level 3 „Brian Smith“ (profession: Ice Hockey, born: 1972) Behavioral Traces: keystrokes- and mouse movements  timeBeforeInput, timeBeforeClick  tabSwitchFreq  windowToggleFreq  openNewTabFreq  WindowFocusFrequency  totalMouseMovements  scrollUpFreq, scrollDownFreq  ….  Total amount of events: 893.285 (CC Tasks), 736.664 (IF Tasks) Find the middle name of:
  • 13.
    23/02/17 13Stefan Dietze Predictingcompetence from behavioural traces? Training data  Manual annotation of 1800 assessments  Performance types [CHI15]: o “Competent Worker” , o “Diligent Worker” o “Fast Deceiver” o “Incompetent Worker” o “Rule Breaker” o “Smart Deceiver” o “Sloppy Worker”  Prediction of performance types from behavioral traces? Predicting learner types from behavioral traces  “Random Forest Classifier” (per task)  10-fold cross validation  Prediction performance: Accuracy, F-Measure Results  Longer assessments  more signals  Simpler assessments  more conclusive signals  “Competent Workers” (CW, DW): accuracy of 91% respectively 87%  Most significant features: “TotalTime”, “TippingPoint”, “MouseMovementFrequency”, “WindowFocusFrequency”
  • 14.
    23/02/17 14Stefan Dietze Otherfeatures to predict competence in learning/assessments? “Dunning-Kruger Effect”  Incompetence in task/domain reduces capacity to recognice/assess own incompetence Research question  Self-assessment as indicator for competence? Results  Self-assessment as reliable indicator of competence (94% accuracy), superior to mere performance measurement  Tendency to over-estimated own competence increases with increasing difficulty level David Dunning. 2011. The Dunning-Kruger Effect: On Being Ignorant of One’s Own Ignorance. Advances in experimental social psychology 44 (2011), 247. Performance („Accuracy“) of users classified as „competent“
  • 15.
    23/02/17 15Stefan Dietze Summary& outlook  Learning analytics in online & Web-based settings o Detection of learning & learning-related notions in absence of assessment/performance indicators? o Analysis of range of data, including behavioral traces, activity streams, self assessment etc o Actual big data  Positive results from initial models and classifiers  Application of developed models and classifiers in online (learning) environments (e.g. AFEL Projekt) o GNOSS/Didactalia (200.000 users) o LearnWeb o Deutsche Welle online o …
  • 16.
    Acknowledgements: Team 23/02/17 16StefanDietze  Pavlos Fafalios (L3S)  Besnik Fetahu (L3S)  Ujwal Gadiraju (L3S)  Eelco Herder (L3S)  Ivana Marenzi (L3S)  Ran Yu (L3S)  Pracheta Sahoo (L3S, IIT India)  Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)  Mathieu d‘Aquin (The Open University, UK)  Davide Taibi (CNR, Italy)  ...
  • 17.
    Acknowledgements: Team 23/02/17 17StefanDietze  Pavlos Fafalios (L3S)  Besnik Fetahu (L3S)  Ujwal Gadiraju (L3S)  Eelco Herder (L3S)  Ivana Marenzi (L3S)  Ran Yu (L3S)  Pracheta Sahoo (L3S, IIT India)  Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)  Mathieu d‘Aquin (The Open University, UK)  Davide Taibi (CNR, Italy)  ... ?http://stefandietze.net