Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

LearningQ: A Large-scale Dataset for Educational Question Generation (ICWSM 2018)

114 views

Published on

Presentation given in the AAAI Conference on Web and Social Media (ICWSM) in June 2018, describing how to collect LearningQ, a large-scale dataset can be used for educational question generation.

Published in: Science
  • Be the first to comment

  • Be the first to like this

LearningQ: A Large-scale Dataset for Educational Question Generation (ICWSM 2018)

  1. 1. LearningQ: A Large-scale Dataset for Educational Question Generation flickr@georgiasouthern Guanliang Chen1, Jie Yang2, Claudia Hauff1 and Geert-Jan Houben1 1Delft University of Technology, 2University of Fribourg https://angusglchen.github.io/
  2. 2. Motivation Students High-quality questions
  3. 3. Motivation Students High-quality questions
  4. 4. Motivation Students High-quality questions
  5. 5. Motivation Students High-quality questions
  6. 6. Motivation Students High-quality questions Instructors & Teachers
  7. 7. Motivation Students High-quality questions Instructors & Teachers
  8. 8. Motivation Students High-quality questions Instructors & Teachers Deep question generator
  9. 9. Motivation Students High-quality questions Instructors & Teachers Deep question generator
  10. 10. Motivation Students High-quality questions Instructors & Teachers Deep question generator SQuAD RACE TriviaQA … Datasets
  11. 11. Motivation Students High-quality questions Instructors & Teachers Deep question generator SQuAD RACE TriviaQA … Datasets Not for education or learning!
  12. 12. Motivation Students High-quality questions Instructors & Teachers Deep question generator SQuAD RACE TriviaQA … Datasets Not for education or learning! Educational dataset Wanted
  13. 13. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments
  14. 14. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments
  15. 15. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Lecture videos Reading materialLecture videos
  16. 16. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Lecture videos Reading materialLecture videos 7K Instructor- designed questions
  17. 17. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Lecture videos Reading materialLecture videos 1,146K Learner-generated posts7K Instructor- designed questions
  18. 18. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Lecture videos Reading materialLecture videos 1,146K Learner-generated posts7K Instructor- designed questions What is the direction of current in a circuit? Can someone please help me?
  19. 19. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Lecture videos Reading materialLecture videos 1,146K Learner-generated posts7K Instructor- designed questions What is the direction of current in a circuit? Can someone please help me? Question classifier based on CNN
  20. 20. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Lecture videos Reading materialLecture videos 1,146K Learner-generated posts7K Instructor- designed questions What is the direction of current in a circuit? Can someone please help me? 230K Learner-generated questions Question classifier based on CNN
  21. 21. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Table 1: Dataset statistics Features SQuAD RACE TED-Ed Khan Academy Video Article # Docuemtns 20K 27K 1K 7K 1K # Questions 97K 72K 7K 201K 22K # Avg. Questions / Document 4.67 2.60 6.91 25.40 12.44 # Avg. words / questions 11.31 11.51 20.07 16.72 17.11
  22. 22. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Table 1: Dataset statistics Features SQuAD RACE TED-Ed Khan Academy Video Article # Docuemtns 20K 27K 1K 7K 1K # Questions 97K 72K 7K 201K 22K # Avg. Questions / Document 4.67 2.60 6.91 25.40 12.44 # Avg. words / questions 11.31 11.51 20.07 16.72 17.11
  23. 23. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Table 1: Dataset statistics Features SQuAD RACE TED-Ed Khan Academy Video Article # Docuemtns 20K 27K 1K 7K 1K # Questions 97K 72K 7K 201K 22K # Avg. Questions / Document 4.67 2.60 6.91 25.40 12.44 # Avg. words / questions 11.31 11.51 20.07 16.72 17.11
  24. 24. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Table 1: Dataset statistics Features SQuAD RACE TED-Ed Khan Academy Video Article # Docuemtns 20K 27K 1K 7K 1K # Questions 97K 72K 7K 201K 22K # Avg. Questions / Document 4.67 2.60 6.91 25.40 12.44 # Avg. words / questions 11.31 11.51 20.07 16.72 17.11
  25. 25. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Table 1: Dataset statistics Features SQuAD RACE TED-Ed Khan Academy Video Article # Docuemtns 20K 27K 1K 7K 1K # Questions 97K 72K 7K 201K 22K # Avg. Questions / Document 4.67 2.60 6.91 25.40 12.44 # Avg. words / questions 11.31 11.51 20.07 16.72 17.11
  26. 26. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Table 1: Dataset statistics Features SQuAD RACE TED-Ed Khan Academy Video Article # Docuemtns 20K 27K 1K 7K 1K # Questions 97K 72K 7K 201K 22K # Avg. Questions / Document 4.67 2.60 6.91 25.40 12.44 # Avg. words / questions 11.31 11.51 20.07 16.72 17.11 A larger dataset consisting of longer questions.
  27. 27. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Table 1: Dataset statistics Features SQuAD RACE TED-Ed Khan Academy Video Article # Docuemtns 20K 27K 1K 7K 1K # Questions 97K 72K 7K 201K 22K # Avg. Questions / Document 4.67 2.60 6.91 25.40 12.44 # Avg. words / questions 11.31 11.51 20.07 16.72 17.11 A larger dataset consisting of longer questions. Table 2: Distribution of Bloom’s Revised Taxonomy Labels. Cognitive Dimension SQuAD RACE TED-Ed Khan Academy Remembering 100 82.19 61.86 18.24 Understanding 0 18.26 38.66 55.97 Applying 0 0.46 9.79 12.58 Analysing 0 8.22 14.95 15.09 Evaluating 0 1.37 4.12 1.89 Creating 0 0 1.55 0.63
  28. 28. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Table 1: Dataset statistics Features SQuAD RACE TED-Ed Khan Academy Video Article # Docuemtns 20K 27K 1K 7K 1K # Questions 97K 72K 7K 201K 22K # Avg. Questions / Document 4.67 2.60 6.91 25.40 12.44 # Avg. words / questions 11.31 11.51 20.07 16.72 17.11 A larger dataset consisting of longer questions. Table 2: Distribution of Bloom’s Revised Taxonomy Labels. Cognitive Dimension SQuAD RACE TED-Ed Khan Academy Remembering 100 82.19 61.86 18.24 Understanding 0 18.26 38.66 55.97 Applying 0 0.46 9.79 12.58 Analysing 0 8.22 14.95 15.09 Evaluating 0 1.37 4.12 1.89 Creating 0 0 1.55 0.63
  29. 29. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Table 1: Dataset statistics Features SQuAD RACE TED-Ed Khan Academy Video Article # Docuemtns 20K 27K 1K 7K 1K # Questions 97K 72K 7K 201K 22K # Avg. Questions / Document 4.67 2.60 6.91 25.40 12.44 # Avg. words / questions 11.31 11.51 20.07 16.72 17.11 A larger dataset consisting of longer questions. Table 2: Distribution of Bloom’s Revised Taxonomy Labels. Cognitive Dimension SQuAD RACE TED-Ed Khan Academy Remembering 100 82.19 61.86 18.24 Understanding 0 18.26 38.66 55.97 Applying 0 0.46 9.79 12.58 Analysing 0 8.22 14.95 15.09 Evaluating 0 1.37 4.12 1.89 Creating 0 0 1.55 0.63 More higher cognitive level questions.
  30. 30. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Table 3:Experimental results on LearningQ. Datasets Methods Bleu 4 Meteor RougeL Khan Academy H&S 0.10 3.24 6.61 Seq2Seq 2.29 6.44 23.11 Attention Seq2Seq 3.63 8.73 27.36 TED-Ed H&S 0.15 3.00 6.52 Seq2Seq 0.73 4.34 16.09 Attention Seq2Seq 1.15 5.32 17.69
  31. 31. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Table 3:Experimental results on LearningQ. Datasets Methods Bleu 4 Meteor RougeL Khan Academy H&S 0.10 3.24 6.61 Seq2Seq 2.29 6.44 23.11 Attention Seq2Seq 3.63 8.73 27.36 TED-Ed H&S 0.15 3.00 6.52 Seq2Seq 0.73 4.34 16.09 Attention Seq2Seq 1.15 5.32 17.69
  32. 32. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Table 3:Experimental results on LearningQ. Datasets Methods Bleu 4 Meteor RougeL Khan Academy H&S 0.10 3.24 6.61 Seq2Seq 2.29 6.44 23.11 Attention Seq2Seq 3.63 8.73 27.36 TED-Ed H&S 0.15 3.00 6.52 Seq2Seq 0.73 4.34 16.09 Attention Seq2Seq 1.15 5.32 17.69 Much lower than existing dataset.
  33. 33. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Table 3:Experimental results on LearningQ. Datasets Methods Bleu 4 Meteor RougeL Khan Academy H&S 0.10 3.24 6.61 Seq2Seq 2.29 6.44 23.11 Attention Seq2Seq 3.63 8.73 27.36 TED-Ed H&S 0.15 3.00 6.52 Seq2Seq 0.73 4.34 16.09 Attention Seq2Seq 1.15 5.32 17.69 Much lower than existing dataset.
  34. 34. Step 1:Data Collection Step 2:Data Analysis Step 3: Experiments Calls for more advanced methods. Table 3:Experimental results on LearningQ. Datasets Methods Bleu 4 Meteor RougeL Khan Academy H&S 0.10 3.24 6.61 Seq2Seq 2.29 6.44 23.11 Attention Seq2Seq 3.63 8.73 27.36 TED-Ed H&S 0.15 3.00 6.52 Seq2Seq 0.73 4.34 16.09 Attention Seq2Seq 1.15 5.32 17.69 Much lower than existing dataset.
  35. 35. Conclusion
  36. 36. Conclusion We presented LearningQ, which consists of 230K document-question pairs that can be used for educational question generation.
  37. 37. Conclusion We presented LearningQ, which consists of 230K document-question pairs that can be used for educational question generation. Download the dataset via:https://bit.ly/learningq and contact via: guanliang.chen@tudelft.nl
  38. 38. Conclusion We presented LearningQ, which consists of 230K document-question pairs that can be used for educational question generation. Download the dataset via:https://bit.ly/learningq and contact via: guanliang.chen@tudelft.nl Thank you!

×