The document summarizes notes from a meeting about a grant project to build an intelligent web service for collecting and classifying online educational resources. The project will use a combination of manual and automatic methods, including machine learning classifiers and rule-based classifiers, to identify different types of syllabus materials on websites. It will provide users with sample course sites, overviews of course topic relationships and schedules, and the ability to populate or search for materials based on topics or templates. The goal is to leverage both human and machine intelligence to maximize the usefulness and accuracy of the system.
1. 3/31/05 CSKD Funding Meeting
Notes from CSKD Grant Meeting of 3/31/05
By Kathryn Clodfelter
Attendees:
Elin Jacob
Kiduk Yang
Kathryn Clodfelter
Kiduk created a diagram of how he envisions the grant project. (See separate PowerPoint
document: http://elvis.slis.indiana.edu/index.shtml )
Collection builder:
• Intelligent web harvest – a customized crawler will identify:
o syllabus
o topics in schedule format
o reading assignments associated
o lecture itself, exercises/homework/problems/questions (application of lectures)
o labs/exercises
o Link to external resources – eg. Physics on-line dictionary, encyclopedia
Question: how do we identify these?
• Heuristic: based on URL information:
o extension (e.g. pdf)
o name of file
o path
o home page
• Actual content of the page:
o Automatic classifier – identify subset of data (like all the lectures), then train the
classifier, who will identify what type of material it is; this is machine learning
approach: get training data, identified and labeled documents (some set of
lectures, papers), run thru statistical process – will fail if content is exactly the
same
o Rule-based classifier: looking for lexicons (specific words), e.g., Linguistic: look
at training data manually and come up with a heuristic; whether visual format or
whatever it may be – use human intelligence to identify the rules for classifying
things
Computational Linguistics: comes from NLP, look at part of speech; still
statistical approach, but utilizes the linguistic structure that may be
necessary to identify these things
Significant question: Machine learning has never been applied to classify these types of
resources. That's why we're going to use all three. Has done combination of manual and
automatic for TREC – use as reference. We're going to leverage machine learning, which is
Page 1
2. 3/31/05 CSKD Funding Meeting
stable area of research. In building rule-based classifier, we're going to discover a combined
method of classifying the resource type, as opposed to topical classification.
One of the significant contributions to research will be to combine the machine-learning
approach with heuristic (rule-based) to identify the different types/formats of documents. May
look at actual markup as a basis (some research on that from Lizzie and Howard).
All href anchor text is separated from content for retrieval purposes more than straight text.
We're going beyond that. Why and what for? To get a high quality collection – items that aren't
outside
To avoid intellectual property issues, every piece of content clicked on will take to original site.
We have data internally for mapping and crunching, but we deliver the original site. Collection
builder can run from any web connection – doesn't have to be IU.
Schedule – has list of topics, etc. presented in a particular order
Plan to give the user:
Intelligent sample sites: we give him 3 options – if you have a schedule in mind for specific
topic, use this, etc. – try the FAQ – come up with diff things – key thing is Introductory IR –
sample course – has entire website (of someone else's). then not an intelligent service where
combines everything. We've manually selected the best web site. The FAQ will be manual.
Elin wants an overview and synthesis of what everyone's doing out there. What's going on at
higher level.
FAQ #1: Sample Site – can select good source for top physics or most popular (link analysis and
page rank)
FAQ #2: Overview – instead of just pointing to site, decide on structure, schedule, questions,
lectures with visual representation, little bit of analysis by looking at all schedules, ranked by
popularity, use-based, links (if have identified vocab and topics, can map relationships bet topics
in various schedules) I want to know how the topics are related; schedule is based on concepts
and the relationships between them, analyze the linear relationships (a line of info is the
connections between the points); the syllabus schedule doesn't follow the concept relationships;
have to find a way to get to that structure thru mapping of concept relationships (concept maps
are part of KB search; facet search)
A faceted structure is just like a database
When do a topic search browse, it shows a concept mapping
Facet search has something based on our faceted prototype
What are the concepts –
Q2: what are the concepts?
#1 doesn't give me kind of help I need not knowing anything
Maybe combine 1&2,
We have to come up with the questions – from Gregor – get by talking to the users/faculty
That kind of FAQ doesn't come from what we're good at
It's there because it's part of Gregor stuff
2 things in intelligent service:
Do sample sites as one aspect of structure/template component
Page 2
3. 3/31/05 CSKD Funding Meeting
Our focus is not to develop the FAQ – we already have several, the sample size
Example: what are the most heavily-used courses on intro physics? What are most highly linked
sites on physics? A lot of lecture content is repeated.
2nd question: show me what the topics are for nuclear fusion course? This was what Gregor
talked about with intelligent web service. KY has steered Gregor toward glorified FAQ – it's an
expert system. That's why it's called intelligent FAQ. It's a kb. Show me the syllabus overview,
then do some analysis on all syllabus materials we have. Then do things by frequency, some kind
of structure – these are common ones.
Where's the data on how people build their syllabi?
What we have is the middle one – concept/topic search/browse
The FAQ whose answer requires analysis has to use the data we create
Show me concept map:
Show me the concept relationships on intro physics, linked to graphical concept map
Concept topic search are when you have a specific topic and you want to find material
Structured template populator is when instead of using a concept/topic as query, you use a
structure or template (e.g. schedule) – get structure that's hyperlinked
If everything has to be linked back to original, result page is organized and filtered to find what
you want
Needs to be diff from doing a Google search
this is a digital indexing service, you're pointing me somewhere else, not giving me the
resources; therefore not a DL
example: get frustrated with SLIS course page, having to back to main page to go to diff session
don't want something where I have to keep going back
not giving them value-added
3rd component of structure/template: I came up with course structure, there's a form where I type
in and create a schedule, got result page, everything is linked,
That person comes in with their structure in hand
I put in the topic I want, and then this table comes back all linked
Date Topic Lecture & Lab Readings Assignments
Can add description
The link is not specific reading
Remember CSKD is all about fusion – give humans options
Have copyright issue – identify the available readings – can do by link analysis popularity, user
popularity; user can sort by source, trustworthiness, usage
Need to have something like this for someone who doesn't have the structure
FAQs are for the person who has no idea
It's the who, what, when, where, how and why
Minimum cognitive requirement from user – don’t' even have to come up with question
Some type of reasonable answer – samples, site, overview /analysis of schedule
FAQs about how to use this service
It's a FAQ – where get the questions
Page 3
4. 3/31/05 CSKD Funding Meeting
Then user contribution module will stabilize if not used; we keep track of the system log, the
usage of these services
The way we bill it, is we will build over time – it's not the focus – where gonna get FAQs
Schedule overview – click
Do by top 10 overviews by linkage popularity, usage popularity, source importance, reading
material, overview of readings
How come up with data? We have all harvested schedules, do link analysis for top 10
Has nothing to do with what we do, it's purely statistical
Can use the intelligent classifier to identify resources by structure, add the content, and plug it in
Structure/template populator, the core module would be conctp/topic search – each concept
employer that search to populate
Middle one – you're trying to find a single topic, Last one is a list of all topics grouped
Concept/Topic Search:
2 of that has concept relationships as part of result – one is facet search result with a list of
results organized using our classified data
Get list of resources & list of relationships (remember facet search interface)
Still want thing that gives me sense of how to structure my course – that's the concept/topic
browse
Concept is that
Develop FAQ as part of user contribution module, they will contribute data as well as
Some will have structure, some won't
Run thru classifier and concept mapper
On top of this is digital object management
Parser will try to get automatically pull metadata info – eg. Author
Rule based classifier construction will be done manually, then becomes automatic
Automatic classifier
Once rule identified, becomes automatic classifier
Manual & automatic & hybrid: whole thing about maximizing the utility/performance/task
completion by leveraging both machine capability and human intelligence
next time:
1) hybrid, etc.
2) flag where can make significant contribution for IR and classification
Page 4