2. Introduction
● This tool was designed to summarize HTML version of the papers published in
the proceedings of CHI 96 - Conference on Human Factors in Computing
Systems, 1996: http://sigchi.org/chi96/proceedings/papers.htm
● Since we've written a parser of our own to parse the HTML source, we realize
that its not very generic and may not work for all the inputs apart from the ones
in these proceedings.
● Since this is just a Proof-Of-Concept application, don't expect too much of error
handling. But we try to provide some basic error messages when something
fails.
3. How does it work?
● First, we parse the HTML of the paper, so as to distinguish between HTML
tags and the actual text.
● Next, divide the paper into sections and subsections based on the heading in
the paper. For instance, text in between first <h1> becomes section 1, text
under first <h2> becomes section 1.1 and so on.
● Now, we extract the actual text for each subsection and ignore any other tags
like <div>, <span>, etc.
4. How does it work?
● We pass the extracted plain text for each subsection to the summarizer so
that we get a brief summary of each subsection. The size of the summary
should be limited to be about 4-5 sentences.
● Along with the text, we also extract relevant images and tables from the
paper and insert them into the presentation under relevant sections.
● Once we have heading for each section and content under it from the parser,
we just need to pass the appropriate arguments to Latexslides
5. How does it work?
(a Python tool) which generates the presentation in LaTeX.
● Finally we obtain the presentation in ‘.pdf’ format from the LaTeX source
using ‘pdflatex’.
6. Features of the parser
● Grabs all the tags and their relevant text in a well formatted html page.
● Classifies the text into proper sections. Like:
● Intelligently detects the type of tag and assigns proper at attribute to it.
● Output is out in as a well-formatted JSON array, so that it can be used
independently in other applications.
● It also takes care of hyperlinks and includes them in the appropriate
section/subsection.
7. Features of the summarizer
● Summarizer takes as input a blob of text as input and outputs an array of
sentences that summarizes the given text.
● The summarizer takes as input maximum number of sentences to be
returned as output, so its flexible in this regard.
● Calculates the importance of a sentence by comparing it with all other
sentences in the given text and assigning appropriate weights to all the
words.
● Since we’re using stop-words from nltk(natural language toolkit), it can be
extended to any natural language.
8. Features of the PDF Generator
● It takes a JSON file as input, so it is independent in way that it can work with
any JSON input, the only condition being that the JSON file must follow our
standard format.
● If a particular section is appearing several times in the input with different
contents, it automatically combines their content into one section.
● It generates the `tex` file before converting it to PDF so user can download
the PDF. Also, he can edit the `tex` file itself if he wants so.
9. Web Interface
● We’ve also developed a web interface to use this tool in an easy manner:
http://web.iiit.ac.in/~chandan.singh/html2presentation/
10. Possible Use Cases
● Can be used to automatically generate presentations of one’s paper.
● Can also be tweaked easily to summarize blogs.
● Since the code is written in a modular manner, more modules can easily be
added or removed to enhance the user interface.
11. Thank You!
● Any feedback/suggestions are welcome.
● You can contact us via our contact page: http://web.iiit.ac.in/~chandan.
singh/html2presentation/team/