Classifying Tech News with Sparkling Water

H2O.ai 
Machine Intelligence
BUILDING MACHINE LEARNING
APPLICATIONS WITH SPARKLING WATER
AV N I WA D H WA & V I N O D I Y E N G A R

H2O.ai 
Sparkling Water
• Seamless integration of H2O with Spark ecosystem
• Transparent use of H2O data structures and algorithms
with Spark API
• Excels in existing Spark workflows requiring advanced
Machine Learning algorithms
Provides the following:

H2O.ai 
Sparkling Water Requirements
• Spark Version 1.4
• Sparkling Water 1.4.3 (download at h2o.ai/download)

H2O.ai 
Tech News Use Case
• The goal is to predict the tag based on the short summary of the
article

H2O.ai 
Tech News Use Case— Crawler
Used import.io to create a crawler which went through numerous pages of techcrunch.com
news and and acquired data regarding the title of the article, the author, a 2-3 sentence
opening from the beginning of the articles, and the tags associated with the article

H2O.ai 
Tech News Use Case
First manipulation of words involves eliminating words that could
occur frequently and do not add value to the classification process.
Sample Scala code:

H2O.ai 
Tech News Use Case
We now eliminate words that do not add value to the
classification process
• ie punctuation, stopwords, and words that do not occur
frequently
Sample Scala code:

H2O.ai 
Tech News Use Case — Word2Vec
A mathematical way to represent a word as a vector of numbers.
These vector ‘representations’ encode information about the
given word. In other words, the vector captures the meaning of
the word.

Text blurb
Word2Vec
Model
GBM 
Model
Word2Vec
Categorize
the text
Train a model
“This article is
related to gadgets”
“Apple has been tinkering
with ways to make
the iPhone better at managing battery life…”
Article Blurb
Tech News Work Flow

H2O.ai 
Category Information
The original data set yielded about 55
categories. In order to streamline the
classification process, we chose the 14
most frequently appearing tags in our
dataset and labeled the rest into a
catch-all category titled “Other.” The
figure to the right shows the
distribution of data in each category.
Category Information
The variable importance chart to the right shows that
the author holds an overwhelming majority when it
comes to importance among variables. In other
words, the classification took place using very little
information from the text samples provided and
came mostly from authors that frequently write
under the same article tag. Let’s see how this changes
when we try to classify the articles using only the text
samples.

H2O.ai 
Analysis
The validation confusion matrix below is for the model that used both the authors and text blurbs to
categorize articles. We know that in this model, there was a heavy variable importance placed on
authors. In the confusion matrix below, we see how this effects the error rate of various tags. For tags
with smaller sets of data, it is common that a few authors write the majority of articles associated with
those tags. For the “Enterprise” tag for example, the data set is relatively small, and the error rate is
relatively low (40%).

H2O.ai 
Analysis
The validation confusion matrix below is for the model that uses text blurbs exclusively to categorize
articles. If we look at the error rate on the “Enterprise” tag, we see that the error rate is 75%,
significantly higher than the error rate we saw when authors were incorporated into the data. This
shows the strength in the variable importance of the authors.

H2O.ai 
Example Classification
With the Scala code below, we identify and author of an article and a
the snippet of the article provided, and try to classify what the article is
about.

H2O.ai 
Hit Ratios
With Authors Without Authors
Hit ratios illustrate the chances of your model correctly categorizing a text blurb on the 1st,
2nd, 3rd, etc. try. The above charts show that both the model that do and do not include
authors have approx. 70% chance of correctly predicting a text blurb on the second try.

H2O.ai 
Possible Use
A possible use for such classification capabilities would be for blog
posting sites. The user would enter their text into the field, and the
classification model would automatically choose tags for the post.

H2O.ai 
Machine IntelligenceCustomers • Community •
November 9, 10, 11
Computer History Museum
H2OWORLD.H2O.AI
20% off
registration
using code:
h2ocommuni
ty

Classifying Tech News with Sparkling Water

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Classifying Tech News with Sparkling Water

Similar to Classifying Tech News with Sparkling Water (20)

More from Sri Ambati

More from Sri Ambati (20)

Recently uploaded

Recently uploaded (20)

Classifying Tech News with Sparkling Water