AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark

1. Text Analytics
2. AI with Apache Spark
3. AI Experiment with
Microsoft ML Studio
By Adj Prof. Giuseppe Mascarella
giuseppe@valueamplify.com
WELCOME TO:
Text Analytics And Social Media Analytics

People are sharing more than ever before…
190 million tweets per day
100 hours of video uploaded to
YouTube per minute
500M+ photos uploaded and
shared each day
Totalpiecesofcontentsharedon
Facebookeachmonth70billion

Crunching the Text Analytics Numbers
One negative social media
comment can cost a company
30customers
*Convergys corp. A customer review on
social media reaches an
average audience of 45
people…
67% of whom
would avoid or
completely stop doing
business with a
company they
heard bad things about.
*Convergys corp.
44%
Of customers today complain about
products and services on social
networks
*IBM, 2012
Customers are nearly 60%
through the sales process before
engaging a sales rep, regardless of
price point.
*Google and CEB
Whose advice do you seek for
guidance when evaluating software
and service technology solutions for
your business? Social media?
<25 56%
25-29 61%
30-34 53%
35-40 52%
41-50 40%
>50 34%
*SMB group 2010 small & medium business
routes to market study, July 2010
Percent of tweeters
who don’t tweet,
but watch other people
tweet 40%

News travels quickly…
People are more likely to share a bad experience online vs. a good
one.
87% will share
with others
33% will share
with 5+ people

Social media
terminology
and tools

Social media terminology
Click through rate

Data Mining Text
1. Law, Ordinances, Court Sentences
2. Citizen data (i.e.
https://www.covidimpactmeter.com/)
3. Product news
4. Interviews with industry experts
5. Opinion columns or blogs
6. Tutorials on new technologies or
practices
7. Analysis from independent
journalists or analysts
1. Vendor-provided information
2. Research on market size
3. Press Release
4. Analysis from peers
5. White papers
6. Vendor news
7. Product reviews or buying guides
8. Forward-looking trends
9. Case studies
10. https://www.statista.com/
11. etc

REPU.SCORE
1.MICROSOFT BING
ENGINE
2. ADD GOOGLE
ENGINE
3. CUSTOM
-EXTRACT TEST,
-READ
-VERIFY
-COLLECT QUOTES
3. GIUGNO
START SUPPORT
MANUAL DUE
DILIGENCE

What Is FOIL?
The Freedom of Information Law (FOIL) promotes the policy of open government. FOIL was
designed to make documents generated by and in the possession of government agencies available
to the public with certain specific exceptions.
• The Freedom of Information Law (“FOIL”), Article 6 (Sections 84-90) of the NYS Public Officers
Law, provides the public right to access records maintained by government agencies with certain
exceptions.
• “Record” means any information kept, held, filed, produced or reproduced by, with, or for this
agency, in any physical form whatsoever including, but not limited to, reports, statements,
examinations, memoranda, opinions, folders, files, books, manuals, pamphlets, forms, papers,
designs, drawings, maps, photos, letters, microfilms, computer tapes or disks, rules, regulations
or codes.
• Learn more about records access and Open FOIL NY.
Federal Level: https://www.foia.gov/

1. Text Analytics
2. AI Text Analytics with Apache Spark
3. AI Experiment with Microsoft ML Studio
WELCOME TO:

APACHE SPARK
• 1. What is
• 2. Spark MLlib
• 3. Scenario: Text Mining

What Challenge We Are Trying To Solve?
Data Grows Faster
than Moore’s Law
[IDC report, Kathy
Yelick, LBNL]
Data
Machine
Learning
Distributed
Computing

How Can We Do So At Scale With Massive
Data?
1. Start implementing distributed pipelines
2. Optimize what Table you keep in memory for how long
3. Understand ML libraries that are not single thread computing
4. Optimize more hardware to store / process modern data Scale-out
(distributed, e.g., cloud-based)
5. Leverage Commodity hardware pricings, and scales to massive problems
6. Real-time streaming data processing
While
Increasing
Speed

1. What is Spark?
Apache Spark is a fast and general-purpose cluster
computing system,
with in-memory data processing engine,
suited for modern data centric applications.
APIs
Spark Core
Spark
SQL
Spark
Streaming
MLlib GraphX
Apache
Mesos
Hadoop
YARN
Spark
Standalone
SCALA
Python
R, Java
S3
HDFS
Azure
…

What Spark?
Free software: spark.apache.org
Apache Spark is 100% open source, hosted at the vendor-independent Apache
Software Foundation
“..our efforts focuses on both the Spark
codebase and support.
All of our work on Spark is open source
and goes directly to Apache.”
Matei Zaharia, VP, Apache Spark,
Founder & CTO, Databricks
 The Washington Post recommendation
engine for content to their readers.
 Yelp, in connecting users with local
businesses, Spark to increase the click-
through rates of display
advertisements.
 Hearst Corporation uses Streaming to
process clickstream data from over
200 web. For real-time view of article
performance and trending topics.
 ..Uber, Pinterest, Conviva, and Yahoo
Early Adopters

Spark Infrastructure
Cluster Deployment with built-in interoperability with Hadoop YARN, Apache
Mesos, files systems, etc
Spark Core defines RDD data abstraction and provides in-memory
computing capabilities.

How Does The Core Work?
Parallelizes Distributed Data Processing
Shuffle
DAGScheduler, TaskScheduler, SchedulerBackend
Resilient Distributed Dataset (and DataFrames)
In-Memory, Partitioned, Cacheable, Parallel, Typed, Lazy Evaluated

2. Spark MLlib
Machine Learning in Distributed Computing Environment with Peta Bytes of Data
 org.apache.spark.mllib for a RDD-based
ML
 RDD-based API
 Pipeline API for designing, evaluating,
and tuning machine learning

ML Algorithms with RDD
• Classification: logistic regression, naive Bayes,...
• Regression: generalized linear regression, isotonic
regression,...
• Decision trees, random forests, and gradient-
boosted trees
• Recommendation: alternating least squares (ALS)
• Clustering: K-means, Gaussian mixtures
(GMMs),...
• Topic modeling: latent Dirichlet allocation (LDA)
• Feature transformations: standardization,
normalization, hashing,...
• Model evaluation and hyper-parameter tuning
• ML Pipeline construction
• ML persistence: saving and loading models and
Pipelines
• Survival analysis: accelerated failure time model
• Frequent itemset and sequential pattern mining:
FP-growth, association rules, PrefixSpan
• Distributed linear algebra: singular value
decomposition (SVD), principal component
analysis (PCA),...
• Statistics: summary statistics, hypothesis
testing,...
https://spark-packages.org/?q=tags%3A%22Machine
Learning%22

3. Scenario: Text Classification
Algorithm: Logistic Regression (Binary Classification)
Problem: Given a text document, classify it,
as a scientific or non-scientific one.
Goal: Find decision boundary, the” model”.
Scientific
Non
Scientific
Model
Schema for Categories
sealed trait Category
case object Scientific extends
Category
case object NonScientific
extends Category
$
𝒌=𝟎
𝒏
𝒏
𝒌
𝒙 𝒌

MLlib Pipeline
Load data
Engineer
Features
Train
Model
Scoring
Offer
Service
Dataset
Transformers
Fit
Evaluate and
Tune
PMML/
Livy
Industry Spark Our Txt Scenario
Create a Dataset
Tokenizers
Hash Term
Logistic Regression
Evaluate
PMML/
Livy
REST API
Pipeline(Stages=T,H,Lr)
Label: Obj S/N-Sc
Text: String
Words: Seq[String]
Features: Vector
Predictions:Obj S/NSc

Definition
• Tokenization, when applied to data security, is the process of
substituting a sensitive data element with a non-sensitive
equivalent, referred to as a token, that has no extrinsic or
exploitable meaning or value. The token is a reference (i.e.
identifier) that maps back to the sensitive data through a
tokenization system.
• Hash are used to index data. Hashing values can be used to map
data to individual "buckets" within a hash table. Each bucket has a
unique ID that serves as a pointer to the original data. This creates
an index that is significantly smaller than the original data,
allowing the values to be searched and accessed more efficiently.

Transformer: Tokenizer
SCALA 101
1. Import Library
2. Def Obj
3. Transform
(method)
Schema
(text)
Schema (words)
1. import org.apache.spark.ml.feature.RegexTokenizer
2. val tokenizer = new
RegexTokenizer().setInputCol(“Bible").setOutputCol("words
scientific")
3. tokenizer.transform (training.df)
transform
Dataset
Dataset

Modeling
1. import org.apache.spark.ml.Pipeline
2. val lr=new LogisticRegression(maxIter=20,
regParam=0.01)
3. pipeline=new Pipeline().setStages(Array(tokenizer,
hashingTF, lr))
4. model = pipeline.fit(training)
1. Import Library
2. Instr. Obj Algorithm
3. Create Pipeline
(70%)
4. Train Model
Dataset
Model

https://www.linkedin.com/pulse/apache-spark-20-just-released-join-
our-workshop-your-mascarella?trk=prof-post

Data In the Social Mission Control Room
Social
Engagement
1.Needs
2.
Solution
Example
3 How

Why Do You Need To Monitor Conversation Worldwide Timely?
1. People and troubles issues can quickly go viral
2. Influencers are difficult to identify and prioritize
3. Trends and opportunity signals get lost in the noise

Sentiment Analysis Addresses:
1. People, Topic & Event sentiment
Gain insight and learn what people really think
and feel about it .
3.Top influencer tracking
Identify and monitor the top influencers in your
industry, company, and customer base.
2.Campaign monitoring
Measure the effectiveness of remidiation
campaigns in real time.

Social care
Global sentiment analysis
Track social care cases in Spanish, English and
Portuguese.
Real-time social case resolution
Create alerts to quickly identify customer issues,
and identify trends early on.
Integrated channel management
Across the social and web (e.g., Twitter,
Facebook, Blogs, and boards).

Social Alert Management
Competitive intelligence
Gain important insights about your competitors’
weaknesses and strengths.
Target account tracking
Monitor key developments and decision makers
at your top accounts.
Un-attended alerting
Create rule based real-time alerts agents and
generate PR templatized actions

Data In the Social Mission Control Room
Social
Engagement
1.Needs
2. Solution
Example
3 How

What Is Media Monitoring
 Metrics of social audience
 Measure sentiment of the audience on characters
 Browse conversations

 NLP, Text/Document level classification methods and software tools
 Positive/Negative/Neutral (+/-1)
 Subjective -> Objective
 Tailored for short texts
 Handles: twitter jargon: RT, @, #, , spelling errors, disfluence
 Entity level sentiment
A. What Is Sentiment Analisys?
Finding Information Driven By Opinion mining or emotions AI
to understand the voice of the crowd through reviews and
survey responses, online and on social media.
B. Where Is It Used?
Marketing, Branding, Entertainment, Customer service, Clinical
medicine, Politics, Finance, etc
C. What Are The Mechanics of It?

 Real time visualizations on twitter conversations
 Overall sentiment of the audience on each player / character
Sentiment Analysis:
From Volume Analisys To Quality Analysis
$45M/Year

Use Case: Company FY/Quarter Result Release
Produced by Value Amplify - Confidential

The Mechanics Of Text Classification
Algorithm: Logistic Regression (Binary Classification)
Problem: Given a text document, classify it, as a “Positive” in Sentiment Analysis
Goal: Find decision boundary, the” model”.
Negative
Positive
Model
Entities/Channels
SentimentAnalysisIndex

Social Mission Control Room
Social
Engagement
1.Needs
2.Solution
Samples
3 HOW
By Prof. Giuseppe Mascarella
Cell: 425 269 5478

You can single out an influencer point of view in 1 channel and
reply with info form other channels

The Business Problem (Microsoft Case Study)
Elena works for an Internet-based retailer company selling DVDs, software,
video games, toys, electronics, and furniture.
The company shows customers feedback at the product level. Her task is to
build a pipeline that automatically analyzes customer feedback and Twitter
messages, to provide the overall sentiment for each product.
The aim (Label, Target) is to help consumers who want to understand if
public opinion (previous review or best seller) has had a positive ( =4) review
before purchasing a product.
How Do You Use AI For This?
-Training Data
-Target (label) : Review (not best seller)
-Algorithm

1. Text Analytics
2. AI with Apache Spark
3. AI Sentiment Analysis Experiment with Microsoft ML Studio
WELCOME TO:

Data Set
• The open source data comprises approximately 1,600,000
automatically annotated tweets: http://help.sentiment140.com/
• Any tweet containing positive emoticon such as :),:-), :D or =D was
assumed to bear positive sentiment,
Any tweets with negative emoticons such as :<, :-( or :( were
supposed to bear negative polarity.
• Tweets containing both positive and negative emoticons are a
problem
• Dataset to be used:
http://azuremlsamples.azureml.net/templatedata/Text - Input.csv

Dependent Variables (Features, Text Column)
1. sentiment_label - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
2. tweet_id - the id of the tweet
3. time_stamp - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
4. target - the query (lyx). If there is no query, then this value is NO_QUERY.
5. user_id - the user who posted the tweet
6. tweet_text - the text of the tweet

What Algorithm Is Best?
• N-gram model is another common vector representation model, but knows that there is
no conclusive answer of which one works the best.
• Bag-of-words vector representation model is commonly used for text classification. In
this method, the frequency of occurrence of each word, or term-frequency (TF), is
multiplied by the inverse document frequency, and the TF-IDF scores are used as feature
values for training a classifier.
• In steps 5A and 5B, the most accurate model is deployed as a published web service,
using either RRS (Request Response Service) or BES (Batch Execution Service).
When using RRS, only one text instance is classified at a time.
When using BES, a batch of text instances can be sent for classification at the same time.
By using these web services, you can perform classification in parallel, using either an
external worker or the Azure Data Factory, for greatly enhanced efficiency.

Text pre-processin
Text cleaning
-replacing special
characters and
punctuation marks with
spaces,
-normalizing case,
removing duplicate
characters, removing user-
defined or built-in stop-
words, and word
stemming.
Highly custom steps are
implemented using the R
programming language.

• Feature hashing
To convert variable-length text documents to equal-length
numeric feature vectors, using the 32-bit murmurhash v3
hashing method provided by the Vowpal Wabbit library.
• The objective of using feature hashing is dimensionality
reduction; also feature hashing makes the lookup of
feature weights faster at classification time, because it uses
hash value comparison instead of string comparison.
• Hashing bitsize is used to specify the number of bits to use
when creating the hash table.
• The default bit size is 10. For many problems, this value is more
than adequate, but whether suffices for your data depends on
the size of the n-grams vocabulary in the training text. With a
large vocabulary, more space might be needed to avoid
collisions.
• We recommend that you try using a different number of bits
for this parameter, and evaluate the performance of the
machine learning solution.
• For N-grams, type a number that defines the maximum length
of the n-grams to add to the training dictionary. An n-gram is a
sequence of n words, treated as a unique unit.
• N-grams = 1: Unigrams, or single words.
• N-grams = 2: Bigrams, or two-word sequences, plus unigrams.
• N-grams = 3: Trigrams, or three-word sequences, plus bigrams and
unigrams.
•
Hashing bits to 15, and set the number of n-grams to 2. With these settings, the hash
table can hold 2^15 or 32,768 entries, in which each hashing feature represents one or
more n-gram features and its value represents the occurrence frequency of that n-gram
in the text instance.
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-
reference/feature-hashing

Reduce Computational Complexity
The classification time and complexity of a trained model depends on the number of
features (the dimensionality of the input space). For a linear model, such as a support
vector machine, the complexity is linear with respect to the number of features.
For text classification tasks, the number of features resulting from feature extraction is high
because each word in the vocabulary and each n-gram is mapped to a feature.
To select a more compact feature subset from the exhaustive list of extracted hashing
features, we used the Filter Based Feature Selection module. The aim is to avoid the
effects of the curse of dimensionality and to reduce the computational complexity without
harming classification accuracy. To get the top 5,000 most relevant features with respect to
the sentiment label out of the 2^15 extracted features,
Use the Chi-squared score function to rank the hashing features in descending order.
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-
reference/feature-hashing

TF-IDF Calculation
• When the metric word frequency of occurrence
(TF) in a document is used as a feature value, a
higher weight tends to be assigned to words that
appear frequently in a corpus (such as stop-words).
• The Inverse Document Frequency (IDF) is a better
metric, because it assigns a lower weight to
frequent words. You calculate IDF as the log of the
ratio of the number of documents in the training
corpus to the number of documents containing the
given word.
Combining these numbers in a metric (TF/IDF)
places greater importance on words that are
frequent in the document but rare in the corpus.
This assumption is valid not only for unigrams but
also for bigrams, trigrams, etc.
• This experiment converts unstructured text data
into equal-length numeric feature vectors where
each feature represents the TF-IDF of a unigram in
a text instance.
• Feature selection (dimensionality reduction)
We used the Chi-squared score function to rank the
hashing features in descending order, and returned
the top 5,000 most relevant features with respect to
the sentiment label, out of all the extracted unigrams.

Use the first Split module to split the data into two subset.
1. will be used to train the model
2. will be split in the next step into development/validation set and test set. In
the sample experiment, we split the data into 70% and 30% respectively.
4.3. Use the second Split module to split the data into two subset.
1. used later by the Sweep Parameters module.
2. used as test set to evaluate the performance of the trained model. In the
sample experiment, we split the 30% data sample into two halves. That is,
each of the development set and the test set represents 15% of the input
data.
Use the Sweep Parameters module to get the optimal values for the underlying
learning algorithm parameters, 2 option to try:
1. Random sweep where the module will conduct a number of training runs
(specified by the parameter 'Maximum number of runs on random sweep')
from the parameter ranges.
2. Entire grid option as a parameter sweeping mode to explore all possible values
for each parameter as specified in the learning algorithm module such as the
Two-Class Logistic Regression module. In the sample experiment, the AUC is
specified as Metric for measuring performance for classification. Other
performance evaluation criteria can be used for model selection such as
precision, recall and F-score.
For binary-class classification tasks, you can either keep the Two-Class Logistic
Regression module, or you can replace it with another binary-class classification
trainer, such as Two-Class Support Vector Machine, Two-Class Boosted Decision Tree,
etc.

Microsoft ML Studio Experiment
ML Experiment Documentation
ML Experiment:
https://studio.azureml.net/community/unpack?packageUri=https%3a%
2f%2fstorage.azureml.net%2fdirectories%2fdc2f564b4c6c4c9d825836e
ae21f8751%2fitems&communityUri=https%3a%2f%2fgallery.azure.ai%
2fDetails%2ftext-classification-step-1-of-5-data-preparation-
3&entityId=Text-Classification-Step-1-of-5-data-preparation-3

AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark

Recommended

Recommended

More Related Content

Similar to AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark

Similar to AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark (20)

More from Value Amplify Consulting

More from Value Amplify Consulting (20)

Recently uploaded

Recently uploaded (20)

AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark

Editor's Notes