SlideShare a Scribd company logo
1 of 57
1. Text Analytics
2. AI with Apache Spark
3. AI Experiment with
Microsoft ML Studio
By Adj Prof. Giuseppe Mascarella
giuseppe@valueamplify.com
WELCOME TO:
Text Analytics And Social Media Analytics
People are sharing more than ever before…
190 million tweets per day
100 hours of video uploaded to
YouTube per minute
500M+ photos uploaded and
shared each day
Totalpiecesofcontentsharedon
Facebookeachmonth70billion
Crunching the Text Analytics Numbers
One negative social media
comment can cost a company
30customers
*Convergys corp. A customer review on
social media reaches an
average audience of 45
people…
67% of whom
would avoid or
completely stop doing
business with a
company they
heard bad things about.
*Convergys corp.
44%
Of customers today complain about
products and services on social
networks
*IBM, 2012
Customers are nearly 60%
through the sales process before
engaging a sales rep, regardless of
price point.
*Google and CEB
Whose advice do you seek for
guidance when evaluating software
and service technology solutions for
your business? Social media?
<25 56%
25-29 61%
30-34 53%
35-40 52%
41-50 40%
>50 34%
*SMB group 2010 small & medium business
routes to market study, July 2010
Percent of tweeters
who don’t tweet,
but watch other people
tweet 40%
News travels quickly…
People are more likely to share a bad experience online vs. a good
one.
87% will share
with others
33% will share
with 5+ people
Social media
terminology
and tools
Social media terminology
Click through rate
Data Mining Text
1. Law, Ordinances, Court Sentences
2. Citizen data (i.e.
https://www.covidimpactmeter.com/)
3. Product news
4. Interviews with industry experts
5. Opinion columns or blogs
6. Tutorials on new technologies or
practices
7. Analysis from independent
journalists or analysts
1. Vendor-provided information
2. Research on market size
3. Press Release
4. Analysis from peers
5. White papers
6. Vendor news
7. Product reviews or buying guides
8. Forward-looking trends
9. Case studies
10. https://www.statista.com/
11. etc
REPU.SCORE
1.MICROSOFT BING
ENGINE
2. ADD GOOGLE
ENGINE
3. CUSTOM
-EXTRACT TEST,
-READ
-VERIFY
-COLLECT QUOTES
3. GIUGNO
START SUPPORT
MANUAL DUE
DILIGENCE
What Is FOIL?
The Freedom of Information Law (FOIL) promotes the policy of open government. FOIL was
designed to make documents generated by and in the possession of government agencies available
to the public with certain specific exceptions.
• The Freedom of Information Law (“FOIL”), Article 6 (Sections 84-90) of the NYS Public Officers
Law, provides the public right to access records maintained by government agencies with certain
exceptions.
• “Record” means any information kept, held, filed, produced or reproduced by, with, or for this
agency, in any physical form whatsoever including, but not limited to, reports, statements,
examinations, memoranda, opinions, folders, files, books, manuals, pamphlets, forms, papers,
designs, drawings, maps, photos, letters, microfilms, computer tapes or disks, rules, regulations
or codes.
• Learn more about records access and Open FOIL NY.
Federal Level: https://www.foia.gov/
1. Text Analytics
2. AI Text Analytics with Apache Spark
3. AI Experiment with Microsoft ML Studio
By Adj Prof. Giuseppe Mascarella
giuseppe@valueamplify.com
WELCOME TO:
Text Analytics And Social Media Analytics
APACHE SPARK
• 1. What is
• 2. Spark MLlib
• 3. Scenario: Text Mining
What Challenge We Are Trying To Solve?
Data Grows Faster
than Moore’s Law
[IDC report, Kathy
Yelick, LBNL]
Data
Machine
Learning
Distributed
Computing
How Can We Do So At Scale With Massive
Data?
1. Start implementing distributed pipelines
2. Optimize what Table you keep in memory for how long
3. Understand ML libraries that are not single thread computing
4. Optimize more hardware to store / process modern data Scale-out
(distributed, e.g., cloud-based)
5. Leverage Commodity hardware pricings, and scales to massive problems
6. Real-time streaming data processing
While
Increasing
Speed
1. What is Spark?
Apache Spark is a fast and general-purpose cluster
computing system,
with in-memory data processing engine,
suited for modern data centric applications.
APIs
Spark Core
Spark
SQL
Spark
Streaming
MLlib GraphX
Apache
Mesos
Hadoop
YARN
Spark
Standalone
SCALA
Python
R, Java
S3
HDFS
Azure
…
What Spark?
Free software: spark.apache.org
Apache Spark is 100% open source, hosted at the vendor-independent Apache
Software Foundation
“..our efforts focuses on both the Spark
codebase and support.
All of our work on Spark is open source
and goes directly to Apache.”
Matei Zaharia, VP, Apache Spark,
Founder & CTO, Databricks
 The Washington Post recommendation
engine for content to their readers.
 Yelp, in connecting users with local
businesses, Spark to increase the click-
through rates of display
advertisements.
 Hearst Corporation uses Streaming to
process clickstream data from over
200 web. For real-time view of article
performance and trending topics.
 ..Uber, Pinterest, Conviva, and Yahoo
Early Adopters
Spark Infrastructure
Cluster Deployment with built-in interoperability with Hadoop YARN, Apache
Mesos, files systems, etc
Spark Core defines RDD data abstraction and provides in-memory
computing capabilities.
Appendix
How Does The Core Work?
Parallelizes Distributed Data Processing
Shuffle
DAGScheduler, TaskScheduler, SchedulerBackend
Resilient Distributed Dataset (and DataFrames)
In-Memory, Partitioned, Cacheable, Parallel, Typed, Lazy Evaluated
2. Spark MLlib
Machine Learning in Distributed Computing Environment with Peta Bytes of Data
 org.apache.spark.mllib for a RDD-based
ML
 RDD-based API
 Pipeline API for designing, evaluating,
and tuning machine learning
ML Algorithms with RDD
• Classification: logistic regression, naive Bayes,...
• Regression: generalized linear regression, isotonic
regression,...
• Decision trees, random forests, and gradient-
boosted trees
• Recommendation: alternating least squares (ALS)
• Clustering: K-means, Gaussian mixtures
(GMMs),...
• Topic modeling: latent Dirichlet allocation (LDA)
• Feature transformations: standardization,
normalization, hashing,...
• Model evaluation and hyper-parameter tuning
• ML Pipeline construction
• ML persistence: saving and loading models and
Pipelines
• Survival analysis: accelerated failure time model
• Frequent itemset and sequential pattern mining:
FP-growth, association rules, PrefixSpan
• Distributed linear algebra: singular value
decomposition (SVD), principal component
analysis (PCA),...
• Statistics: summary statistics, hypothesis
testing,...
https://spark-packages.org/?q=tags%3A%22Machine
Learning%22
3. Scenario: Text Classification
Algorithm: Logistic Regression (Binary Classification)
Problem: Given a text document, classify it,
as a scientific or non-scientific one.
Goal: Find decision boundary, the” model”.
Scientific
Non
Scientific
Model
Schema for Categories
sealed trait Category
case object Scientific extends
Category
case object NonScientific
extends Category
$
𝒌=𝟎
𝒏
𝒏
𝒌
𝒙 𝒌
MLlib Pipeline
Load data
Engineer
Features
Train
Model
Scoring
Offer
Service
Dataset
Transformers
Fit
Evaluate and
Tune
PMML/
Livy
Industry Spark Our Txt Scenario
Create a Dataset
Tokenizers
Hash Term
Logistic Regression
Evaluate
PMML/
Livy
REST API
Pipeline(Stages=T,H,Lr)
Label: Obj S/N-Sc
Text: String
Words: Seq[String]
Features: Vector
Predictions:Obj S/NSc
Definition
• Tokenization, when applied to data security, is the process of
substituting a sensitive data element with a non-sensitive
equivalent, referred to as a token, that has no extrinsic or
exploitable meaning or value. The token is a reference (i.e.
identifier) that maps back to the sensitive data through a
tokenization system.
• Hash are used to index data. Hashing values can be used to map
data to individual "buckets" within a hash table. Each bucket has a
unique ID that serves as a pointer to the original data. This creates
an index that is significantly smaller than the original data,
allowing the values to be searched and accessed more efficiently.
Transformer: Tokenizer
SCALA 101
1. Import Library
2. Def Obj
3. Transform
(method)
Schema
(text)
Schema (words)
1. import org.apache.spark.ml.feature.RegexTokenizer
2. val tokenizer = new
RegexTokenizer().setInputCol(“Bible").setOutputCol("words
scientific")
3. tokenizer.transform (training.df)
transform
Dataset
Dataset
Modeling
1. import org.apache.spark.ml.Pipeline
2. val lr=new LogisticRegression(maxIter=20,
regParam=0.01)
3. pipeline=new Pipeline().setStages(Array(tokenizer,
hashingTF, lr))
4. model = pipeline.fit(training)
1. Import Library
2. Instr. Obj Algorithm
3. Create Pipeline
(70%)
4. Train Model
Dataset
Model
https://www.linkedin.com/pulse/apache-spark-20-just-released-join-
our-workshop-your-mascarella?trk=prof-post
Data In the Social Mission Control Room
Social
Engagement
1.Needs
2.
Solution
Example
3 How
Why Do You Need To Monitor Conversation Worldwide Timely?
1. People and troubles issues can quickly go viral
2. Influencers are difficult to identify and prioritize
3. Trends and opportunity signals get lost in the noise
Sentiment Analysis Addresses:
1. People, Topic & Event sentiment
Gain insight and learn what people really think
and feel about it .
3.Top influencer tracking
Identify and monitor the top influencers in your
industry, company, and customer base.
2.Campaign monitoring
Measure the effectiveness of remidiation
campaigns in real time.
Social care
Global sentiment analysis
Track social care cases in Spanish, English and
Portuguese.
Real-time social case resolution
Create alerts to quickly identify customer issues,
and identify trends early on.
Integrated channel management
Across the social and web (e.g., Twitter,
Facebook, Blogs, and boards).
Social Alert Management
Competitive intelligence
Gain important insights about your competitors’
weaknesses and strengths.
Target account tracking
Monitor key developments and decision makers
at your top accounts.
Un-attended alerting
Create rule based real-time alerts agents and
generate PR templatized actions
Data In the Social Mission Control Room
Social
Engagement
1.Needs
2. Solution
Example
3 How
What Is Media Monitoring
 Metrics of social audience
 Measure sentiment of the audience on characters
 Browse conversations
 NLP, Text/Document level classification methods and software tools
 Positive/Negative/Neutral (+/-1)
 Subjective -> Objective
 Tailored for short texts
 Handles: twitter jargon: RT, @, #, , spelling errors, disfluence
 Entity level sentiment
A. What Is Sentiment Analisys?
Finding Information Driven By Opinion mining or emotions AI
to understand the voice of the crowd through reviews and
survey responses, online and on social media.
B. Where Is It Used?
Marketing, Branding, Entertainment, Customer service, Clinical
medicine, Politics, Finance, etc
C. What Are The Mechanics of It?
 Real time visualizations on twitter conversations
 Overall sentiment of the audience on each player / character
Sentiment Analysis:
From Volume Analisys To Quality Analysis
$45M/Year
Use Case: Company FY/Quarter Result Release
Produced by Value Amplify - Confidential
The Mechanics Of Text Classification
Algorithm: Logistic Regression (Binary Classification)
Problem: Given a text document, classify it, as a “Positive” in Sentiment Analysis
Goal: Find decision boundary, the” model”.
Negative
Positive
Model
Entities/Channels
SentimentAnalysisIndex
Social Mission Control Room
Social
Engagement
1.Needs
2.Solution
Samples
3 HOW
By Prof. Giuseppe Mascarella
Cell: 425 269 5478
Willa Preston
You can single out an influencer point of view in 1 channel and
reply with info form other channels
The Business Problem (Microsoft Case Study)
Elena works for an Internet-based retailer company selling DVDs, software,
video games, toys, electronics, and furniture.
The company shows customers feedback at the product level. Her task is to
build a pipeline that automatically analyzes customer feedback and Twitter
messages, to provide the overall sentiment for each product.
The aim (Label, Target) is to help consumers who want to understand if
public opinion (previous review or best seller) has had a positive ( =4) review
before purchasing a product.
How Do You Use AI For This?
-Training Data
-Target (label) : Review (not best seller)
-Algorithm
1. Text Analytics
2. AI with Apache Spark
3. AI Sentiment Analysis Experiment with Microsoft ML Studio
By Adj Prof. Giuseppe Mascarella
giuseppe@valueamplify.com
WELCOME TO:
Text Analytics And Social Media Analytics
Data Set
• The open source data comprises approximately 1,600,000
automatically annotated tweets: http://help.sentiment140.com/
• Any tweet containing positive emoticon such as :),:-), :D or =D was
assumed to bear positive sentiment,
Any tweets with negative emoticons such as :<, :-( or :( were
supposed to bear negative polarity.
• Tweets containing both positive and negative emoticons are a
problem
• Dataset to be used:
http://azuremlsamples.azureml.net/templatedata/Text - Input.csv
Dependent Variables (Features, Text Column)
1. sentiment_label - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
2. tweet_id - the id of the tweet
3. time_stamp - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
4. target - the query (lyx). If there is no query, then this value is NO_QUERY.
5. user_id - the user who posted the tweet
6. tweet_text - the text of the tweet
AI Workflow
What Algorithm Is Best?
• N-gram model is another common vector representation model, but knows that there is
no conclusive answer of which one works the best.
• Bag-of-words vector representation model is commonly used for text classification. In
this method, the frequency of occurrence of each word, or term-frequency (TF), is
multiplied by the inverse document frequency, and the TF-IDF scores are used as feature
values for training a classifier.
• In steps 5A and 5B, the most accurate model is deployed as a published web service,
using either RRS (Request Response Service) or BES (Batch Execution Service).
When using RRS, only one text instance is classified at a time.
When using BES, a batch of text instances can be sent for classification at the same time.
By using these web services, you can perform classification in parallel, using either an
external worker or the Azure Data Factory, for greatly enhanced efficiency.
Text pre-processin
Text cleaning
-replacing special
characters and
punctuation marks with
spaces,
-normalizing case,
removing duplicate
characters, removing user-
defined or built-in stop-
words, and word
stemming.
Highly custom steps are
implemented using the R
programming language.
• Feature hashing
To convert variable-length text documents to equal-length
numeric feature vectors, using the 32-bit murmurhash v3
hashing method provided by the Vowpal Wabbit library.
• The objective of using feature hashing is dimensionality
reduction; also feature hashing makes the lookup of
feature weights faster at classification time, because it uses
hash value comparison instead of string comparison.
• Hashing bitsize is used to specify the number of bits to use
when creating the hash table.
• The default bit size is 10. For many problems, this value is more
than adequate, but whether suffices for your data depends on
the size of the n-grams vocabulary in the training text. With a
large vocabulary, more space might be needed to avoid
collisions.
• We recommend that you try using a different number of bits
for this parameter, and evaluate the performance of the
machine learning solution.
• For N-grams, type a number that defines the maximum length
of the n-grams to add to the training dictionary. An n-gram is a
sequence of n words, treated as a unique unit.
• N-grams = 1: Unigrams, or single words.
• N-grams = 2: Bigrams, or two-word sequences, plus unigrams.
• N-grams = 3: Trigrams, or three-word sequences, plus bigrams and
unigrams.
•
Hashing bits to 15, and set the number of n-grams to 2. With these settings, the hash
table can hold 2^15 or 32,768 entries, in which each hashing feature represents one or
more n-gram features and its value represents the occurrence frequency of that n-gram
in the text instance.
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-
reference/feature-hashing
Reduce Computational Complexity
The classification time and complexity of a trained model depends on the number of
features (the dimensionality of the input space). For a linear model, such as a support
vector machine, the complexity is linear with respect to the number of features.
For text classification tasks, the number of features resulting from feature extraction is high
because each word in the vocabulary and each n-gram is mapped to a feature.
To select a more compact feature subset from the exhaustive list of extracted hashing
features, we used the Filter Based Feature Selection module. The aim is to avoid the
effects of the curse of dimensionality and to reduce the computational complexity without
harming classification accuracy. To get the top 5,000 most relevant features with respect to
the sentiment label out of the 2^15 extracted features,
Use the Chi-squared score function to rank the hashing features in descending order.
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-
reference/feature-hashing
TF-IDF Calculation
• When the metric word frequency of occurrence
(TF) in a document is used as a feature value, a
higher weight tends to be assigned to words that
appear frequently in a corpus (such as stop-words).
• The Inverse Document Frequency (IDF) is a better
metric, because it assigns a lower weight to
frequent words. You calculate IDF as the log of the
ratio of the number of documents in the training
corpus to the number of documents containing the
given word.
Combining these numbers in a metric (TF/IDF)
places greater importance on words that are
frequent in the document but rare in the corpus.
This assumption is valid not only for unigrams but
also for bigrams, trigrams, etc.
• This experiment converts unstructured text data
into equal-length numeric feature vectors where
each feature represents the TF-IDF of a unigram in
a text instance.
• Feature selection (dimensionality reduction)
We used the Chi-squared score function to rank the
hashing features in descending order, and returned
the top 5,000 most relevant features with respect to
the sentiment label, out of all the extracted unigrams.
R Scripts
Use the first Split module to split the data into two subset.
1. will be used to train the model
2. will be split in the next step into development/validation set and test set. In
the sample experiment, we split the data into 70% and 30% respectively.
4.3. Use the second Split module to split the data into two subset.
1. used later by the Sweep Parameters module.
2. used as test set to evaluate the performance of the trained model. In the
sample experiment, we split the 30% data sample into two halves. That is,
each of the development set and the test set represents 15% of the input
data.
Use the Sweep Parameters module to get the optimal values for the underlying
learning algorithm parameters, 2 option to try:
1. Random sweep where the module will conduct a number of training runs
(specified by the parameter 'Maximum number of runs on random sweep')
from the parameter ranges.
2. Entire grid option as a parameter sweeping mode to explore all possible values
for each parameter as specified in the learning algorithm module such as the
Two-Class Logistic Regression module. In the sample experiment, the AUC is
specified as Metric for measuring performance for classification. Other
performance evaluation criteria can be used for model selection such as
precision, recall and F-score.
For binary-class classification tasks, you can either keep the Two-Class Logistic
Regression module, or you can replace it with another binary-class classification
trainer, such as Two-Class Support Vector Machine, Two-Class Boosted Decision Tree,
etc.
Microsoft ML Studio Experiment
ML Experiment Documentation
ML Experiment:
https://studio.azureml.net/community/unpack?packageUri=https%3a%
2f%2fstorage.azureml.net%2fdirectories%2fdc2f564b4c6c4c9d825836e
ae21f8751%2fitems&communityUri=https%3a%2f%2fgallery.azure.ai%
2fDetails%2ftext-classification-step-1-of-5-data-preparation-
3&entityId=Text-Classification-Step-1-of-5-data-preparation-3

More Related Content

Similar to AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark

IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02BIWUG
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointJoris Poelmans
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudPeter Haase
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big DataInfochimps, a CSC Big Data Business
 
Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesNick Pentreath
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewLucidworks
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIDenodo
 
Ανδρέας Τσαγκάρης, 5th Digital Banking Forum
Ανδρέας Τσαγκάρης, 5th Digital Banking ForumΑνδρέας Τσαγκάρης, 5th Digital Banking Forum
Ανδρέας Τσαγκάρης, 5th Digital Banking ForumStarttech Ventures
 
Web Analytics Wednesday Melbourne Meet Up
Web Analytics Wednesday Melbourne Meet UpWeb Analytics Wednesday Melbourne Meet Up
Web Analytics Wednesday Melbourne Meet UpNarbeh Yousefian
 
Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Sri Ambati
 

Similar to AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark (20)

DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePoint
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017 Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Maruti gollapudi cv
Maruti gollapudi cvMaruti gollapudi cv
Maruti gollapudi cv
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI Pipelines
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
 
Ανδρέας Τσαγκάρης, 5th Digital Banking Forum
Ανδρέας Τσαγκάρης, 5th Digital Banking ForumΑνδρέας Τσαγκάρης, 5th Digital Banking Forum
Ανδρέας Τσαγκάρης, 5th Digital Banking Forum
 
Web Analytics Wednesday Melbourne Meet Up
Web Analytics Wednesday Melbourne Meet UpWeb Analytics Wednesday Melbourne Meet Up
Web Analytics Wednesday Melbourne Meet Up
 
Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session
 

More from Value Amplify Consulting

AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesAI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesValue Amplify Consulting
 
AI Class Topic 3: Building Machine Learning Predictive Systems (Predictive Ma...
AI Class Topic 3: Building Machine Learning Predictive Systems (Predictive Ma...AI Class Topic 3: Building Machine Learning Predictive Systems (Predictive Ma...
AI Class Topic 3: Building Machine Learning Predictive Systems (Predictive Ma...Value Amplify Consulting
 
AI Class Topic 2: Step-by-step Process for AI development
AI Class Topic 2: Step-by-step Process for AI developmentAI Class Topic 2: Step-by-step Process for AI development
AI Class Topic 2: Step-by-step Process for AI developmentValue Amplify Consulting
 
What Is Artificial Intelligence? Part 1/10
What Is Artificial Intelligence? Part 1/10What Is Artificial Intelligence? Part 1/10
What Is Artificial Intelligence? Part 1/10Value Amplify Consulting
 
Fractional Chief AI Officer Services For Hire
Fractional Chief AI Officer Services For HireFractional Chief AI Officer Services For Hire
Fractional Chief AI Officer Services For HireValue Amplify Consulting
 
Chief AI Officer and AI Digital Transformation
Chief AI Officer and AI Digital TransformationChief AI Officer and AI Digital Transformation
Chief AI Officer and AI Digital TransformationValue Amplify Consulting
 
EKATRA IoT Digital Twin Presentation at FOG World Congress
EKATRA IoT Digital Twin Presentation at FOG World Congress EKATRA IoT Digital Twin Presentation at FOG World Congress
EKATRA IoT Digital Twin Presentation at FOG World Congress Value Amplify Consulting
 
EKATRA IoT Digital Twin Presentation at FOG World Congress
EKATRA IoT Digital Twin Presentation at FOG World CongressEKATRA IoT Digital Twin Presentation at FOG World Congress
EKATRA IoT Digital Twin Presentation at FOG World CongressValue Amplify Consulting
 
Bitcoin, Altcoins and Trading Robots jan2018
Bitcoin, Altcoins  and Trading Robots jan2018Bitcoin, Altcoins  and Trading Robots jan2018
Bitcoin, Altcoins and Trading Robots jan2018Value Amplify Consulting
 
Bitcoin: Busienss and Technology Robot Overview
Bitcoin: Busienss and Technology Robot OverviewBitcoin: Busienss and Technology Robot Overview
Bitcoin: Busienss and Technology Robot OverviewValue Amplify Consulting
 
Tutorial on BlockChain and ICO in Commodity Trading
Tutorial on BlockChain and ICO in Commodity TradingTutorial on BlockChain and ICO in Commodity Trading
Tutorial on BlockChain and ICO in Commodity TradingValue Amplify Consulting
 
Introduction to Blockchain and BitCoin New Business Opportunties
Introduction to Blockchain and BitCoin New Business OpportuntiesIntroduction to Blockchain and BitCoin New Business Opportunties
Introduction to Blockchain and BitCoin New Business OpportuntiesValue Amplify Consulting
 
Rapid Economic Justifcation for Machine Learning in IoT
Rapid Economic Justifcation for Machine Learning in IoTRapid Economic Justifcation for Machine Learning in IoT
Rapid Economic Justifcation for Machine Learning in IoTValue Amplify Consulting
 

More from Value Amplify Consulting (20)

AI Is An ROI Booster For Restaurants
AI Is An ROI Booster For RestaurantsAI Is An ROI Booster For Restaurants
AI Is An ROI Booster For Restaurants
 
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesAI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
 
AI Class Topic 5: Social Network Graph
AI Class Topic 5:  Social Network GraphAI Class Topic 5:  Social Network Graph
AI Class Topic 5: Social Network Graph
 
AI Class Topic 3: Building Machine Learning Predictive Systems (Predictive Ma...
AI Class Topic 3: Building Machine Learning Predictive Systems (Predictive Ma...AI Class Topic 3: Building Machine Learning Predictive Systems (Predictive Ma...
AI Class Topic 3: Building Machine Learning Predictive Systems (Predictive Ma...
 
AI Class Topic 2: Step-by-step Process for AI development
AI Class Topic 2: Step-by-step Process for AI developmentAI Class Topic 2: Step-by-step Process for AI development
AI Class Topic 2: Step-by-step Process for AI development
 
What Is Artificial Intelligence? Part 1/10
What Is Artificial Intelligence? Part 1/10What Is Artificial Intelligence? Part 1/10
What Is Artificial Intelligence? Part 1/10
 
Fractional Chief AI Officer Services For Hire
Fractional Chief AI Officer Services For HireFractional Chief AI Officer Services For Hire
Fractional Chief AI Officer Services For Hire
 
Chief AI Officer and AI Digital Transformation
Chief AI Officer and AI Digital TransformationChief AI Officer and AI Digital Transformation
Chief AI Officer and AI Digital Transformation
 
AI Planning Workshop overview
AI Planning Workshop overviewAI Planning Workshop overview
AI Planning Workshop overview
 
EKATRA IoT Digital Twin Presentation at FOG World Congress
EKATRA IoT Digital Twin Presentation at FOG World Congress EKATRA IoT Digital Twin Presentation at FOG World Congress
EKATRA IoT Digital Twin Presentation at FOG World Congress
 
EKATRA IoT Digital Twin Presentation at FOG World Congress
EKATRA IoT Digital Twin Presentation at FOG World CongressEKATRA IoT Digital Twin Presentation at FOG World Congress
EKATRA IoT Digital Twin Presentation at FOG World Congress
 
AI WITH AN ROI
AI WITH AN ROIAI WITH AN ROI
AI WITH AN ROI
 
Bitcoin, Altcoins and Trading Robots jan2018
Bitcoin, Altcoins  and Trading Robots jan2018Bitcoin, Altcoins  and Trading Robots jan2018
Bitcoin, Altcoins and Trading Robots jan2018
 
Bitcoin and Blockchain overview
Bitcoin and Blockchain overviewBitcoin and Blockchain overview
Bitcoin and Blockchain overview
 
Bitcoin: Busienss and Technology Robot Overview
Bitcoin: Busienss and Technology Robot OverviewBitcoin: Busienss and Technology Robot Overview
Bitcoin: Busienss and Technology Robot Overview
 
ICOs Good The Bad and the Ugly
ICOs Good The Bad and the UglyICOs Good The Bad and the Ugly
ICOs Good The Bad and the Ugly
 
Tutorial on BlockChain and ICO in Commodity Trading
Tutorial on BlockChain and ICO in Commodity TradingTutorial on BlockChain and ICO in Commodity Trading
Tutorial on BlockChain and ICO in Commodity Trading
 
Introduction to Blockchain and BitCoin New Business Opportunties
Introduction to Blockchain and BitCoin New Business OpportuntiesIntroduction to Blockchain and BitCoin New Business Opportunties
Introduction to Blockchain and BitCoin New Business Opportunties
 
Rapid Economic Justifcation for Machine Learning in IoT
Rapid Economic Justifcation for Machine Learning in IoTRapid Economic Justifcation for Machine Learning in IoT
Rapid Economic Justifcation for Machine Learning in IoT
 
ROI of Machine Learning In IoT
ROI of Machine Learning In IoTROI of Machine Learning In IoT
ROI of Machine Learning In IoT
 

Recently uploaded

Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...meghakumariji156
 
Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingNauman Safdar
 
Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1kcpayne
 
Falcon Invoice Discounting: Unlock Your Business Potential
Falcon Invoice Discounting: Unlock Your Business PotentialFalcon Invoice Discounting: Unlock Your Business Potential
Falcon Invoice Discounting: Unlock Your Business PotentialFalcon investment
 
Falcon Invoice Discounting: Aviate Your Cash Flow Challenges
Falcon Invoice Discounting: Aviate Your Cash Flow ChallengesFalcon Invoice Discounting: Aviate Your Cash Flow Challenges
Falcon Invoice Discounting: Aviate Your Cash Flow Challengeshemanthkumar470700
 
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al MizharAl Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizharallensay1
 
BeMetals Investor Presentation_May 3, 2024.pdf
BeMetals Investor Presentation_May 3, 2024.pdfBeMetals Investor Presentation_May 3, 2024.pdf
BeMetals Investor Presentation_May 3, 2024.pdfDerekIwanaka1
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel
 
Falcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business GrowthFalcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business GrowthFalcon investment
 
Power point presentation on enterprise performance management
Power point presentation on enterprise performance managementPower point presentation on enterprise performance management
Power point presentation on enterprise performance managementVaishnaviGunji
 
Cannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 UpdatedCannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 UpdatedCannaBusinessPlans
 
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...ssuserf63bd7
 
Cracking the 'Career Pathing' Slideshare
Cracking the 'Career Pathing' SlideshareCracking the 'Career Pathing' Slideshare
Cracking the 'Career Pathing' SlideshareWorkforce Group
 
Paradip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Paradip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGParadip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Paradip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGpr788182
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Centuryrwgiffor
 
New 2024 Cannabis Edibles Investor Pitch Deck Template
New 2024 Cannabis Edibles Investor Pitch Deck TemplateNew 2024 Cannabis Edibles Investor Pitch Deck Template
New 2024 Cannabis Edibles Investor Pitch Deck TemplateCannaBusinessPlans
 
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan CytotecJual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan CytotecZurliaSoop
 
Falcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investorsFalcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investorsFalcon Invoice Discounting
 

Recently uploaded (20)

Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
 
Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for Viewing
 
Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1
 
Falcon Invoice Discounting: Unlock Your Business Potential
Falcon Invoice Discounting: Unlock Your Business PotentialFalcon Invoice Discounting: Unlock Your Business Potential
Falcon Invoice Discounting: Unlock Your Business Potential
 
Falcon Invoice Discounting: Aviate Your Cash Flow Challenges
Falcon Invoice Discounting: Aviate Your Cash Flow ChallengesFalcon Invoice Discounting: Aviate Your Cash Flow Challenges
Falcon Invoice Discounting: Aviate Your Cash Flow Challenges
 
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al MizharAl Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
 
BeMetals Investor Presentation_May 3, 2024.pdf
BeMetals Investor Presentation_May 3, 2024.pdfBeMetals Investor Presentation_May 3, 2024.pdf
BeMetals Investor Presentation_May 3, 2024.pdf
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024
 
Falcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business GrowthFalcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business Growth
 
Power point presentation on enterprise performance management
Power point presentation on enterprise performance managementPower point presentation on enterprise performance management
Power point presentation on enterprise performance management
 
Cannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 UpdatedCannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 Updated
 
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
 
Cracking the 'Career Pathing' Slideshare
Cracking the 'Career Pathing' SlideshareCracking the 'Career Pathing' Slideshare
Cracking the 'Career Pathing' Slideshare
 
Paradip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Paradip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGParadip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Paradip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Century
 
HomeRoots Pitch Deck | Investor Insights | April 2024
HomeRoots Pitch Deck | Investor Insights | April 2024HomeRoots Pitch Deck | Investor Insights | April 2024
HomeRoots Pitch Deck | Investor Insights | April 2024
 
New 2024 Cannabis Edibles Investor Pitch Deck Template
New 2024 Cannabis Edibles Investor Pitch Deck TemplateNew 2024 Cannabis Edibles Investor Pitch Deck Template
New 2024 Cannabis Edibles Investor Pitch Deck Template
 
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan CytotecJual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
 
Falcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investorsFalcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investors
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
 

AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark

  • 1. 1. Text Analytics 2. AI with Apache Spark 3. AI Experiment with Microsoft ML Studio By Adj Prof. Giuseppe Mascarella giuseppe@valueamplify.com WELCOME TO: Text Analytics And Social Media Analytics
  • 2. People are sharing more than ever before… 190 million tweets per day 100 hours of video uploaded to YouTube per minute 500M+ photos uploaded and shared each day Totalpiecesofcontentsharedon Facebookeachmonth70billion
  • 3. Crunching the Text Analytics Numbers One negative social media comment can cost a company 30customers *Convergys corp. A customer review on social media reaches an average audience of 45 people… 67% of whom would avoid or completely stop doing business with a company they heard bad things about. *Convergys corp. 44% Of customers today complain about products and services on social networks *IBM, 2012 Customers are nearly 60% through the sales process before engaging a sales rep, regardless of price point. *Google and CEB Whose advice do you seek for guidance when evaluating software and service technology solutions for your business? Social media? <25 56% 25-29 61% 30-34 53% 35-40 52% 41-50 40% >50 34% *SMB group 2010 small & medium business routes to market study, July 2010 Percent of tweeters who don’t tweet, but watch other people tweet 40%
  • 4. News travels quickly… People are more likely to share a bad experience online vs. a good one. 87% will share with others 33% will share with 5+ people
  • 7. Data Mining Text 1. Law, Ordinances, Court Sentences 2. Citizen data (i.e. https://www.covidimpactmeter.com/) 3. Product news 4. Interviews with industry experts 5. Opinion columns or blogs 6. Tutorials on new technologies or practices 7. Analysis from independent journalists or analysts 1. Vendor-provided information 2. Research on market size 3. Press Release 4. Analysis from peers 5. White papers 6. Vendor news 7. Product reviews or buying guides 8. Forward-looking trends 9. Case studies 10. https://www.statista.com/ 11. etc
  • 8. REPU.SCORE 1.MICROSOFT BING ENGINE 2. ADD GOOGLE ENGINE 3. CUSTOM -EXTRACT TEST, -READ -VERIFY -COLLECT QUOTES 3. GIUGNO START SUPPORT MANUAL DUE DILIGENCE
  • 9. What Is FOIL? The Freedom of Information Law (FOIL) promotes the policy of open government. FOIL was designed to make documents generated by and in the possession of government agencies available to the public with certain specific exceptions. • The Freedom of Information Law (“FOIL”), Article 6 (Sections 84-90) of the NYS Public Officers Law, provides the public right to access records maintained by government agencies with certain exceptions. • “Record” means any information kept, held, filed, produced or reproduced by, with, or for this agency, in any physical form whatsoever including, but not limited to, reports, statements, examinations, memoranda, opinions, folders, files, books, manuals, pamphlets, forms, papers, designs, drawings, maps, photos, letters, microfilms, computer tapes or disks, rules, regulations or codes. • Learn more about records access and Open FOIL NY. Federal Level: https://www.foia.gov/
  • 10.
  • 11. 1. Text Analytics 2. AI Text Analytics with Apache Spark 3. AI Experiment with Microsoft ML Studio By Adj Prof. Giuseppe Mascarella giuseppe@valueamplify.com WELCOME TO: Text Analytics And Social Media Analytics
  • 12. APACHE SPARK • 1. What is • 2. Spark MLlib • 3. Scenario: Text Mining
  • 13. What Challenge We Are Trying To Solve? Data Grows Faster than Moore’s Law [IDC report, Kathy Yelick, LBNL] Data Machine Learning Distributed Computing
  • 14. How Can We Do So At Scale With Massive Data? 1. Start implementing distributed pipelines 2. Optimize what Table you keep in memory for how long 3. Understand ML libraries that are not single thread computing 4. Optimize more hardware to store / process modern data Scale-out (distributed, e.g., cloud-based) 5. Leverage Commodity hardware pricings, and scales to massive problems 6. Real-time streaming data processing While Increasing Speed
  • 15. 1. What is Spark? Apache Spark is a fast and general-purpose cluster computing system, with in-memory data processing engine, suited for modern data centric applications. APIs Spark Core Spark SQL Spark Streaming MLlib GraphX Apache Mesos Hadoop YARN Spark Standalone SCALA Python R, Java S3 HDFS Azure …
  • 16. What Spark? Free software: spark.apache.org Apache Spark is 100% open source, hosted at the vendor-independent Apache Software Foundation “..our efforts focuses on both the Spark codebase and support. All of our work on Spark is open source and goes directly to Apache.” Matei Zaharia, VP, Apache Spark, Founder & CTO, Databricks  The Washington Post recommendation engine for content to their readers.  Yelp, in connecting users with local businesses, Spark to increase the click- through rates of display advertisements.  Hearst Corporation uses Streaming to process clickstream data from over 200 web. For real-time view of article performance and trending topics.  ..Uber, Pinterest, Conviva, and Yahoo Early Adopters
  • 17. Spark Infrastructure Cluster Deployment with built-in interoperability with Hadoop YARN, Apache Mesos, files systems, etc Spark Core defines RDD data abstraction and provides in-memory computing capabilities.
  • 19. How Does The Core Work? Parallelizes Distributed Data Processing Shuffle DAGScheduler, TaskScheduler, SchedulerBackend Resilient Distributed Dataset (and DataFrames) In-Memory, Partitioned, Cacheable, Parallel, Typed, Lazy Evaluated
  • 20. 2. Spark MLlib Machine Learning in Distributed Computing Environment with Peta Bytes of Data  org.apache.spark.mllib for a RDD-based ML  RDD-based API  Pipeline API for designing, evaluating, and tuning machine learning
  • 21. ML Algorithms with RDD • Classification: logistic regression, naive Bayes,... • Regression: generalized linear regression, isotonic regression,... • Decision trees, random forests, and gradient- boosted trees • Recommendation: alternating least squares (ALS) • Clustering: K-means, Gaussian mixtures (GMMs),... • Topic modeling: latent Dirichlet allocation (LDA) • Feature transformations: standardization, normalization, hashing,... • Model evaluation and hyper-parameter tuning • ML Pipeline construction • ML persistence: saving and loading models and Pipelines • Survival analysis: accelerated failure time model • Frequent itemset and sequential pattern mining: FP-growth, association rules, PrefixSpan • Distributed linear algebra: singular value decomposition (SVD), principal component analysis (PCA),... • Statistics: summary statistics, hypothesis testing,... https://spark-packages.org/?q=tags%3A%22Machine Learning%22
  • 22. 3. Scenario: Text Classification Algorithm: Logistic Regression (Binary Classification) Problem: Given a text document, classify it, as a scientific or non-scientific one. Goal: Find decision boundary, the” model”. Scientific Non Scientific Model Schema for Categories sealed trait Category case object Scientific extends Category case object NonScientific extends Category $ 𝒌=𝟎 𝒏 𝒏 𝒌 𝒙 𝒌
  • 23. MLlib Pipeline Load data Engineer Features Train Model Scoring Offer Service Dataset Transformers Fit Evaluate and Tune PMML/ Livy Industry Spark Our Txt Scenario Create a Dataset Tokenizers Hash Term Logistic Regression Evaluate PMML/ Livy REST API Pipeline(Stages=T,H,Lr) Label: Obj S/N-Sc Text: String Words: Seq[String] Features: Vector Predictions:Obj S/NSc
  • 24. Definition • Tokenization, when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no extrinsic or exploitable meaning or value. The token is a reference (i.e. identifier) that maps back to the sensitive data through a tokenization system. • Hash are used to index data. Hashing values can be used to map data to individual "buckets" within a hash table. Each bucket has a unique ID that serves as a pointer to the original data. This creates an index that is significantly smaller than the original data, allowing the values to be searched and accessed more efficiently.
  • 25. Transformer: Tokenizer SCALA 101 1. Import Library 2. Def Obj 3. Transform (method) Schema (text) Schema (words) 1. import org.apache.spark.ml.feature.RegexTokenizer 2. val tokenizer = new RegexTokenizer().setInputCol(“Bible").setOutputCol("words scientific") 3. tokenizer.transform (training.df) transform Dataset Dataset
  • 26. Modeling 1. import org.apache.spark.ml.Pipeline 2. val lr=new LogisticRegression(maxIter=20, regParam=0.01) 3. pipeline=new Pipeline().setStages(Array(tokenizer, hashingTF, lr)) 4. model = pipeline.fit(training) 1. Import Library 2. Instr. Obj Algorithm 3. Create Pipeline (70%) 4. Train Model Dataset Model
  • 28. Data In the Social Mission Control Room Social Engagement 1.Needs 2. Solution Example 3 How
  • 29. Why Do You Need To Monitor Conversation Worldwide Timely? 1. People and troubles issues can quickly go viral 2. Influencers are difficult to identify and prioritize 3. Trends and opportunity signals get lost in the noise
  • 30. Sentiment Analysis Addresses: 1. People, Topic & Event sentiment Gain insight and learn what people really think and feel about it . 3.Top influencer tracking Identify and monitor the top influencers in your industry, company, and customer base. 2.Campaign monitoring Measure the effectiveness of remidiation campaigns in real time.
  • 31. Social care Global sentiment analysis Track social care cases in Spanish, English and Portuguese. Real-time social case resolution Create alerts to quickly identify customer issues, and identify trends early on. Integrated channel management Across the social and web (e.g., Twitter, Facebook, Blogs, and boards).
  • 32. Social Alert Management Competitive intelligence Gain important insights about your competitors’ weaknesses and strengths. Target account tracking Monitor key developments and decision makers at your top accounts. Un-attended alerting Create rule based real-time alerts agents and generate PR templatized actions
  • 33. Data In the Social Mission Control Room Social Engagement 1.Needs 2. Solution Example 3 How
  • 34. What Is Media Monitoring  Metrics of social audience  Measure sentiment of the audience on characters  Browse conversations
  • 35.  NLP, Text/Document level classification methods and software tools  Positive/Negative/Neutral (+/-1)  Subjective -> Objective  Tailored for short texts  Handles: twitter jargon: RT, @, #, , spelling errors, disfluence  Entity level sentiment A. What Is Sentiment Analisys? Finding Information Driven By Opinion mining or emotions AI to understand the voice of the crowd through reviews and survey responses, online and on social media. B. Where Is It Used? Marketing, Branding, Entertainment, Customer service, Clinical medicine, Politics, Finance, etc C. What Are The Mechanics of It?
  • 36.  Real time visualizations on twitter conversations  Overall sentiment of the audience on each player / character Sentiment Analysis: From Volume Analisys To Quality Analysis $45M/Year
  • 37. Use Case: Company FY/Quarter Result Release Produced by Value Amplify - Confidential
  • 38. The Mechanics Of Text Classification Algorithm: Logistic Regression (Binary Classification) Problem: Given a text document, classify it, as a “Positive” in Sentiment Analysis Goal: Find decision boundary, the” model”. Negative Positive Model Entities/Channels SentimentAnalysisIndex
  • 39. Social Mission Control Room Social Engagement 1.Needs 2.Solution Samples 3 HOW By Prof. Giuseppe Mascarella Cell: 425 269 5478
  • 41. You can single out an influencer point of view in 1 channel and reply with info form other channels
  • 42.
  • 43.
  • 44. The Business Problem (Microsoft Case Study) Elena works for an Internet-based retailer company selling DVDs, software, video games, toys, electronics, and furniture. The company shows customers feedback at the product level. Her task is to build a pipeline that automatically analyzes customer feedback and Twitter messages, to provide the overall sentiment for each product. The aim (Label, Target) is to help consumers who want to understand if public opinion (previous review or best seller) has had a positive ( =4) review before purchasing a product. How Do You Use AI For This? -Training Data -Target (label) : Review (not best seller) -Algorithm
  • 45. 1. Text Analytics 2. AI with Apache Spark 3. AI Sentiment Analysis Experiment with Microsoft ML Studio By Adj Prof. Giuseppe Mascarella giuseppe@valueamplify.com WELCOME TO: Text Analytics And Social Media Analytics
  • 46. Data Set • The open source data comprises approximately 1,600,000 automatically annotated tweets: http://help.sentiment140.com/ • Any tweet containing positive emoticon such as :),:-), :D or =D was assumed to bear positive sentiment, Any tweets with negative emoticons such as :<, :-( or :( were supposed to bear negative polarity. • Tweets containing both positive and negative emoticons are a problem • Dataset to be used: http://azuremlsamples.azureml.net/templatedata/Text - Input.csv
  • 47. Dependent Variables (Features, Text Column) 1. sentiment_label - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive) 2. tweet_id - the id of the tweet 3. time_stamp - the date of the tweet (Sat May 16 23:58:44 UTC 2009) 4. target - the query (lyx). If there is no query, then this value is NO_QUERY. 5. user_id - the user who posted the tweet 6. tweet_text - the text of the tweet
  • 49. What Algorithm Is Best? • N-gram model is another common vector representation model, but knows that there is no conclusive answer of which one works the best. • Bag-of-words vector representation model is commonly used for text classification. In this method, the frequency of occurrence of each word, or term-frequency (TF), is multiplied by the inverse document frequency, and the TF-IDF scores are used as feature values for training a classifier. • In steps 5A and 5B, the most accurate model is deployed as a published web service, using either RRS (Request Response Service) or BES (Batch Execution Service). When using RRS, only one text instance is classified at a time. When using BES, a batch of text instances can be sent for classification at the same time. By using these web services, you can perform classification in parallel, using either an external worker or the Azure Data Factory, for greatly enhanced efficiency.
  • 50. Text pre-processin Text cleaning -replacing special characters and punctuation marks with spaces, -normalizing case, removing duplicate characters, removing user- defined or built-in stop- words, and word stemming. Highly custom steps are implemented using the R programming language.
  • 51.
  • 52. • Feature hashing To convert variable-length text documents to equal-length numeric feature vectors, using the 32-bit murmurhash v3 hashing method provided by the Vowpal Wabbit library. • The objective of using feature hashing is dimensionality reduction; also feature hashing makes the lookup of feature weights faster at classification time, because it uses hash value comparison instead of string comparison. • Hashing bitsize is used to specify the number of bits to use when creating the hash table. • The default bit size is 10. For many problems, this value is more than adequate, but whether suffices for your data depends on the size of the n-grams vocabulary in the training text. With a large vocabulary, more space might be needed to avoid collisions. • We recommend that you try using a different number of bits for this parameter, and evaluate the performance of the machine learning solution. • For N-grams, type a number that defines the maximum length of the n-grams to add to the training dictionary. An n-gram is a sequence of n words, treated as a unique unit. • N-grams = 1: Unigrams, or single words. • N-grams = 2: Bigrams, or two-word sequences, plus unigrams. • N-grams = 3: Trigrams, or three-word sequences, plus bigrams and unigrams. • Hashing bits to 15, and set the number of n-grams to 2. With these settings, the hash table can hold 2^15 or 32,768 entries, in which each hashing feature represents one or more n-gram features and its value represents the occurrence frequency of that n-gram in the text instance. https://docs.microsoft.com/en-us/azure/machine-learning/studio-module- reference/feature-hashing
  • 53. Reduce Computational Complexity The classification time and complexity of a trained model depends on the number of features (the dimensionality of the input space). For a linear model, such as a support vector machine, the complexity is linear with respect to the number of features. For text classification tasks, the number of features resulting from feature extraction is high because each word in the vocabulary and each n-gram is mapped to a feature. To select a more compact feature subset from the exhaustive list of extracted hashing features, we used the Filter Based Feature Selection module. The aim is to avoid the effects of the curse of dimensionality and to reduce the computational complexity without harming classification accuracy. To get the top 5,000 most relevant features with respect to the sentiment label out of the 2^15 extracted features, Use the Chi-squared score function to rank the hashing features in descending order. https://docs.microsoft.com/en-us/azure/machine-learning/studio-module- reference/feature-hashing
  • 54. TF-IDF Calculation • When the metric word frequency of occurrence (TF) in a document is used as a feature value, a higher weight tends to be assigned to words that appear frequently in a corpus (such as stop-words). • The Inverse Document Frequency (IDF) is a better metric, because it assigns a lower weight to frequent words. You calculate IDF as the log of the ratio of the number of documents in the training corpus to the number of documents containing the given word. Combining these numbers in a metric (TF/IDF) places greater importance on words that are frequent in the document but rare in the corpus. This assumption is valid not only for unigrams but also for bigrams, trigrams, etc. • This experiment converts unstructured text data into equal-length numeric feature vectors where each feature represents the TF-IDF of a unigram in a text instance. • Feature selection (dimensionality reduction) We used the Chi-squared score function to rank the hashing features in descending order, and returned the top 5,000 most relevant features with respect to the sentiment label, out of all the extracted unigrams.
  • 56. Use the first Split module to split the data into two subset. 1. will be used to train the model 2. will be split in the next step into development/validation set and test set. In the sample experiment, we split the data into 70% and 30% respectively. 4.3. Use the second Split module to split the data into two subset. 1. used later by the Sweep Parameters module. 2. used as test set to evaluate the performance of the trained model. In the sample experiment, we split the 30% data sample into two halves. That is, each of the development set and the test set represents 15% of the input data. Use the Sweep Parameters module to get the optimal values for the underlying learning algorithm parameters, 2 option to try: 1. Random sweep where the module will conduct a number of training runs (specified by the parameter 'Maximum number of runs on random sweep') from the parameter ranges. 2. Entire grid option as a parameter sweeping mode to explore all possible values for each parameter as specified in the learning algorithm module such as the Two-Class Logistic Regression module. In the sample experiment, the AUC is specified as Metric for measuring performance for classification. Other performance evaluation criteria can be used for model selection such as precision, recall and F-score. For binary-class classification tasks, you can either keep the Two-Class Logistic Regression module, or you can replace it with another binary-class classification trainer, such as Two-Class Support Vector Machine, Two-Class Boosted Decision Tree, etc.
  • 57. Microsoft ML Studio Experiment ML Experiment Documentation ML Experiment: https://studio.azureml.net/community/unpack?packageUri=https%3a% 2f%2fstorage.azureml.net%2fdirectories%2fdc2f564b4c6c4c9d825836e ae21f8751%2fitems&communityUri=https%3a%2f%2fgallery.azure.ai% 2fDetails%2ftext-classification-step-1-of-5-data-preparation- 3&entityId=Text-Classification-Step-1-of-5-data-preparation-3

Editor's Notes

  1. As you know, the customer experience is your brand. And every touch point customers have influences of that brand. Those touch points used to be very limited – a TV advertisement, the product packaging, the product itself, maybe a brochure or flyer. If customers wanted information it came from the company, industry publications or a select group of peers. With the rise of social and mobile, those touch points are multiplying. And multiplying exponentially. Today’s customers have huge networks to tap for information and are no longer buying in the ways they used to. While this shift creates new challenges, it also creates a huge opportunity for businesses to reach customers in ways never before possible. As we have just mentioned, the world has gone social. And we all experience it every day in our personal lives. We are sharing on Twitter, Facebook, and blogs. When we want to buy a product, we go online for product reviews, or ask questions on forums. We engage our peers for opinions and Tweet our likes and loves. We post photos and videos for our friends and the world. And this isn’t a trend or a fad – this is a fundamental shift in the way people communicate. But this isn’t just a change in the way consumers behave – these social conversations and behaviors are increasingly moving to the B2B space as well. There is a conversation going on out there and if you’re not listening to it, you’ll be in the dark. You can see the facts on this slide which are meant to illustrate what’s happening in the market – and it’s amazing to quantity some of the numbers for how social the world is becoming. But to highlight a few other facts of what this rise in social means. First: 75% of B2B are likely to use social media to influence their purchase decisions. What that means is that customers aren’t just using social – they’re using it to buy. So if you’re not engaged on social, you’re missing the majority of your target market. These customers are out there talking about your products, brands and competitors and if you aren’t hearing the conversations and using them to inform your strategy, you’ll be at a severe disadvantage. Source: http://www.insideview.com/social-selling?utm_source=infographic&utm_medium=howsocialisb2b&utm_campaign=social-selling Second: 2. Customers are 70% of the way through the purchase process before they contact sales. When you think about this – and the way customers buy, that means they’ve already done their research and are close to a decision before they even contact you. Given this, if you want to be able to influence purchasing decisions, you’ll need to engage long before they ever contact your sales department. Source: SiriusDecisions “When you couple that with the fact that over 500,000,000 photos are uploaded and shared each day on Facebook, 100 hours of video are uploaded to YouTube per minute, there are 190,000,000 tweets per day, and the total piece of content shared on Facebook each month exceeds 70,000,000,000….we can see that our customers are part of this social reality and are using social media as a mechanism to get the information they are looking for as they are looking to purchase products from us. In fact, (as we just stated), they are well down the path of the buying cycle because of it before they even reach out to us to engage.”
  2. Instructor Walk through the slide animations and use this as points of discussion in your class.
  3. “Let’s talk a little bit about customer experience.” Walk through the animation of the slide to guide your class discussion.
  4. “As you can see, leveraging social media as well as monitoring what is happening relative to your social footprint is a vital aspect of many businesses’ day to day marketing, sales, and even customer care efforts. Before we spend too much more time on specifics, let’s review some of the common terms and tools used online that drives this customer behavior.”
  5. “ Hashtag The hashtag is a word or phrase preceded by the “#” sign. #Hashtags are a simple way to mark the topic (or topics) of social media messages and make them discoverable to people with shared interests. On most social networks, clicking a hashtag will reveal all the public and recently published messages that also contain that hashtag. Hashtags first emerged on Twitter as a user-created phenomenon and are now used on almost every other social media platform, including Facebook, Google+, Instagram, Vine and Pinterest. Bitly/ URL shortening Converting a long URL name into a short one. Also called "URL redirecting," there are free URL shortening services on the Web that take a long URL and convert it to a short one for publication on a Web page or other venue. When the short URL is clicked, the URL shortening service receives the request and redirects it to the long URL. Short URLs are widely used to slim down long URLs printed in magazines and newspapers to make it easier to type into a browser. They are also used in e-mail, because URLs in the message often wind up as two lines of text at the recipient's end that cannot be clicked as one link. Although the recipient can copy the two lines back together in the browser's address bar, the short URL eliminates the potential problem. Permalink The URL address of an individual piece of content. Permalinks are useful because they allow you to reference a specific Tweet, update, or blog post instead of the feed or timeline in which you found it. You can quickly find an item’s permalink by clicking on its timestamp SEO Search engine optimization is the practice of increasing the “organic” visibility of a web page in a search engine, such as Google. Although businesses can pay to promote their websites on search engine results pages (Search Engine Marketing, or SEM), SEO refers to “free” tactics that enhance the search ranking of a page. Sentiment analysis An attempt to understand how an audience feels about a brand, company, or product based on data collected from social media. It typically involves the use of natural language processing or another computational method to identify the attitude contained in a social media message. Different analytics platforms classify sentiment in a variety of ways; for example, some use “polar” classification (positive or negative sentiment), while others sort messages by emotion or tone (Contentment/Gratitude, Fear/Uneasiness, etc.).  Meme An idea, fashion, or behavior that is transmitted from person to person through media, speech, gestures, and other forms of communication. The term was conceived by evolutionary biologist Richard Dawkins in the 1970s, but it has exploded into greater relevance in the past decade with the rise of online culture. In Dawkins’ theory, memes are ideas (or fragments of ideas) that are copied and combined as they move from person to person, much like genes are passed down from generation to generation. Dawkins surmised that we could use the concept of evolution by natural selection to understand how ideas spread and change over time. Some memes spread far and wide, some die out, and others mutate. Social media has made it possible to visualize and measure this phenomenon like never before. For example, we can see hashtags rise and fall in popularity and track how quickly they spread throughout a network. The word meme is itself a meme. The theory isn’t perfect, and it has its share of critics, but it’s an alluringly simple way to think about the spread of ideas. Therefore, people use the word and pass it on. Its meaning has also evolved over time as it has become increasingly used to describe viral social media content. Social Capital The central theme of Social Capital is that social networks have value. It refers to the collective value of all social networks (who people know) and the tendencies or benefits that come from these networks to do things for each other. The interactions and flow of information provides benefit/ value for the people who are connected. Social capital works through multiple channels such as information flows, norms of reciprocity (mutual aid), collective action, and social identity. User-generated content (UGC) Media that has been created and published online by the users of a social or collaboration platform, typically for non-commercial purposes. User-generated content is one of the defining characteristics of social media. It is often produced collaboratively and in real-time by multiple users. Many companies have enthusiastically embraced and encouraged user-generated content as a means of increasing brand awareness and customer loyalty. Instagram contests, Vine video contests, and other UGC-based social campaigns allow businesses to tap into the creative energies of their customers and use their contributions to fuel marketing strategies. Brandjacking The hijacking of a brand to promote an agenda or damage a reputation. Brandjackers don’t hack the social media accounts of target individuals and organizations. Instead, they assume a target’s online identity through indirect means such as fake accounts, promoted hashtags, and satirical marketing campaigns. Clickthrough rate (CTR) This is a common metric for reporting on the number of people who viewed a message or piece of content and then actually performed the action required such as clicking on the ad or link in an email marketing campaign. The actual metric is calculated by comparing the number of clicks to impressions. For example, if 100 people saw your ad in Google and one person clicked on the ad, you would have a click-through-rate of 1.0%. Clickthrough rate (CTR) is most commonly used for search engine marketing and other performance-driven channels as the general philosophy is that the higher your CTR, the more effective your marketing is. Creative Commons Creative Commons is a public copyright license that gives you the ability to use and share otherwise copyrighted material. For social media users, Creative Commons often comes into play when we are looking for images and photos to accompany a social media message or blog post. In both of these cases, unless you are using your own images or have express permission, you can only share Creative Commons images. Sites like Google Image Search and Flickr have filters so you can easily search for Creative Commons photos. Just be careful, as there are different level of Creative Commons which could restrict whether an image could be used commercially, whether it can be modified, and what kind of attribution is required. Crowdsourcing Crowdsourcing refers to the process of leveraging your online community to assist in services, content and ideas for your business. Business examples include getting your audience to volunteer in helping translate your product or by asking your community to contribute content for your blog. Geotargetting A feature on many social media platforms that allows users to share their content with geographically defined audiences. Instead of sending a generic message for the whole world to see, you can refine the messaging and language of your content to better connect with people in specific cities, countries, and regions. You can also filter your audience by language. Astroturfing Astroturfing is a fake grassroots campaign that seeks to create the impression of legitimate buzz or interest in a product, service or idea. Often this movement is motivated by a payment or gift to the writer of a post or comment or may be written under a pseudonym. Return on relationship (ROR) A measurement of the value gained by a person or business from developing a relationship. Measuring ROR isn’t easy; it involves not only analyzing connection growth, but also understanding the impact your customers’ voices have on your brand and reputation. This includes sentiment analysis, as well as engagement metrics for your content, like organic sharing rates. ROR is an alternative (or complementary) metric to social media ROI. Troll In Internet slang, a troll is someone who posts controversial, inflammatory, irrelevant or off-topic messages in an online community, such as an online discussion forum or chat room, with the primary intent of provoking other users into an emotional response or to generally disrupt normal on-topic discussion. SMO Social Media Optimization (SMO) is a set of practices for generating publicity through social media, online communities and social networks. The focus is on driving traffic from sources other than search engines, though improved search ranking is also a benefit of successful SMO. Reach Reach is a data metric that determines the potential size of audience any given message could reach. It does not mean that that entire audience will see your social media post, but rather tells you the maximum amount of people your post could potentially reach. Reach is determined by a fairly complex calculation, that includes # of followers, shares and impressions as well as net follower increase over time. Reach should not be confused with Impressions or Engagement. Share of Voice Share of voice is a metric for understanding how many social media mentions a particular brand is receiving in relation to its competition. Usually measured as a percentage of total mentions within an industry or among a defined group of competitors. Word cloud Word clouds, also known as tag clouds or weighted lists, are a visual representation of text, where the frequency of a word determines its size in the word cloud. This is a great tool for identifying words that are repeated or most common.
  6. A) Data Moore law: number of transistors in a dense integrated circuit doubles approximately every two years A) To make things even more challenging, you need Distr Processing distributed of etherogenus data sets. Need more hardware to store / process modern data. Distributed Clusters are Pervasive (AWS, Vm, Azure, Rack Space). B) ML Mature Methods for Common Problems
 e.g., classification, regression, collaborative filtering, clustering Traditional tools (Matlab, R, Excel, etc.) run on a single machine Who is going to manage all this? IT Architects? MIT: When there chaos there are opportunities.
  7. Libraries -SparkSQL (DataFrame manipulation ) processing structured and semi-structured datasets, Hive Support, SQL Windowing (ANSI 2003) Languages: 3 levels a) Command line R b) Shell are available for Python, Scala and R c) IntellJ Idea (Java-Scala) -Spark Streaming (continuous DStream computation )for Scalable, High-t Throughput, FaultTolerant Stream Processing of Live Data Stream (Kafka, Flume, Twitter, etc) -Spark Graphs for Graphs and Graph-Parallel Computation. Graph algorithms: PageRank, Connected Components, Triangle Counting
  8. General purpose engine, It provides in-memory computing capabilities to deliver speed, a generalized execution model to support a wide variety of applications, and Java, Scala, and Python APIs for ease of development. Spark is built using Scala around the concept of Resilient Distributed Datasets (RDD) and provides actions / transformations on top of RDD.
  9. RDDs are a 'immutable resilient  distributed collection of records' which can be stored in the volatile memory or in a persistent storage  (HDFS, HBase etc) and can be converted into another RDD through some of the transformations. An action like count can also be applied on an RDD. http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html Partitions are logical buckets for data. Partitions correspond to Hadoop's splits (if the data lives on HDFS) or partitioning schemes in the source storage RDD (and hence the data inside) is partitioned. Spark manages data using partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors. Data in partitions can be skewed, i.e. unevenly distributed across partitions. e is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. Data is often distributed unevenly across partitions. repartition and coalesce operators can repartition a dataset.
  10. provide distributed machine learning algorithms on top of Spark’s RDD abstraction.
  11. Orientation
  12. Model saved as set of files including an object containing the model exportable to PMML Output jar file that can be executed Pipeline is a wrapper for this green part of workflow The goal of the Pipeline API (aka Spark ML or spark.ml given the package the API lives in) is to let users quickly and easily assemble and configure practical distributed machine learning pipelines (aka workflows) by standardizing the APIs for different Machine Learning concepts. + The ML Pipeline API is a new DataFrame-based API developed under the spark.ml package
  13. Campaign Monitoring Measure overall tone and sentiment of external social conversations related to a campaign Brand Monitoring Measure overall tone and sentiment of external social conversations related to the business Identify Influencers Who should I care about? What’s the impact of their influence? How do I harness them?
  14. Extending Service Reach Identify specific issues and engage social customers to resolve their issues Early Warning System - Identify trends in subjects and sentiment to anticipate and respond to customer issues Knowledge Management Harness the conversations across the business as a living, breathing knowledge base Knowledge Harvesting Extract, package and re-use best practices and solutions from customers helping customers
  15. Lead Harvesting Listen for, identify and react to potential sales triggers
  16. Sentiment analysis is one of the Natural Language Processing fields, dedicated to the exploration of subjective opinions or feelings collected from various sources about a particular subject.
  17. Ronaldo is the highest paid footballer in the world Every year Cristiano Ronaldo earns a base salary of $45 million.
  18. Orientation