unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
1. 1. Text Analytics
2. AI with Apache Spark
3. AI Experiment with
Microsoft ML Studio
By Adj Prof. Giuseppe Mascarella
giuseppe@valueamplify.com
WELCOME TO:
Text Analytics And Social Media Analytics
2. People are sharing more than ever before…
190 million tweets per day
100 hours of video uploaded to
YouTube per minute
500M+ photos uploaded and
shared each day
Totalpiecesofcontentsharedon
Facebookeachmonth70billion
3. Crunching the Text Analytics Numbers
One negative social media
comment can cost a company
30customers
*Convergys corp. A customer review on
social media reaches an
average audience of 45
people…
67% of whom
would avoid or
completely stop doing
business with a
company they
heard bad things about.
*Convergys corp.
44%
Of customers today complain about
products and services on social
networks
*IBM, 2012
Customers are nearly 60%
through the sales process before
engaging a sales rep, regardless of
price point.
*Google and CEB
Whose advice do you seek for
guidance when evaluating software
and service technology solutions for
your business? Social media?
<25 56%
25-29 61%
30-34 53%
35-40 52%
41-50 40%
>50 34%
*SMB group 2010 small & medium business
routes to market study, July 2010
Percent of tweeters
who don’t tweet,
but watch other people
tweet 40%
4. News travels quickly…
People are more likely to share a bad experience online vs. a good
one.
87% will share
with others
33% will share
with 5+ people
7. Data Mining Text
1. Law, Ordinances, Court Sentences
2. Citizen data (i.e.
https://www.covidimpactmeter.com/)
3. Product news
4. Interviews with industry experts
5. Opinion columns or blogs
6. Tutorials on new technologies or
practices
7. Analysis from independent
journalists or analysts
1. Vendor-provided information
2. Research on market size
3. Press Release
4. Analysis from peers
5. White papers
6. Vendor news
7. Product reviews or buying guides
8. Forward-looking trends
9. Case studies
10. https://www.statista.com/
11. etc
9. What Is FOIL?
The Freedom of Information Law (FOIL) promotes the policy of open government. FOIL was
designed to make documents generated by and in the possession of government agencies available
to the public with certain specific exceptions.
• The Freedom of Information Law (“FOIL”), Article 6 (Sections 84-90) of the NYS Public Officers
Law, provides the public right to access records maintained by government agencies with certain
exceptions.
• “Record” means any information kept, held, filed, produced or reproduced by, with, or for this
agency, in any physical form whatsoever including, but not limited to, reports, statements,
examinations, memoranda, opinions, folders, files, books, manuals, pamphlets, forms, papers,
designs, drawings, maps, photos, letters, microfilms, computer tapes or disks, rules, regulations
or codes.
• Learn more about records access and Open FOIL NY.
Federal Level: https://www.foia.gov/
10.
11. 1. Text Analytics
2. AI Text Analytics with Apache Spark
3. AI Experiment with Microsoft ML Studio
By Adj Prof. Giuseppe Mascarella
giuseppe@valueamplify.com
WELCOME TO:
Text Analytics And Social Media Analytics
12. APACHE SPARK
• 1. What is
• 2. Spark MLlib
• 3. Scenario: Text Mining
13. What Challenge We Are Trying To Solve?
Data Grows Faster
than Moore’s Law
[IDC report, Kathy
Yelick, LBNL]
Data
Machine
Learning
Distributed
Computing
14. How Can We Do So At Scale With Massive
Data?
1. Start implementing distributed pipelines
2. Optimize what Table you keep in memory for how long
3. Understand ML libraries that are not single thread computing
4. Optimize more hardware to store / process modern data Scale-out
(distributed, e.g., cloud-based)
5. Leverage Commodity hardware pricings, and scales to massive problems
6. Real-time streaming data processing
While
Increasing
Speed
15. 1. What is Spark?
Apache Spark is a fast and general-purpose cluster
computing system,
with in-memory data processing engine,
suited for modern data centric applications.
APIs
Spark Core
Spark
SQL
Spark
Streaming
MLlib GraphX
Apache
Mesos
Hadoop
YARN
Spark
Standalone
SCALA
Python
R, Java
S3
HDFS
Azure
…
16. What Spark?
Free software: spark.apache.org
Apache Spark is 100% open source, hosted at the vendor-independent Apache
Software Foundation
“..our efforts focuses on both the Spark
codebase and support.
All of our work on Spark is open source
and goes directly to Apache.”
Matei Zaharia, VP, Apache Spark,
Founder & CTO, Databricks
The Washington Post recommendation
engine for content to their readers.
Yelp, in connecting users with local
businesses, Spark to increase the click-
through rates of display
advertisements.
Hearst Corporation uses Streaming to
process clickstream data from over
200 web. For real-time view of article
performance and trending topics.
..Uber, Pinterest, Conviva, and Yahoo
Early Adopters
17. Spark Infrastructure
Cluster Deployment with built-in interoperability with Hadoop YARN, Apache
Mesos, files systems, etc
Spark Core defines RDD data abstraction and provides in-memory
computing capabilities.
19. How Does The Core Work?
Parallelizes Distributed Data Processing
Shuffle
DAGScheduler, TaskScheduler, SchedulerBackend
Resilient Distributed Dataset (and DataFrames)
In-Memory, Partitioned, Cacheable, Parallel, Typed, Lazy Evaluated
20. 2. Spark MLlib
Machine Learning in Distributed Computing Environment with Peta Bytes of Data
org.apache.spark.mllib for a RDD-based
ML
RDD-based API
Pipeline API for designing, evaluating,
and tuning machine learning
21. ML Algorithms with RDD
• Classification: logistic regression, naive Bayes,...
• Regression: generalized linear regression, isotonic
regression,...
• Decision trees, random forests, and gradient-
boosted trees
• Recommendation: alternating least squares (ALS)
• Clustering: K-means, Gaussian mixtures
(GMMs),...
• Topic modeling: latent Dirichlet allocation (LDA)
• Feature transformations: standardization,
normalization, hashing,...
• Model evaluation and hyper-parameter tuning
• ML Pipeline construction
• ML persistence: saving and loading models and
Pipelines
• Survival analysis: accelerated failure time model
• Frequent itemset and sequential pattern mining:
FP-growth, association rules, PrefixSpan
• Distributed linear algebra: singular value
decomposition (SVD), principal component
analysis (PCA),...
• Statistics: summary statistics, hypothesis
testing,...
https://spark-packages.org/?q=tags%3A%22Machine
Learning%22
22. 3. Scenario: Text Classification
Algorithm: Logistic Regression (Binary Classification)
Problem: Given a text document, classify it,
as a scientific or non-scientific one.
Goal: Find decision boundary, the” model”.
Scientific
Non
Scientific
Model
Schema for Categories
sealed trait Category
case object Scientific extends
Category
case object NonScientific
extends Category
$
𝒌=𝟎
𝒏
𝒏
𝒌
𝒙 𝒌
24. Definition
• Tokenization, when applied to data security, is the process of
substituting a sensitive data element with a non-sensitive
equivalent, referred to as a token, that has no extrinsic or
exploitable meaning or value. The token is a reference (i.e.
identifier) that maps back to the sensitive data through a
tokenization system.
• Hash are used to index data. Hashing values can be used to map
data to individual "buckets" within a hash table. Each bucket has a
unique ID that serves as a pointer to the original data. This creates
an index that is significantly smaller than the original data,
allowing the values to be searched and accessed more efficiently.
28. Data In the Social Mission Control Room
Social
Engagement
1.Needs
2.
Solution
Example
3 How
29. Why Do You Need To Monitor Conversation Worldwide Timely?
1. People and troubles issues can quickly go viral
2. Influencers are difficult to identify and prioritize
3. Trends and opportunity signals get lost in the noise
30. Sentiment Analysis Addresses:
1. People, Topic & Event sentiment
Gain insight and learn what people really think
and feel about it .
3.Top influencer tracking
Identify and monitor the top influencers in your
industry, company, and customer base.
2.Campaign monitoring
Measure the effectiveness of remidiation
campaigns in real time.
31. Social care
Global sentiment analysis
Track social care cases in Spanish, English and
Portuguese.
Real-time social case resolution
Create alerts to quickly identify customer issues,
and identify trends early on.
Integrated channel management
Across the social and web (e.g., Twitter,
Facebook, Blogs, and boards).
32. Social Alert Management
Competitive intelligence
Gain important insights about your competitors’
weaknesses and strengths.
Target account tracking
Monitor key developments and decision makers
at your top accounts.
Un-attended alerting
Create rule based real-time alerts agents and
generate PR templatized actions
33. Data In the Social Mission Control Room
Social
Engagement
1.Needs
2. Solution
Example
3 How
34. What Is Media Monitoring
Metrics of social audience
Measure sentiment of the audience on characters
Browse conversations
35. NLP, Text/Document level classification methods and software tools
Positive/Negative/Neutral (+/-1)
Subjective -> Objective
Tailored for short texts
Handles: twitter jargon: RT, @, #, , spelling errors, disfluence
Entity level sentiment
A. What Is Sentiment Analisys?
Finding Information Driven By Opinion mining or emotions AI
to understand the voice of the crowd through reviews and
survey responses, online and on social media.
B. Where Is It Used?
Marketing, Branding, Entertainment, Customer service, Clinical
medicine, Politics, Finance, etc
C. What Are The Mechanics of It?
36. Real time visualizations on twitter conversations
Overall sentiment of the audience on each player / character
Sentiment Analysis:
From Volume Analisys To Quality Analysis
$45M/Year
37. Use Case: Company FY/Quarter Result Release
Produced by Value Amplify - Confidential
38. The Mechanics Of Text Classification
Algorithm: Logistic Regression (Binary Classification)
Problem: Given a text document, classify it, as a “Positive” in Sentiment Analysis
Goal: Find decision boundary, the” model”.
Negative
Positive
Model
Entities/Channels
SentimentAnalysisIndex
39. Social Mission Control Room
Social
Engagement
1.Needs
2.Solution
Samples
3 HOW
By Prof. Giuseppe Mascarella
Cell: 425 269 5478
41. You can single out an influencer point of view in 1 channel and
reply with info form other channels
42.
43.
44. The Business Problem (Microsoft Case Study)
Elena works for an Internet-based retailer company selling DVDs, software,
video games, toys, electronics, and furniture.
The company shows customers feedback at the product level. Her task is to
build a pipeline that automatically analyzes customer feedback and Twitter
messages, to provide the overall sentiment for each product.
The aim (Label, Target) is to help consumers who want to understand if
public opinion (previous review or best seller) has had a positive ( =4) review
before purchasing a product.
How Do You Use AI For This?
-Training Data
-Target (label) : Review (not best seller)
-Algorithm
45. 1. Text Analytics
2. AI with Apache Spark
3. AI Sentiment Analysis Experiment with Microsoft ML Studio
By Adj Prof. Giuseppe Mascarella
giuseppe@valueamplify.com
WELCOME TO:
Text Analytics And Social Media Analytics
46. Data Set
• The open source data comprises approximately 1,600,000
automatically annotated tweets: http://help.sentiment140.com/
• Any tweet containing positive emoticon such as :),:-), :D or =D was
assumed to bear positive sentiment,
Any tweets with negative emoticons such as :<, :-( or :( were
supposed to bear negative polarity.
• Tweets containing both positive and negative emoticons are a
problem
• Dataset to be used:
http://azuremlsamples.azureml.net/templatedata/Text - Input.csv
47. Dependent Variables (Features, Text Column)
1. sentiment_label - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
2. tweet_id - the id of the tweet
3. time_stamp - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
4. target - the query (lyx). If there is no query, then this value is NO_QUERY.
5. user_id - the user who posted the tweet
6. tweet_text - the text of the tweet
49. What Algorithm Is Best?
• N-gram model is another common vector representation model, but knows that there is
no conclusive answer of which one works the best.
• Bag-of-words vector representation model is commonly used for text classification. In
this method, the frequency of occurrence of each word, or term-frequency (TF), is
multiplied by the inverse document frequency, and the TF-IDF scores are used as feature
values for training a classifier.
• In steps 5A and 5B, the most accurate model is deployed as a published web service,
using either RRS (Request Response Service) or BES (Batch Execution Service).
When using RRS, only one text instance is classified at a time.
When using BES, a batch of text instances can be sent for classification at the same time.
By using these web services, you can perform classification in parallel, using either an
external worker or the Azure Data Factory, for greatly enhanced efficiency.
50. Text pre-processin
Text cleaning
-replacing special
characters and
punctuation marks with
spaces,
-normalizing case,
removing duplicate
characters, removing user-
defined or built-in stop-
words, and word
stemming.
Highly custom steps are
implemented using the R
programming language.
51.
52. • Feature hashing
To convert variable-length text documents to equal-length
numeric feature vectors, using the 32-bit murmurhash v3
hashing method provided by the Vowpal Wabbit library.
• The objective of using feature hashing is dimensionality
reduction; also feature hashing makes the lookup of
feature weights faster at classification time, because it uses
hash value comparison instead of string comparison.
• Hashing bitsize is used to specify the number of bits to use
when creating the hash table.
• The default bit size is 10. For many problems, this value is more
than adequate, but whether suffices for your data depends on
the size of the n-grams vocabulary in the training text. With a
large vocabulary, more space might be needed to avoid
collisions.
• We recommend that you try using a different number of bits
for this parameter, and evaluate the performance of the
machine learning solution.
• For N-grams, type a number that defines the maximum length
of the n-grams to add to the training dictionary. An n-gram is a
sequence of n words, treated as a unique unit.
• N-grams = 1: Unigrams, or single words.
• N-grams = 2: Bigrams, or two-word sequences, plus unigrams.
• N-grams = 3: Trigrams, or three-word sequences, plus bigrams and
unigrams.
•
Hashing bits to 15, and set the number of n-grams to 2. With these settings, the hash
table can hold 2^15 or 32,768 entries, in which each hashing feature represents one or
more n-gram features and its value represents the occurrence frequency of that n-gram
in the text instance.
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-
reference/feature-hashing
53. Reduce Computational Complexity
The classification time and complexity of a trained model depends on the number of
features (the dimensionality of the input space). For a linear model, such as a support
vector machine, the complexity is linear with respect to the number of features.
For text classification tasks, the number of features resulting from feature extraction is high
because each word in the vocabulary and each n-gram is mapped to a feature.
To select a more compact feature subset from the exhaustive list of extracted hashing
features, we used the Filter Based Feature Selection module. The aim is to avoid the
effects of the curse of dimensionality and to reduce the computational complexity without
harming classification accuracy. To get the top 5,000 most relevant features with respect to
the sentiment label out of the 2^15 extracted features,
Use the Chi-squared score function to rank the hashing features in descending order.
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-
reference/feature-hashing
54. TF-IDF Calculation
• When the metric word frequency of occurrence
(TF) in a document is used as a feature value, a
higher weight tends to be assigned to words that
appear frequently in a corpus (such as stop-words).
• The Inverse Document Frequency (IDF) is a better
metric, because it assigns a lower weight to
frequent words. You calculate IDF as the log of the
ratio of the number of documents in the training
corpus to the number of documents containing the
given word.
Combining these numbers in a metric (TF/IDF)
places greater importance on words that are
frequent in the document but rare in the corpus.
This assumption is valid not only for unigrams but
also for bigrams, trigrams, etc.
• This experiment converts unstructured text data
into equal-length numeric feature vectors where
each feature represents the TF-IDF of a unigram in
a text instance.
• Feature selection (dimensionality reduction)
We used the Chi-squared score function to rank the
hashing features in descending order, and returned
the top 5,000 most relevant features with respect to
the sentiment label, out of all the extracted unigrams.
56. Use the first Split module to split the data into two subset.
1. will be used to train the model
2. will be split in the next step into development/validation set and test set. In
the sample experiment, we split the data into 70% and 30% respectively.
4.3. Use the second Split module to split the data into two subset.
1. used later by the Sweep Parameters module.
2. used as test set to evaluate the performance of the trained model. In the
sample experiment, we split the 30% data sample into two halves. That is,
each of the development set and the test set represents 15% of the input
data.
Use the Sweep Parameters module to get the optimal values for the underlying
learning algorithm parameters, 2 option to try:
1. Random sweep where the module will conduct a number of training runs
(specified by the parameter 'Maximum number of runs on random sweep')
from the parameter ranges.
2. Entire grid option as a parameter sweeping mode to explore all possible values
for each parameter as specified in the learning algorithm module such as the
Two-Class Logistic Regression module. In the sample experiment, the AUC is
specified as Metric for measuring performance for classification. Other
performance evaluation criteria can be used for model selection such as
precision, recall and F-score.
For binary-class classification tasks, you can either keep the Two-Class Logistic
Regression module, or you can replace it with another binary-class classification
trainer, such as Two-Class Support Vector Machine, Two-Class Boosted Decision Tree,
etc.
57. Microsoft ML Studio Experiment
ML Experiment Documentation
ML Experiment:
https://studio.azureml.net/community/unpack?packageUri=https%3a%
2f%2fstorage.azureml.net%2fdirectories%2fdc2f564b4c6c4c9d825836e
ae21f8751%2fitems&communityUri=https%3a%2f%2fgallery.azure.ai%
2fDetails%2ftext-classification-step-1-of-5-data-preparation-
3&entityId=Text-Classification-Step-1-of-5-data-preparation-3
Editor's Notes
As you know, the customer experience is your brand. And every touch point customers have influences of that brand. Those touch points used to be very limited – a TV advertisement, the product packaging, the product itself, maybe a brochure or flyer. If customers wanted information it came from the company, industry publications or a select group of peers.
With the rise of social and mobile, those touch points are multiplying. And multiplying exponentially. Today’s customers have huge networks to tap for information and are no longer buying in the ways they used to. While this shift creates new challenges, it also creates a huge opportunity for businesses to reach customers in ways never before possible.
As we have just mentioned, the world has gone social. And we all experience it every day in our personal lives. We are sharing on Twitter, Facebook, and blogs. When we want to buy a product, we go online for product reviews, or ask questions on forums. We engage our peers for opinions and Tweet our likes and loves. We post photos and videos for our friends and the world.
And this isn’t a trend or a fad – this is a fundamental shift in the way people communicate. But this isn’t just a change in the way consumers behave – these social conversations and behaviors are increasingly moving to the B2B space as well. There is a conversation going on out there and if you’re not listening to it, you’ll be in the dark.
You can see the facts on this slide which are meant to illustrate what’s happening in the market – and it’s amazing to quantity some of the numbers for how social the world is becoming. But to highlight a few other facts of what this rise in social means.
First:
75% of B2B are likely to use social media to influence their purchase decisions. What that means is that customers aren’t just using social – they’re using it to buy. So if you’re not engaged on social, you’re missing the majority of your target market. These customers are out there talking about your products, brands and competitors and if you aren’t hearing the conversations and using them to inform your strategy, you’ll be at a severe disadvantage.
Source: http://www.insideview.com/social-selling?utm_source=infographic&utm_medium=howsocialisb2b&utm_campaign=social-selling
Second:
2. Customers are 70% of the way through the purchase process before they contact sales. When you think about this – and the way customers buy, that means they’ve already done their research and are close to a decision before they even contact you. Given this, if you want to be able to influence purchasing decisions, you’ll need to engage long before they ever contact your sales department.
Source: SiriusDecisions
“When you couple that with the fact that over 500,000,000 photos are uploaded and shared each day on Facebook, 100 hours of video are uploaded to YouTube per minute, there are 190,000,000 tweets per day, and the total piece of content shared on Facebook each month exceeds 70,000,000,000….we can see that our customers are part of this social reality and are using social media as a mechanism to get the information they are looking for as they are looking to purchase products from us. In fact, (as we just stated), they are well down the path of the buying cycle because of it before they even reach out to us to engage.”
Instructor
Walk through the slide animations and use this as points of discussion in your class.
“Let’s talk a little bit about customer experience.”
Walk through the animation of the slide to guide your class discussion.
“As you can see, leveraging social media as well as monitoring what is happening relative to your social footprint is a vital aspect of many businesses’ day to day marketing, sales, and even customer care efforts. Before we spend too much more time on specifics, let’s review some of the common terms and tools used online that drives this customer behavior.”
“
Hashtag
The hashtag is a word or phrase preceded by the “#” sign. #Hashtags are a simple way to mark the topic (or topics) of social media messages and make them discoverable to people with shared interests. On most social networks, clicking a hashtag will reveal all the public and recently published messages that also contain that hashtag. Hashtags first emerged on Twitter as a user-created phenomenon and are now used on almost every other social media platform, including Facebook, Google+, Instagram, Vine and Pinterest.
Bitly/ URL shortening
Converting a long URL name into a short one. Also called "URL redirecting," there are free URL shortening services on the Web that take a long URL and convert it to a short one for publication on a Web page or other venue. When the short URL is clicked, the URL shortening service receives the request and redirects it to the long URL.Short URLs are widely used to slim down long URLs printed in magazines and newspapers to make it easier to type into a browser. They are also used in e-mail, because URLs in the message often wind up as two lines of text at the recipient's end that cannot be clicked as one link. Although the recipient can copy the two lines back together in the browser's address bar, the short URL eliminates the potential problem.
Permalink
The URL address of an individual piece of content. Permalinks are useful because they allow you to reference a specific Tweet, update, or blog post instead of the feed or timeline in which you found it. You can quickly find an item’s permalink by clicking on its timestamp
SEO
Search engine optimization is the practice of increasing the “organic” visibility of a web page in a search engine, such as Google. Although businesses can pay to promote their websites on search engine results pages (Search Engine Marketing, or SEM), SEO refers to “free” tactics that enhance the search ranking of a page.
Sentiment analysis
An attempt to understand how an audience feels about a brand, company, or product based on data collected from social media. It typically involves the use of natural language processing or another computational method to identify the attitude contained in a social media message. Different analytics platforms classify sentiment in a variety of ways; for example, some use “polar” classification (positive or negative sentiment), while others sort messages by emotion or tone (Contentment/Gratitude, Fear/Uneasiness, etc.).
Meme
An idea, fashion, or behavior that is transmitted from person to person through media, speech, gestures, and other forms of communication. The term was conceived by evolutionary biologist Richard Dawkins in the 1970s, but it has exploded into greater relevance in the past decade with the rise of online culture.
In Dawkins’ theory, memes are ideas (or fragments of ideas) that are copied and combined as they move from person to person, much like genes are passed down from generation to generation. Dawkins surmised that we could use the concept of evolution by natural selection to understand how ideas spread and change over time. Some memes spread far and wide, some die out, and others mutate. Social media has made it possible to visualize and measure this phenomenon like never before. For example, we can see hashtags rise and fall in popularity and track how quickly they spread throughout a network.
The word meme is itself a meme. The theory isn’t perfect, and it has its share of critics, but it’s an alluringly simple way to think about the spread of ideas. Therefore, people use the word and pass it on. Its meaning has also evolved over time as it has become increasingly used to describe viral social media content.
Social Capital
The central theme of Social Capital is that social networks have value. It refers to the collective value of all social networks (who people know) and the tendencies or benefits that come from these networks to do things for each other. The interactions and flow of information provides benefit/ value for the people who are connected.
Social capital works through multiple channels such as information flows, norms of reciprocity (mutual aid), collective action, and social identity.
User-generated content (UGC)
Media that has been created and published online by the users of a social or collaboration platform, typically for non-commercial purposes. User-generated content is one of the defining characteristics of social media. It is often produced collaboratively and in real-time by multiple users. Many companies have enthusiastically embraced and encouraged user-generated content as a means of increasing brand awareness and customer loyalty. Instagram contests, Vine video contests, and other UGC-based social campaigns allow businesses to tap into the creative energies of their customers and use their contributions to fuel marketing strategies.
Brandjacking
The hijacking of a brand to promote an agenda or damage a reputation. Brandjackers don’t hack the social media accounts of target individuals and organizations. Instead, they assume a target’s online identity through indirect means such as fake accounts, promoted hashtags, and satirical marketing campaigns.
Clickthrough rate (CTR)
This is a common metric for reporting on the number of people who viewed a message or piece of content and then actually performed the action required such as clicking on the ad or link in an email marketing campaign. The actual metric is calculated by comparing the number of clicks to impressions. For example, if 100 people saw your ad in Google and one person clicked on the ad, you would have a click-through-rate of 1.0%. Clickthrough rate (CTR) is most commonly used for search engine marketing and other performance-driven channels as the general philosophy is that the higher your CTR, the more effective your marketing is.
Creative Commons
Creative Commons is a public copyright license that gives you the ability to use and share otherwise copyrighted material. For social media users, Creative Commons often comes into play when we are looking for images and photos to accompany a social media message or blog post. In both of these cases, unless you are using your own images or have express permission, you can only share Creative Commons images. Sites like Google Image Search and Flickr have filters so you can easily search for Creative Commons photos. Just be careful, as there are different level of Creative Commons which could restrict whether an image could be used commercially, whether it can be modified, and what kind of attribution is required.
Crowdsourcing
Crowdsourcing refers to the process of leveraging your online community to assist in services, content and ideas for your business. Business examples include getting your audience to volunteer in helping translate your product or by asking your community to contribute content for your blog.
Geotargetting
A feature on many social media platforms that allows users to share their content with geographically defined audiences. Instead of sending a generic message for the whole world to see, you can refine the messaging and language of your content to better connect with people in specific cities, countries, and regions. You can also filter your audience by language.
Astroturfing
Astroturfing is a fake grassroots campaign that seeks to create the impression of legitimate buzz or interest in a product, service or idea. Often this movement is motivated by a payment or gift to the writer of a post or comment or may be written under a pseudonym.
Return on relationship (ROR)
A measurement of the value gained by a person or business from developing a relationship. Measuring ROR isn’t easy; it involves not only analyzing connection growth, but also understanding the impact your customers’ voices have on your brand and reputation. This includes sentiment analysis, as well as engagement metrics for your content, like organic sharing rates. ROR is an alternative (or complementary) metric to social media ROI.
Troll
In Internet slang, a troll is someone who posts controversial, inflammatory, irrelevant or off-topic messages in an online community, such as an online discussion forum or chat room, with the primary intent of provoking other users into an emotional response or to generally disrupt normal on-topic discussion.
SMO
Social Media Optimization (SMO) is a set of practices for generating publicity through social media, online communities and social networks. The focus is on driving traffic from sources other than search engines, though improved search ranking is also a benefit of successful SMO.
Reach
Reach is a data metric that determines the potential size of audience any given message could reach. It does not mean that that entire audience will see your social media post, but rather tells you the maximum amount of people your post could potentially reach. Reach is determined by a fairly complex calculation, that includes # of followers, shares and impressions as well as net follower increase over time. Reach should not be confused with Impressions or Engagement.
Share of Voice
Share of voice is a metric for understanding how many social media mentions a particular brand is receiving in relation to its competition. Usually measured as a percentage of total mentions within an industry or among a defined group of competitors.
Word cloud
Word clouds, also known as tag clouds or weighted lists, are a visual representation of text, where the frequency of a word determines its size in the word cloud. This is a great tool for identifying words that are repeated or most common.
A) Data Moore law: number of transistors in a dense integrated circuit doubles approximately every two years
A) To make things even more challenging, you need Distr Processing distributed of etherogenus data sets. Need more hardware to store / process modern data. Distributed Clusters are Pervasive (AWS, Vm, Azure, Rack Space).
B) ML Mature Methods for Common Problems e.g., classification, regression, collaborative filtering, clustering
Traditional tools (Matlab, R, Excel, etc.) run on a single machine
Who is going to manage all this? IT Architects? MIT: When there chaos there are opportunities.
Libraries
-SparkSQL (DataFrame manipulation ) processing structured and semi-structured datasets, Hive Support, SQL Windowing (ANSI 2003)
Languages: 3 levels a) Command line R b) Shell are available for Python, Scala and R c) IntellJ Idea (Java-Scala)
-Spark Streaming (continuous DStream computation )for Scalable, High-t Throughput, FaultTolerant Stream Processing of Live Data Stream (Kafka, Flume, Twitter, etc)
-Spark Graphs for Graphs and Graph-Parallel Computation. Graph algorithms:
PageRank, Connected Components, Triangle Counting
General purpose engine, It provides in-memory computing capabilities to deliver speed, a generalized execution model to support a wide variety of applications, and Java, Scala, and Python APIs for ease of development.
Spark is built using Scala around the concept of Resilient Distributed Datasets (RDD) and provides actions / transformations on top of RDD.
RDDs are a 'immutable resilient distributed collection of records' which can be stored in the volatile memory or in a persistent storage (HDFS, HBase etc) and can be converted into another RDD through some of the transformations. An action like count can also be applied on an RDD.
http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html
Partitions are logical buckets for data.
Partitions correspond to Hadoop's splits (if the data lives on HDFS) or partitioning schemes in the source storage
RDD (and hence the data inside) is partitioned.
Spark manages data using partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.
Data in partitions can be skewed, i.e. unevenly distributed across partitions.
e is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions.
Data is often distributed unevenly across partitions.
repartition and coalesce operators can repartition a dataset.
provide distributed machine learning algorithms on top of Spark’s RDD abstraction.
Orientation
Model saved as set of files including an object containing the model exportable to PMML
Output jar file that can be executed
Pipeline is a wrapper for this green part of workflow
The goal of the Pipeline API (aka Spark ML or spark.ml given the package the API lives in) is to let users quickly and easily assemble and configure practical distributed machine learning pipelines (aka workflows) by standardizing the APIs for different Machine Learning concepts.
+
The ML Pipeline API is a new DataFrame-based API developed under the spark.ml package
Campaign Monitoring
Measure overall tone and sentiment of external social conversations related to a campaign
Brand Monitoring
Measure overall tone and sentiment of external social conversations related to the business
Identify Influencers
Who should I care about? What’s the impact of their influence? How do I harness them?
Extending Service Reach
Identify specific issues and engage social customers to resolve their issues
Early Warning System - Identify trends in subjects and sentiment to anticipate and respond to customer issues
Knowledge Management
Harness the conversations across the business as a living, breathing knowledge base
Knowledge Harvesting
Extract, package and re-use best practices and solutions from customers helping customers
Lead Harvesting
Listen for, identify and react to potential sales triggers
Sentiment analysis is one of the Natural Language Processing fields, dedicated to the exploration of subjective opinions or feelings collected from various sources about a particular subject.
Ronaldo is the highest paid footballer in the world
Every year Cristiano Ronaldo earns a base salary of $45 million.