This presentation explains the research I made during while working at the Social Computing Lab at KAIST.
The main goal was to expand the LIWC vocabulary and adapt for Twiter sentiment analysis.
Download it to see the animations :)
LIWC-ing at Texts for Insights from Linguistic PatternsShalin Hai-Jew
Since the mid-1990s, researchers have been using the Linguistic Inquiry and Word Count (LIWC pronounced “luke”) software tool to explore various text corpora for hidden insights from linguistic patterns. The LIWC tool has evolved over the years. Simultaneously, research using computational text analysis has evolved and shed light on areas of deception, threat assessment, personality, predictive analytics, and other areas. This presentation will highlight some of the applications of LIWC in the research literature and showcase the tool on some original text sets.
Exploring Article Networks on Wikipedia with NodeXLShalin Hai-Jew
With 4.7 million articles in the English version of Wikipedia, this crowd-sourced online encyclopedia is regularly one of the top-ten visited sites online. For many, this is the go-to source for a first read on a topic. The open-source and free Network Overview, Discovery and Exploration for Excel (NodeXL), which is an add-on to Microsoft Excel, enables the capture of “article networks” from Wikipedia. Such content network analysis-based data visualizations enable the development of research leads; some understandings of public conceptualizations of related concepts, peoples, events, and phenomena; the profiling of Wikipedia editors (both humans and ‘bots), and other research insights. This presentation will showcase this affordance of NodeXL and provide some ideas for practical applications of this channel of research and knowing.
Coding Social Imagery: Learning from a #selfie #humor Image Set from InstagramShalin Hai-Jew
Social media messaging has long been harnessed to inform faculty about their respective learners. The textual channel is often used because of the ease of interpretation and analysis. Social imagery—tagged images, #selfies, grouped imagery, and others—has been less used, in part because images are more complex and multi-meaninged to analyze. Also, there are not many generalist models that inform how to code or even understand social imagery in an emergent way. (There are large-scale computational means to interpret online images, such as the AlchemyAPI of IBM Watson, for various types of feature extractions. There are ways to code imagery based on specific research questions in particular fields-of-practice.)
The presenter recently analyzed a 941-image #selfie + #humor image set from Instagram, with three main research questions:
What does identity-based humor look like in terms of a #selfie #humor- tagged image set from the Instagram photo-sharing mobile app?
Do more modern forms of mediated social humor link to more traditional forms theoretically? Is it possible to apply the Humor Styles Model to the images from the #selfie #humor Instagram image set to better understand #selfie #humor?
What are some constructive and systematized ways to analyze social image sets manually (with some computational support)?
This digital poster session will highlight some of the initial research findings (forthcoming in a near-future publication) and share insights about effectively coding social imagery in a bottom-up and emergent way.
This slideshow highlights the Tweet Analyzer machine, a tool created by Paterva and enabled through Maltego Carbon 3.5.3 and Maltego Chlorine 3.6.0. The Tweet Analyzer enables real-time captures of Tweets (from Twitter's streaming API) along with real-time sentiment analysis (based on polarities: positive, negative, and neutral), based on the Alchemy API.
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...Shalin Hai-Jew
This introduces methods for extracting and analyzing social network data from Twitter for hashtag conversations (and emergent events), event graphs, search networks, and user ego neighborhoods (using NodeXL). There will be direct demonstrations and discussions of how to analyze social network graphs. This information may be extended with human- and / or machine-based sentiment analysis.
Researchers have long known that the words of a text have always contained more information than on the surface. As such, texts have been studied for subtexts and other latent or hidden information. One approach has involved the machine-enabled analysis of human sentiment, usually mapped out on a positive-negative polarity. NVivo 11 Plus (a qualitative research tool released in late 2015) enables the automated sentiment analysis of texts (coded research, formal articles, text corpora, Tweetstream datasets, Facebook wall posts, websites, and other sources) based on four categories: very positive, moderately positive, moderately negative, and very negative. The tool feature compares the target text set against a sentiment dictionary and enables coding at different units of analysis: sentence, paragraph, or cell. Further, the sentiment capability extracts the coded text into respective text sets which may be further analyzed using text frequency counts, text searches, automated theme and sub-theme extractions (topic modeling), and data visualizations.
LIWC-ing at Texts for Insights from Linguistic PatternsShalin Hai-Jew
Since the mid-1990s, researchers have been using the Linguistic Inquiry and Word Count (LIWC pronounced “luke”) software tool to explore various text corpora for hidden insights from linguistic patterns. The LIWC tool has evolved over the years. Simultaneously, research using computational text analysis has evolved and shed light on areas of deception, threat assessment, personality, predictive analytics, and other areas. This presentation will highlight some of the applications of LIWC in the research literature and showcase the tool on some original text sets.
Exploring Article Networks on Wikipedia with NodeXLShalin Hai-Jew
With 4.7 million articles in the English version of Wikipedia, this crowd-sourced online encyclopedia is regularly one of the top-ten visited sites online. For many, this is the go-to source for a first read on a topic. The open-source and free Network Overview, Discovery and Exploration for Excel (NodeXL), which is an add-on to Microsoft Excel, enables the capture of “article networks” from Wikipedia. Such content network analysis-based data visualizations enable the development of research leads; some understandings of public conceptualizations of related concepts, peoples, events, and phenomena; the profiling of Wikipedia editors (both humans and ‘bots), and other research insights. This presentation will showcase this affordance of NodeXL and provide some ideas for practical applications of this channel of research and knowing.
Coding Social Imagery: Learning from a #selfie #humor Image Set from InstagramShalin Hai-Jew
Social media messaging has long been harnessed to inform faculty about their respective learners. The textual channel is often used because of the ease of interpretation and analysis. Social imagery—tagged images, #selfies, grouped imagery, and others—has been less used, in part because images are more complex and multi-meaninged to analyze. Also, there are not many generalist models that inform how to code or even understand social imagery in an emergent way. (There are large-scale computational means to interpret online images, such as the AlchemyAPI of IBM Watson, for various types of feature extractions. There are ways to code imagery based on specific research questions in particular fields-of-practice.)
The presenter recently analyzed a 941-image #selfie + #humor image set from Instagram, with three main research questions:
What does identity-based humor look like in terms of a #selfie #humor- tagged image set from the Instagram photo-sharing mobile app?
Do more modern forms of mediated social humor link to more traditional forms theoretically? Is it possible to apply the Humor Styles Model to the images from the #selfie #humor Instagram image set to better understand #selfie #humor?
What are some constructive and systematized ways to analyze social image sets manually (with some computational support)?
This digital poster session will highlight some of the initial research findings (forthcoming in a near-future publication) and share insights about effectively coding social imagery in a bottom-up and emergent way.
This slideshow highlights the Tweet Analyzer machine, a tool created by Paterva and enabled through Maltego Carbon 3.5.3 and Maltego Chlorine 3.6.0. The Tweet Analyzer enables real-time captures of Tweets (from Twitter's streaming API) along with real-time sentiment analysis (based on polarities: positive, negative, and neutral), based on the Alchemy API.
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...Shalin Hai-Jew
This introduces methods for extracting and analyzing social network data from Twitter for hashtag conversations (and emergent events), event graphs, search networks, and user ego neighborhoods (using NodeXL). There will be direct demonstrations and discussions of how to analyze social network graphs. This information may be extended with human- and / or machine-based sentiment analysis.
Researchers have long known that the words of a text have always contained more information than on the surface. As such, texts have been studied for subtexts and other latent or hidden information. One approach has involved the machine-enabled analysis of human sentiment, usually mapped out on a positive-negative polarity. NVivo 11 Plus (a qualitative research tool released in late 2015) enables the automated sentiment analysis of texts (coded research, formal articles, text corpora, Tweetstream datasets, Facebook wall posts, websites, and other sources) based on four categories: very positive, moderately positive, moderately negative, and very negative. The tool feature compares the target text set against a sentiment dictionary and enables coding at different units of analysis: sentence, paragraph, or cell. Further, the sentiment capability extracts the coded text into respective text sets which may be further analyzed using text frequency counts, text searches, automated theme and sub-theme extractions (topic modeling), and data visualizations.
Deep Learning with Audio Signals: Prepare, Process, Design, ExpectKeunwoo Choi
Is deep learning Alchemy? No! But it heavily relies on tips and tricks, a set of common wisdom that probably works for similar problems. In this talk, I’ll introduce what the audio/music research societies have discovered while playing with deep learning when it comes to audio classification and regression -- how to prepare the audio data and preprocess them, how to design the networks (or choose which one to steal from), and what we can expect as a result.
Simple Data Engineering in Python 3.5+ — Pycon.DE 2017 Karlsruhe — Bonobo ETLRomain Dorgueil
Simple Data Engineering in Python 3.5+ using Bonobo ETL, with real world example using Django2 and DBPedia.
https://www.bonobo-project.org/
Presentation from Pycon.DE 2017 in Karlsruhe
LocJam is a game localization contest held between April 5-13, 2014. This presentation covers all the localization insights that Richard Mark Honeywood gave in workshop in Tokyo on April 6.
Topics include:
- Familiarization
- Glossary and style guide
- Editing
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Deep Learning with Audio Signals: Prepare, Process, Design, ExpectKeunwoo Choi
Is deep learning Alchemy? No! But it heavily relies on tips and tricks, a set of common wisdom that probably works for similar problems. In this talk, I’ll introduce what the audio/music research societies have discovered while playing with deep learning when it comes to audio classification and regression -- how to prepare the audio data and preprocess them, how to design the networks (or choose which one to steal from), and what we can expect as a result.
Simple Data Engineering in Python 3.5+ — Pycon.DE 2017 Karlsruhe — Bonobo ETLRomain Dorgueil
Simple Data Engineering in Python 3.5+ using Bonobo ETL, with real world example using Django2 and DBPedia.
https://www.bonobo-project.org/
Presentation from Pycon.DE 2017 in Karlsruhe
LocJam is a game localization contest held between April 5-13, 2014. This presentation covers all the localization insights that Richard Mark Honeywood gave in workshop in Tokyo on April 6.
Topics include:
- Familiarization
- Glossary and style guide
- Editing
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
2. Motivation
• Dictionary-based classifiers have high precision
• But usually low recall
• Natural language is very dynamic
• New words appear
• Words change their meaning and sentiment
• Heap’s Law
• Hard to update the dictionary at the same speed
3. LIWC Dictionary
• Fairly large dictionary
• Almost 4,500 words and steams
• 406 positive
• 499 negative
• Development and Update is a long process
• Almost exclusively done manually
• Requires a lot of human resources
• Last update was in 2007
• Twitter was launched in July, 2006
4. System overview
19027743 1985381275 NULL NULL <d>2009-06-01
00:00:00</d> <s>web</s> <t>I think i
'm gonna go with the magic in 6.... just cause now
that bron bron's out i wanna
see kobe lose too.</t> SeanBennettt 98 434 159 -
18000 0 0 <n>Sean Bennett</n> <u
d>2009-01-15 16:36:04</ud> <t>Eastern Time (US
& Canada)</t> <l>Long Island,
NY</l>
.
.
.
Postive:
.. :) :- ...... live tweet ;) .& -- =) everytime rain tweets (:
mj xd michael !!!!!! lil ." dog sun jus fan wit =] :] aww
album via luv photo ;- john pic different kno wearing
la ).
Negative:
!! :( ?? getting twitter omg ?! ppl :/ dude idk da
weather bout wtf iphone smh wat internet =( heat dnt
=/ facebook :| gosh kate :[ fml ima jon swear punch
text =[ cringe ): nd ** imma
6. System overview/Parser
19027743 1985381275 NULL NULL <d>2009-06-01
00:00:00</d> <s>web</s> <t>I think i'm gonna go
with the magic in 6.... just cause now that bron bron's
out i wanna see kobe lose too.</t> SeanBennettt 98
434 159 -18000 0 0 <n>Sean Bennett</n> <ud>2009-
01-15 16:36:04</ud> <t>Eastern Time (US &
Canada)</t> <l>Long Island, NY</l>
.
.
.
haha nooo! i just wanna kill mee!!!! i didn`t do my
homework...and i feel sick =(
I can see the bus again. that makes me happy.
$$ Black Swan Fund Makes a Big Bet on Inflation
wonder how Roubini feels about this...?
blahh, i feel boredd and tiredd as hell haha
jay to conan... upgrade. lc to kristin... downgrade.
rushing home for lauren's final episode. my life
makes me sad.
7. Parser
Structured Extract tweet
Tweets
Text (RegEx)
Filter
Clean Tweets
Clean
Remove Remove Remove
user name URL hash tag
(RegEx) (RegEx) (RegEx)
8. Parser
• Regular Expressions
• Very powerful tool for text processing…
• ..but very complex
• Ex.:
<d>2009-06-01 00:00:00</d>
<s>web</s> <t>I just reached level 2.
#spymaster http://bit.ly/playspy</t>
asmith393 1522 1498 207 -18000 0 0 I just reached level 2. #spymaster
<n>Adam Smith</n> <ud>2007-03-07 <t>(.*?)</t> http://bit.ly/playspy
18:17:20</ud> <t>Eastern Time (US
& Canada)</t>
9. Parser
• Regular Expressions
• Very powerful tool for text processing…
• ..but very complex
• Ex.:
I just reached level 2. I just reached level 2.
#spymaster #[0-9a-zA-Z+_]* http://bit.ly/playspy
http://bit.ly/playspy
10. Parser
• Regular Expressions
• Very powerful tool for text processing…
• ..but very complex
• Ex.:
I just reached level 2.
#spymaster
((http://|www.)([a-zA- I just reached level 2.
#spymaster
http://bit.ly/playspy Z0-9/.~])*)
11. System overview/Master
haha nooo! i just wanna kill mee!!!! i didn`t do my
homework...and i feel sick =(
I can see the bus again. that makes me happy.
$$ Black Swan Fund Makes a Big Bet on Inflation
wonder how Roubini feels about this...?
blahh, i feel boredd and tiredd as hell haha
jay to conan... upgrade. lc to kristin... downgrade.
rushing home for lauren's final episode. my life
makes me sad.
Index Frequency Chunks Co-frequency
12. Master
Tweets
Splitter Tweets
Chunks Mapper
Tweets
Indexer Index M M M
R
Reducer R
R
Unsorted Co-frequency
Co-frequency
Frequency Sort Co-frequency
Frequency
13. Master/Splitter
• Count the lines in the input file
• Select only tweets that words on the LIWC
dictionary
• Split the input file in smaller chunks
15. Master/Mapper
• Spawn processes in parallel and divide the
chunks among them
• Each worker does two jobs:
• First: create (word, frequency) pairs
Frequency.tmp
someone 6
down 8
ever 10
Chunk Worker kinda 2
crazy 14
…
16. Master/Mapper
• Spawn processes in parallel and divide the
chunks among them
• Each worker does two jobs:
• First: create (word, frequency) pairs
• Second: save the co-words for each word
17. Master/Mapper
Split Words
Remove
Duplicates
Generate files
Save co-words Worker
haha
haha i
nooo do
haha nooo! i just wanna kill ! didn`t
mee!!!! i didn`t do my i my
homework...and i feel sick =( just
homework
wanna
... and
kill
mee feel =(
i
!!!! sick
18. Master/Mapper/Issues
• Splitting is not trivial
• Splitting in whitespaces
• homework… ≠ homework
• Remove punctuation
• :) ☐
• Solution: RegEx again
• ([w-'`]*)(W*)
• File names:
• Unique, easy to find and respect OS rules
• Hash
• This is why the index file is important
19. Master/Mapper/Issues
• Parallel programming on Python
• Original interpreter don’t support multi-thread…
• Alternatives, such as Jython and IronPython, do
• …but it is still possible to work in parallel
• Multi-thread vs. Multi-process
• Multi-process in Python
• multiprocessing module
• http://docs.python.org/library/multiprocessing.html#module-
multiprocessing.pool
20. Master/Reducer
• Spawn processes in parallel and split the words
among them
• Basically counts the mapper results
• Also, each work does two jobs:
• First: sums all the (word, frequency) pairs and save
frequency.tmp
car 4 frequency.txt
house 2 Reducer car 5
ball 5 house 3
car 1 ball 5
house 1
21. Master/Reducer
• Spawn processes in parallel and split the words
among them
• Basically counts the mapper results
• Also, each work does two jobs:
• First: sums all the (word, frequency) pairs and save
• Second: sums the co-occurrence frequency
trip
trip
car 1
Worker car 3
ball 3
Ball 3
car 2
house 1
house 1
22. Master/Reducer/Issues
• Index file
• Useful to access the files
• Each word has a file with a list of co-words
• But file name is hashed
• Non-invertible function
• Look-up on index, hash the word and get the file
24. Classifier
α β γ
Frequency Scores
δ
Co-frequency
Max results New words
25. Classifier/Sentiment words
Car 232
Ball 143
Street 125 Top α%
Frequency House 121
Boat 114
Pencil 105
Pen 98
Computer 81
26. Classifier/Co-words
Top β%
engine tire door
Car
Ball
court game play
Street
name size
27. Classifier/Score
engine tire door
engine 1 0
court game play
tire 1 0
door 2 1
door size
size 1 2
size room type home
price size door
28. Classifier/Collapse
• Created to deal with problems like:
• :) :)) :), :).
• They should all be treated as the same token
• Harder for words
29. Classifier/New words
• Rules to compare the scores
• So far the rules are
• If the positive score is bigger than the negative
score plus delta, tag the word as positive
• Same idea for negative
• Returns the new words up to a maximum value
31. Evaluation
• Two evaluation methods:
• First method
• Find tweets that could not be categorized before
but now they can
• Manually check the precision of the result
• Second method
• Manually select positive and negative tweets
• Compare the precision of the old dictionary with
the new dictionary
32. Sub-product
• LIWC Dictionary Library for Python
• Provides easy access to the dictionary information
• Easy search
• Reverse index
• Match wildcard
• Ex.: