Your SlideShare is downloading. ×
0
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Mining and analyzing social media hicss 45 tutorial – part 1

4,419

Published on

HICSS 45 Tutorial on Mining and Analyzing Social Media Part 1. David King. Jan 4, 2012

HICSS 45 Tutorial on Mining and Analyzing Social Media Part 1. David King. Jan 4, 2012

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,419
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
65
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Mining and Analyzing Social Media HICSS 45 Tutorial – Part 1 Dave King January 4, 2012
  • 2. Agenda: This is how the slides areorganized• Part 1 – Introduction – Bio, Resources, Social Media – Data Mining – Processes and Example – Text Mining – General Processes and Example – Predicting the Future – The Portmanteaus• Part 2 – Sentiment Analysis – Social Network Analysis - Introduction 2 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 3. Biography: Dave King • Currently, EVP of Product Development and Management at JDA Software • 30 years in enterprise package software business • 15 years as university professor • 14 years as Co-Chair of the Internet & Digital Economy Track (HICSS) • Long time interest in various aspects of E-Commerce & Business Intelligence • Tutorial topic primarily reflects a personal interest and tangentially a job(s) related interest. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 4. Personal Experiences withAnalytics• Taught applied statistics, math modeling & mathematical sociology• In software R&D for 30 years – Optimization in the 80s – Natural Language Frontends • NLI Query & CMU Robotics Lab – EIS Competitive Analysis • Dow Jones and Reuters • Verity Topics • NewsAlert – InXight’s Hyperbolic Tree – Supply Chain Analytics• In the case of text analysis and it’s practical application, often audiences have been small, bewildered, and fleeting Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 5. Mining and Analytics Resources 5 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 6. Mining and Analytics Resources 6 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 7. Mining and Analytics Resources 7 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 8. Mining and Analytics Resources 8 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 9. Mining and Analytics Resources:Web Sites, Online Books & Tutorials• DM/Blog -- abbottanalytics.blogspot.com• DM/Blog – blog.data-miners.com• DM/Blog -- bx.businessweek.com/data-mining/blogs• DM/Blog -- bytemining.com• DM/Blog – data-mining.alltop.com• DM/Blog -- dataminingblog.com• DMBlog – dataminingdownunder.com• DM/Blog -- datamining.typepad.com• DM/Blog -- datawrangling.com• DM/Blog -- timmanns.blogspot.com• DM/General -- kdnuggets.com• DM/General -- mydatamine.com• DM/General -- the-data-mine.com• DM/Online Book -- chem-eng.utoronto.ca/~datamining/dmc/data_mining_map.htm• DM/Tutorial -- autonlab.org/tutorials/ 9 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 10. Mining and Analytics Resources:Web Sites, Online Books & Tutorials• TA/General -- social.textanalyticsnews.com• TA/General -- textanalysis.info• TM/Blog -- blogs.sas.com/text-mining• TM/Blog -- lingpipe-blog.com• TM/Blog -- texttechnologies.com• TM & TA/Blog -- informationweek.com/authors/showAuthor.jhtml?authorID=1331• TA Tutorial -- slideshare.net/SethGrimes/text-analytics-overview-2011• TM & DM/Online Book -- statsoft.com/textbook/text-mining/• TM & DM/Tutorial -- alias-i.com/lingpipe/demos/tutorial/db/read-me.html• TM Tutorial -- scienceforseo.com/tutorials/text-mining-tutorial• TM/Wiki -- textanalytics.wikidot.com• SNA/Blog – iq.harvard.edu/blog/netgov/2011/10/• SNA/Blog – thenetworkthinkers.com• SNA/Blog – blog.echen.me/tag/social-network-analysis/• SNA/Blog – lithosphere.lithium.com/t5/user/viewprofilepage/user-id/151• SNA/Tutorial -- cs.stanford.edu/people/jure/icml09networks/ 10 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 11. Mining and Analytics Resources:Web Sites, Online Books & Tutorials• DA/Blog – dataists.com• DA/Blog – drewconway.com• Visualization/Blog – abeautifulwww.com/• Visualization/Blog – benfry.com/writing/• Visualization/Blog -- blog.blprnt.com• Visualization/Blog – chrisharrison.net/index.php/visualization.com• Visualization/Blog – datavisualization.ch/• Visualization/Blog – eagereyes.com• Visualization/Blog – informationandvisualization.de/• Visualization/Blog – infosthetics.com• Visualization/Blog – junkcharts.typepad.com/junk_charts/• Visualization/Blog – neoformix.com• Visualization/Blog – perpetualedge.com/blog• Visualization/Blog – processing.org• Visualization/Blog – visualcomplexity.com• Visualization/Blog – well-formed-data.net/ 11 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 12. Social Media Defined Marta Kagan Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 13. Social Media Defined: …Sort of … 13 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 14. Social Media Defined:Actually, it’s 33 Definitions1. Media for social interaction, using highly accessible and scalable 18. Not one thing. It’s five distinct things: communication techniques. 19. Digital, content-based communications based on the interactions enabled by a2. Various user-driven (inbound marketing) channels (e.g., Facebook, Twitter, plethora of web technologies blogs, YouTube). 20. Collection of online platforms and tools that people use to share content,3. Most transparent, engaging and interactive form of public relations profiles, opinions, insights, experiences, perspectives and media itself,4. What we do and say together, worldwide, to communicate in all direction at facilitating conversations and interactions online between groups of people. any time, by any possible (digital) means. 21. Platform/tools.5. New marketing tool that allows you to get to know your customers and 22. Act of connecting on social media platforms. prospects in ways that were previously not possible. 23. How businesses join the conversation in an authentic and transparent way to6. Platforms that enable the interactive web by engaging users to participate in, build relationships. comment on and create content as means of communicating 24. The notion that social media is about the technology that facilitates individuals7. Consists of any online platform or channel for user generated content. and groups of people to connect and interact, create and share.8. Digital content and interaction that is created by and between people. 25. Any of a number of individual web-based applications aggregating users who9. Shift in how we get our information. Social media allows us to network, to find are able to conduct one-to-one and one-to-many two-way conversations. people with like interests, and to meet people who can become friends or 26. Media channel that relies on listening and conversation, as opposed to a customers. monologue, to get your point across, make a connection and build a10. Platforms for interaction and relationships, not content and ads. relationship.11. Online platforms and locations that provide a way for people to participate in 27. Social media is all about leveraging online tools that promote sharing and these conversations. conversations, which ultimately lead to engagement with current and future12. People’s conversations and actions online that can be mined by advertisers customers and influencers in your target market. for insights but not coerced to pass along marketing messages. 28. Social media: Evolution, Revolution and Contribution -by the ability of13. Tools, services, and communication facilitating connection between peers everybody to share and contribute as a publisher with common interests. 29. Social media is communication channels or tools used to store, aggregate,14. Online technologies and practices that people use to share content, opinions, share, discuss or deliver information within online communities. insights, experiences, perspectives, and media themselves. 30. Social Media is simply another arrow to be shot in a company’s marketing15. Ever-growing and evolving collection of online tools and toys, platforms and quiver. applications that enable all of us to interact with and share information. 31. Social media platforms make it easier to share information–usually online. Increasingly, it’s both the connective tissue and neural net of the Web. 32. Any object or tool, that connects people in dialogue or interaction — in16. Reflection of conversations happening every day, whether at the supermarket, person, in print, or online. a bar, the train, the watercooler or the playground. 33. Wild, Wild West of Marketing, with brands, businesses, and organizations17. Online text, pictures, videos and links, shared amongst people and jostling with individuals to make news, friends, connections and build organizations. communities in the virtual space. 14 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 15. Social Media Defined: If a Picture isn’tworth a 1000 words, then … 15 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 16. Social Media Defined Online technologies and practices for social interaction enabling the sharing of opinions, insights, experiences, perspectives and media itself 16 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 17. Social Media Defined: Categories 17 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 18. Social Media Defined:Unanimous Agreement Marta Kagan 18 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 19. Social Media is Huge: Users Marta Kagan750 Million: Facebook200 Million: Twitter100 Million: LinkedIn 19 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 20. Social Media is Huge! Marta KaganIf Facebookwere a country,it would be the3 rd largest inthe world 20 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 21. Social Media Data: Research Opportunity“Every day, Twitter generates more social network data than theentire field of SNA possessed 10 years ago.” 21 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 22. Social Media is Huge:Usage and Content Nam e 10**N Nam e Value (Sym bol) (Sym bol) kilobyte (kB) 3 kibibyte (KiB) 210 = 1.024 × 103 megabyte (MB) 6 mebibyte (MiB) 220 ≈ 1.049 × 106 gigabyte (GB) 9 gibibyte (GiB) 230 ≈ 1.074 × 109 terabyte (TB) 12 tebibyte (TiB) 240 ≈ 1.100 × 1012 petabyte (PB) 15 pebibyte (PiB) 250 ≈ 1.126 × 1015 exabyte (EB) 16 exbibyte (EiB) 260 ≈ 1.153 × 1018 zettabyte (ZB) 21 zebibyte (ZiB) 270 ≈ 1.181 × 1021 yottabyte (YB) 24 yobibyte (YiB) 280 ≈ 1.209 × 1024 22 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 23. Social Media Data:Part of a Bigger Picture 23 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 24. Social Media Data:Ways in big data is creating value • Makes information transparent and usable at much higher frequency. • Provides more transactional data in digital form, that can be used to improve performance across the board. • Allows ever-narrower segmentation of customers to tailor products or services. • Improves decision-making through sophisticated. • Improves the development of the next generation of products and services 24 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 25. Data Mining: DefinedDiscovering meaningfulpatterns from large datasets using patternrecognition technologies. 25 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 26. Data Mining: CRISP-DM Real-World Data Data Consolidation Business Data Understanding Understanding Data Preparation Data Cleaning Deployment Modeling Data Transformation Evaluation Data Reduction Well-FormedCross-Industry Standard Process for Data Mining Data 26 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 27. Data Mining:General Data Assumptions Structured Transformed Well-Formed 27 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 28. Data Mining: Example Affinity Analysis 28 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 29. Data Mining: Example1. Market Basket Analysis: Items for Sale: Apples Bananas Cherries2. Possible Transactions: With one item or a collection of items selected as the Driver or Independent Variable No X Y No X Y 1 A B 7 C A 2 A C 8 C B 3 A B C 9 C A B 4 B A 10 A B C 5 B C 11 A C B 6 B A C 12 B C A3. Objective is to empirically determine those groups of items that occur frequently together in a set of transactions, producing a set of rules of the form X -> Y. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 30. Data Mining: Example 1 1 1 1Transaction ID Items 2 1 0 0 1 Apple 3 0 1 1 1 Banana 4 0 1 1 1 Cherry 5 1 1 0 2 Apple 6 1 1 0 3 Banana 7 1 0 1 3 Cherry 8 1 1 0 4 Banana 9 1 1 1 4 Cherry 10 1 1 0 5 Apple Sum 8 8 5 5 Banana 6 Apple 6 Banana Standard Market Basket Measures: 7 Apple 7 Cherry Support: Rule’s coverage (% match antecedents) 8 Apple N(X & Y)/ N(T) Example: N(A & B)/ N(T) = 2/7 = 29% 8 Banana 9 Apple Confidence: Rule’s predictive ability (% consequent | antecedent) 9 Banana N(X & Y)/ N(X) Example: N(A & B)/ N(A) = 2/4 = 50% 9 Cherry 10 Apple Lift: Predictive improvement (ratio of observed support for X&Y to support if X& Y 10 Banana independent -- S(XuY)/S(X)S(Y) Example: (2 x7)/(4/7)(5/7) = .7 or 70% 30 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 31. Data Mining: Example Rule selection usually based Parameters Min. Support 40% on minimum support & confidence Min. Confidence 75% No X Y N(XuY) N(T) S(XuY) N(X) Conf N(Y) S(X) S(Y) Lift Rule 1 A B 6 10 60% 8 75% 8 80% 80% 94% Ok 2 A C 3 10 30% 8 38% 5 80% 50% 94% 3 A B C 2 10 20% 8 25% 4 80% 40% 78% 4 B A 6 10 60% 8 75% 8 80% 80% 117% Ok 5 B C 4 10 40% 8 50% 5 80% 50% 125% 6 B A C 2 10 20% 8 25% 3 80% 30% 104% 7 C A 3 10 30% 5 60% 8 50% 80% 150% 8 C B 4 10 40% 5 80% 8 50% 80% 200% Ok 9 C A B 2 10 20% 5 40% 6 50% 60% 133% 10 A B C 2 10 20% 6 33% 5 60% 50% 111% 11 A C B 2 10 20% 3 67% 8 30% 80% 278% 12 B C A 2 10 20% 4 50% 8 40% 80% 156% 31 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 32. Data Mining:Simple ExampleBut, what if the baskets were described in thefollowing manner: – Jane bought a handful of maraschinos and a couple of granny smiths. – Harold purchased a bag of appls and 2 bananas. – Bill paid for a pound of cherries but decided not to buy the three durians because of their odor.How could we automate the analysis? Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 33. Social Media Data: 33 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 34. Social Media Data: Commonality? 34 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 35. Text Mining: DefinedUsing data mining to discover patterns in a collection of documents 35 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 36. Text Mining:CRISP-Like Processes Real-World Text Data Document Business Understanding Document Understanding Consolidation Document Establish the Preparation Corpus Deployment Documents Modeling Corpus Refinement (Token, Stem, Stop…) Feature Selection Evaluation & Weighting Term- Doc-Matrix* 36 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 37. Text Mining Process: Sample Corpa• Brown Corpus – first million word corpus compiled in 60s at Brown U., 500 samples across 15 genres, each ~2000 words with POS tags (Lancaster-Oslo-Bergen Corpus – British equivalent)• Linguistic Consortium Treebanks – collections of manually tagged and parsed (tree structures) of sentences from a variety of sources (includes well-known Penn Treebank collection)• Reuters 21578, RCV1 & V2, TRC2 -- collections (1000s of) Reuter’s English & multi-lingual news stories classified into topics and grouped into training & test sets• Pang & Lee’s Sentiment Analysis – 1000 positive and 1000 negative movie reviews• MEDLINE – An extensive collection of articles and abstracts (18M+) used in a variety of biomedical and linguistic text mining applications• WordNet® -- large lexical database of English grouped into sets of cognitive synonyms (synsets) and interlinked by means of conceptual-semantic and lexical relations.• 20 Newsgroups -- collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups each representing a different topic. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 38. Text Mining Process:Corpus Refinement Common representation of tokens within and between documents Eliminate Tokenization Normalize Stemming Stop Words• Tokenization —Parse the text to generate terms. Sophisticated analyzers can also extract phrases from the text.• Normalize — Convert them to lowercase.• Eliminate stop words — Eliminate terms that appear very often (e.g. the, and, …).• Stemming — Convert the terms into their stemmed form—remove plurals and different word forms (e.g. achieve, achieves, achieved – achiev) [note: word about synonyms – WordNet Synset] Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 39. Text Mining:Feature Extraction & Weighting Feature Extraction “Bag of Words, Terms  or Tokens” Vector Representation -> Word, Term, Token or Pairs-Triplets x Doc Matrix Token1 Token2 Token3 Token4 … Doc1 1 2 2 4 Words or Tokens are Doc2 4 2 3 0 attributes and documents Doc3 1 1 1 0 Doc4 1 1 1 2 are examples … Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 40. Text Mining:Transforming Frequencies• Binary Frequencies: tf =1 for tf>0; otherwise 0• Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K• Log Frequencies: 1 + log(tf) for tf>0; otherwise 0• Normalized Frequencies: Divide each frequency by SQRT of Sum of Squares of the frequencies within the vector (column)• Term Frequency–Inverse Document Frequency – TF * IDF – Inverse Document Frequency: log(N/(1+D)) where N is total number of docs and D is number with term Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 41. Text Mining: Simple ExampleListening Post is an art installation by MarkHansen and Ben Rubin that culls textfragments in real time from thousands ofunrestricted Internet chat rooms, bulletinboards and other public forums. 41 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 42. Text Mining: Simple Example 42 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 43. Text Mining: Simple Example sentence imageid Blogs feeling “I feel” posttime “I’m feeling” postdate posturl 15-20K gender Feelings born Per Day country Contains state Every 1 of 5000 city10 Mins Pre-Determined lat Feelings lon conditions 43 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 44. Text Mining: Simple Example Query API Result <?xml version="1.0" ?>http://api.wefeelfine.org <feelings>:8080/ <feeling imageid="-ShowFeelings? mZmybPrOGTZ+xukpcU7jg"display=xml& feeling="better" sentence="i feel almost 100 betterreturnfields= aside from that weird sandy feeling inSentence my throat"&postdate=2010-11-25 posttime="1321633467" postdate=2010-11-25="0"&limit=500 posturl="http://jenngreenleaf.blogspot.com /2011/11/im-coming-down-with-cold-or- am-i.html" gender="0" country="united states" state="maine" city="richmond" lat="44.091522" lon="-69.801787" conditions="4" /> … 44 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 45. Text Mining: Simple Example • im done believing you dont know what im feeling • i feel so out of place • im feeling healthy • i never feel down when im with her • i love the feeling • i feel like ive been run over by a truck • i feel so positive today • i feel like a poor mans pin up girl 45 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 46. Text Mining: Simple Example • Input String (128925 chars; 24282 spaces) – "i have found to be helpful especially during those times when i am feeling discouragedni have a 50km commute and just the lack of the sense of freedom that driving brings just leaves me feeling scaredni seem to be feeling better mostly…" • Tokenize (26465 tokens) – [i, , have, found, to, be, helpful, especially, during, those, times, when, i, am, feeling, discouraged, i, have, a, 50km, commute, and, just, the, lack, of, the, sense, of, freedom, that, driving, brings, just, leaves, me, feeling, scared, i, feel, noone, know, if, you, were, me, you, will, feel, the, same, way‘, …] • Set of Tokens (3045 distinct tokens) – ["", "believe", "d", "en", "encoding", "feedlinks", "forever", "gets", "http", "ismobile", "isprivate", "item", "languagedirection", "ll", "locale", "ltr", "m", "mefaked", "mobileclass", "mr", "no", "okay", "on", "pagetitle", "pagetype", "re", "s", "t", "toned", "url", "us", "utf", "ve", "yes", 0, 034, 039, 0aeverytime, 0d, 10, 100, 101,…] Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 47. Text Mining: Simple ExampleCorpus Word Length Sentence Length Lexical DiversityWe Feel Fine 4 17 8Gutenberg CorpusAusten-persuasion.txt 4 23 16Bible-kjv.txt 4 33 79Blake-poems.txt 4 18 5Carroll-alice.txt 4 16 12Melville-moby.txt 4 24 15Milton-paradise.txt 4 52 15Shakespeare-caesar.txt 4 12 8Shakespeare-hamlet.txt 4 13 7 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 48. Text Mining: Simple Example • Eliminate Stopwords (175 words - a, about, above, after, …) – Set of tokens (12827) with stopwords eliminated [ab, abit, able, abs, absolute, absolutely, absorb, abuse, accomplished, accomplishment, achieve, achieved, across, acted, action, activities, activity, actually, acura, add, …] – Content (11896 or 45% of tokens not stopwords – 4053 with tokens starting with apostrophes and #s eliminated ) • Stemming – Stemmed tokens (11896) [abdomen, abdul, abil, abl, abrupt, absolut, abstract, academ, accept, accid, accomplish, accur, accus, accustom, achi, achiev, acknowledg, across, action, activ‘…] – Set of tokens in stemmed content(2283) [abdomen, abdul, abil, abl, abrupt, absolut, abstract, academ, accept, accid, accomplish, accur, accus, accustom, achi, achiev,…] Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 49. Text Mining: Simple Example Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 50. Text Mining: Simple Example Document-Term Matrix Sum 416 94 90 89 83 80 80 76 76 75 … 16 16 16 16 16 16 16 16 16Sum WeFeel like know time go think better way get good love … hear didn place almost comfort everyonsinc babi actual 3 comment1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 comment2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 comment3 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 comment4 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 comment5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 comment6 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 comment7 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 7 comment8 2 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 comment9 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 comment10 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 … … 2 comment1490 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 comment1491 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 6 comment1492 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 3 comment1493 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 comment1494 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 comment1495 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 comment1496 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 comment1497 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 comment1498 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 comment1499 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 50 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 51. Text Mining: Simple Example Madness Murmerings Montage Mobs Metrics Mounds Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 52. Prediction Collective, macroscopic trends which can be scientifically inferred by harnessing publicly accessible data from the Internet. 52 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 53. Prediction: Characteristics Public Practical Big 53 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 54. Prediction: Sources Easily accessible digital traces: What we surf Whom we “friend” What we say Where we go What we buy How we play 54 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 55. Prediction: Sample Studies 55 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 56. Prediction: Sample Studies Infodemiology Nowcasting Culturomics 56 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 57. Prediction: Infodemiology Information + Epidemiology: Science of distribution and determinants of information in an electronic medium, specifically the Internet, or in a population, with the ultimate aim to inform public health and public policy Coined by Gunther Eysenbach, Univ. of Toronto 57 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 58. Prediction: InfodemiologyA Major Application - Practical 58 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 59. Prediction: Infodemiology A Major Application - PracticalVi Regional, Weekly Syndromic Surveillance 59 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 60. Prediction: InfodemiologyAn Alternative Approach Text Mining of Worldwide Newswires, Web Sites and Various Offline Reports 60 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 61. Prediction: InfodemiologyUtilizing Aggregate Search Data Monitoring and analyzing queries from Internet search engines or peoples status updates on microblogs for syndromic surveillance to predict disease outbreaks 61 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 62. Prediction: InfodemiologyUtilizing Aggregate Search Data 62 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 63. Prediction: InfodemiologyUtilizing Aggregate Search Data 63 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 64. Prediction: InfodemiologyUtilizing Aggregate Search Data Dependent Dependent Traditional, Aggregate Variable at Variable at Publicly Search Time t Time t - n Available Index or (Standard = b0 + b1 (Standard + b2 Explanatory + b3 Social +e Publicly Publicly Variable Media Available Available Freq. Measure) Measure) Count Standard Linear Prediction Model 64 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 65. Prediction: InfodemiologyUtilizing Aggregate Search Data “Detecting Influenza Epidemics Using Search Engine Query Data” (Ginsberg et. al.), 2/19/09 • Aggregating historical logs of search queries from 2003-2008, computing weekly time series • Logit(P) = b0 + b1 * logit(Q) + e – P – percentage of ILI physician visits – Q – query fraction 45 highest influenza queries • r is between .80-.96 for 9 regions 65 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 66. Prediction: InfodemiologyUtilizing Aggregate Search Data http://www.google.org/flutrends/about/how.html 66 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 67. Prediction: InfodemiologyUtilizing Aggregate Search Data 67 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 68. Prediction: InfodemiologyA Similar Application http://www.google.org/denguetrends/ 68 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 69. Prediction: InfodemiologyUtilizing Tweets ? 69 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 70. Prediction: InfodemiologyUtilizing Tweets 70 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 71. Prediction: InfodemiologyUtilizing Tweets “Nowcasting Events from the Social Web with Statistical Learning,” Lampos and Cristianini, ACM IS&T, 9/11 • Text analysis of 50M tweets for 3 regions of UK from 6/09-4/10 (303 days) • HPA weekly reports of GP consultations with ILI diagnosis correlated with number of “hybrid grams” • Average “r” of .911 71 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 72. Prediction: InfodemiologyA Major Application – Text Analysis 50M Tweets Corpus 3 Region UK, 6/09-4/10 Corpus Lower Stop Tokens Stems Refinement Case Words Feature 1- 2 Hybrid N-Gram Selection Grams Grams Grams Freqs 72 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 73. Prediction: InfodemiologyUtilizing Tweets Discarded when n<50 BoLasso - Bootstrap LASSO (least absolute shrinkage and selection operator 73 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 74. Prediction: InfodemiologyUtilizing Tweets 74 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 75. Prediction: InfodemiologyUtilizing Tweets 75 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 76. Prediction: Now + Forecasting: Predicting the present by analyzing large volumes of data that can be used to "forecast" current events for which official analysis has not been released 76 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 77. Prediction: NowcastingWeather Envy Within the next 6 hours … 77 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 78. Prediction: Sample Studies with SearchAuthors Date (Mnth-Year) Dependent Variables Explanatory Variables Model ResultsSong, Pan, Ng Apr-10 Weekly Hotel Bookings in Indexed Search Volumes from Log of Room Nights for Log of Search Test various statistical models; all gave Charleston, SC Google Trends/Insights Jan Volumes - Charleston, Travel Charleston, reasonable forecasts. Best fit model 2008-Aug 2009 Charleston Hotels, Charleston was Autoregressive Distributed Lag Restaurants, Charleston Tourism (ADLM) with a lag period of 6 weeks.Kholodilin, Apr-10 Year-on-Year Growth Rate 220 Google Trend/Insights Y-o-Y monthly URPC growth rates for 3 Query term principal componentsPodstawski, of Monthly US Real Search terms related to Priv sets of regressors -- Sentiment outperform standard Sentiment andSliliverstovs Private Consumption, Consumption reduced to 10 (consumer sentiment and confidence); Financial Indicators. A combination of ALFRED db of Fed Rsrv of principal components for Financial (short term and long term two of the factors work best -- those St. Louis montly periods from Jan 2005 interest rates and S&P 500); Query related to mobility and health care to Dec 2009 (combinations of principal components of consumption. query terms)Choi, Varian Apr-09 US Census Bureau Google Trend/Insight query Google Trend indices for query Simple seasonal AR models and fixed- Advance Monthly Retail indices for categories and subcategories related to (log values) of effects models that includes relevant Sales (general and subcategories related to retail overall monthly retail trade (NAICS Google Trend variables tend to specific) and Travel sales (general and specifix) categories), automotive sales, home outperform models that exclude these (Visitor arrival in Hong and related to Travel sales and travel. variables. In some cases small gains, in Kong) other substantial.McLaren, Q2-11 Official monthly Google Trend/Insight query For unemployment, linear AR model For unemployment forecasts, claimantShanbhogue unemployment data and indexes for the term "Job with query term, claimant count, and GfK count strongest followed by query term. housing price growth in Seekers Allowance (JSA)" for consumer confid. as exp vars; for housing For housing prices, the query term was the UK from June 2004-Jan unemployment and "Estate price growth with query term, Home much stronger than HBF and RICS data. 2011 Agents" for housing Builders and Royal Instit. of Chartered Surveyors price growth balances as exp vars. 78 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 79. Prediction: Sample Studies with Social MediaAuthors Date (Mnth-Year) Dependent Explanatory Variables Model Results VariablesAsur, Mar-10 Box-office Promotion tweets-retweets for a particular movie, Regression of 1st weekend box Promotional tweets are weaklyHuberman revenues for (24) tweet rates for particular movie per hour, ratio of office revenues by promotional correlated 1st weekend revs. Tweet movies positive to negative sentiments for the movie tweets-retweets, by tweet rates rates are very strongly correlated vs. Hollywood Stock Exchange (min .9) and a stronger predictor than prices, and 2nd weekend HSX. Finally, tweet rates are strongly revenues by tweet rates and the correlated with 2nd weekend sentiment ratio. revenues and sentiments improve the forecasts slightly.Gruhl, Guha, Aug-05 Amazon Sales Number of mentions of the book/author in over 300K Cross correlation of time series While sales rank is a poor predictor ofKumar, Novak, Rank for 2340 blogs whose postings that were maintained by IBMs for sales rank and mentions. the change in sales rankings, a priorTomkins bestselling books WebFountain project (over 200K postings/day) spike in mentions predicts quite well in 4 month period a future spike in sales rank. (Jul 2004-Aug 2004) and spikes in these sales ranksSadikov, Aug-09 Movie critic Basic features that count movie references in blogs, Linear regression for weekly Minimal correlation betweenParameswaran, ranking, user count movie references taking into account ranking rankings and sales data by blog rankings and references andVenetis ranking, 2008 and indegree of the blogs where they appear, references and sentiment. sentiment. Strong correlation gross sales, consider only references made within a time window between references and gross sales weekly box office before or after a movie release date, features that but week with sentiment. Strongest sales (weeks 1-5) consider positive sentiment; and combinations of relationships with timing of these. References based on spinn3r.com blog data references in weeks after release. set 11/07-11/08 79 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 80. Prediction: Any Guesses? 80 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 81. Prediction: Idiom, a Sculpture of10s of 1000s of Books 81 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 82. Prediction: It comes in manyShapes but not Sizes Omphalos Book Cell Matej Krén Gravity Mixer 82 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 83. Prediction: Culturnomics Culture + Genomics: Application of high- throughput data collection and analysis to the study of human culture. 83 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 84. Prediction: Culturomics “Quantitative Analysis of Culture Using Millions of Digitized Books,” Science, 12/16/10. 84 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 85. Prediction: Culturomics 2.0 http://www.youtube.com/watch?v=61qn7S9NCOs 85 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 86. Prediction: Culturomics 2.0 Culturomics 2.0: Forecasting Large-Scale Human Behavior Using Global News Media Tone in Time and Space, Kalev Leetaru, 9/11 • The tone of real-time consciousness reflected in the media can be used to forecast broad social behavior. • Combined three massive news archives totaling more than 100 million articles worldwide to explore the global consciousness of the news media. • Employs a large shared-memory supercomputer (University of Tennessee SGI Altix supercomputer Nautilus with 1024 processors and 4-TB of memory) • Using the tone and location of the reports, (claims to have) predicted the outcome of the Arab Spring and the location of Bin Laden within radius of 125 miles 86 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 87. Prediction: Culturomics 2.0Based on Carbon Capture Report 87 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 88. Prediction: Culturomics 2.0Based on Carbon Capture Report 88 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 89. Prediction: Culturomics 2.0Features of Stories or Tweets • Tone/Positivity/Negativity. Ratio of + to - tone (- 100 to 100) • Polarity. Emotional charge (0 to 100) • Activity. Intensity of "active language" (0 to 100) • Personalization. Degree to which the writer attempts to bring the reader into the fold (0 to 100) • Questions/Exclamations. Tweet tone indicators of non-word items • Geocoding. Location of story content 89 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 90. Prediction: Culturomics 2.0 Features of Stories or Tweets 100M Articles from the: Sentiment Mining, New York Times (1945-05) Geocoding,Sum. of Wrld Brdcasts (1979-10) Entity Extraction GeocodingGoogle News articles (2006-11) Nautilus Supercomputer Feature Scores 2.4 Petabyte Network with over 10M entitles 90 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 91. Prediction: Culturomics 2.0Predicting Unrest 91 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 92. Prediction: Culturomics 2.0NY Times View of Tone http://contentanalysis.ichass.illinois.edu/Culturomics20/nyt-movie- 1000x1000.gif 92 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  • 93. Prediction: Culturomics 2.0SWB View of Tone http://contentanalysis.ichass.illinois.edu/Culturomics20/swb-movie- 1000x1000.gif 93 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

×