The document provides an overview of a tutorial on text data mining and analytics. It discusses the growing interest in analytics and defines text mining and analysis. It also outlines some common text mining processes such as establishing a corpus, preprocessing texts through tokenization and stemming, feature extraction and weighting, and some example application areas.
Big Data & Text Mining: Finding Nuggets in Mountains of Textual Data
Big amount of information is available in textual form in databases or online sources, and for many enterprise functions (marketing, maintenance, finance, etc.) represents a huge opportunity to improve their business knowledge. For example, text mining is starting to be used in marketing, more specifically in analytical customer relationship management, in order to achieve the holy 360° view of the customer (integrating elements from inbound mails, web comments, surveys, internal notes, etc.).
Facing this new domain I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The below presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
Big Data & Text Mining: Finding Nuggets in Mountains of Textual Data
Big amount of information is available in textual form in databases or online sources, and for many enterprise functions (marketing, maintenance, finance, etc.) represents a huge opportunity to improve their business knowledge. For example, text mining is starting to be used in marketing, more specifically in analytical customer relationship management, in order to achieve the holy 360° view of the customer (integrating elements from inbound mails, web comments, surveys, internal notes, etc.).
Facing this new domain I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The below presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
ODAM is an Experiment Data Table Management System (EDTMS) that gives you an open access to your data and make them ready to be mined - A data explorer as bonus
Text Analytics in Enterprise Search - Daniel Linglucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
Text analytics is a large and interesting subject, covering a wide range of topics. In the world of enterprise search however, the usual application of text analytics rarely ranges beyond extracting semi-structured information from the source data. As some of the more advanced concepts in text analytics, such as automatic text categorization, can be easily leveraged to bring a search installation from a search tool to a tool for discovery.
A college level presentation covering the following topics:-
Introduction
Text mining Comparison with other mining
Text Mining Process
How Algorithm is derived for Text Mining
Text Analysis For Google Sheet
Conclusion
Slides presenting a paper published in the proceeding of 22nd International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2018), Belgrade, Serbia
Broad introduction to information retrieval and web search, used to teaching at the Yahoo Bangalore Summer School 2013. Slides are a mash-up from my own and other people's presentations.
This 2-hour lecture was held at Amsterdam University of Applied Sciences (HvA) on October 16th, 2013. It represents a basic overview over core technologies used by ICT companies such as Google, Twitter or Facebook. The lecture does not require a strong technical background and stays at conceptual level.
ODAM is an Experiment Data Table Management System (EDTMS) that gives you an open access to your data and make them ready to be mined - A data explorer as bonus
Text Analytics in Enterprise Search - Daniel Linglucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
Text analytics is a large and interesting subject, covering a wide range of topics. In the world of enterprise search however, the usual application of text analytics rarely ranges beyond extracting semi-structured information from the source data. As some of the more advanced concepts in text analytics, such as automatic text categorization, can be easily leveraged to bring a search installation from a search tool to a tool for discovery.
A college level presentation covering the following topics:-
Introduction
Text mining Comparison with other mining
Text Mining Process
How Algorithm is derived for Text Mining
Text Analysis For Google Sheet
Conclusion
Slides presenting a paper published in the proceeding of 22nd International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2018), Belgrade, Serbia
Broad introduction to information retrieval and web search, used to teaching at the Yahoo Bangalore Summer School 2013. Slides are a mash-up from my own and other people's presentations.
This 2-hour lecture was held at Amsterdam University of Applied Sciences (HvA) on October 16th, 2013. It represents a basic overview over core technologies used by ICT companies such as Google, Twitter or Facebook. The lecture does not require a strong technical background and stays at conceptual level.
Redstor has built a ‘Partner-Enabled’, Cloud Services Delivery Platform and developed a range of relevant Cloud services to be offered by its partners.
Allowing you to bring your own branded Cloud offerings to market quickly
Get the advantages of world-class services, reduced time to market, a broader product portfolio and recurring revenues
Offer all the benefits under your own brand without the capital investment.
Full sales and technical training, marketing collateral, ROI tools, and more
Comprehensive Service Level Agreements (SLA).
40% Margins to the channel.
NCB London Seminar GoL Presentation The Health Of Looked after Children Febru...Shirley Ayres
Shirley Ayres, Amy Wilkinson, the health of children in care, NCB, scoping review, learning from emerging practice, final report, GoL, promoting good practice, integrated working, be inspired
Guidelines to avoid kulula Sky trademark infringementBlogatize.net
Low fare airline kulula.com has recently undergone an extensive process to trademark the SkyTM, and today announced that authorisation of this trademark has been granted.
Pik's portfolio of his recent works. Now he is a freelance editor, writer and marketing project director. You are welcome to take a look at my previous works. Feel free to contact me if there is anything we could work out together.
Have fun first, Business later.
Text Mining is an Important part of data mining and it is used nowadays on a large scale. This mining technique is used to find patterns in text data collected from many online sources , and to gain some interestings insights from the patterns observed. Since text is basically everywhere on the internet, it becomes quite difficult to get the data in structured format, which is why text mining plays a huge role. It uses NLP(Natural Language Processing Techniques) to automate the text mining and this concept is used in Machine Learning.
Data Mining, KDD Process, Data mining functionalities, Characterization,
Discrimination ,
Association,
Classification,
Prediction,
Clustering,
Outlier analysis, Data Cleaning as a Process
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...NelTorrente
In this research, it concludes that while the readiness of teachers in Caloocan City to implement the MATATAG Curriculum is generally positive, targeted efforts in professional development, resource distribution, support networks, and comprehensive preparation can address the existing gaps and ensure successful curriculum implementation.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
3. Difference between a Symposium & a Tutorial at HICSS Symposium Audience M:M Tutorial 1:M
4. Difference between a Symposium & a Tutorial at HICSS Wv(t + 1) = Wv(t) + Θ (v, t) α(t)(D(t) - Wv(t))
5. Agenda Part 1: Growing Interest in Analytics Overview of Text Mining and Analysis General Text Mining and Analysis Processes Part 2: Classification and Categorization Clustering Information Extraction Overview of Tools & Packages
6. This is the only note you’ll need to take Presentation can be found at: www.slideshare.net
7. Biography: Dave King Currently, EVP of Product Development and Management at JDA Software 28 years in enterprise package software business 15 years as university professor 12 years as Co-Chair of the Internet & Digital Economy Track (HICSS) Long time interest in various aspects of E-Commerce & Business Intelligence Tutorial topic primarily reflects a personal interest and tangentially a job(s) related interest.
8. Personal Experiences with Analytics Taught applied statistics and math modeling In software R&D Optimization in the 80s Natural Language Frontends NLI Query & CMU Robotics Lab EIS Competitive Analysis Dow Jones and Reuters Verity Topics NewsAlert InXight’s Hyperbolic Tree Often the audiences has been small, sometimes bewildered, and often fleeting
9. If I have seen further it is only by … plagiarizing the works of others.
15. Interest in Analytics:Growing Awareness Source: Google Trends Analytics – “Extensive use of data, statistical and quantitative analysis, exploratory and predictive models, and fact-based management to drive decisions and actions…a subset of what has come to be called BI.” (Davenport and Harris, Competing on Analytics, HBS, 2007)
16. Interest in Analytics:Theory and Practice Data Mining Optimization In theory, there is no difference between theory and practice. But, in practice, there is.
18. Interest in Analytics:Potential Reasons for the Interest Next generation DSS: Progression of DSS->EIS->BI->PM->Analytics Increasing volumes of data requiring new approaches or modifications in existing approaches Focus on CRM and Supply Chains … General belief that more sophisticated analysis is required to compete in today’s environments …
19. Interest in Text Mining & Analytics: An old adage George Mallory . “WHY did you want to climb Mount Everest?" (in 1923 interview). His reply, “Because it’s there.” .
20. Interest in Text Mining & Analytics: The 80% Rule Unstructured (Textual) 80% Structured (Databases) 20% “It's a truism that 80 percent of business-relevant information originates in unstructured form, primarily text… The 80 percent unstructured figure comes from, well, everywhere.” Source: Seth Grimes, Unstructured Data and the 80 Percent Rule
21. Text Mining and Analytics:Definitions General: All types of text processing that deal with finding, organizing and analyzing textual (unstructured) information. Formal: Utilizing data mining techniques to create new information that is not obvious in a collection of documents (implies that Text Analytics ~ Text Mining ~ Text Data Mining)
22. Text Mining and Analytics:Types of Processing and Techniques Clustering. Grouping similar documents without having a predefined set of categories. Categorization. Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes. Information extraction. Identification of key phrases and relationships within text by looking for predefined sequences in text via pattern matching Named-Entity Recognition Seeks to locate and classify atomic elements in text into predefined categories (e.g. names of persons) Concept linking and Topic Tracking. Connects related documents by identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods. Summarization. Summarizing a document to save time on the part of the reader.
23. Text Mining and Analytics:Sample Application Areas Seth Grimes Papers
24. Text Mining:A Common Issue George Herbert, Welsh Poet & Priest A great dowry is a bed full of brambles. Outlandish Proverbs, 1640 Structured data mining is a bed of roses when compared to unstructured, textual mining which is a bed of brambles
25. Data Mining: Simple Example (Affinity Analysis) Study of attributes or characteristics that “go together.” Seek to uncover “association rules” that quantify the relationship between two or more attributes. Rules take the form of “If antecedent, then consequent” Examples: Market basket analysis to determine which items are purchased together (in single transaction) Web analysis to determine which sequences of pages users visit Major issue is number of potential combinations as the number of attributes increases
26. Data Mining: Simple Example (Affinity Analysis) 1. Market Basket Analysis: Items for Sale: Apples Bananas Cherries Durians 2. Possible Transactions: With one item or a collection of items selected as the Driver or Independent Variable 3. Objective is to empirically determine those groups of items that occur frequently together in a set of transactions, producing a set of rules of the form X -> Y.
27. Data Mining: Simple Example (Affinity Analysis) Standard Market Basket Measures: Support = N(X & Y)/ N(T) Example: N(A & B)/ N(T) = 2/7 = 29% Confidence = N(X & Y)/ N(X) Example: N(A & B)/ N(A) = 2/4 = 50% Where N(T) = No of Trans and N(X & Y) = No of Trans X&Y
29. Data Mining: General Data Assumptions Requires structured data (numbers and categories well-defined) Transformed by data preparation or collected with a prior design in mind Typically housed and organized in a relational database, data mart or data warehouse
30. Data Mining: Simple Example But, what if the baskets were described in the following manner: Jane bought a handful of maraschinos and a couple of granny smiths. Harold purchased a bag of appls and 2 bananas. Bill paid for a pound of cherries but decided not to buy the three durians because of their odor. How could we automate the analysis?
31. Data Mining: CRISP-DM Real-World Data Data Consolidation Data Cleaning Business Understanding Data Understanding Data Preparation Deployment Data Transformation Data Reduction Modeling Evaluation Well-Formed Data Cross-Industry Standard Process for Data Mining
32. Text MiningCRISP-Like Processes Real-World Text Data Document Consolidation Establish the Corpus Business Understanding Document Understanding Document Preparation Deployment Corpus Refinement (Token, Stem, Stop…) Feature Selection & Weighting Documents Modeling Evaluation Term- Doc-Matrix* * - Entity-Relationships
33. Text Mining Process:Establish the Corpus First step in textual data preparation is to systematically collect samples of text, i.e. the documents related to the context being studied Range of possibilities: word documents, PDFs, emails, IM chat, Web pages, RSS Feeds, Blogs, Tweets, Open ended surveys, Transcripts of Helpline calls … Convert into organized set of texts – called a corpus – standardized and prepared for the purpose of knowledge discovery.
34. Text Mining Process:Establish the Corpus Brown Corpus – first million word corpus compiled in 60s at Brown U., 500 samples across 15 genres, each ~2000 words with POS tags Linguistic Consortium Treebanks– collections of manually tagged and parsed (tree structures) of sentences from a variety of sources (includes well-known Penn Treebank collection) Reuters 21578, RCV1 & V2 -- collections (1000s of) Reuter’s English & multi-lingual news stories classified into topics and grouped into training & test sets Pang & Lee’s Sentiment Analysis – 1000 positive and 1000 negative movie reviews MEDLINE – An extensive collection of articles and abstracts (18M+) used in a variety of biomedical and linguistic text mining applications WordNet® -- large lexical database of English grouped into sets of cognitive synonyms (synsets) and interlinked by means of conceptual-semantic and lexical relations. Google Ngram -- 500 billion words from 5.2 million books published between 1500 and 2008 in English, French, Spanish, German, Russian, and Chinese.
36. Text Mining Process:Establishing the Corpus (Penn Treebank) .START Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Raw [ Pierre/NNP Vinken/NNP ] ,/, [ 61/CD years/NNS ] old/JJ ,/, will/MD join/VB [ the/DT board/NN ] as/IN [ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ] ./. Tagged ( (S (NP-SBJ (NP Pierre Vinken) , (ADJP (NP 61 years) old) ,) (VP will (VP join (NP the board) (PP-CLR as (NP a nonexecutive director)) (NP-TMP Nov. 29))) .)) Parsed
37. Text Mining Process:Establishing the Corpus (Reuters) 14826 ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT Mounting trade friction between the U.S. And Japan has raised fears among many of Asia's exporting nations that the row could inflict far-reaching economic damage, businessmen and officials said. They told Reuter correspondents in Asian capitals a U.S. Move against Japan might boost protectionist sentiment in the U.S. And lead to curbs on American imports of their products. But some exporters said that while the conflict would hurt them in the long-run, in the short-term Tokyo's loss might be their gain. The U.S. Has said it will impose 300 mlndlrs of tariffs on imports of Japanese electronics goods on April 17, in retaliation for Japan's alleged failure to stick to a pact not to sell semiconductors on world markets at below cost. Unofficial Japanese estimates put the impact of the tariffs at 10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of products hit by the new taxes.
40. Text Mining Process:Establish the Corpus (Google NGrams) 8,500 new words a year, 70% growth from 1950-2000, 50%+ of English lexicon is "dark matter." We’re forgetting our past faster with each passing year (tracking the references to the numerical years) Innovations spread faster than ever Modern celebrities are younger and more famous than predecessors, but their fame is shorter-lived. Culturomics is a powerful tool for automatically identifying censorship and propaganda. (e.g. e, Jewish artist Marc Chagall was mentioned just once in the entire German corpus from 1936-44) to 1944, even as his prominence in English-language books grew roughly fivefold. "Freud" is more deeply engrained in our collective subconscious than "Galileo," "Darwin," or "Einstein." “Quantitative Analysis of Culture Using Millions of Digitized Books” Science Magazine, Dec. 18, 2010
41. Text Mining Process: Corpus Refinement Common representation of tokens within and between documents Eliminate Stop Words Tokenization Normalize Stemming Tokenization —Parse the text to generate terms. Sophisticated analyzers can also extract phrases from the text. Normalize — Convert them to lowercase. Eliminate stop words — Eliminate terms that appear very often (e.g. the, and, …). Stemming — Convert the terms into their stemmed form—remove plurals and different word forms (e.g. achieve, achieves, achieved – achiev) [note: word about synonyms – WordNetSynset]
42. Text Mining: Feature Extraction & Weighting Feature Extraction “Bag of Words, Terms or Tokens” Vector Representation: Word, Term or Token/Doc Matrix Words or Tokens are attributes and documents are examples
43. Text Mining:Transforming Frequencies Binary Frequencies: tf =1 for tf>0; otherwise 0 Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K Log Frequencies: 1 + log(tf) for tf>0; otherwise 0 Normalized Frequencies: Divide each frequency by SQRT of Sum of Squares of the frequencies within the vector (column) Term Frequency–Inverse Document Frequency TF * IDF Inverse Document Frequency: log(N/(1+D)) where N is total number of docs and D is number with term
44. Text Mining Processes:Simple Overview Example Scours the Internet every ten minutes, harvesting human feelings from a large number of blogs (generally identifying and saving between 15,000 and 20,000 feelings per day. ). Scans blog posts for sentences with the phrases "I feel" and "I am feeling“, extracts the sentence, and looks to see if it includes one of about 5,000 pre-identified "feelings". If a valid feeling is found, the sentence is said to represent one person who feels that way. URL format of many blog posts can be used to extract the username of the post's author which is used to extract the age, gender, country, state, and city of the blog's owner. Given the country, state, and city, we can then retrieve the local weather conditions for that city at the time the post was written. We extract and save as much of this information as we can, along with the post.
45. Text Mining Processes:Simple Overview Example API Query from wefeelfine.org: http://api.wefeelfine.org:8080/ShowFeelings?display=xml&returnfields=imageid,feeling,sentence,posttime,postdate,posturl,gender,born,country,state,city,lat,lon,conditions&limit=500 Result from Query: <?xml version="1.0" ?> - <feelings> <feeling feeling="super" sentence="i've been feeling super depressed missing my ex" posttime="1292298985" postdate="2010-12-13" posturl="http://screamingnspace.blogspot.com/2010/12/guilty-as-charged.html" gender="0" country="united states" state="south carolina" /> Source: www.wefeelfine.org/api.html
46. Text Mining Processes:Simple Overview Example i'm blinded to other santas because this was my first but i can't help feeling that there can't be a better one i went to mcd with an idiot which is having the same feeling as me now i feel asleep i feel about little red shoes and mittens i feel the sands of time moving so quickly in my life it seems i feel too young to have her this beauty across from me i feel like im waiting for something profound or inspirational to hit me …
47. Text Mining Processes:Simple Overview Example Input String (43743 chars; 8245 spaces) "i'm blinded to other santas because this was my first but i can't help feeling that there can't be a better onei went to mcd with an idiot which is having the same feeling as me nowi'll feel bad bout it and soi feel asleep…” Tokenize (9019 tokens) ['i', "'m", 'blinded', 'to', 'other', 'santas', 'because', 'this', 'was', 'my', 'first', 'but', 'i', 'ca', "n't", 'help', 'feeling', 'that', 'there', 'ca', "n't", 'be', 'a', 'better', 'one', 'i', 'went', 'to', 'mcd', 'with', 'an', 'idiot', 'which', 'is', 'having', 'the', 'same', 'feeling', 'as', 'me', 'now', 'i', "'ll", 'feel', 'bad', 'bout', 'it', 'and', 'so', 'i', 'feel', 'asleep', …] Set of Tokens (1816 distinct tokens) ["'", "'bout", "'cleaner", "'d", "'http", "'i", "'ll", "'m", "'re", "'s", "'ve", '000', '039', '097', '1', '100', '101', '102', '104', '105', '108', '111', '114', '115', '116', '118', '11am', '12', '121', '15', '16', '180', '1998', '1st', '2', '2013', '23', '2nd', '3', '30', '78', '9', ':', 'a', 'ab', 'abit', 'able', 'about', 'above', 'abs', 'absolute', 'absolutely', 'absorb', 'abuse', 'accomplished', 'accomplishment', 'achieve', 'achieved', 'across', 'acted', 'action', 'activities', 'activity', 'actually', 'acura', …]
54. Text Mining Process:Overview Example 2 Twitter Statistics: ~106M registered users. New users 300K per day. 180 million unique visitors per mnth. 75% of traffic from 3rd Party Apps Average 55 million tweets a day. 600 million search queries per day. 37% use their phone to tweet. 60% of tweets from 3rd Party Apps Based on 1+B tweets generated by over 20 million Twitter users in 2010 (bio, web site, loc info). Source:huffingtonpost.com/2010/04/14/twitter-user-statistics-r_n_537992.html
55. Text Mining Process:Overview Example 2 Each tweet <= 140 characters (avg. 10-15 words/message) Heavy presence of non-alpha symb0-ols, abbrevs, misspellings and slang Tweets often include retweets (original tweet repeated) In spite of this – Tweets have proven to be an interesting text mining resource (e.g. see lifeanalytics.blogspot.com & mashable.com/author/dan-zarrella/)
56. Text Mining Process:Overview Example 2 Twitter gets a total of 3 billion requests a day via its API API Calls for Public Tweets http://search.twitter.com/search.json?q=%3A)+feel+ feeling&rpp=100&page=1 http://api.twitter.com/1/trends/current.json?exclude=hashtags
57. Text Mining Process:Overview Example 2 u'iso_language_code': u'en', u'to_user_id_str': None, u'text': u"RT @EverSoSassy56 <--- I'm sportin' my glasses... I feel all sophisticated and stuff. :-) -- And the operative word is feeling...LOL", u'from_user_id_str': u'168852471', u'profile_image_url': u'http://a0.twimg.com/profile_images/1166685224/Jonise_normal.jpg', u'id': 16300313380130816L, u'source': u'<ahref="http://twidroid.com" rel="nofollow">twidroid</a>', u'id_str': u'16300313380130816', u'from_user': u‘XXXXXXXXXX', u'from_user_id': 168852471, u'to_user_id': None, u'geo': None, u'created_at': u'Sun, 19 Dec 2010 01:14:32 +0000', u'metadata': {u'result_type': u'recent'}
58. Text Mining Process:Establish the Corpus (2nd Example) Happy Face Sad Face Tokens = 14670 Set of Tokens= 2289 avg./Sent = 24 lex. div. = 6.4 Non-Stop words = 10406 Set Non-Stop = 2117 Stems = 5003 Set of Stems = 1052 w/o Feel = 3921 Set w/o Feel = 1051
59. Text Mining Process:Overview Example 2 “Twitter Sentiment Classification using Distant Supervision” Utilizes presence of emoticons “ :)” & “ :( “ to serve as surrogates for classification as positive and negative sentiment statements To construct the term-document matrix relies on a list of positive and negative key words from Twittratr, counting number of key words that appear in each tweet. 180K tweets collected for training purposes between April and June 2009 80%+ accuracy in classification
60. Text Mining Processes:Overview Example 2 What is this? An areacartogram is a map in which some thematic mapping variable – such as travel time or GNP -- is substituted for land area. The geometry or space of the map is distorted in order to convey the information of this alternate variable.
61. Text Mining Process:Overview Example 2 Pulse of the Nation: U.S. Mood throughout the Day Inferred from Twitter Analyzed 300M public tweets produced in the US from 9/2006-8/2009 and containing words from a psychological word-rating system (“Affective Norms for English Words”) Through a natural language processing algorithm called Sentiment Analysis, each tweet was assigned a mood score based on the number of positive or negative words it contained. Calculated the average mood score of all the users living in a state hour by hour which formed the basis of a series of time-varying mood maps.
5.2 million digitized books - about 4% of all books ever printedpublished during the past 200 yearsAll told, about 129 million books have been published since the invention of the printing press. In 2004, Google software engineers began making electronic copies of them, and have about 15 million so far, comprising more than two trillion words in 400 languages.They currently include Chinese, English, French, German, Russian and Spanish books dating back to the year 1500—about 4% of all books published. The database doesn't include periodicals, which might reflect popular culture from a different vantage.The resulting corpus contains over 500 billion words, inEnglish (361 billion), French (45B), Spanish (45B), German(37B), Chinese (13B), Russian (35B), and Hebrew (2B). Theoldest works were published in the 1500s. The early decadesare represented by only a few books per year, comprisingseveral hundred thousand words. By 1800, the corpus growsto 60 million words per year; by 1900, 1.4 billion; and by2000, 8 billion.
5.2 million digitized books - about 4% of all books ever printedpublished during the past 200 yearsAll told, about 129 million books have been published since the invention of the printing press. In 2004, Google software engineers began making electronic copies of them, and have about 15 million so far, comprising more than two trillion words in 400 languages.They currently include Chinese, English, French, German, Russian and Spanish books dating back to the year 1500—about 4% of all books published. The database doesn't include periodicals, which might reflect popular culture from a different vantage.The resulting corpus contains over 500 billion words, inEnglish (361 billion), French (45B), Spanish (45B), German(37B), Chinese (13B), Russian (35B), and Hebrew (2B). Theoldest works were published in the 1500s. The early decadesare represented by only a few books per year, comprisingseveral hundred thousand words. By 1800, the corpus growsto 60 million words per year; by 1900, 1.4 billion; and by2000, 8 billion.