Modelling Text Luhn’s analysis of Messengers of the Nervous System, a Scientific American article http://wordle.net, applied to the NY Times article“Statistical information derived from word frequency and distribution isused by the machine to compute a relative measure of significance, firstfor individual words and then for sentences. Sentences scoring highest insignificance are extracted and printed out to become the auto-abstract.” -- H.P. Luhn, The Automatic Creation of Literature Abstracts, IBM Journal, 1958.
Luhn’s Example New York Times, September 8, 1957
Can Software Make the Connection? Mark Lombardi, George W. Bush, Harken Energy and Jackson Stephens, c. 1979-90, Detail
There and Back Again: Modelling Text, 2The text content of a document can be considered an unordered “bag of words.”Particular documents are points in a high-dimensional vector space. Salton, Wong & Yang, “A Vector Space Model for Automatic Indexing,” November 1975.
Modelling Text, 3We might construct a document-term matrix... • D1 = “I like databases” • D2 = “I hate hate databases” I like hate databases D1 1 1 0 1 D2 1 0 2 1 http://en.wikipedia.org/wiki/Term-document_matrixand use a weighting such as TF-IDF (term frequency– inverse document frequency)…in computing the cosine of the angle between weighted doc-vectors to determine similarity.
Modelling Text, 4In the form of query-document similarity, this is Information Retrieval 101. • See, for instance, Salton & Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” 1988. • A useful basic tech paper: Russ Albright, SAS, “Taming Text with the SVD,” 2004.Given the complexity of human language, statistical models may fall short. “Reading from text in general is a hard problem, because it involves all of common sense knowledge.” -- Expert systems pioneer Edward A. Feigenbaum
From Text to Data: FeaturesAnalytical methods make text tractable. Latent semantic indexing utilizing singular value decomposition for term reduction / feature selection.Classification technologies / methods: • Naive Bayes. • Support Vector Machine. • K-nearest neighbor.
“Reading from Text is a Hard Problem” Eugène Delacroix, St. Michael Defeats the Devil Thus the Orb he roamdWith narrow search; and with inspection deep Considerd every Creature, which of all Most opportune might serve his Wiles. -- John Milton, Paradise Lost
Data, Search, Analysis, and Discovery Eugène Delacroix, St. Michael Defeats the Devil DataFor Spacefeatures Analysis Thus the Orb he roamd With narrow search; and with inspection deep Considerd every Creature, which of all Intent, Most opportune might serve his Wiles. Goals -- John Milton, Paradise Lost
The User Interface“Search is the UI for data today.” -- Grant Ingersoll, Chief Scientist, LucidWorks Quoted by Gil Press in Forbes, “LucidWorks: Bringing Search to Big Data” http://www.forbes.com/sites/gilpress/2012/09/24/lucidworks-bringing-search-to-big-data/What’s beyond?
Search and Sensemaking“It is convenient to divide the entireinformation access process into twomain components: information retrievalthrough searching and browsing, andanalysis and synthesis of results. Thisbroader process is often referred to inthe literature as sensemaking.Sensemaking refers to an iterativeprocess of formulating a conceptualrepresentation from of a large volumeof information. Search plays only onepart in this process.” -- Marti Hearst, 2009 http://searchuserinterfaces.com/
Toward Semantic Search SensemakingOld Search SensemakingSearch on: keywords + identity, history & contextSources: content/type silos UnifiedIndexed: terms + metadata (properties)Returned: hit lists Categories / clusters / answers firstRelevance: PageRank (Inferred) intentPrevalence: plenty of new Plenty of established platforms with old(ish) search with new(ish) search capabilities, also wanna- bes.
The Back EndPlatforms and ecosystems.APIs and services.Text and content analytics -- Discerns and extracts features including relationships from source materials. Features = entities, key-value pairs, concepts, topics, events, sentiment, etc. Provide (for) BI on content-sourced data.Data integration, record linkage, data fusion.
Text+ Technology MashupsText/content analytics generates semantics to bridge search, BI, and applications, enabling next- generation information systems. Semantic search Information access (search + text) (search + text + BI)Search based Search BIapplications Integrated analytics(search + text + (text + BI)apps) Applica- Text analytics tions NextGen CRM, EFM, (inner circle) MR, marketing, …
Analytical Assets (Open Source) >>> import nltk >>> sentence = """At eight oclock on Thursday morning... Arthur didnt feel very good.""" >>> tokens = nltk.word_tokenize(sentence) >>> tokens [At, eight, "oclock", on, Thursday, morning, Arthur, did, "nt", feel, very, good, .] >>> tagged = nltk.pos_tag(tokens) >>> tagged[0:6] [(At, IN), (eight, CD), ("oclock", JJ), (on, IN), (Thursday, NNP), (morning, NN)] http://nltk.org/tm: Text Mining PackageA framework for text miningapplications within R.
A Big Data Analytics Architecturehttp://hpccsystems.com/ (GNU Affero GPL) http://www.geeklawblog.com/2011/12/lexis-advance-platform-launch-two.html
Drivers and TrendsSocial media! … and personal-social-enterprise integration.Via-API cloud services.Big Data (even if you don’t like the term). Volume and velocity mean new analytical approaches. Variety: new types and a new fusion imperative.Sentiment: Mood, opinions, emotions, intent.Question answering.
Text Tech InitiativesNow and near future. • Broader & deeper international language support. • Sentiment analysis, beyond polarity. Emotions, intent signals. etc. • Identity resolution & profile extraction. Online-social-enterprise data integration. • Semantic data integration, Complex Data. • Speech analytics. • Discourse analysis. Because isolated messages are not conversations. • Rich-media content analytics. • Augmented reality; new human-computer interfaces.