Successfully reported this slideshow.
Text mining
michel.bruley@teradata.com

Extract from various presentations: Temis, URI-INIST-CNRS, Aster
Data …
www.decide...
Information context
Big amount of information is available in
textual form in databases and online
sources
In this context...
Text mining definition
The objective of Text Mining is to exploit
information contained in textual documents
in various wa...
Text mining pipeline
Unstructured Text
(implicit knowledge)

Information
Retrieval

Information
extraction

Knowledge
Disc...
Text mining process
Text preprocessing
Syntactic/Semantic text
analysis
Features Generation
Bag of words
Features Selectio...
Text mining actors
Publishers
Enriched content
Annotation tools
Tools for authors
New applications based on annotation lay...
Challenges in text mining
Data collection is “free text”, is not well-organized (Semistructured or unstructured)
No unifor...
Data source administration

Intranet

File System
Databases
EDMS

Internet

Web
Crawling
On-line
Databank

XML Normalisati...
Text mining tasks
Name Extractions
Term Extraction
Feature extraction
Categorization

Text Analysis
Tools

Abbreviation Ex...
Information extraction
Keyword Ranking
Link Analysis
Query Log Analysis
Metadata Extraction
Intelligent Match
Duplicate El...
Document collections treatment

Categorization

www.decideo.fr/bruley

Clustering
Text Mining example: Obama vs. McCain

www.decideo.fr/bruley
Aster Data position for Text
Analysis
Data
Data
Acquisition
Acquisition
Gather text from
relevant sources
(web crawling, d...
Aster Data Value for Text
Analytics
•

Ability to store and process massive volumes of text data
– Massively parallel data...
Aster Data Capabilities for Text
Data
Pre-built SQL-MapReduce functions for text processing
•

•

•

Data transformation u...
Upcoming SlideShare
Loading in …5
×

Big Data & Text Mining

14,087 views

Published on

Big Data & Text Mining: Finding Nuggets in Mountains of Textual Data

Big amount of information is available in textual form in databases or online sources, and for many enterprise functions (marketing, maintenance, finance, etc.) represents a huge opportunity to improve their business knowledge. For example, text mining is starting to be used in marketing, more specifically in analytical customer relationship management, in order to achieve the holy 360° view of the customer (integrating elements from inbound mails, web comments, surveys, internal notes, etc.).

Facing this new domain I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The below presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.

Published in: Business, Technology, Education
  • For Business Analytics tools Online Training register at http://www.todaycourses.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Big Data & Text Mining

  1. 1. Text mining michel.bruley@teradata.com Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data … www.decideo.fr/bruley
  2. 2. Information context Big amount of information is available in textual form in databases and online sources In this context, manual analysis and effective extraction of useful information are not possible It is relevant to provide automatic tools for analyzing large textual collections www.decideo.fr/bruley
  3. 3. Text mining definition The objective of Text Mining is to exploit information contained in textual documents in various ways, including … discovery of patterns and trends in data, associations among entities, predictive rules, etc. The results can be important both for: the analysis of the collection, and providing intelligent navigation and browsing methods www.decideo.fr/bruley
  4. 4. Text mining pipeline Unstructured Text (implicit knowledge) Information Retrieval Information extraction Knowledge Discovery Structured content (explicit knowledge) www.decideo.fr/bruley Sem ant ic Sea rch / Dat a Min ing Semantic metadata
  5. 5. Text mining process Text preprocessing Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics Text/Data Mining Classification- Supervised learning Clustering- Unsupervised learning Analyzing results Mapping/Visualization Result interpretation www.decideo.fr/bruley Iterative and interactive process
  6. 6. Text mining actors Publishers Enriched content Annotation tools Tools for authors New applications based on annotation layers Richer cross linking based on content… Analysts Empowers them Annotating research output Hypothesis generation Summarisation of findings Focused semantic search… www.decideo.fr/bruley Libraries Linking between Institutional repositories Access to richer metadata Aggregation Aids to subject analysis/classification …
  7. 7. Challenges in text mining Data collection is “free text”, is not well-organized (Semistructured or unstructured) No uniform access over all sources, each source has separate storage and algebra, examples: email, databases, applications, web A quintuple heterogeneity: semantic, linguistic, structure, format, size of unit information Learning techniques for processing text typically need annotated training XML as the common model, it allows: – Manipulation data with standards – Mining becomes more data mining – RDF emerging as a complementary model The more structure you can explore the better you can do mining www.decideo.fr/bruley
  8. 8. Data source administration Intranet File System Databases EDMS Internet Web Crawling On-line Databank XML Normalisation -subject -Author -text corpora -keywords Information Provider Format filter www.decideo.fr/bruley
  9. 9. Text mining tasks Name Extractions Term Extraction Feature extraction Categorization Text Analysis Tools Abbreviation Extraction Relationship Extraction Summarization Clustering Hierarchical Clustering Binary relational Clustering TM Text search engine Web Searching Tools NetQuestion Solution Web Crawler www.decideo.fr/bruley
  10. 10. Information extraction Keyword Ranking Link Analysis Query Log Analysis Metadata Extraction Intelligent Match Duplicate Elimination www.decideo.fr/bruley Extract domain-specific information from natural language text – Need a dictionary of extraction patterns (e.g., “traveled to <x>” or “presidents of <x>”) • Constructed by hand • Automatically learned from hand-annotated training data – Need a semantic lexicon (dictionary of words with semantic category labels) • Typically constructed by hand
  11. 11. Document collections treatment Categorization www.decideo.fr/bruley Clustering
  12. 12. Text Mining example: Obama vs. McCain www.decideo.fr/bruley
  13. 13. Aster Data position for Text Analysis Data Data Acquisition Acquisition Gather text from relevant sources (web crawling, document scanning, news feeds, Twitter feeds, …) Pre-Processing Pre-Processing Mining Mining Analytic Analytic Applications Applications Perform processing required to transform and store text data and information Apply data mining techniques to derive insights about stored information Leverage insights from text mining to provide information that improves decisions and processes (stemming, parsing, indexing, entity extraction, …) (statistical analysis, classification, natural language processing, …) (sentiment analysis, document management, fraud analysis, e-discovery, ...) Aster Data Fit Third-Party Tools Fit Aster Data Value: Massive scalability of text storage and processing, Functions for text processing, Flexibility to develop diverse custom analytics and incorporate third-party libraries www.decideo.fr/bruley
  14. 14. Aster Data Value for Text Analytics • Ability to store and process massive volumes of text data – Massively parallel data stores and massively parallel analytics engine – SQL-MapReduce framework enables in-database processing for specialized text analytics tools • Tools and extensibility for processing diverse text data – SQL-MapReduce framework enables loading and transforming diverse sources and types of text data – Pre-built functions for text processing • Flexible platform for building and processing diverse analytics – SQL-MapReduce framework enables creation of flexible, reusable analytics – Embedded MapReduce processing engine for high-performance analytics www.decideo.fr/bruley
  15. 15. Aster Data Capabilities for Text Data Pre-built SQL-MapReduce functions for text processing • • • Data transformation utilities - Pack: compress multi-column data into a single column - Unpack: extract nested data for further analysis Custom and Packaged Analytics Aster Data nCluster App App Web log analysis - Sessionization: identify unique browsing sessions in clickstream data Text analysis - Text parser: general tool for tokenizing, stemming, and counting text data - nGram: split text into component parts (words & phrases) - Levenstein distance: compute “distance” between words www.decideo.fr/bruley App App App App Aster Data Analytic Foundation SQL-MapReduce SQL Data Data Data

×