Lexalytics Text Analytics Workshop: Perfect Text Analytics
Upcoming SlideShare
Loading in...5
×
 

Lexalytics Text Analytics Workshop: Perfect Text Analytics

on

  • 3,981 views

Presentation by Seth Redmore, VP Product Management at the Text Analytics Summit 2010

Presentation by Seth Redmore, VP Product Management at the Text Analytics Summit 2010

Statistics

Views

Total Views
3,981
Views on SlideShare
3,768
Embed Views
213

Actions

Likes
1
Downloads
77
Comments
0

8 Embeds 213

http://www.lexalytics.com 169
http://www.slideshare.net 22
http://www.gleanster.com 14
http://webdev.lexalytics.com 4
http://lexalytics.local 1
http://gleanster.com 1
http://www.slashdocs.com 1
http://lexalytics.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lexalytics Text Analytics Workshop: Perfect Text Analytics Lexalytics Text Analytics Workshop: Perfect Text Analytics Presentation Transcript

  • Perfect Text Analytics
    Seth Redmore
    VP, Product Management
  • Perfect
    per·fect
        [adj., n. pur-fikt; v. per-fekt]
    1. conforming absolutely to the description or definition of an ideal type: a perfect sphere; a perfect gentleman.
    2. excellent or complete beyond practical or theoretical improvement: There is no perfect legal code. The proportions of this temple are almost perfect.
    2
    All right reserved © 2010 Lexalytics Inc.
  • Text Analytics
    The term text analytics describes a set of linguistic statistical, and machine learning techniques that model and structure the information content of textual sources. (Wikipedia)
    In other words, enhancing the value of text content by extracting entities, features, context, relationships and emotion.
    3
    All right reserved © 2010 Lexalytics Inc.
  • Perfect is Fast
    Average Human Reading Speed: 250wpm
    Conservative computer reading speed: 6000 wpm/core (our speed on a moderate single core)
    Each core is equivalent to the reading bandwidth of 12 people.
    Modern machines have 8 cores.
    That’s just about 100 people in a box.
    Nice.
    4
    All right reserved © 2010 Lexalytics Inc.
  • Perfect is Useable
    “I don’t like the results” is not the same as “the results are incorrect”
    Understanding the behavior key to usefulness
    Can you make better decisions?
    Can you make more money or save money?
    What is the most controversial area of text analytics?
    Thompson Reuters trading w/Sentiment Analysis increased Alpha (profit over market) by 80 basis points
    5
    All right reserved © 2010 Lexalytics Inc.
  • Useable: How much can you differ?
    “In my shop, that up until now has relied exclusively on human coding, we consider anything below 90% to be unacceptably inaccurate…. There is no doubt that automated sentiment is getting much much better, but to suggest that people should be okay with 20% of their data being wrong is just absurd.” Katie Delahaye Payne
    Why is 10% “wrong” so much less absurd than 20% “wrong”?
    20% Error
    10% Error
    6
    All right reserved © 2010 Lexalytics Inc.
  • Perfect is Consistent
    Same results for same content, every time
    University of Pittsburgh “Multi-Perspective Question Answering” Corpus: 535 documents, 11k+ sentences.
    40 hours of training for each rater
    ~80% inter-rater agreement
    7
    All right reserved © 2010 Lexalytics Inc.
  • Perfect is (new) Knowledge
    Discover the stuff you don’t know
    Text Analytics is really, really great at telling you the who, the what, and the where. Sometimes the “how”
    You have to supply the “why” – but that question is way easier to answer when you know the other “w’s and the h”
    8
    All right reserved © 2010 Lexalytics Inc.
  • Perfect Includes Everything
    Running our top of the line software flat out across one year will cost you about $.002/document analyzed (news article sized content) (assuming 3 docs/core-second, 8 core machine)
    The more data the better and the greater worth your ta has
    9
    All right reserved © 2010 Lexalytics Inc.
  • Perfect is Trainable
    Can you solve YOUR business problem with it?
    Can you optimize to suit different kinds of content and roll those results up into a single reporting system?
    10
    All right reserved © 2010 Lexalytics Inc.
  • Perfect Text Analytics
    11
    All right reserved © 2010 Lexalytics Inc.
    Fast
    Useable
    Consistent
    Knowledge
    (that is)
    Inclusive
    Trainable
  • Customer Snapshots
    (or, “rubber, meet road”)
  • Reputation Management
    13
    All right reserved © 2010 Lexalytics Inc.
  • Politics
    14
    All right reserved © 2010 Lexalytics Inc.
  • Market Intelligence
    Client Employee
    User
    Authentication
    Single
    Sign-on
    External Content Providers
    SinglePoint
    Client Company
    User
    Authentication
    Web 2.0
    Collaboration
    Search Results
    Secondary
    Research
    Suppliers
    User
    Authentication
    MI Analyst
    Text Analytics
    Integrated
    Index
    News
    & Journals
    NL Search Engine
    FIREWALL
    Internal
    Document
    Repository
    Optional
    Document
    Repository
    Financial
    analyst
    reports
    Internal
    research
    Content
    Processing
    Custom Web
    Crawls & Gov.
    Databases
    Trash
    can
    crawl,
    FTP
    or CD
    15
    All right reserved © 2010 Lexalytics Inc.
  • Hospitality
    16
    All right reserved © 2010 Lexalytics Inc.
  • Financial Services
    Turns News into numbers for automatic trading systems
    • Company stocks + Commodities
    • Resilient server product
    All right reserved © 2010 Lexalytics Inc.
    17
    Algorithmic
    Trading
    (QED firm)
    Financial data
    Indicators
    Buy/Sell
    RNSE
    Server
    Indicators
    • Ultimate customers are financial institutions
    • QED (Quantitative and Event-Driven Trading) Banks, hedge funds.
    • JPMorgan, SocGen, Alpha Equities…and others
  • ROI – Retrieving Organized Information
    RTI CONSULTING SERVICES
    REPEATABLE
    EVOLVING
    DESIGNS
    BALANCED METHODOLOGY
    Business Assessment
    User Interviews
    Taxonomy Design and Recommendation
    Content Governance / Analysis
    DEPLOYMENT / SUPPORT
    Solution Alternatives
    Integration & Deployment
    Testing, Tuning, and Evaluation
    THOUGHT LEADERSHIP
    Strategy Consultation
    Roadmaps – Evolution and Growth
    PROF. TED SULLIVAN
  • Pharma
    19
    All right reserved © 2010 Lexalytics Inc.
  • The Next Year…
  • Opinion Mining
    Who said what about whom?
    All right reserved © 2010 Lexalytics Inc.
    21
  • Sarcasm, Twitter
    Model trained to detect sarcasm
    Once detected, you can decide what to do with it – because actually determining the sentiment is going to be unreliable
    New model trained on Twitter content
    Moving towards a concept of text analytics driven by business logic
    All right reserved © 2010 Lexalytics Inc.
    22
  • Thesaurus-based Theme Rollup
    Machine generated conceptual taxonomy
    Gas/Electric Hybrid and EV might roll up to EV
    Fewer themes, but very useful to detect patterns across content
    All right reserved © 2010 Lexalytics Inc.
    23
  • Foreign Language Support
    French is first, followed by other Romance languages
    New stemmer
    New summarization algorithm
    New part-of-speech tagger
    Automatic language detection
    New sentiment/entity extraction algorithms
    Also applicable to vertical specific content
    Confidence scoring by algorithm
    Use business logic to meld the results
    All right reserved © 2010 Lexalytics Inc.
    24
  • Trainable Entity Sentiment
    New technique for entity sentiment
    Initial results from testing in English extremely promising
    Average human scoring overlap of >> 90% for scored sentences
    Initially used only for French
    25
    All right reserved © 2010 Lexalytics Inc.
  • Tool Enhancements
    Eventually use on English content:
    Twitter
    Customer Satisfaction
    Others…
    Entity Management Toolkit
    Part of Speech Tagset training
    Using to train Salience on French
    Sentiment Toolkit
    Build your own entity sentiment models:
    French (first)
    New Sentiment Toolkit + Maximum Entropy model builder allows new Entity and Sentiment modules
    New EMT helps us build a new French PoS tagger
    Entity Extraction
    & Sentiment Models
    Fully
    Tagged
    Document
    Doc
    POS Tagger
    26
    All right reserved © 2010 Lexalytics Inc.
    Themes
    &
    Summaries
  • Business Logic + TA Algorithms
    Content
    Source
    Search
    Business Logic
    Other TA System
    Sarcasm
    Route On
    Sports
    Finance
    Unknown
    $
    ?
    A
    B
    C
    D
    Entity:
    Cisco
    27
    All right reserved © 2010 Lexalytics Inc.
    ProbabilityScores
    Cisco : Positive
  • Summary
    Lots of people making money with text analytics
    In lots of different verticals
    Next 12 months brings online a whole host of features to make our software even more flexible
    Check out tas.lexalytics.com
    Check out www.lexalytics.com/lexascope
    All right reserved © 2010 Lexalytics Inc.
    28