SlideShare a Scribd company logo
1 of 33
I Can Do Text Analytics!
Designing Development Tools for Novice Developers
Huahai Yang* Daina Pupons-Wickham** Laura Chiticariu*
Yunyao Li* Benjamin Nguyen** Arnaldo Carreno-fuentes*
*IBM Research - Almaden **IBM Software - Silicon Valley
OUTLINE
• Problem motivation
– Text analytics
– User population and needs
• Formative design iterations
– Expert interviews
– User studies in lab and field
• Current design and evaluation
– Workflow Guide and Extraction Plan
– Evaluation by competition
TEXT ANALYTICS
Public Text
Web Text
Private Text
Text
Analytics
Marketing
Financial investment
Drug discovery
Law enforcement
…
Applications
Social media
News
SEC
Internal
Data
Subscription
Data
USPTO
HIDDEN VALUES IN TEXT
DREAM
REALITY
TEXT ANALYTICS IS HARD
ML DOES NOT SAVE THE DAY
Wagstaff, K. Machine Learning that Matters. In ICML (2012)
ANNOTATION QUERY LANGUAGE (AQL)
• A declarative language for
developing text analytics
extractors [Chiticariu et al., 2010]
• Very expressive
• Runs very fast
SIMPLE EXAMPLE: OPINION ON A MOVIE
Movie
Mission Impossible has an entertaining plot, but terrible acting.
Input
Opinion
(Movie Name, Aspect, Opinion)
(Mission Impossible, plot, positive)
(Mission Impossible, acting, negative)
Desired Output
Aspect Opinion Aspect
SAMPLE AQL FOR OPINION ON A MOBILE
<Movie> <Opinion>
0-15 tokens
create view MovieReviewSnippet as
select M.name as name, O.value as value, A.aspect as aspect
CombineSpans(M.name,A.aspect) as review
from Movie M, Opinion O, Aspect A
where FollowsTok(M.name, O.value, 0, 15)
and FollowsTok(O.value, A.aspect, 0, 0);
create view Opinion as
extract dict ‘opinion.dict’ on D.text
from Document D;
<Aspect>
0 token
create view Aspect as
extract dict ‘aspect.dict’ on D.text
from Document D;
SKILLED PROGRAMMER, BUT NOVICE
DEVELOPER IN TEXT ANALYTICS
SKILLED PROGRAMMER, BUT NOVICE
DEVELOPER IN TEXT ANALYTICS
Named
Entities
Sentiment
Purchase
Intent
Consumer
Profile
Root
Cause
Risk
Analysis
Protein
Interaction
CAN NOVICE DEVELOPER BE PRODUCTIVE?
WHAT IS MISSING HERE?
BRING TEXT BACK TO TEXT ANALYTICS
WHAT EXPERT DEVELOPERS KNOW?
WHAT EXPERT DEVELOPERS KNOW?
WHAT EXPERT DEVELOPERS KNOW?
We designed tools to embody the best practice
FORMATIVE LAB STUDY
• 14 novice developers
• First given a tutorial on AQL
• Task: extract revenue by divisions from
company annual report
• Without tool, none complete the task
• With tool, all completed within 90 minute
FORMATIVE FIELD STUDY
• 12 week, 10 project members, 4 doing text
analytics (4 or 5 hours per week)
• Built profiles for pharmaceutical companies
• Interviews
– Participants reported that the tool was easy to use
– Participants made many suggestions for UI
enhancement
MAIN FEATURE: WORKFLOW GUIDE
MAIN FEATURE: EXTRACTION PLAN
CODE TEMPLATE FROM EXTRACTION PLAN
EVALUATION BY COMPETITION
• Task: buzz identification -
identifying tweets mentioning
the top 10 Billboard songs in the
week of May 5, 2012
• Participants: summer interns, 6
registered, 4 submitted answers
• Price: $500 for the winner
• Setup:
– Participants were given labeled training data (159 tweets)
– Participants wrote extractors independently with our tool
– Extractor quality measured on unseen test data (100 tweets)
Pre-competition Briefing
TASK HARDER THAN IT LOOKS
• RT @ardanradio #NowPlaying FUN feat Janelle Monae - We Are Young | #RIAUW
• RT @arieladriane: @1DirectionIndo what makes you beautiful - one direction cover by glee.
http://t.co/t4BmvZbM
• @Cimorelliband @LisaCim @LaurenCimorelli payphone was amazing can you guys please do we are
young by fun!! Thanks.
• RT @Jadore1Dx: Dear Mothers &amp; fathers of 1D - as The Wanted would say, im glad you came.
• RT @Melisaaa11: My boyfriend knows hes jealous of my relationship with Justin Bieber
• Now u just somebody that I used to know!
• RT @ardanradio #NowPlaying FUN feat Janelle Monae - We Are Young | #RIAUW
• RT @arieladriane: @1DirectionIndo what makes you beautiful - one direction cover by glee.
http://t.co/t4BmvZbM
• @Cimorelliband @LisaCim @LaurenCimorelli payphone was amazing can you guys please do we are
young by fun!! Thanks.
• RT @Jadore1Dx: Dear Mothers &amp; fathers of 1D - as The Wanted would say, im glad you came.
• RT @Melisaaa11: My boyfriend knows hes jealous of my relationship with Justin Bieber
• Now u just somebody that I used to know!
PERFORMANCE MEASURE
• Precision
– Proportion of identified buzz that are real:
• Recall
– Proportion of real buzz identified:
• F1
– Combining precision and recall:
All test
tweets
Tweets
identified as
buzz
Real buzz
EVALUATION RESULTS
• State of the art F1 is around 80% for similar
tasks [Ritter et al. EMNLP’11; Liu et al. ACL’12]
INTERVIEW
• Interviewed before announcing winners
• All worked only the day before deadline
• The winner worked only 5 hours
“Because the process is very clear, the wizard is very easy to follow”
“is quite helpful to analyze the sample data and define basic concepts.
I used it extensively to create my dictionaries”
“I did not face any problems using the tool”
LOWER BARRIER TO COMPLEX DOMAIN
CONTRIBUTIONS
• Summarized the best practice of text analytics
via expert interviews
• Built UI features to support the text analytics
best practice
• Lowered barrier and raised ceiling for text
analytics
FUTURE WORK
• Enable non-programmers to build text
extractors with similar power as AQL
• Collaborative text analytics
Q & A
More Info
Huahai Yang
IBM Research - Almaden
hyang@us.ibm.com
• IBM InfoSphere BigInsights Text Analytics YouTube videos:
http://bit.ly/10pfDgY
• Online classes: http://BigDataUniversity.com

More Related Content

Similar to I Can Do Text Analytics! Designing Development Tools for Novice Developers

Learning from your customers - A diary study with Slack
Learning from your customers - A diary study with SlackLearning from your customers - A diary study with Slack
Learning from your customers - A diary study with SlackProduct Anonymous
 
Hacker News Meetup April 2014
Hacker News Meetup April 2014Hacker News Meetup April 2014
Hacker News Meetup April 2014Dan Quine
 
Personas: Where Product Meets User Needs
Personas: Where Product Meets User NeedsPersonas: Where Product Meets User Needs
Personas: Where Product Meets User NeedsPerfetti Media
 
The Importance of Culture: Building and Sustaining Effective Engineering Org...
The Importance of Culture:  Building and Sustaining Effective Engineering Org...The Importance of Culture:  Building and Sustaining Effective Engineering Org...
The Importance of Culture: Building and Sustaining Effective Engineering Org...Randy Shoup
 
The what, when, why and how of usability
The what, when, why and how of usabilityThe what, when, why and how of usability
The what, when, why and how of usabilitySimpleUsability
 
Social Media Workshop: Get Ready for 2020
Social Media Workshop: Get Ready for 2020Social Media Workshop: Get Ready for 2020
Social Media Workshop: Get Ready for 2020Chris Snider
 
2021 05-08 building ai product - experience
2021 05-08 building ai product - experience2021 05-08 building ai product - experience
2021 05-08 building ai product - experienceKien Le
 
Launches, SEO, Adwords, Twitter, Blog, Search Engine, Keyword Research
Launches, SEO, Adwords, Twitter, Blog, Search Engine, Keyword ResearchLaunches, SEO, Adwords, Twitter, Blog, Search Engine, Keyword Research
Launches, SEO, Adwords, Twitter, Blog, Search Engine, Keyword ResearchMike Roberts
 
Experiences in user centred design at the University of Edinburgh (IWMW2012 w...
Experiences in user centred design at the University of Edinburgh (IWMW2012 w...Experiences in user centred design at the University of Edinburgh (IWMW2012 w...
Experiences in user centred design at the University of Edinburgh (IWMW2012 w...Neil Allison
 
InfoVision_PM101_RPadaki
InfoVision_PM101_RPadakiInfoVision_PM101_RPadaki
InfoVision_PM101_RPadakiRavi Padaki
 
Lean UX and Optimisation - Userzoom : 24 jan 2012 - lean optimisation
Lean UX and Optimisation - Userzoom : 24 jan 2012 - lean optimisationLean UX and Optimisation - Userzoom : 24 jan 2012 - lean optimisation
Lean UX and Optimisation - Userzoom : 24 jan 2012 - lean optimisationCraig Sullivan
 
Ever Thought about a Webinar as a Way to Fill Your Sales Funnel?
Ever Thought about a Webinar as a Way to Fill Your Sales Funnel?Ever Thought about a Webinar as a Way to Fill Your Sales Funnel?
Ever Thought about a Webinar as a Way to Fill Your Sales Funnel?DOYO Live
 
Remote Moderated Usability Testing & Tools
Remote Moderated Usability Testing & ToolsRemote Moderated Usability Testing & Tools
Remote Moderated Usability Testing & ToolsSusan Price
 
20 Ways to Shaft your Split Tesring : Conversion Conference
20 Ways to Shaft your Split Tesring : Conversion Conference20 Ways to Shaft your Split Tesring : Conversion Conference
20 Ways to Shaft your Split Tesring : Conversion ConferenceCraig Sullivan
 
Confessions of an uber optimiser conversion summit - craig sullivan - v 1.9
Confessions of an uber optimiser   conversion summit - craig sullivan - v 1.9Confessions of an uber optimiser   conversion summit - craig sullivan - v 1.9
Confessions of an uber optimiser conversion summit - craig sullivan - v 1.9Craig Sullivan
 

Similar to I Can Do Text Analytics! Designing Development Tools for Novice Developers (20)

Learning from your customers - A diary study with Slack
Learning from your customers - A diary study with SlackLearning from your customers - A diary study with Slack
Learning from your customers - A diary study with Slack
 
Proyectos Investigación y Desarrollo
Proyectos Investigación y DesarrolloProyectos Investigación y Desarrollo
Proyectos Investigación y Desarrollo
 
Discovery Phase: Planing Your Web Project
Discovery Phase: Planing Your Web ProjectDiscovery Phase: Planing Your Web Project
Discovery Phase: Planing Your Web Project
 
Hacker News Meetup April 2014
Hacker News Meetup April 2014Hacker News Meetup April 2014
Hacker News Meetup April 2014
 
Ask your users
Ask your usersAsk your users
Ask your users
 
Personas: Where Product Meets User Needs
Personas: Where Product Meets User NeedsPersonas: Where Product Meets User Needs
Personas: Where Product Meets User Needs
 
The Importance of Culture: Building and Sustaining Effective Engineering Org...
The Importance of Culture:  Building and Sustaining Effective Engineering Org...The Importance of Culture:  Building and Sustaining Effective Engineering Org...
The Importance of Culture: Building and Sustaining Effective Engineering Org...
 
The what, when, why and how of usability
The what, when, why and how of usabilityThe what, when, why and how of usability
The what, when, why and how of usability
 
Social Media Workshop: Get Ready for 2020
Social Media Workshop: Get Ready for 2020Social Media Workshop: Get Ready for 2020
Social Media Workshop: Get Ready for 2020
 
2021 05-08 building ai product - experience
2021 05-08 building ai product - experience2021 05-08 building ai product - experience
2021 05-08 building ai product - experience
 
Launches, SEO, Adwords, Twitter, Blog, Search Engine, Keyword Research
Launches, SEO, Adwords, Twitter, Blog, Search Engine, Keyword ResearchLaunches, SEO, Adwords, Twitter, Blog, Search Engine, Keyword Research
Launches, SEO, Adwords, Twitter, Blog, Search Engine, Keyword Research
 
Experiences in user centred design at the University of Edinburgh (IWMW2012 w...
Experiences in user centred design at the University of Edinburgh (IWMW2012 w...Experiences in user centred design at the University of Edinburgh (IWMW2012 w...
Experiences in user centred design at the University of Edinburgh (IWMW2012 w...
 
InfoVision_PM101_RPadaki
InfoVision_PM101_RPadakiInfoVision_PM101_RPadaki
InfoVision_PM101_RPadaki
 
Contextual Inquiry
Contextual InquiryContextual Inquiry
Contextual Inquiry
 
Lean UX and Optimisation - Userzoom : 24 jan 2012 - lean optimisation
Lean UX and Optimisation - Userzoom : 24 jan 2012 - lean optimisationLean UX and Optimisation - Userzoom : 24 jan 2012 - lean optimisation
Lean UX and Optimisation - Userzoom : 24 jan 2012 - lean optimisation
 
Ever Thought about a Webinar as a Way to Fill Your Sales Funnel?
Ever Thought about a Webinar as a Way to Fill Your Sales Funnel?Ever Thought about a Webinar as a Way to Fill Your Sales Funnel?
Ever Thought about a Webinar as a Way to Fill Your Sales Funnel?
 
Remote Moderated Usability Testing & Tools
Remote Moderated Usability Testing & ToolsRemote Moderated Usability Testing & Tools
Remote Moderated Usability Testing & Tools
 
20 Ways to Shaft your Split Tesring : Conversion Conference
20 Ways to Shaft your Split Tesring : Conversion Conference20 Ways to Shaft your Split Tesring : Conversion Conference
20 Ways to Shaft your Split Tesring : Conversion Conference
 
Lecture 31
Lecture 31Lecture 31
Lecture 31
 
Confessions of an uber optimiser conversion summit - craig sullivan - v 1.9
Confessions of an uber optimiser   conversion summit - craig sullivan - v 1.9Confessions of an uber optimiser   conversion summit - craig sullivan - v 1.9
Confessions of an uber optimiser conversion summit - craig sullivan - v 1.9
 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 

I Can Do Text Analytics! Designing Development Tools for Novice Developers

  • 1. I Can Do Text Analytics! Designing Development Tools for Novice Developers Huahai Yang* Daina Pupons-Wickham** Laura Chiticariu* Yunyao Li* Benjamin Nguyen** Arnaldo Carreno-fuentes* *IBM Research - Almaden **IBM Software - Silicon Valley
  • 2. OUTLINE • Problem motivation – Text analytics – User population and needs • Formative design iterations – Expert interviews – User studies in lab and field • Current design and evaluation – Workflow Guide and Extraction Plan – Evaluation by competition
  • 3. TEXT ANALYTICS Public Text Web Text Private Text Text Analytics Marketing Financial investment Drug discovery Law enforcement … Applications Social media News SEC Internal Data Subscription Data USPTO
  • 8. ML DOES NOT SAVE THE DAY Wagstaff, K. Machine Learning that Matters. In ICML (2012)
  • 9. ANNOTATION QUERY LANGUAGE (AQL) • A declarative language for developing text analytics extractors [Chiticariu et al., 2010] • Very expressive • Runs very fast
  • 10. SIMPLE EXAMPLE: OPINION ON A MOVIE Movie Mission Impossible has an entertaining plot, but terrible acting. Input Opinion (Movie Name, Aspect, Opinion) (Mission Impossible, plot, positive) (Mission Impossible, acting, negative) Desired Output Aspect Opinion Aspect
  • 11. SAMPLE AQL FOR OPINION ON A MOBILE <Movie> <Opinion> 0-15 tokens create view MovieReviewSnippet as select M.name as name, O.value as value, A.aspect as aspect CombineSpans(M.name,A.aspect) as review from Movie M, Opinion O, Aspect A where FollowsTok(M.name, O.value, 0, 15) and FollowsTok(O.value, A.aspect, 0, 0); create view Opinion as extract dict ‘opinion.dict’ on D.text from Document D; <Aspect> 0 token create view Aspect as extract dict ‘aspect.dict’ on D.text from Document D;
  • 12. SKILLED PROGRAMMER, BUT NOVICE DEVELOPER IN TEXT ANALYTICS
  • 13. SKILLED PROGRAMMER, BUT NOVICE DEVELOPER IN TEXT ANALYTICS Named Entities Sentiment Purchase Intent Consumer Profile Root Cause Risk Analysis Protein Interaction
  • 14. CAN NOVICE DEVELOPER BE PRODUCTIVE?
  • 16. BRING TEXT BACK TO TEXT ANALYTICS
  • 19. WHAT EXPERT DEVELOPERS KNOW? We designed tools to embody the best practice
  • 20. FORMATIVE LAB STUDY • 14 novice developers • First given a tutorial on AQL • Task: extract revenue by divisions from company annual report • Without tool, none complete the task • With tool, all completed within 90 minute
  • 21. FORMATIVE FIELD STUDY • 12 week, 10 project members, 4 doing text analytics (4 or 5 hours per week) • Built profiles for pharmaceutical companies • Interviews – Participants reported that the tool was easy to use – Participants made many suggestions for UI enhancement
  • 24. CODE TEMPLATE FROM EXTRACTION PLAN
  • 25. EVALUATION BY COMPETITION • Task: buzz identification - identifying tweets mentioning the top 10 Billboard songs in the week of May 5, 2012 • Participants: summer interns, 6 registered, 4 submitted answers • Price: $500 for the winner • Setup: – Participants were given labeled training data (159 tweets) – Participants wrote extractors independently with our tool – Extractor quality measured on unseen test data (100 tweets) Pre-competition Briefing
  • 26. TASK HARDER THAN IT LOOKS • RT @ardanradio #NowPlaying FUN feat Janelle Monae - We Are Young | #RIAUW • RT @arieladriane: @1DirectionIndo what makes you beautiful - one direction cover by glee. http://t.co/t4BmvZbM • @Cimorelliband @LisaCim @LaurenCimorelli payphone was amazing can you guys please do we are young by fun!! Thanks. • RT @Jadore1Dx: Dear Mothers &amp; fathers of 1D - as The Wanted would say, im glad you came. • RT @Melisaaa11: My boyfriend knows hes jealous of my relationship with Justin Bieber • Now u just somebody that I used to know! • RT @ardanradio #NowPlaying FUN feat Janelle Monae - We Are Young | #RIAUW • RT @arieladriane: @1DirectionIndo what makes you beautiful - one direction cover by glee. http://t.co/t4BmvZbM • @Cimorelliband @LisaCim @LaurenCimorelli payphone was amazing can you guys please do we are young by fun!! Thanks. • RT @Jadore1Dx: Dear Mothers &amp; fathers of 1D - as The Wanted would say, im glad you came. • RT @Melisaaa11: My boyfriend knows hes jealous of my relationship with Justin Bieber • Now u just somebody that I used to know!
  • 27. PERFORMANCE MEASURE • Precision – Proportion of identified buzz that are real: • Recall – Proportion of real buzz identified: • F1 – Combining precision and recall: All test tweets Tweets identified as buzz Real buzz
  • 28. EVALUATION RESULTS • State of the art F1 is around 80% for similar tasks [Ritter et al. EMNLP’11; Liu et al. ACL’12]
  • 29. INTERVIEW • Interviewed before announcing winners • All worked only the day before deadline • The winner worked only 5 hours “Because the process is very clear, the wizard is very easy to follow” “is quite helpful to analyze the sample data and define basic concepts. I used it extensively to create my dictionaries” “I did not face any problems using the tool”
  • 30. LOWER BARRIER TO COMPLEX DOMAIN
  • 31. CONTRIBUTIONS • Summarized the best practice of text analytics via expert interviews • Built UI features to support the text analytics best practice • Lowered barrier and raised ceiling for text analytics
  • 32. FUTURE WORK • Enable non-programmers to build text extractors with similar power as AQL • Collaborative text analytics
  • 33. Q & A More Info Huahai Yang IBM Research - Almaden hyang@us.ibm.com • IBM InfoSphere BigInsights Text Analytics YouTube videos: http://bit.ly/10pfDgY • Online classes: http://BigDataUniversity.com

Editor's Notes

  1. Good afternoon. My name is Huahai Yang, I am from IBM Almaden Research Center. I am happy to present our work on developing tools to support novice developers of text analytics.
  2. I will begin by introducing the problem of text analytics and explain the reason why it is a hard problem. I will then introduce the kind of programming task and the user population we are targeting. After briefly going through our iterative design and evaluation process, I will explain the user interface features that we designed to support novice developers in text analytics. Finally, I will report the formal evaluation we conducted and the very encouraging results obtained.
  3. On the Web today, there are abundant text available on social media sites such as twitter, facebook and yelp. Government bodies, for example, security exchange commission and US patent office, also make a large amount of text publicly accessible. In private organizations, there are enormous amount of textual data, such as call center transcripts. Enterprises can also purchase text data such as reports from Gartner and Forrester. Text analytics, as a technical field, takes all these text data as input, and tries to extract valuable information from them, in order to build practical applications in wide range of domains, such as marketing, financial services, pharmaceutics, law enforcement, and so on.
  4. The hidden values in text can be very diverse. Using text analytics, we may want to find named entities such as person, organization and addresses in email archives, to discover consumer’s perceptions about a brand in online reviews, to build consumer profiles using social media, to establish relationships among companies using their financial filings, or to discover protein interaction from medical literature.
  5. Our dream is to be able to surf the vast ocean of textual information, and find the value at ease.
  6. The reality is that we are often overwhelmed by the mountains of unstructured text, and could not find information we need.
  7. Add to the difficulty, human language is full of subtlety and nuances. Automatically processing natural language text is simply a very hard problem.Correspondingly, the applications built to do text analytics are often very hard to use.
  8. With the rise of machine learning, some may say that text analytics could be done automagically by machine learning. But the machine learning based solutions often requires a ph.d. level expertise in machine learning. For the domain of text analytics, we seek to lower the barrier of entry and to enable ordinary programmers to do text analytics.
  9. At IBM, we have developed a declarative language for doing text analytics, called AQL. The language uses the familiar syntax of database query language SQL, includes many very powerful semantic operators dedicated for text analytics, and runs orders of magnitude faster than other text analytics technologies, thanks to its leverage of mature database like optimizations.
  10. To see how AQL works, here is a typical example task of aspect based opinion extraction. For a text input, we are attempt to generate an output of tuples with fields of movie name, aspect and the opinion polarity.
  11. A simple AQL solution for the extraction problem could look like this: we first build components of opinion, movie names and aspects using corresponding word lists, then combine these components into a text pattern, taking into consideration the possible gaps among the components. The real business solutions could be much more complex and involve hundreds of AQL rules like these.
  12. The goal of this project is to enable a common, competent programmer, who is versed in general purpose programming
  13. But is a novice in developing business solutions for text analytics, to become productive very quickly in AQL.
  14. As we found out in a user study a few years ago, novice developers in AQL has enormous conceptual difficulty in understanding how to approach a text analytics problem. Even with a great deal of help, it is still very hard for some of the developers.
  15. Stepping back, we looked at what an AQL developer would see after starting a project. For a developer asked to do text analytics, what is missing in this user interface?
  16. Exactly, the text. The first design decision we made after engaged with the project was to bring the text back and put in the center of the user interface.
  17. We also noticed that there are a few very skilled AQL developers in the company, and we were curious about what they know about text analytics development process that novice developers lack. We interviews two expert AQL developers on a weekly basis for a month, encouraged them to develop the best practice of text extractor development and come up with a work flow for text analytics. The first step of the work flow is to develop a plan or specifications for text extraction. This is achieved by examining input text documents, and identify parts of the text that provides clues for solving the text analytics problem at hand.
  18. Once the extraction specification is defined, developers go through an iterative process of rule development, by first writing some AQL code, testing the code, refining the code, testing it again. And repeat the cycle of refinement until the extractor is deemed as having sufficient quality.
  19. After the semantics of the extractor are implemented, it is often necessary to profile the extractor to remove the performance bottleneck. Finally the extractor is packaged for delivery to applications. Based on this workflow process, we developed two new tools, Workflow Guide and Extraction Plan, as part of the AQL development environment, details of which will be discussed shortly.
  20. After initial implementation of the tools, we conducted a formative lab user study, and found that tool was generally helpful for novice developers.
  21. We then put the tool in a 3 month real world deployment for a real development project. The interviews of the developers were very helpful in improving the tool design.
  22. One of the main feature of our tool is Workflow Guide modeled on the best practice elicited from the expert AQL developers. The step by step guide give user detailed instructions and provides shortcuts to perform needed actions. The guide can be hidden once the user become proficient. Only the first step of “selecting of text documents to work with” is mandatory once the project is started, all the other steps are optional.
  23. The most important feature of the UI is the Extraction Plan. When a user highlight parts of the text documents, they are given the options to turn the text snippets into examples of semantic labels, and be added into the extraction plan. Users can also edit the extraction plan directly to specify an semantic structure for the text extractor.
  24. For each element of the extraction plan, user has the option to choose its concrete implementation with AQL features. The system will then generate the appropriate code template for the user to fill in the details. This way, users are exposed to and gradually learn the language.
  25. From formative user studies, we had some confidence about our system’s ability to lower the barrier of entry to text analytics, in the formal evaluation, we wanted to know how high the quality of extractors written by novice developer using our tool could be. To this end, we organized a competition among the summer interns in our lab. None of the interns had any prior experience with text analytics. The winner of the competition would receive $500. In the pre-competition meeting, about 10 interns showed up. Seeing how difficulty the task was, only 6 of them registered for the competition. At the end of the two week competition period, 4 students submitted answers. The task was a so called buzz identification task, basically, finding tweets that mentioned the top 10 songs in May last year…
  26. The list of the 10 songs were listed on the top of the screen. As can be seen, many of the song titles contain very common words. Here are some example tweets. The correctly identified songs are in green. These tweets are the correctly identified buzz about songs. However, the bottom two tweets are not buzz.
  27. We used the standard precision, recall and F1 index to measure the quality of extractors written. Precision measure the accuracy of the extractor, recall compute the coverage of the extractor, whereas F1 combine the two measure to give a single index of quality.
  28. We were pleasantly surprised to find the quality of the extractors written by the students were very high. As can be seen, three of the four participants had written extractors that beat the state of art F1 results reported in the literature for the similar task by a large margin.
  29. Interviews with the participants revealed that all participants only worked on the competition on the day of the deadline. The winner of the competition only started working on the problem after leaving work at 6pm and before the 11:59 deadline. Participants all reported satisfaction with development tool and felt it is adequate for them to develop the extractors.
  30. Although we had only tested our approach in the text analytics domain, we feel a general lesson can be learned here regarding teaching a domain-specific language. A workable approach seem to involve starting from visualizing the work object, and enabling direct manipulation on the work object to create a conceptual plan of operation, then the system can self-disclose the underlying language, which enable users to learn the domain specific language gradually. Finally, the immediate feedback between the language and the work object changes are important for obvious reasons. The key idea here is the explicit representation of conceptual plan in the user interface.
  31. In summary, through expert interviews, we abstract out the best practice of text analytics and designed user interface features to support the best practice in an iterative design and evaluation process. In then end, our solution not only lowered the barrier of entry to text analytics, but also raised ceiling for the task by enable novice developer to consistent built high quality extractors that beat the state of art, with minimal training and a few hours of uptake time.
  32. Currently, we are working on a more challenging problem of developing user interface to enable non-programmers to build text extractors with the same power of AQL without seeing the code. In addition, supporting collaborative text analytics is a valuable future direction.
  33. Thank you. I am now happy to take questions.