SlideShare a Scribd company logo
Extending "Variety" of
Data to "Variety" of
Users
Tina Groves
Big Data and Analytics, IBM
© 2014 IBM Corporation2 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Tina Groves
 IBM Big Data & Analytics Product Strategy team
 Product manager, 15+ years
 Focus on new product introduction and
innovation areas
 Results tied to 1,000s of customers; 1,000,000s
of users and 100s of millions in revenue
 Personal
 hockey mom, skier, closet Scrabble nerd and
oftentimes analytics geek
© 2014 IBM Corporation3 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
What Makes Text Analytics Challenging?
Company IBM
Annual
Revenue
99,751
Annual
Revenue
Units
Billion
Number of
Employees
432,212
Tone Conservative
Easier: One source; derive attributes Harder: Many sources; infer perception
& behaviour
© 2014 IBM Corporation4 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
What Makes Text Analytics Even MORE CHALLENGING?
Culture, Slang, Sarcasm
• Same word, different
meanings
e.g., “Sick”
• Same meaning,
different words
e.g., “daks”, “trousers”,
“pants”
Infrastructure
Increasing
• Volume
• Sources
• Users
• Analytic complexity
“Juan” = “John” or “Jean”
“The lazy brown dog
lazed in the sunshine” =
Multiple languages
怠惰な茶色の犬は太陽
の下でlazed
Setting aside the obvious linguistic challenges…
© 2014 IBM Corporation5 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
The Dilemma with Text Analytics: Skills vs. Need Disconnect
Business
Analyst
Application
Developer
Data
Scientist
Domain
Knowledge
Advanced
Analytics & NLP
Programming
© 2014 IBM Corporation6 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
NLP Engines & tools: Developer and Data Scientist Tools
• NLP market is about 10-15
yrs old
• Highly fragmented, no clear
leader
• Many open source or free
alternatives
tm
Text
Mining
Free / Open Source
NLP Pure
Plays
Sources:
• A Review of Text Analytics Suppliers, Butler
Analytics, 2014-01
• Text Analytics 2014: User Perspectives on
Solutions and Providers, Seth Grimes, 2014-06
• Who's Who in Text Analytics, Gartner, 2012-09
© 2014 IBM Corporation7 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
“100 Best Jobs” Copyrighted 2014. U.S. News &
World Report. 112878:914JM.



Occupation Software
Developer
Rating 8.4
Upward
Mobility
good
(Average)
Stress Level Fair (Average)
Flexibility Fair (Average)
Let’s look at an example
© 2014 IBM Corporation8 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Rstudio using stringr & hmisc libraries against BigInsights
© 2014 IBM Corporation9 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
# Text Analytics with Open Source R
# Strata Conference - October 2014, NYC
#################################
# Clean-up environment
rm(list=ls())
# Load required open source R packages
library(stringr)
library(Hmisc)
# Loop through all files in a directory
files <-
list.files(path="/home/biadmin/Desktop/Be
st-jobs", pattern=".txt", all.files=T,
full.names=T)
outDF <-
data.frame(InputDocument=character(),
JobTitle=character(),
OverallScore=numeric(),
Stress=character(),
UpWardMobility=character(),
Flexibility=character())
for (file in files) {

# Read in the text file of interest
f <- readLines(file)
# Text to extract: Occupation
cline1 <- f[1]
val1 <- as.character(str_extract(cline1,"[a-
zA-Z]+s*[a-zA-Z]*"))
val1 <- ifelse(is.null(val1) == TRUE, NA,
val1)
# Text to extract: Rating (Overall Score)
cline2 <- grep("Overall Score", f,
value=TRUE)
val2 <- as.numeric(str_extract(cline2,"[0-
9]+.[0-9]+"))
val2 <- ifelse(is.null(val2) == TRUE, NA,
val2)
# Text to extract: Stress Level
cline3 <- grep("Stress Level",f,
value=TRUE)
val3 <- as.character(substring(cline3, 14))
val3 <- first.word(val3)

val3 <- ifelse(is.null(val3)
val3)
# Text to extract: Upward
cline4 <- grep("Upward M
value=TRUE)
val4 <- as.character(subs
val4 <- first.word(val4)
val4 <- ifelse(is.null(val4)
val4)
# Text to extract: Flexibili
cline5 <- grep("Flexibility
val5 <- as.character(subs
val5 <- first.word(val5)
val5 <- ifelse(is.null(val5)
val5)
fileName <- basename(fi
newRow <-
data.frame(InputDocume
JobTitle=val1, OverallSco
Stress=val3, UpWardMob

© 2014 IBM Corporation10 #strataconf #hadoopworldIbm.com/hadoop @tinagroves



# Text to extract: Occupation
cline1 <- f[1]
val1 <- as.character(str_extract(cline1,"[a-
zA-Z]+s*[a-zA-Z]*"))
val1 <- ifelse(is.null(val1) == TRUE, NA,
val1)
# Text to extract: Stress Level
cline3 <- grep("Stress Level",f,
value=TRUE)
val3 <- as.character(substring(cline3, 14))
val3 <- first.word(val3)
val3 <- ifelse(is.null(val3) == TRUE, NA,
val3)
© 2014 IBM Corporation11 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Finished Results
Time: a few hours
Considerations
 Programming
 Text parsing
 Multiple files
 Missing values resulting in
missing rows
 Infrastructure
 Dataset size
 Single machine vs cluster
© 2014 IBM Corporation12 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
NLP Tool: BigInsights Big Text Example
Note: In Beta
© 2014 IBM Corporation13 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
But… how to reach the Business Analyst?
Business
Analyst
Application
Developer
Data
Scientist
Domain
Knowledge
Programming
Advanced
Analytics & NLP
© 2014 IBM Corporation14 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
A. Within
Enterprise
Offerings
NLP Engines
B. Niche
Tool
Reaching the Business Analyst with Tools
• Key drivers: ease of use and
time-to-results
• Differences from NLP tools
• GUI-driven
• Built-in algorithms
• Multi-language support
• Related technologies
 Search or Information
Discovery
 Machine Learning
© 2014 IBM Corporation15 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
“100 Best Jobs” Copyrighted 2014. U.S.
News & World Report. 112878:914JM.



The Difference
1,000
2,000
3,000
4,000
5,000
6,000
7,000
High School or
less
2 years post
secondary or
less
Bachelor's
degree or
higher
Thousands
Projected Job Growth 2020
© 2014 IBM Corporation16 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Business Analyst: IBM Social Media Analytics example
© 2014 IBM Corporation17 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Tools & Engines incorporate Domain Knowledge
Marketplace
Platforms
Point
Solutions
Where’s the Growth?
• Key influence: SaaS
• NLP Engines Solution Platforms
• LOB Tools  Point Solutions
• data integration services
Areas trending
• Marketing, fraud, healthcare
NLP Engines
LOB Tools
© 2014 IBM Corporation18 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Business Analyst: IBM Social Media Analytics example
© 2014 IBM Corporation20 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Conclusion
A variety of tools is
needed to reach a
variety of users
1 With a highly
fragmented market,
look for integration.
2
This market is changing.
Don’t be afraid to re-assess.3
© 2014 IBM Corporation21 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
© 2014 IBM Corporation22 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
ibmhadoop.challengepost.com
Stop by
IBM Booth
#321 to
learn more!
© 2014 IBM Corporation23 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Legal Disclaimer
• © IBM Corporation 2014. All Rights Reserved.
• The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained in this publication, it is
provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not
be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any
warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.
• References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this
presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing
contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
• If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete:
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon
many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can
be given that an individual user will achieve results similar to those stated here.
• If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete:
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance
characteristics may vary by customer.
• Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM Lotus® Sametime® Unyte™).
Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server). Please refer to http://www.ibm.com/legal/copytrade.shtml for
guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your presentation. All product names must be used as adjectives rather than nouns. Please list all of
the trademarks that you use in your presentation as follows; delete any not included in your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2,
PartnerWorld and Lotusphere are trademarks of International Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other
countries, or both.
• If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete:
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries.
• If you reference Java™ in the text, please mark the first use and include the following; otherwise delete:
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
• If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete:
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.
• If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete:
Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
• If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete:
UNIX is a registered trademark of The Open Group in the United States and other countries.
• If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete:
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.
• If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta Bank, Acme) please update
and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration purposes only.

More Related Content

Similar to A Text Analytics Marketscape (from Strata NY 2014)

Building intelligent APIs - Andy Thurai, IBM
Building intelligent APIs - Andy Thurai, IBMBuilding intelligent APIs - Andy Thurai, IBM
Building intelligent APIs - Andy Thurai, IBM
PAPIs.io
 
[IBM Pulse 2014] #1579 DevOps Technical Strategy and Roadmap
[IBM Pulse 2014] #1579 DevOps Technical Strategy and Roadmap[IBM Pulse 2014] #1579 DevOps Technical Strategy and Roadmap
[IBM Pulse 2014] #1579 DevOps Technical Strategy and Roadmap
Daniel Berg
 
Spark working with a Cloud IDE: Notebook/Shiny Apps
Spark working with a Cloud IDE: Notebook/Shiny AppsSpark working with a Cloud IDE: Notebook/Shiny Apps
Spark working with a Cloud IDE: Notebook/Shiny Apps
Data Con LA
 
Enhanced 360 degree view of customer
Enhanced 360 degree view of customerEnhanced 360 degree view of customer
Enhanced 360 degree view of customer
Trisha Dutta
 
Ibm mobile first digital_strategy_dc
Ibm mobile first digital_strategy_dcIbm mobile first digital_strategy_dc
Ibm mobile first digital_strategy_dcNitin Gaur
 
IBM MobileFirst: Defining a Digital Strategy Communicating to Understand
IBM MobileFirst: Defining a Digital StrategyCommunicating to UnderstandIBM MobileFirst: Defining a Digital StrategyCommunicating to Understand
IBM MobileFirst: Defining a Digital Strategy Communicating to Understand
Nitin Gaur
 
IBM Retail Analytics Solutions
IBM Retail Analytics Solutions IBM Retail Analytics Solutions
IBM Retail Analytics Solutions
Virginia Fernandez
 
Mobile to Mainframe - the Challenges of Enterprise DevOps Adoption
Mobile to Mainframe - the Challenges of Enterprise DevOps AdoptionMobile to Mainframe - the Challenges of Enterprise DevOps Adoption
Mobile to Mainframe - the Challenges of Enterprise DevOps Adoption
Sanjeev Sharma
 
SmartData Webinar: Cognitive Computing in the Mobile App Economy
SmartData Webinar: Cognitive Computing in the Mobile App EconomySmartData Webinar: Cognitive Computing in the Mobile App Economy
SmartData Webinar: Cognitive Computing in the Mobile App Economy
DATAVERSITY
 
CIN-2650 - Cloud adoption! Enforcer to transform your organization around peo...
CIN-2650 - Cloud adoption! Enforcer to transform your organization around peo...CIN-2650 - Cloud adoption! Enforcer to transform your organization around peo...
CIN-2650 - Cloud adoption! Enforcer to transform your organization around peo...
Hendrik van Run
 
Mobile to mainframe - The Challenges and Best Practices of Enterprise DevOps
Mobile to mainframe - The Challenges and Best Practices of Enterprise DevOps Mobile to mainframe - The Challenges and Best Practices of Enterprise DevOps
Mobile to mainframe - The Challenges and Best Practices of Enterprise DevOps
IBM UrbanCode Products
 
Business value Drivers for IoT Solutions
Business value Drivers for IoT SolutionsBusiness value Drivers for IoT Solutions
Business value Drivers for IoT Solutions
IBM_Info_Management
 
Benchmarking Hadoop - Which hadoop sql engine leads the herd
Benchmarking Hadoop - Which hadoop sql engine leads the herdBenchmarking Hadoop - Which hadoop sql engine leads the herd
Benchmarking Hadoop - Which hadoop sql engine leads the herd
Gord Sissons
 
Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...
Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...
Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...
Marc Nehme
 
Achieving Engaging and Differentiated Digital Experiences for Better Business...
Achieving Engaging and Differentiated Digital Experiences for Better Business...Achieving Engaging and Differentiated Digital Experiences for Better Business...
Achieving Engaging and Differentiated Digital Experiences for Better Business...
IBM Digital Experience
 
Improving Predictability and Efficiency with Kanban Metrics using Rational In...
Improving Predictability and Efficiency with Kanban Metrics using Rational In...Improving Predictability and Efficiency with Kanban Metrics using Rational In...
Improving Predictability and Efficiency with Kanban Metrics using Rational In...Paulo Lacerda
 
DevOps for IBM Commerce
DevOps for IBM CommerceDevOps for IBM Commerce
DevOps for IBM Commerce
Michael Elder
 
BigInsights For Telecom
BigInsights For TelecomBigInsights For Telecom
BigInsights For Telecom
Seeling Cheung
 
Impact 2014 - enabling an intelligent enterprise theory and practice
Impact 2014 -  enabling an intelligent enterprise theory and practiceImpact 2014 -  enabling an intelligent enterprise theory and practice
Impact 2014 - enabling an intelligent enterprise theory and practiceAndrew Coleman
 
Insight2014 ibm client_center_4_adv_analytics_7171
Insight2014 ibm client_center_4_adv_analytics_7171Insight2014 ibm client_center_4_adv_analytics_7171
Insight2014 ibm client_center_4_adv_analytics_7171
IBMgbsNA
 

Similar to A Text Analytics Marketscape (from Strata NY 2014) (20)

Building intelligent APIs - Andy Thurai, IBM
Building intelligent APIs - Andy Thurai, IBMBuilding intelligent APIs - Andy Thurai, IBM
Building intelligent APIs - Andy Thurai, IBM
 
[IBM Pulse 2014] #1579 DevOps Technical Strategy and Roadmap
[IBM Pulse 2014] #1579 DevOps Technical Strategy and Roadmap[IBM Pulse 2014] #1579 DevOps Technical Strategy and Roadmap
[IBM Pulse 2014] #1579 DevOps Technical Strategy and Roadmap
 
Spark working with a Cloud IDE: Notebook/Shiny Apps
Spark working with a Cloud IDE: Notebook/Shiny AppsSpark working with a Cloud IDE: Notebook/Shiny Apps
Spark working with a Cloud IDE: Notebook/Shiny Apps
 
Enhanced 360 degree view of customer
Enhanced 360 degree view of customerEnhanced 360 degree view of customer
Enhanced 360 degree view of customer
 
Ibm mobile first digital_strategy_dc
Ibm mobile first digital_strategy_dcIbm mobile first digital_strategy_dc
Ibm mobile first digital_strategy_dc
 
IBM MobileFirst: Defining a Digital Strategy Communicating to Understand
IBM MobileFirst: Defining a Digital StrategyCommunicating to UnderstandIBM MobileFirst: Defining a Digital StrategyCommunicating to Understand
IBM MobileFirst: Defining a Digital Strategy Communicating to Understand
 
IBM Retail Analytics Solutions
IBM Retail Analytics Solutions IBM Retail Analytics Solutions
IBM Retail Analytics Solutions
 
Mobile to Mainframe - the Challenges of Enterprise DevOps Adoption
Mobile to Mainframe - the Challenges of Enterprise DevOps AdoptionMobile to Mainframe - the Challenges of Enterprise DevOps Adoption
Mobile to Mainframe - the Challenges of Enterprise DevOps Adoption
 
SmartData Webinar: Cognitive Computing in the Mobile App Economy
SmartData Webinar: Cognitive Computing in the Mobile App EconomySmartData Webinar: Cognitive Computing in the Mobile App Economy
SmartData Webinar: Cognitive Computing in the Mobile App Economy
 
CIN-2650 - Cloud adoption! Enforcer to transform your organization around peo...
CIN-2650 - Cloud adoption! Enforcer to transform your organization around peo...CIN-2650 - Cloud adoption! Enforcer to transform your organization around peo...
CIN-2650 - Cloud adoption! Enforcer to transform your organization around peo...
 
Mobile to mainframe - The Challenges and Best Practices of Enterprise DevOps
Mobile to mainframe - The Challenges and Best Practices of Enterprise DevOps Mobile to mainframe - The Challenges and Best Practices of Enterprise DevOps
Mobile to mainframe - The Challenges and Best Practices of Enterprise DevOps
 
Business value Drivers for IoT Solutions
Business value Drivers for IoT SolutionsBusiness value Drivers for IoT Solutions
Business value Drivers for IoT Solutions
 
Benchmarking Hadoop - Which hadoop sql engine leads the herd
Benchmarking Hadoop - Which hadoop sql engine leads the herdBenchmarking Hadoop - Which hadoop sql engine leads the herd
Benchmarking Hadoop - Which hadoop sql engine leads the herd
 
Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...
Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...
Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...
 
Achieving Engaging and Differentiated Digital Experiences for Better Business...
Achieving Engaging and Differentiated Digital Experiences for Better Business...Achieving Engaging and Differentiated Digital Experiences for Better Business...
Achieving Engaging and Differentiated Digital Experiences for Better Business...
 
Improving Predictability and Efficiency with Kanban Metrics using Rational In...
Improving Predictability and Efficiency with Kanban Metrics using Rational In...Improving Predictability and Efficiency with Kanban Metrics using Rational In...
Improving Predictability and Efficiency with Kanban Metrics using Rational In...
 
DevOps for IBM Commerce
DevOps for IBM CommerceDevOps for IBM Commerce
DevOps for IBM Commerce
 
BigInsights For Telecom
BigInsights For TelecomBigInsights For Telecom
BigInsights For Telecom
 
Impact 2014 - enabling an intelligent enterprise theory and practice
Impact 2014 -  enabling an intelligent enterprise theory and practiceImpact 2014 -  enabling an intelligent enterprise theory and practice
Impact 2014 - enabling an intelligent enterprise theory and practice
 
Insight2014 ibm client_center_4_adv_analytics_7171
Insight2014 ibm client_center_4_adv_analytics_7171Insight2014 ibm client_center_4_adv_analytics_7171
Insight2014 ibm client_center_4_adv_analytics_7171
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 

Recently uploaded (20)

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 

A Text Analytics Marketscape (from Strata NY 2014)

  • 1. Extending "Variety" of Data to "Variety" of Users Tina Groves Big Data and Analytics, IBM
  • 2. © 2014 IBM Corporation2 #strataconf #hadoopworldIbm.com/hadoop @tinagroves Tina Groves  IBM Big Data & Analytics Product Strategy team  Product manager, 15+ years  Focus on new product introduction and innovation areas  Results tied to 1,000s of customers; 1,000,000s of users and 100s of millions in revenue  Personal  hockey mom, skier, closet Scrabble nerd and oftentimes analytics geek
  • 3. © 2014 IBM Corporation3 #strataconf #hadoopworldIbm.com/hadoop @tinagroves What Makes Text Analytics Challenging? Company IBM Annual Revenue 99,751 Annual Revenue Units Billion Number of Employees 432,212 Tone Conservative Easier: One source; derive attributes Harder: Many sources; infer perception & behaviour
  • 4. © 2014 IBM Corporation4 #strataconf #hadoopworldIbm.com/hadoop @tinagroves What Makes Text Analytics Even MORE CHALLENGING? Culture, Slang, Sarcasm • Same word, different meanings e.g., “Sick” • Same meaning, different words e.g., “daks”, “trousers”, “pants” Infrastructure Increasing • Volume • Sources • Users • Analytic complexity “Juan” = “John” or “Jean” “The lazy brown dog lazed in the sunshine” = Multiple languages 怠惰な茶色の犬は太陽 の下でlazed Setting aside the obvious linguistic challenges…
  • 5. © 2014 IBM Corporation5 #strataconf #hadoopworldIbm.com/hadoop @tinagroves The Dilemma with Text Analytics: Skills vs. Need Disconnect Business Analyst Application Developer Data Scientist Domain Knowledge Advanced Analytics & NLP Programming
  • 6. © 2014 IBM Corporation6 #strataconf #hadoopworldIbm.com/hadoop @tinagroves NLP Engines & tools: Developer and Data Scientist Tools • NLP market is about 10-15 yrs old • Highly fragmented, no clear leader • Many open source or free alternatives tm Text Mining Free / Open Source NLP Pure Plays Sources: • A Review of Text Analytics Suppliers, Butler Analytics, 2014-01 • Text Analytics 2014: User Perspectives on Solutions and Providers, Seth Grimes, 2014-06 • Who's Who in Text Analytics, Gartner, 2012-09
  • 7. © 2014 IBM Corporation7 #strataconf #hadoopworldIbm.com/hadoop @tinagroves “100 Best Jobs” Copyrighted 2014. U.S. News & World Report. 112878:914JM.    Occupation Software Developer Rating 8.4 Upward Mobility good (Average) Stress Level Fair (Average) Flexibility Fair (Average) Let’s look at an example
  • 8. © 2014 IBM Corporation8 #strataconf #hadoopworldIbm.com/hadoop @tinagroves Rstudio using stringr & hmisc libraries against BigInsights
  • 9. © 2014 IBM Corporation9 #strataconf #hadoopworldIbm.com/hadoop @tinagroves # Text Analytics with Open Source R # Strata Conference - October 2014, NYC ################################# # Clean-up environment rm(list=ls()) # Load required open source R packages library(stringr) library(Hmisc) # Loop through all files in a directory files <- list.files(path="/home/biadmin/Desktop/Be st-jobs", pattern=".txt", all.files=T, full.names=T) outDF <- data.frame(InputDocument=character(), JobTitle=character(), OverallScore=numeric(), Stress=character(), UpWardMobility=character(), Flexibility=character()) for (file in files) {  # Read in the text file of interest f <- readLines(file) # Text to extract: Occupation cline1 <- f[1] val1 <- as.character(str_extract(cline1,"[a- zA-Z]+s*[a-zA-Z]*")) val1 <- ifelse(is.null(val1) == TRUE, NA, val1) # Text to extract: Rating (Overall Score) cline2 <- grep("Overall Score", f, value=TRUE) val2 <- as.numeric(str_extract(cline2,"[0- 9]+.[0-9]+")) val2 <- ifelse(is.null(val2) == TRUE, NA, val2) # Text to extract: Stress Level cline3 <- grep("Stress Level",f, value=TRUE) val3 <- as.character(substring(cline3, 14)) val3 <- first.word(val3)  val3 <- ifelse(is.null(val3) val3) # Text to extract: Upward cline4 <- grep("Upward M value=TRUE) val4 <- as.character(subs val4 <- first.word(val4) val4 <- ifelse(is.null(val4) val4) # Text to extract: Flexibili cline5 <- grep("Flexibility val5 <- as.character(subs val5 <- first.word(val5) val5 <- ifelse(is.null(val5) val5) fileName <- basename(fi newRow <- data.frame(InputDocume JobTitle=val1, OverallSco Stress=val3, UpWardMob 
  • 10. © 2014 IBM Corporation10 #strataconf #hadoopworldIbm.com/hadoop @tinagroves    # Text to extract: Occupation cline1 <- f[1] val1 <- as.character(str_extract(cline1,"[a- zA-Z]+s*[a-zA-Z]*")) val1 <- ifelse(is.null(val1) == TRUE, NA, val1) # Text to extract: Stress Level cline3 <- grep("Stress Level",f, value=TRUE) val3 <- as.character(substring(cline3, 14)) val3 <- first.word(val3) val3 <- ifelse(is.null(val3) == TRUE, NA, val3)
  • 11. © 2014 IBM Corporation11 #strataconf #hadoopworldIbm.com/hadoop @tinagroves Finished Results Time: a few hours Considerations  Programming  Text parsing  Multiple files  Missing values resulting in missing rows  Infrastructure  Dataset size  Single machine vs cluster
  • 12. © 2014 IBM Corporation12 #strataconf #hadoopworldIbm.com/hadoop @tinagroves NLP Tool: BigInsights Big Text Example Note: In Beta
  • 13. © 2014 IBM Corporation13 #strataconf #hadoopworldIbm.com/hadoop @tinagroves But… how to reach the Business Analyst? Business Analyst Application Developer Data Scientist Domain Knowledge Programming Advanced Analytics & NLP
  • 14. © 2014 IBM Corporation14 #strataconf #hadoopworldIbm.com/hadoop @tinagroves A. Within Enterprise Offerings NLP Engines B. Niche Tool Reaching the Business Analyst with Tools • Key drivers: ease of use and time-to-results • Differences from NLP tools • GUI-driven • Built-in algorithms • Multi-language support • Related technologies  Search or Information Discovery  Machine Learning
  • 15. © 2014 IBM Corporation15 #strataconf #hadoopworldIbm.com/hadoop @tinagroves “100 Best Jobs” Copyrighted 2014. U.S. News & World Report. 112878:914JM.    The Difference 1,000 2,000 3,000 4,000 5,000 6,000 7,000 High School or less 2 years post secondary or less Bachelor's degree or higher Thousands Projected Job Growth 2020
  • 16. © 2014 IBM Corporation16 #strataconf #hadoopworldIbm.com/hadoop @tinagroves Business Analyst: IBM Social Media Analytics example
  • 17. © 2014 IBM Corporation17 #strataconf #hadoopworldIbm.com/hadoop @tinagroves Tools & Engines incorporate Domain Knowledge Marketplace Platforms Point Solutions Where’s the Growth? • Key influence: SaaS • NLP Engines Solution Platforms • LOB Tools  Point Solutions • data integration services Areas trending • Marketing, fraud, healthcare NLP Engines LOB Tools
  • 18. © 2014 IBM Corporation18 #strataconf #hadoopworldIbm.com/hadoop @tinagroves Business Analyst: IBM Social Media Analytics example
  • 19. © 2014 IBM Corporation20 #strataconf #hadoopworldIbm.com/hadoop @tinagroves Conclusion A variety of tools is needed to reach a variety of users 1 With a highly fragmented market, look for integration. 2 This market is changing. Don’t be afraid to re-assess.3
  • 20. © 2014 IBM Corporation21 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
  • 21. © 2014 IBM Corporation22 #strataconf #hadoopworldIbm.com/hadoop @tinagroves ibmhadoop.challengepost.com Stop by IBM Booth #321 to learn more!
  • 22. © 2014 IBM Corporation23 #strataconf #hadoopworldIbm.com/hadoop @tinagroves Legal Disclaimer • © IBM Corporation 2014. All Rights Reserved. • The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. • References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. • If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete: Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. • If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete: All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. • Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM Lotus® Sametime® Unyte™). Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server). Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your presentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included in your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both. • If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete: Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. • If you reference Java™ in the text, please mark the first use and include the following; otherwise delete: Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. • If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete: Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both. • If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete: Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. • If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete: UNIX is a registered trademark of The Open Group in the United States and other countries. • If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete: Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. • If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration purposes only.