SlideShare a Scribd company logo
1 of 31
Download to read offline
Slide 1 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 1
Evolving MyBuzzMetrics with Text Analytics
September 2012
Eric Austvold – Insights Executive
Fernando Mesa – WW Director of Enterprise Solution
Pete Aven – Systems Engineer
Slide 2 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 2
Agenda
§ Introductions
§ NM Incite goals for text analytics
§ MarkLogic evolving MyBuzzMetrics with Text Analytics
§ Entity Extraction
§ Topic Discovery / Theme extraction
§ Data Faceting
§ Trend spotting
§ Visualization
§ Use Cases and Demos
§ Next Steps
Slide 3 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 3
Goals for Text Analytics
§ What’s your goal?
§ What are your clients asking you for?
§ How do you want to service your internal clients? Analysts,
researchers, account managers?
§ How do you want to service your external clients? Self service
reporting? Ad-hoc analysis? Integration with their data?
§ How do you envision your new solution to complement other
Nielsen services?
Slide 4 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 4
Text Analytics - Evolution
§ Reliant on relational data
structures that are challenging
to manage, silos of data
§ Not indexed immediately, not
possible to query in real time
§ New Parses = Re-ingestion
§ Re-ingestion = new schema
design – creates delays
§ Not real time – difficult to
determine buzz
§ Impossible at 30+ billion docs
§ Pre-processing required to
handle batches of data
§ Extraction methods lose
context and full perspective
§ Flexible – Built on an
infrastructure that can integrate
text mining output
§ Context Aware – Without schema
redesigns, context of original
document persists as text miners
enrich that content, preserving
relationships to the original data
§ Scales – Can accommodate real
time ad-hoc queries and reports
across a corpus of 30+ billion
documents
§ Enrichment – a better method of
leveraging text mining work
Traditional
Methods
MarkLogic Enabled
Methods
Slide 5 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 5
The Parse
§ The Parse
§ Actor, Action, Object
§ Fact
§ Entity
§ Qualifier
§ Etc.
§ Basis Entity Enrichment
§ Open Enrichment Framework
§ Calais
§ Temis
§ Data Harmony
§ NetOwl
What it means…
We can integrate with
all enrichment engines.
Slide 6 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 6
The Platform
§ The Platform
§ Flexibility
§ Speed
§ Scale
§ Delivery of Insight
What it means…
Clients can rapidly
deliver insight in real
time to help users
discover new insights.
Slide 7Copyright © 2009 Mark Logic Corporation. All rights reserved.
MarkLogic and Text Analytics
Web Services
ETL
Connector (*)
Social Media
Connector (*)
RDBMS
connector
Search
Unified Index
For all data structuresTransactional
Database
Data
Retrieval
Repository
Classification
Concept Extraction
Entity Enrichment
Web
Applications
Decision
Support
APIs/Services
Taxonomies
App Server
Third-party Partners
Analytics
Leverage value generated from text mining Generate Opinions
(in the form of data)
Slide 8 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 8
Traditional Enrichment = Extraction
First
Name
Last
Name
Other Comments
Chris Smith Data Chris bought an upgrade package for his black, 2011, Honda
Pilot on 9/16. Car returned for service on 9/21. The bolt on
the undercarriage cracked due to heat. He doesn’t think it’s the
transmission however as …..
Actor Action Object
Chris buy package
Fact
package-buy
car-return
bolt-cracked
Entity Type
Chris person
Honda organization
9/16 date
Qualifiers
upgrade
black
More Parsing = More Tables/Rows = More Joins = Does Not Scale!
And What About Context?
Slide 9 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 9
Enrichment with MarkLogic
<actor><person>Chris </person></actor><action>bought
</action>an <qualifier>upgrade
</qualifier><object>package</object> for his
<qualifier>black </qualifier>, <qualifier>2011 </qualifier>,
<organization>Honda </organization> Pilot on
<date>9/16</date>. Car returned for service on <date>
9/21 </date>. The bolt on the undercarriage cracked due to
heat. <person @name=“Chris”>He </person> doesn’t think
it’s the transmission however as …..
Pepsi<name> </name><brand> </brand><drink> </drink>
Markup Inline!
Every Tag Becomes a Candidate For an Index!
What it
means…
Enrichment
persists
context and
scales without
a schema
redesign, saves
time and
resources as
client needs
evolve.
Slide 10 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 10
Text Mining is part of the big picture
Words and phrases
... Semantic Web is a collaborative
movement led by the World Wide Web
Consortium (W3C) ...
Structure Label
Author Ing
Comp
ID Para
Org
Data/Metadata
name:sorbitol
date:2012-06-04
company:Roche
Entities in Context
... diabetes, since the risk of
blindness is very high in such
patients...
Geospatial
<location>
<lat>46.946584</lat>
<lng>93.076172</lng>
</location>
Universal Index
Slide 11 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 11
Demo
Slide 12 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 12
Agenda
§ Introductions
§ Nielsen’s goals and challenges related to unstructured data
§ MarkLogic Beyond Big Data Search
§ Entity Extraction
§ Topic Discovery / Theme extraction
§ Data Faceting
§ Trend spotting
§ Visualization
§ Use Cases and Demos
§ Next Steps
Slide 15 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 15
MarkLogic Analytics
§ Why Use MarkLogic Analytics?
§ Term list analytics
§ Range index analytics
§ Combining term lists and range indexes
§ Range index best practices & references
Slide 16 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 16
MarkLogic Analytics – why use it?
Applications increasingly combine structured and unstructured
information (e.g., electronic healthcare records)
Show me male patients that are under the age of 45 with an
ADMITTING DIAGNOSIS that included Chest Pain, or with a
HISTORY OF PRESENT ILLNESS including symptoms for Chest
Pain, Shortness of Breath, or Dizziness. Additionally, identify patients
within this population with regular alcohol consumption in the SOCIAL
HISTORY, alcoholism in the FAMILY HISTORY, and one of the
following 17 synonyms for stress diagnoses in the ASSESSMENT
AND TREATMENT PLAN.
Structured Unstructured/Contextual
Slide 32 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 32
Agenda
§ Text Enrichment
Slide 33 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 33
Text Enrichment – with Entities
§ Load, manipulate, query content as-is
§ … then enrich the content over time
§ Entity extraction
§ Specialized technology
§ Identifies people, places, things in free text
§ Entity extraction -> Entity enrichment
§ Entities are marked-up in-line
§ Gives you
§ More focused search (includes proximity, structure)
§ Analytics
§ Alerting
Slide 34 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 34
Enrich Your Content … With
Entities: Example
<Article xmlns:e="http://marklogic.com/entity">
<title><e:person>John Louis</e:person></title>
<acknowledgement><e:gpe>Wikipedia</e:gpe>, the free encyclopedia</acknowledgement>
<section>
<para>"Tiger" <e:person>John Louis</e:person> (born <e:date>14 June
1941</e:date>)[<refto ID="1">1</refto>] was an <e:gpe>England</e:gpe> international speedway
rider who rode for <e:organization>Ipswich Witches</e:organization>. He is the father of
<e:gpe>Great Britain</e:gpe> international <e:person>Chris Louis</e:person>.
<e:person>John</e:person> rode a weslake for most of his career.</para>
</section>
<section>
<title>Career history</title>
<para><e:person>John</e:person> finished third in the 1975 Speedway World
Championship and was part of the <e:organization>England Speedway World
Cup</e:organization> winning teams of 1972, 1974 and 1975. He was also World Pairs Champion
in 1976 with <e:person>Malcolm Simmons</e:person>. He also captained
<e:gpe>Ipswich</e:gpe> when they were <e:nationality>British</e:nationality> Champions in
1976. <e:person>John</e:person> won the <e:nationality>British</e:nationality> Speedway
Championship in 1975. He was also <e:organization>National League Riders</e:organization>
champion in 1971 and <e:organization>British League Riders</e:organization> champion in
1979.</para>
<para>He retired in 1984 and is now the promoter of <e:organization>Ipswich
Witches</e:organization>.</para>
</section>
Slide 35 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 35
Entity Enrichment With
MarkLogic Server
1. Rule Based using Built-in function
§ Can leverage a taxonomy for drive entity definition
§ Uses Content Processing Framework to Automate process
2. Statistical Analysis using built-in Entity Enricher
§ Licensed BASIS for enrichment
§ For automated entity enrichment
3. External Using Partner Network
§ Seamless integration using Open Enrichment Framework
§ Can use a combination of tools (Best of Bread)
§ Can leverage both internal and external Solution
Three Approaches to Entity Enrichment
Slide 36 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 36
Entity Enrichment: Built-in
§ Take an XML node, and markup entities in that node
§ Substitute $expr for each entity in $node
§ Use any style of markup using $expr plus these variables:
§ $cts:node
§ $cts:text
§ $cts:entity-type
§ Advantage: the most flexible
§ Choose your style of markup
§ Choose which parts you want to markup
§ Choose which entities you want to use/ignore
cts:entity-highlight(
$node as node(),
$expr as item()*
) as node()
Slide 37 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 37
Entity Enrichment: Built-in:
Example(2)
cts:entity-highlight(
<a>John went to England</a>,
<entity>{
element {$cts:entity-type} {$cts:text}
}
</entity>
)
<a>
<entity><PERSON>John</PERSON></entity>
went to
<entity><GPE>England</GPE></entity>
</a>
Slide 38 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 38
BASIS Enrichment: What Gets
Tagged?
With the built-in entity enrichment, you can tag:
person
organization
location
GPE (geopolitical entity)
facility
religion
nationality
credit card number
email
latitude/longitude
money
percent
ID (personal ID number)
phone number
URL
UTM
date
time
Slide 39 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 39
Entity Enrichment Framework
§ You have a choice …
§ There are several Entity Extraction engines available
§ No engine is best-of-breed for all knowledge domains, all
languages
§ The Open Enrichment Framework lets you choose an engine that
suits your needs to extract more domain-specific entities and/or
support additional languages
§ Pipelines available
§ Temis Luxid
§ Open Calais
§ Data Harmony
§ NetOWL
§ Add other pipelines yourself
Slide 40 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 40
Agenda
§ Classification
Slide 41 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 41
Classification With MarkLogic
Server
1. Rule Based using Reverse Queries
§ Match documents against a pre-defined rule and automatically tag
content
§ Can use both Forward and Reverse queries for sophisticated
scenarios. We call it Match-making
2. Statistical Classification using built-in SVM Classifier
3. External Using Partner Network
§ Seamless integration using Open Enrichment Framework
§ Can use a combination of tools (Best of Bread)
§ Can leverage both internal and external Solution
Three Approaches to Classification
Slide 42 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 42
Agenda
§ Trend Spotting
Slide 43 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 43
Trend Spotting With MarkLogic
Server
1. Co-Occurrences with Frequency Rules
§ Spot trends in Business Entities and their relationship to other
concepts as they bubble up and surface above the noise
§ Use Co-Occurrence Analytical Indexes paired with Alerting to signal
trends and anomalies in real-time
Analytics + Alerting
Slide 44 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 44
Agenda
§ Other Text Analytics
Slide 45 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 45
Additional Text Analytics
1. Linking of unstructured Information
§ CTS:Similar to find related pieces of information in unstructured
documents
§ External Tools for finger-printing (find loose associations)
2. Query Expansion using Synonyms and Taxonomies
§ Narrow / Broaden Analytics
§ Parent / Child
§ Associative & Equivalent
3. Type-Ahead using Lexicons
§ Support for high-speed distinct values in entire database or in a
segment
Slide 46 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 46
Math and Statistical analytical
functions
1. math:variance-p
2. cts:variance-p
3. math:variance
4. cts:variance
5. math:stddev-p
6. cts:stddev-p
7. math:stddev
8. cts:stddev
9. math:covariance-p
10. cts:covariance-p
11. math:covariance
12. cts:covariance
13. math:correlation
14. cts:correlation
15. math:linear-model
16. cts:linear-model
17. math:median
18. cts:median
19. math:percentile
20. cts:percentile
21. math:mode
22. math:rank
23. cts:rank
24. math:percent-rank
25. cts:percent-rank
Slide 47 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 47
Next Steps
Slide 48 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 48
The Only Operational Database Technology for
Mission-Critical Big Data Applications

More Related Content

Similar to Mark logic text analytics

MarkLogic Semantic use cases
MarkLogic Semantic use cases MarkLogic Semantic use cases
MarkLogic Semantic use cases Fernando Mesa
 
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...Cambridge Semantics
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIDenodo
 
Fbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_servicesFbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_servicesCindy Irby
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesDataWorks Summit
 
Tapping Into A Massively Interconnected Knowledge Network
Tapping Into A Massively Interconnected Knowledge NetworkTapping Into A Massively Interconnected Knowledge Network
Tapping Into A Massively Interconnected Knowledge NetworkBlueFish
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesDataWorks Summit
 
[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...
[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...
[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...Insight Technology, Inc.
 
Targeted Marketing: How Marketing Companies can use Big Data to Target Custom...
Targeted Marketing: How Marketing Companies can use Big Data to Target Custom...Targeted Marketing: How Marketing Companies can use Big Data to Target Custom...
Targeted Marketing: How Marketing Companies can use Big Data to Target Custom...Ray Février
 
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...Introduction to AutoML and Data Science using the Oracle Autonomous Database ...
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...Sandesh Rao
 
Cranking It Up - SuiteWorld 2017
Cranking It Up  - SuiteWorld 2017Cranking It Up  - SuiteWorld 2017
Cranking It Up - SuiteWorld 2017Diego Cardozo
 
The New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the CloudThe New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the CloudInside Analysis
 
Robert Parkin Portfolio
Robert Parkin PortfolioRobert Parkin Portfolio
Robert Parkin Portfoliorsparkin
 
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Karen Thompson
 
Präsentation share point
Präsentation share pointPräsentation share point
Präsentation share pointcoda-efurt
 
Interior Designs
Interior DesignsInterior Designs
Interior Designsarun kumar
 
Sharepoint Architecture
Sharepoint Architecture Sharepoint Architecture
Sharepoint Architecture arun kumar
 
Microsoft PPT_Sharepoint_introduction
Microsoft PPT_Sharepoint_introductionMicrosoft PPT_Sharepoint_introduction
Microsoft PPT_Sharepoint_introductionDipti Bohra
 
Where the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information AccessWhere the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information AccessInside Analysis
 

Similar to Mark logic text analytics (20)

MarkLogic Semantic use cases
MarkLogic Semantic use cases MarkLogic Semantic use cases
MarkLogic Semantic use cases
 
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
 
Fbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_servicesFbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_services
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
Tapping Into A Massively Interconnected Knowledge Network
Tapping Into A Massively Interconnected Knowledge NetworkTapping Into A Massively Interconnected Knowledge Network
Tapping Into A Massively Interconnected Knowledge Network
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...
[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...
[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...
 
Targeted Marketing: How Marketing Companies can use Big Data to Target Custom...
Targeted Marketing: How Marketing Companies can use Big Data to Target Custom...Targeted Marketing: How Marketing Companies can use Big Data to Target Custom...
Targeted Marketing: How Marketing Companies can use Big Data to Target Custom...
 
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...Introduction to AutoML and Data Science using the Oracle Autonomous Database ...
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...
 
Cranking It Up - SuiteWorld 2017
Cranking It Up  - SuiteWorld 2017Cranking It Up  - SuiteWorld 2017
Cranking It Up - SuiteWorld 2017
 
The New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the CloudThe New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the Cloud
 
Robert Parkin Portfolio
Robert Parkin PortfolioRobert Parkin Portfolio
Robert Parkin Portfolio
 
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
 
Präsentation share point
Präsentation share pointPräsentation share point
Präsentation share point
 
Interior Designs
Interior DesignsInterior Designs
Interior Designs
 
Sharepoint Architecture
Sharepoint Architecture Sharepoint Architecture
Sharepoint Architecture
 
Microsoft PPT_Sharepoint_introduction
Microsoft PPT_Sharepoint_introductionMicrosoft PPT_Sharepoint_introduction
Microsoft PPT_Sharepoint_introduction
 
Mahendrababu N
Mahendrababu NMahendrababu N
Mahendrababu N
 
Where the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information AccessWhere the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information Access
 

Recently uploaded

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseWSO2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformWSO2
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityVictorSzoltysek
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceIES VE
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringWSO2
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxMarkSteadman7
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 

Recently uploaded (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 

Mark logic text analytics

  • 1. Slide 1 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 1 Evolving MyBuzzMetrics with Text Analytics September 2012 Eric Austvold – Insights Executive Fernando Mesa – WW Director of Enterprise Solution Pete Aven – Systems Engineer
  • 2. Slide 2 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 2 Agenda § Introductions § NM Incite goals for text analytics § MarkLogic evolving MyBuzzMetrics with Text Analytics § Entity Extraction § Topic Discovery / Theme extraction § Data Faceting § Trend spotting § Visualization § Use Cases and Demos § Next Steps
  • 3. Slide 3 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 3 Goals for Text Analytics § What’s your goal? § What are your clients asking you for? § How do you want to service your internal clients? Analysts, researchers, account managers? § How do you want to service your external clients? Self service reporting? Ad-hoc analysis? Integration with their data? § How do you envision your new solution to complement other Nielsen services?
  • 4. Slide 4 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 4 Text Analytics - Evolution § Reliant on relational data structures that are challenging to manage, silos of data § Not indexed immediately, not possible to query in real time § New Parses = Re-ingestion § Re-ingestion = new schema design – creates delays § Not real time – difficult to determine buzz § Impossible at 30+ billion docs § Pre-processing required to handle batches of data § Extraction methods lose context and full perspective § Flexible – Built on an infrastructure that can integrate text mining output § Context Aware – Without schema redesigns, context of original document persists as text miners enrich that content, preserving relationships to the original data § Scales – Can accommodate real time ad-hoc queries and reports across a corpus of 30+ billion documents § Enrichment – a better method of leveraging text mining work Traditional Methods MarkLogic Enabled Methods
  • 5. Slide 5 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 5 The Parse § The Parse § Actor, Action, Object § Fact § Entity § Qualifier § Etc. § Basis Entity Enrichment § Open Enrichment Framework § Calais § Temis § Data Harmony § NetOwl What it means… We can integrate with all enrichment engines.
  • 6. Slide 6 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 6 The Platform § The Platform § Flexibility § Speed § Scale § Delivery of Insight What it means… Clients can rapidly deliver insight in real time to help users discover new insights.
  • 7. Slide 7Copyright © 2009 Mark Logic Corporation. All rights reserved. MarkLogic and Text Analytics Web Services ETL Connector (*) Social Media Connector (*) RDBMS connector Search Unified Index For all data structuresTransactional Database Data Retrieval Repository Classification Concept Extraction Entity Enrichment Web Applications Decision Support APIs/Services Taxonomies App Server Third-party Partners Analytics Leverage value generated from text mining Generate Opinions (in the form of data)
  • 8. Slide 8 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 8 Traditional Enrichment = Extraction First Name Last Name Other Comments Chris Smith Data Chris bought an upgrade package for his black, 2011, Honda Pilot on 9/16. Car returned for service on 9/21. The bolt on the undercarriage cracked due to heat. He doesn’t think it’s the transmission however as ….. Actor Action Object Chris buy package Fact package-buy car-return bolt-cracked Entity Type Chris person Honda organization 9/16 date Qualifiers upgrade black More Parsing = More Tables/Rows = More Joins = Does Not Scale! And What About Context?
  • 9. Slide 9 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 9 Enrichment with MarkLogic <actor><person>Chris </person></actor><action>bought </action>an <qualifier>upgrade </qualifier><object>package</object> for his <qualifier>black </qualifier>, <qualifier>2011 </qualifier>, <organization>Honda </organization> Pilot on <date>9/16</date>. Car returned for service on <date> 9/21 </date>. The bolt on the undercarriage cracked due to heat. <person @name=“Chris”>He </person> doesn’t think it’s the transmission however as ….. Pepsi<name> </name><brand> </brand><drink> </drink> Markup Inline! Every Tag Becomes a Candidate For an Index! What it means… Enrichment persists context and scales without a schema redesign, saves time and resources as client needs evolve.
  • 10. Slide 10 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 10 Text Mining is part of the big picture Words and phrases ... Semantic Web is a collaborative movement led by the World Wide Web Consortium (W3C) ... Structure Label Author Ing Comp ID Para Org Data/Metadata name:sorbitol date:2012-06-04 company:Roche Entities in Context ... diabetes, since the risk of blindness is very high in such patients... Geospatial <location> <lat>46.946584</lat> <lng>93.076172</lng> </location> Universal Index
  • 11. Slide 11 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 11 Demo
  • 12. Slide 12 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 12 Agenda § Introductions § Nielsen’s goals and challenges related to unstructured data § MarkLogic Beyond Big Data Search § Entity Extraction § Topic Discovery / Theme extraction § Data Faceting § Trend spotting § Visualization § Use Cases and Demos § Next Steps
  • 13. Slide 15 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 15 MarkLogic Analytics § Why Use MarkLogic Analytics? § Term list analytics § Range index analytics § Combining term lists and range indexes § Range index best practices & references
  • 14. Slide 16 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 16 MarkLogic Analytics – why use it? Applications increasingly combine structured and unstructured information (e.g., electronic healthcare records) Show me male patients that are under the age of 45 with an ADMITTING DIAGNOSIS that included Chest Pain, or with a HISTORY OF PRESENT ILLNESS including symptoms for Chest Pain, Shortness of Breath, or Dizziness. Additionally, identify patients within this population with regular alcohol consumption in the SOCIAL HISTORY, alcoholism in the FAMILY HISTORY, and one of the following 17 synonyms for stress diagnoses in the ASSESSMENT AND TREATMENT PLAN. Structured Unstructured/Contextual
  • 15. Slide 32 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 32 Agenda § Text Enrichment
  • 16. Slide 33 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 33 Text Enrichment – with Entities § Load, manipulate, query content as-is § … then enrich the content over time § Entity extraction § Specialized technology § Identifies people, places, things in free text § Entity extraction -> Entity enrichment § Entities are marked-up in-line § Gives you § More focused search (includes proximity, structure) § Analytics § Alerting
  • 17. Slide 34 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 34 Enrich Your Content … With Entities: Example <Article xmlns:e="http://marklogic.com/entity"> <title><e:person>John Louis</e:person></title> <acknowledgement><e:gpe>Wikipedia</e:gpe>, the free encyclopedia</acknowledgement> <section> <para>"Tiger" <e:person>John Louis</e:person> (born <e:date>14 June 1941</e:date>)[<refto ID="1">1</refto>] was an <e:gpe>England</e:gpe> international speedway rider who rode for <e:organization>Ipswich Witches</e:organization>. He is the father of <e:gpe>Great Britain</e:gpe> international <e:person>Chris Louis</e:person>. <e:person>John</e:person> rode a weslake for most of his career.</para> </section> <section> <title>Career history</title> <para><e:person>John</e:person> finished third in the 1975 Speedway World Championship and was part of the <e:organization>England Speedway World Cup</e:organization> winning teams of 1972, 1974 and 1975. He was also World Pairs Champion in 1976 with <e:person>Malcolm Simmons</e:person>. He also captained <e:gpe>Ipswich</e:gpe> when they were <e:nationality>British</e:nationality> Champions in 1976. <e:person>John</e:person> won the <e:nationality>British</e:nationality> Speedway Championship in 1975. He was also <e:organization>National League Riders</e:organization> champion in 1971 and <e:organization>British League Riders</e:organization> champion in 1979.</para> <para>He retired in 1984 and is now the promoter of <e:organization>Ipswich Witches</e:organization>.</para> </section>
  • 18. Slide 35 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 35 Entity Enrichment With MarkLogic Server 1. Rule Based using Built-in function § Can leverage a taxonomy for drive entity definition § Uses Content Processing Framework to Automate process 2. Statistical Analysis using built-in Entity Enricher § Licensed BASIS for enrichment § For automated entity enrichment 3. External Using Partner Network § Seamless integration using Open Enrichment Framework § Can use a combination of tools (Best of Bread) § Can leverage both internal and external Solution Three Approaches to Entity Enrichment
  • 19. Slide 36 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 36 Entity Enrichment: Built-in § Take an XML node, and markup entities in that node § Substitute $expr for each entity in $node § Use any style of markup using $expr plus these variables: § $cts:node § $cts:text § $cts:entity-type § Advantage: the most flexible § Choose your style of markup § Choose which parts you want to markup § Choose which entities you want to use/ignore cts:entity-highlight( $node as node(), $expr as item()* ) as node()
  • 20. Slide 37 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 37 Entity Enrichment: Built-in: Example(2) cts:entity-highlight( <a>John went to England</a>, <entity>{ element {$cts:entity-type} {$cts:text} } </entity> ) <a> <entity><PERSON>John</PERSON></entity> went to <entity><GPE>England</GPE></entity> </a>
  • 21. Slide 38 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 38 BASIS Enrichment: What Gets Tagged? With the built-in entity enrichment, you can tag: person organization location GPE (geopolitical entity) facility religion nationality credit card number email latitude/longitude money percent ID (personal ID number) phone number URL UTM date time
  • 22. Slide 39 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 39 Entity Enrichment Framework § You have a choice … § There are several Entity Extraction engines available § No engine is best-of-breed for all knowledge domains, all languages § The Open Enrichment Framework lets you choose an engine that suits your needs to extract more domain-specific entities and/or support additional languages § Pipelines available § Temis Luxid § Open Calais § Data Harmony § NetOWL § Add other pipelines yourself
  • 23. Slide 40 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 40 Agenda § Classification
  • 24. Slide 41 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 41 Classification With MarkLogic Server 1. Rule Based using Reverse Queries § Match documents against a pre-defined rule and automatically tag content § Can use both Forward and Reverse queries for sophisticated scenarios. We call it Match-making 2. Statistical Classification using built-in SVM Classifier 3. External Using Partner Network § Seamless integration using Open Enrichment Framework § Can use a combination of tools (Best of Bread) § Can leverage both internal and external Solution Three Approaches to Classification
  • 25. Slide 42 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 42 Agenda § Trend Spotting
  • 26. Slide 43 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 43 Trend Spotting With MarkLogic Server 1. Co-Occurrences with Frequency Rules § Spot trends in Business Entities and their relationship to other concepts as they bubble up and surface above the noise § Use Co-Occurrence Analytical Indexes paired with Alerting to signal trends and anomalies in real-time Analytics + Alerting
  • 27. Slide 44 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 44 Agenda § Other Text Analytics
  • 28. Slide 45 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 45 Additional Text Analytics 1. Linking of unstructured Information § CTS:Similar to find related pieces of information in unstructured documents § External Tools for finger-printing (find loose associations) 2. Query Expansion using Synonyms and Taxonomies § Narrow / Broaden Analytics § Parent / Child § Associative & Equivalent 3. Type-Ahead using Lexicons § Support for high-speed distinct values in entire database or in a segment
  • 29. Slide 46 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 46 Math and Statistical analytical functions 1. math:variance-p 2. cts:variance-p 3. math:variance 4. cts:variance 5. math:stddev-p 6. cts:stddev-p 7. math:stddev 8. cts:stddev 9. math:covariance-p 10. cts:covariance-p 11. math:covariance 12. cts:covariance 13. math:correlation 14. cts:correlation 15. math:linear-model 16. cts:linear-model 17. math:median 18. cts:median 19. math:percentile 20. cts:percentile 21. math:mode 22. math:rank 23. cts:rank 24. math:percent-rank 25. cts:percent-rank
  • 30. Slide 47 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 47 Next Steps
  • 31. Slide 48 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 48 The Only Operational Database Technology for Mission-Critical Big Data Applications