Alyona Medelyan (Pingar), Anna Divoli (Pingar)
presented at Strata O'Reilly Making Data Work Conference on March 1, 2012
The challenge of unstructured data is a top priority for organizations that are looking for ways to search, sort, analyze and extract knowledge from masses of documents they store and create daily. Text mining uses knowledge-driven algorithms to make sense of documents in a similar way a person would do by reading them. Lately, text mining and analytics tools became available via APIs, meaning that organizations can take immediate advantage these tools. We discuss three examples of how such APIs were utilized to solve key business challenges.
Most organizations dream of paperless office, but still generate and receive millions of print documents. Digitizing these documents and intelligently sharing them is a universal enterprise challenge. Major scanning providers offer solutions that analyze scanned and OCR’d documents and then store detected information in document management systems. This works well with pre-defined forms, but human interaction is required when scanning unstructured text. We describe a prototype build for the legal vertical that scans stacks of paper documents and on the fly categorizes and generates meaningful metadata.
In the area of forensics, intelligence and security, manual monitoring of masses of unstructured data is not feasible. The ability of automatically identify people’s names, addresses, credit card and bank account numbers and other entities is the key. We will briefly describe a case study of how a major international financial institution is taking advantage of text mining APIs in order to comply with a recent legislation act.
In healthcare, although Electronic Health Records (EHRs) have been increasingly becoming available over the past two decades, patient confidentiality and privacy concerns have been acting as obstacles from utilizing the incredibly valuable information they contain to further medical research. Several approaches have been reported in assigning unique encrypted identifiers to patients’ ID but each comes with drawbacks. For a number of medical studies consistent uniform ID mapping is not necessary and automated text sanitization can serve as a solution. We will demonstrate how sanitization has practical use in a medical study.
And read a full interview with Alyona and Anna at http://radar.oreilly.com/2012/02/unstructured-data-analysis-tools.html
Semantics empowered Physical-Cyber-Social Systems for EarthCubeAmit Sheth
Presentation at the EarthCube Face Face-to-Face Workshop of Semantics & Ontologies Workgroup: April 30-May 1, 2012, Ballston, VA.
Workshop site: http://earthcube.ning.com/group/semantics-and-ontologies/page/workshops
For more recent material on this topic, see: http://wiki.knoesis.org/index.php/PCS
Lotusphere Comes to You 2008 - Desktop of the FutureEd Brill
A strategy-level presentation covering desktop computing, now and in the future. Examines alternative approaches including smartphones, user segmentation, and considering alternatives to competition.
The Analytic System: Finding Patterns in the DataHealth Catalyst
Dr. Haughom set the stage for this upcoming discussion in his previous webinar, explaining the key components of an effective analytical system that enables self-exploration and learning. In this session Attendees will learn:
How the distinction between random variation and assignable cause variation is critically important to patient care
Creation and application of Statistical Process Control (SPC) charts to:
Monitor process variation over time
Differentiate between assignable cause and random cause variation
Assess effectiveness of change on a given process
Achieve and maintain process stability
How implementing inlier management and creating a collaborative environment will drive continuous improvement
How to identify patterns in data using a live demonstration of advanced analytical tools.
Semantics empowered Physical-Cyber-Social Systems for EarthCubeAmit Sheth
Presentation at the EarthCube Face Face-to-Face Workshop of Semantics & Ontologies Workgroup: April 30-May 1, 2012, Ballston, VA.
Workshop site: http://earthcube.ning.com/group/semantics-and-ontologies/page/workshops
For more recent material on this topic, see: http://wiki.knoesis.org/index.php/PCS
Lotusphere Comes to You 2008 - Desktop of the FutureEd Brill
A strategy-level presentation covering desktop computing, now and in the future. Examines alternative approaches including smartphones, user segmentation, and considering alternatives to competition.
The Analytic System: Finding Patterns in the DataHealth Catalyst
Dr. Haughom set the stage for this upcoming discussion in his previous webinar, explaining the key components of an effective analytical system that enables self-exploration and learning. In this session Attendees will learn:
How the distinction between random variation and assignable cause variation is critically important to patient care
Creation and application of Statistical Process Control (SPC) charts to:
Monitor process variation over time
Differentiate between assignable cause and random cause variation
Assess effectiveness of change on a given process
Achieve and maintain process stability
How implementing inlier management and creating a collaborative environment will drive continuous improvement
How to identify patterns in data using a live demonstration of advanced analytical tools.
Analyzing Unstructured Data in Hadoop WebinarDatameer
Unstructured data is growing 62% per year faster than structured data. According to Gartner, data volumes are set to grow 800% in aggregate over the next 5 years, and 80% of it will be unstructured data.
This on-demand webinar will highlight and discuss:
How applying big data analytics to unstructured data can help you gain richer, deeper and more accurate insights to gain competitive advantages
The sources of unstructured data which include email, social media platforms, CRM systems, call center platforms (including notes and speech-to-text transcripts), and web scrapes
How monitoring the communications of your customers and prospects enables you to make time-sensitive decisions and jump on new business opportunities
Exploring Process Barriers to Release Public Sector Information in Local Gove...Peter Conradie
Conradie, P. & Choenni, S., 2012. Exploring Process Barriers to Release Public Sector Information in Local Government. In 6th International Conference on Theory and Practice of Electronic Governance, Albany. NY. Albany, New York, pp. 5–13.
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Amit Sheth
Ora Lassila and Amit Sheth, "Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Interoperability", Invited Talk at ONC-HHS Invitational Workshop on Next Generation Interoperability for Health, Washington DC, January 19-20, 2011.
From Attention to Trust: Data-driven journalism and the urban futureMirko Lorenz
Presentation from Picnic 2011. At the festival the main topic was urban live. This presentation aims to show that there new needs. The key idea is to transform data into meaningful information, helping each one of us to cope with the factors influencing our lives. One way do that is to use data more and better. Journalists could take a role in this. This is what data-driven journalism is about.
Analyzing Unstructured Data in Hadoop WebinarDatameer
Unstructured data is growing 62% per year faster than structured data. According to Gartner, data volumes are set to grow 800% in aggregate over the next 5 years, and 80% of it will be unstructured data.
This on-demand webinar will highlight and discuss:
How applying big data analytics to unstructured data can help you gain richer, deeper and more accurate insights to gain competitive advantages
The sources of unstructured data which include email, social media platforms, CRM systems, call center platforms (including notes and speech-to-text transcripts), and web scrapes
How monitoring the communications of your customers and prospects enables you to make time-sensitive decisions and jump on new business opportunities
Exploring Process Barriers to Release Public Sector Information in Local Gove...Peter Conradie
Conradie, P. & Choenni, S., 2012. Exploring Process Barriers to Release Public Sector Information in Local Government. In 6th International Conference on Theory and Practice of Electronic Governance, Albany. NY. Albany, New York, pp. 5–13.
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Amit Sheth
Ora Lassila and Amit Sheth, "Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Interoperability", Invited Talk at ONC-HHS Invitational Workshop on Next Generation Interoperability for Health, Washington DC, January 19-20, 2011.
From Attention to Trust: Data-driven journalism and the urban futureMirko Lorenz
Presentation from Picnic 2011. At the festival the main topic was urban live. This presentation aims to show that there new needs. The key idea is to transform data into meaningful information, helping each one of us to cope with the factors influencing our lives. One way do that is to use data more and better. Journalists could take a role in this. This is what data-driven journalism is about.
Fundamentals Concepts on Text Analytics.pptxaini658222
Text analytics, also known as text mining, is the process of deriving high-quality information from text sources using software. It is a multidisciplinary field that combines elements of data mining, machine learning, statistics, and natural language processing (NLP) to process and analyze large amounts of natural language data effectively.
The amount of data in our world today is substantially outsized. Many of the personal and non-personal aspects of our day-to-day activities are aggregated and stored as data by both businesses and governments. The increasing data captured through multimedia, social media, and the Internet are a phenomenon that needs to be properly examined. In this article, we explore this topic and analyse the term data ownership. We aim to raise awareness and trigger a debate for policy makers with regard to data ownership and the need to improve existing data protection, privacy laws, and legislation at both national and international levels.
How Taxonomies and facets bring end users closer to big dataPeter Wren-Hilton
Pingar researcher Dr Anna Divoli's presentation given at the 2012 Text Analytics World Boston. Content includes discussion of taxonomies and big data,.
Presented at Semantic Garage Meetup San Francisco 2011. Unstructured data comes at a high cost - $37,000 per year per person in information industries. By using tools to automatically add metadata enterprises can improve search results, speed e-discovery and risk assessment, summarize content and extract entities from files. Unstructured and semi-structured data represents a large component of big data. By turning unstructured content into business intelligence, enterprise can speed time to information.
Pingar chief research officer Alyona Medelyan presents research conducted jointly with Anna Divoli at the Human Computer Information Retrieval workshop 2011.
Presentation that won the SharePoint Idol competition at the 2011 New Zealand SharePoint Conference. Demonstrates how the Pingar technology can automatically populate metadata fields in SharePoint document collections.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
2. Problem 1
New York London
How do lawyers scan, file, store & share
client’s case documents efficiently?
Images: Ambro / FreeDigitalPhotos.net
3. slambo_42@flickr
Anoto AB@flickr
EHR
EMR
PHR
How do doctors, patients &
researchers distribute & share
medical records efficiently?
4. The FATCA Legislation Problem 3
Takes effect 1 January 2013
annual
report
30%
witholding
tax
Foreign
Financial
waiver
Ins.tu.on
with
IRS
agreement
U.S.
account
holders
U.S.
ownership
en..es
with
without
Custodian
bank
waiver
waiver
without
IRS
agreement
30%
witholding
tax
How can a financial institution find U.S. citizens
in masses of paperwork efficiently?
5. How much time do we actually spend on …
Searching,
gathering
info
17
Wri.ng
emails
14
Crea.ng
docs
13
Analyzing
info
10
Reviewing
docs
9
Organizing
docs
7
Crea.ng
presenta.ons
7
Edi.ng
images
6
Entering
data
6
Translates
to
annual
costs:
Search:
17h
/
week
=
$37,000
/
year
Approving
docs
4
Publishing
docs
4
IDC: Hidden cost of information
Transla.ng
docs
1 average hours / week
6. introduction
conclusions unstructured data
real life problems
compliance unstructured data
in finance & text analytics
healthcare metadata
records issues in legal domain
7. Social
News
Emails
Media
Audio
Images
Databases
Videos
Literature
Blogs
8. unstructured data
Linguistics Search
Statistics Data Extraction
Text Processing Document Organization
Machine Learning Business Intelligence
Natural Language Processing Opinion Mining
Text Mining
9. What can one mine
from unstructured data?
keywords text text text
text text text
tags text text text
text text text sentiment
text text text
text text text
genre
categories
taxonomy terms
entities
names biochemical
patterns … entities text text text
text text text
text text text
text text text
text text text
text text text
10. Social
News
Emails
Media
Audio
Images
Databases
Videos
Literature
Blogs
11. text text text
text text text
text text text
text text text
text text text
text text text
People U.S. politicians News about
U.S. politicians
News
Structured & unstructured data interplay
Unique
iden.fiers
Structured
biological
Literature
references
data
Experts’
annota.on
(free
text)
12. introduction
conclusions unstructured data
real life problems
compliance
unstructured data
in finance
& text analytics
healthcare metadata
records issues in legal domain
13. Legal document processing pipeline
scan
save
ocr
New York metadata
London
dms
Images: Ambro / FreeDigitalPhotos.net
14. jacockshaw@flickr
Assigning metadata
(approximation)
15 docs per day
3 min per doc
0.75 h per day
240 working days per year
$200 hourly charge
$36,000 per year per lawyer
Keyword extraction
0.0027 min per doc
10 min for yearly worth of docs
15. Integra.ng
metadata
extrac.on
with
scanning
h[p://www.youtube.com/watch?v=kluVp25upag
17. introduction
conclusions unstructured data
real life problems
compliance
in finance unstructured data
& text analytics
healthcare metadata
records issues in legal domain
19. Na.onal
Alliance
for
Health
Informa.on
Technology
EMR
(NAHIT)
defini.ons
EHR
PHR
?
Discon.nued!
1. Name,
birth
date,
blood
type
2. Emergency
contact(s)
3. Primary
caregiver/phone
number
4. Medicines,
dosages,
and
how
long
taken
5. Allergies/allergic
reac.ons
6. Date
of
last
physical
7. Dates/results
of
tests
and
screenings
8. Major
illnesses/surgeries
and
their
dates
9. Chronic
diseases
PHI
10. Family
illness
history
11. …
h?p://www.nlm.nih.gov/medlineplus/magazine/
de-‐idenHficaHon
process
20. Medical
researchers
…
records
with
removed
PHI:
use
pa.ent
records
informa.on
from
structured
fields
for
discoveries…
but
mostly
from
free
text!
AMIA
2012
21.
siliconangle.com/blog/
www.hcpro.com
www.informaHon-‐age.com
“The
Health
Insurance
Portability
and
Accountability
Act
of
1996
(HIPAA)
Privacy
and
Security
Rules”
“The
Pa.ent
Safety
and
Quality
Improvement
Act
of
2005
(PSQIA)
Pa.ent
Safety
Rule”
22. 18 identifiers!
PHI
Names
Vehicle
iden.fiers
&
serial
numbers,
incl.
license
Geographic
subdivisions
plate
numbers
smaller
than
a
State:
street
address,
city,
county,
precinct,
zip
code…
Device
iden.fiers
&
Dates
(except
year):
birth,
serial
numbers
admission,
discharge…
URLs
/
IP
addresses
Phone
/
Fax
numbers
Email
addresses
Biometric
iden.fiers,
including
finger
and
voice
prints
Social
security
#
Face
photo
images
Medical
records
#
&
any
comparable
images
Health
plan
beneficiary#
Any
other
unique
IDs
etc.
Accounts
#
23. slambo_42@flickr Thanks
for
discussions:
Nigam
Shah,
Stanford
Eneida
Mendonca,
UWinscosin,
Madison
Irena
Spasic,
Cardiff
University
text text text
text text text
text text text
text text text
text text text
text text text
keywords
tags
Anoto AB@flickr
24. introduction
conclusions unstructured data
real life problems
compliance
in finance unstructured data
& text analytics
healthcare metadata
records issues in legal domain
25. The FATCA Legislation
Takes effect 1 January 2013
annual
report
30%
witholding
tax
waiver
Foreign
Financial
Ins.tu.on
with
IRS
agreement
U.S.
account
holders
U.S.
ownership
en..es
with
without
Custodian
bank
waiver
waiver
30%
witholding
tax
without
IRS
agreement
27. Recommended Solution
from FATCA Legislation:
• “Query an electronic database using
standard queries in programming languages”
• “Adopt similar approaches as used for the
Anti-money-laundering and Know-your-customer
requirements”
• “Note that information, data, or files are not
electronically searchable if they are stored as
images”
28. walmink,
thomwatson@flikr
FATCA COMPLIANCE – STEP 2
Contact client for additional info or a waver
29. Actual Solution
for the FATCA Legislation:
link
analysis
gather
the
trail
client’s
data
ocr
convert
all
images
to
text
en.ty
extrac.on
detect
loca.ons,
bank
numbers
analysis
auto-‐categorize
check
resolve
inconsistencies
31. introduction
conclusions unstructured data
real life problems
compliance
in finance unstructured data
& text analytics
healthcare metadata
records issues in legal domain
32. Alyona Medelyan, PhD Anna Divoli, PhD
@zelandiya @annadivoli
Natural Language Processing Biomedical Text Mining
Text Mining Search User Interfaces
Wikipedia Mining Human Factors
Machine Learning Knowledge Discovery
Try out text analytics provided by the Pingar API!
Online demo: apidemo.pingar.com
Free Sandbox account: pingar.com/get-the-api
Editor's Notes
To summarize:In this talk we gave a brief overview of what text analytics is and how powerful it is when dealing with unstructured data.We presented 3 real world examples, where text analytics eliminates manual boring error-prone labor.In the legal domain, keyword and taxonomy term extraction facilitates automated metadata assignment.Healthcare benefits from automated entity extraction for de-identification (sanitization) and mining useful associations.In the area of compliance & forensics, text analytics helpsscanning from massive amounts of data.No matter how much further our technology develops, we will always continue to communicate in human language. The amount of unstructured data will only increase. Already there are areas where manual analytics is not sustainable. And there will be even more need for efficient text analytics in the future.