Scaling Training Data
for AI Applications
Ron Schmelzer and Kathleen Walch
Principal Analysts, Cognilytica
Kristin Simonini
VP of Product, Applause
Today’s Speakers
2
Ron Schmelzer
Principal Analyst, Cognilytica
Kathleen Walch
Principal Analyst, Cognilytica
Kristin Simonini
VP of Product, Applause
3
Today’s Agenda
• MakingAI a Reality
• The Seven Patterns of AI, andWhat RequiresTraining Data
• Leveraging a Global Community to SourceTraining Data
• Real Example of Overcoming Challenges of a SourcingTraining Data
Project
• Cognilytica is an AI & Cognitive Technology-focused research and
advisory firm.
• Produce market research, advisory and guidance on AI, ML, and
CognitiveTechnology
• Produce the popular AIToday podcast, in addition to infographic
series, whitepapers, webinars, newsletters, and other popular
content.
• Focused on enterprise and public sector adoption of AI, ML, and Cognitive
Technology
• Kathleen Walch and Ron Schmelzer are PrincipalAnalysts and Managing
Partners of Cognilytica
• Contributing writers to Forbes,TechTarget (SearchEnterpriseAI), Cognitive
World, and CTOVision
About Cognilytica
4
• Data is the heart, soul, juju, of AI
• The specific data you need depends on the business problem you’re
solving and the kinds of predictive or goal outcomes you’re looking for
• Activities for data collection:
• Identifying the required data on which to train
• Identifying all the dimensions required for that data for predictive
value of significance
• Identifying the features that are required
• Identifying the sources of data
• Identifying the means to aggregate that data
• There is no exact answer to the question “How much data is needed?”
Identifying Data Sets for ML: Data Collection
5
Making AI a Reality
6
The Seven Patterns of AI
• Machines and humans interacting with each other using natural language,
conversational forms of interaction across a variety of forms of
communication including voice, text, and written, and image forms.
• The objective of this pattern is machines interacting with humans the
way humans interact with each other.
The Conversation & Human Interaction Pattern
7
• Using ML to identify and understand images, sound, items,
handwriting, faces, and gestures.
• The objective of this pattern is to have machines identify and
understand the real world and unstructured data.
The Recognition Pattern
8
• Using machine learning and other cognitive
approaches to understand how to take past / existing
behavior and predict future outcomes or help
humans make decisions about future outcomes
using insight learned from past behavior /
interactions / data.
• The Objective of this pattern is helping humans make
better decisions
Predictive Analytics & Decision Support
9
• Machine learning (esp. Deep Learning) is good at
recognizing patterns
• If you can train it, you can detect it
• If you can train it, you can detect patterns… or
things that don’t fit patterns
Pattern & Anomaly Detection
10
• Physical and virtual (software) systems that are able to
accomplish a task, achieve a goal, interact with their
surroundings, and perform their objective with minimal
or any human involvement.
• The objective of this pattern is minimizing human labor
Autonomous Systems
11
• In order for Supervised Learning approaches to work, they must
be fed clean, well-labeled data that the system can use to
learn from example.
• But how do you get Labeled Data?
• Do it yourself
• Find a source of already labeled data
• Get your Users to Do it
• Hire a Contractor Workforce
• Contract withThird Party Data Labeling Firms
Data Labeling: The Achilles Heel of AI
12
The Data Preparation & Engineering Pipeline
Data Acquisition / Ingest / Capture
• ETL
• Cloud-based data
Merging
• Combining data sources
Cleaning
• Deduping, removing extraneous, bad data
Labeling
• Adding machine learning labels and annotations for training
purposes
Enhancing
• Adding necessary additional data for models
Filtering
• Eliminating bias
Feature Engineering
• Assisting with enhancement (see future on multiplying
data sets)
Retraining Pipelines
• Creation of pipelines to deal with model iteration
World’s Largest Community Of Vetted Digital Professionals
14
Available in real-time and selected to represent your customers.
Custom, Vetted Testing
& Feedback Teams
Any demographic, device, and region
to achieve your specific needs
Applause for AI: An End-to-End Solution
MACHINE-
LEARNING
ALGORITHM
Did it
understand
me?
Did I see or
hear what I
expected?
Did it respond
accurately?
Were the
recommend-
ations
relevant?
Was the
information
captured
correctly?
Was it easy
to use?
Speech
Video
Training Data Testing
Output
Text
Questions
Handwriting
Images
The Challenge: Sourcing Data for AI
16
• 81% of executives said training AI with data is more difficult than expected
• Main challenges included biased or erroneous data, not enough data, or
inability to label data.
• 60% of decision makers at firms adopting AI cite data quality as either
“challenging” or “very challenging.” (IDC)
• “Regardless of your beginner or expert AI status, data is the Debbie Downer of
any AI project.” (Forrester)
What we see in the Enterprise:
• You need LOTS of training data: Thousands to tens of
thousands of artifacts: Images,Videos, Documents,
Voice/Dialects
• You need QUALITY data, not just volume: Poor data
results in costly delays to the Product Development
Lifecycle
• You need a DIVERSE, global community of testers:
Gender, Age, Race, Language are must haves for today’s
AI applications. You can’t have one individual provide
100s of artifacts, you need 100s of testers to provide
single artifacts
• You need to be able to rapidly EVOLVE: As Product
team’s train the algorithm, they often need to change
their sourcing requirements if they are not getting an
expected output.
The Challenge: Sourcing Data for AI
17
Quantity
Diversity
Quality
How Applause Solves……….
18
Sourcing Quality Data at Scale
Leveraging a vetted community of over 400,000 testers in 200+ countries
enabling Applause to deliver a seamless sourcing solution that includes:
 Quality vs.Volume: We build agreements focused on usable data vs.
simple data collection
 Managed Service: End to End program that includes recruitment, quality
control, delivery, tester training
 Privacy and Security: Seamlessly manages the complex Privacy
landscape, including PII, HIPAA,GDPR and unique company confidential
requirements that may be required
 Elastic and Scalable: Unique business model enabling companies to
rapidly supporting evolving product and business requirements
HOW DOES THIS LOOK IN
PRACTICE?
How to
Source
Training Data
Use Case
20
Requirement:
Source thousands of real-world handwritten documents
• Blind collection with no PII data
• No one individual could submit more than a single document
• Minimum density required: Words per page
Challenge: Recruit a High Number of Diverse
Participants
21
• Training Data required thousands of pages of real handwriting across a variety of
documents and personal artifacts, including (but not limited to):
• Prescriptions/doctors notes
• Purchase orders
• Credit applications
• Personal essays and letters
• Drivers licenses and birth certificates from all 50 states
• Tax Forms
• Each handwriting sample had to be unique and could not be replicated across
types or groups
• The Applause service and platform is built to recruit and incentivize thousands of
testers to deliver documents with specific requirements, such as word density and
redaction of all personal information
Challenge: Extremely specific requirements
22
• On top of unique testers, there was a requirement for unique forms with specific
requirements
• Tax Forms required a diversity of different types:W-2, Pay stubs, IRS
1098-T, IRS 1099-R, IRS 1099-DIV, and others
• Each document had specifications
• No more than 1 single folded margin in the middle
• No deformations on the page
• Minimum number of words per page
• Each document needed to be authentic, but with minimal redactions
• Automation only gets you so far. You need a proven QA andValidation process
that is staffed by an experienced team to check multiple requirements and
dependencies
Challenge: Meeting Privacy and Confidentiality
requirements
23
• Sourcing training data for AI Applications means they are typically in
“development” and the collection process needs to meet stringent confidential
requirements
• Privacy laws and policies need to be accounted across different states, countries
and regulatory
• The Applause process and service ensures that sourcing can be blind to the testers
to protect confidentiality while also insuring documents are redacted to account
for all relevant laws, such as GDPR, HIPAA, PII. This includes replacing sensitive
data with “dummy” data as needed.
Things to
Consider
24
 Diversity of testers
 Privacy concerns
 Recruit and train participants
 Ensure quality data
 Execute this at scale
 Evolve as your needs change
Q&A
Scaling Training Data for AI Applications

Scaling Training Data for AI Applications

  • 1.
    Scaling Training Data forAI Applications Ron Schmelzer and Kathleen Walch Principal Analysts, Cognilytica Kristin Simonini VP of Product, Applause
  • 2.
    Today’s Speakers 2 Ron Schmelzer PrincipalAnalyst, Cognilytica Kathleen Walch Principal Analyst, Cognilytica Kristin Simonini VP of Product, Applause
  • 3.
    3 Today’s Agenda • MakingAIa Reality • The Seven Patterns of AI, andWhat RequiresTraining Data • Leveraging a Global Community to SourceTraining Data • Real Example of Overcoming Challenges of a SourcingTraining Data Project
  • 4.
    • Cognilytica isan AI & Cognitive Technology-focused research and advisory firm. • Produce market research, advisory and guidance on AI, ML, and CognitiveTechnology • Produce the popular AIToday podcast, in addition to infographic series, whitepapers, webinars, newsletters, and other popular content. • Focused on enterprise and public sector adoption of AI, ML, and Cognitive Technology • Kathleen Walch and Ron Schmelzer are PrincipalAnalysts and Managing Partners of Cognilytica • Contributing writers to Forbes,TechTarget (SearchEnterpriseAI), Cognitive World, and CTOVision About Cognilytica 4
  • 5.
    • Data isthe heart, soul, juju, of AI • The specific data you need depends on the business problem you’re solving and the kinds of predictive or goal outcomes you’re looking for • Activities for data collection: • Identifying the required data on which to train • Identifying all the dimensions required for that data for predictive value of significance • Identifying the features that are required • Identifying the sources of data • Identifying the means to aggregate that data • There is no exact answer to the question “How much data is needed?” Identifying Data Sets for ML: Data Collection 5
  • 6.
    Making AI aReality 6 The Seven Patterns of AI
  • 7.
    • Machines andhumans interacting with each other using natural language, conversational forms of interaction across a variety of forms of communication including voice, text, and written, and image forms. • The objective of this pattern is machines interacting with humans the way humans interact with each other. The Conversation & Human Interaction Pattern 7
  • 8.
    • Using MLto identify and understand images, sound, items, handwriting, faces, and gestures. • The objective of this pattern is to have machines identify and understand the real world and unstructured data. The Recognition Pattern 8
  • 9.
    • Using machinelearning and other cognitive approaches to understand how to take past / existing behavior and predict future outcomes or help humans make decisions about future outcomes using insight learned from past behavior / interactions / data. • The Objective of this pattern is helping humans make better decisions Predictive Analytics & Decision Support 9
  • 10.
    • Machine learning(esp. Deep Learning) is good at recognizing patterns • If you can train it, you can detect it • If you can train it, you can detect patterns… or things that don’t fit patterns Pattern & Anomaly Detection 10
  • 11.
    • Physical andvirtual (software) systems that are able to accomplish a task, achieve a goal, interact with their surroundings, and perform their objective with minimal or any human involvement. • The objective of this pattern is minimizing human labor Autonomous Systems 11
  • 12.
    • In orderfor Supervised Learning approaches to work, they must be fed clean, well-labeled data that the system can use to learn from example. • But how do you get Labeled Data? • Do it yourself • Find a source of already labeled data • Get your Users to Do it • Hire a Contractor Workforce • Contract withThird Party Data Labeling Firms Data Labeling: The Achilles Heel of AI 12
  • 13.
    The Data Preparation& Engineering Pipeline Data Acquisition / Ingest / Capture • ETL • Cloud-based data Merging • Combining data sources Cleaning • Deduping, removing extraneous, bad data Labeling • Adding machine learning labels and annotations for training purposes Enhancing • Adding necessary additional data for models Filtering • Eliminating bias Feature Engineering • Assisting with enhancement (see future on multiplying data sets) Retraining Pipelines • Creation of pipelines to deal with model iteration
  • 14.
    World’s Largest CommunityOf Vetted Digital Professionals 14 Available in real-time and selected to represent your customers. Custom, Vetted Testing & Feedback Teams Any demographic, device, and region to achieve your specific needs
  • 15.
    Applause for AI:An End-to-End Solution MACHINE- LEARNING ALGORITHM Did it understand me? Did I see or hear what I expected? Did it respond accurately? Were the recommend- ations relevant? Was the information captured correctly? Was it easy to use? Speech Video Training Data Testing Output Text Questions Handwriting Images
  • 16.
    The Challenge: SourcingData for AI 16 • 81% of executives said training AI with data is more difficult than expected • Main challenges included biased or erroneous data, not enough data, or inability to label data. • 60% of decision makers at firms adopting AI cite data quality as either “challenging” or “very challenging.” (IDC) • “Regardless of your beginner or expert AI status, data is the Debbie Downer of any AI project.” (Forrester)
  • 17.
    What we seein the Enterprise: • You need LOTS of training data: Thousands to tens of thousands of artifacts: Images,Videos, Documents, Voice/Dialects • You need QUALITY data, not just volume: Poor data results in costly delays to the Product Development Lifecycle • You need a DIVERSE, global community of testers: Gender, Age, Race, Language are must haves for today’s AI applications. You can’t have one individual provide 100s of artifacts, you need 100s of testers to provide single artifacts • You need to be able to rapidly EVOLVE: As Product team’s train the algorithm, they often need to change their sourcing requirements if they are not getting an expected output. The Challenge: Sourcing Data for AI 17 Quantity Diversity Quality
  • 18.
    How Applause Solves………. 18 SourcingQuality Data at Scale Leveraging a vetted community of over 400,000 testers in 200+ countries enabling Applause to deliver a seamless sourcing solution that includes:  Quality vs.Volume: We build agreements focused on usable data vs. simple data collection  Managed Service: End to End program that includes recruitment, quality control, delivery, tester training  Privacy and Security: Seamlessly manages the complex Privacy landscape, including PII, HIPAA,GDPR and unique company confidential requirements that may be required  Elastic and Scalable: Unique business model enabling companies to rapidly supporting evolving product and business requirements
  • 19.
    HOW DOES THISLOOK IN PRACTICE?
  • 20.
    How to Source Training Data UseCase 20 Requirement: Source thousands of real-world handwritten documents • Blind collection with no PII data • No one individual could submit more than a single document • Minimum density required: Words per page
  • 21.
    Challenge: Recruit aHigh Number of Diverse Participants 21 • Training Data required thousands of pages of real handwriting across a variety of documents and personal artifacts, including (but not limited to): • Prescriptions/doctors notes • Purchase orders • Credit applications • Personal essays and letters • Drivers licenses and birth certificates from all 50 states • Tax Forms • Each handwriting sample had to be unique and could not be replicated across types or groups • The Applause service and platform is built to recruit and incentivize thousands of testers to deliver documents with specific requirements, such as word density and redaction of all personal information
  • 22.
    Challenge: Extremely specificrequirements 22 • On top of unique testers, there was a requirement for unique forms with specific requirements • Tax Forms required a diversity of different types:W-2, Pay stubs, IRS 1098-T, IRS 1099-R, IRS 1099-DIV, and others • Each document had specifications • No more than 1 single folded margin in the middle • No deformations on the page • Minimum number of words per page • Each document needed to be authentic, but with minimal redactions • Automation only gets you so far. You need a proven QA andValidation process that is staffed by an experienced team to check multiple requirements and dependencies
  • 23.
    Challenge: Meeting Privacyand Confidentiality requirements 23 • Sourcing training data for AI Applications means they are typically in “development” and the collection process needs to meet stringent confidential requirements • Privacy laws and policies need to be accounted across different states, countries and regulatory • The Applause process and service ensures that sourcing can be blind to the testers to protect confidentiality while also insuring documents are redacted to account for all relevant laws, such as GDPR, HIPAA, PII. This includes replacing sensitive data with “dummy” data as needed.
  • 24.
    Things to Consider 24  Diversityof testers  Privacy concerns  Recruit and train participants  Ensure quality data  Execute this at scale  Evolve as your needs change
  • 25.

Editor's Notes

  • #4 Sarah runs through the agenda
  • #13 If possible, mention that Applause can source data
  • #14 Click for animation. Kristin to start from the end of this slide.
  • #15 The size, breadth, and quality of our community is what enables us to deliver immense value to our clients. Our community has several hundred thousand testers. Each member of the community is carefully vetted (profile, nda, assessments, courses) to make sure feedback is provided in a detailed and concise manner. The community is diverse, with QA professionals, usability experts, and people with no technical background (average joe off the street) – so you get the right type of feedback. If you need access to someone in England with a certain type of credit card, we can do that In the past year, the community submitted a million pieces of feedback (bug reports, test cases, completed usability surveys, etc.) – that’s over 2,700 a day. We are doing this at-scale for the worlds largest brands.
  • #16 On the right-hand side of this slide, you can see the ‘Testing Output’ portion. This is something that Applause has been doing for years and years. In the last couple years, we’ve identified another area that really only Applause, and a globally managed vetted community, can help with, and that’s providing quality training data at scale. So you can see the different types of data that Applause can source from the global community, from handwriting and text to speech and video. We’ll talk a lot more in this presentation about some of the work we’ve done with sourcing handwriting to train an AI algorithm for how to read handwriting.
  • #17 When we’re talking about sourcing data, there are some major challenges out in the market. A lot of organizations start to go down this path, and then realize it’s actually much more challenging than they might’ve realized. Sourcing training data on your own is, to be frank, extremely challenging at best, and possibly outright impossible. You might not have access to the # of people you need. Even if you do have access to that # of people, you need to ensure you’re getting quality data. If you do get quality data, you need a team to annotate and label the data. And even if you do all that, you need to think about diversity and being able to evolve over time. It’s a massive challenge of logistics and overhead.
  • #18 The “Wheel” of challenges include: Data quality (bias, errors) Lack of quantity Diversity You need thousands of artifacts to properly train an AI algorithm. For example, we recently did some work with BBC to train their voice assistant, Beeb, and the algorithm required over 105,000 voice utterances, which Applause provided for BBC. But of course, having a lot of data is pretty much worthless if it’s poor quality. If your data isn’t labeled correctly, or if it’s not in the right format to begin with, it can delay your project and sometimes is completely useless, depending on the data type Diversity is the third element. If you’re building an AI algorithm, you don’t want to rely on 1 single person to provide the artifacts. That’s not going to lead to a strong AI output. So getting data from not just a lot sources, but a wide variety of sources, is impactful And of course, these projects evolve, and you need to be able to evolve with them
  • #19 So we talk about the challenges – how can Applause solve for them? Our system can produce usable data and follow some pretty strict requirements. Limited overhead  we’re a white-glove service and our internal teams manage the recruitment of data providers, we thoroughly evaluate the data artifacts we get from a source, and we train our testers to follow your requirements strictly. And we haven’t mentioned yet, but Privacy and Security is a major element to consider. There are compliance laws to consider, such as GDPR. Applause works within those confines to ensure confidentiality while also providing useful data for a customer And then elasticity and scalability, our model can shift as a customer’s requirements change, which can happen quite a bit with AI projects
  • #20 Let’s look at an example of a real customer, and how Applause sourced training data for them. Here are some of the challenges that come up with this kind of project.
  • #21 Want to share an example. We worked with an organization to build an algorithm that can read handwritten documents. So the idea is that you could scan a handwritten document, and the AI algorithm could read and understand the document. The software scans the form and identifies the keys and values. It detects the form field name. The content is the value -- even if it was filled in with a typewriter field, it might not be in the same place on every form. The software needs to understand the difference between the key and the value So how do you acquire the data that you need to train that algorithm? And what are some of the challenges that come up there? For one, we needed to source documents globally to acquire different: Handwriting styles Languages Other critical factors Example 1 is Amazon Amazon
  • #22 This was a project where the customer was looking for handwriting samples, to teach an algorithm to read handwriting. It needed thousands of handwriting samples to work, so again, there’s the quantity aspect coming up. But add this couldn’t just be 1 person submitting 500 or 1000 samples – that would’ve made this a lot easier to execute. For this project to work, each person could only submit 1 handwriting sample – in other words, we needed thousands of unique handwriting samples. Since each document had to come from a unique person, Applause had to recruit thousands of people – this is the kind of project that really only an organization like Applause can satisfy. We sourced well over 1000 folks in our community who were willing to provide handwritten documents. Why is an essay or letter valuable? It’s about handwriting recognition, so Applause asked our community for handwritten essays and letters. We had folks who were digging in their closet from 10 years ago that someone had written. We were looking for unique handwriting samples. We even had someone once ask for SAT and ACT essays, but obviously this wasn’t something we could provide
  • #23 In addition to getting a lot of people, Applause was being asked to produce a lot of different types of documents. So you can see, the tax forms, we needed to provide a lot of different types. We had a team that could manage this and ensure we were bringing in the diversity of documents we needed to help the algorithm. We needed at least 50 W-2 forms, 50 IRS 1098-T forms, etc. And then the requirements of the documents themselves, this gets into the “quality” of the data  no deformations on the page, the page can have no more than 1 single folded margin in the middle, and there were several specifications for that. And you’re scanning these documents, so they need to be in good light conditions or the flash is used in dark settings Redactions  the testers had to put in their own dummy data to protect PII There’s a lot of overhead that comes with this project, especially at scale. So having a team that can not only manage this project, but knows what to look for, is really crucial to success.
  • #24 And finally, privacy is a major concern here. We’ve got GDPR, HIPAA that you need to consider. So by giving a company a completed tax document or healthcare form, you could be violating some laws or opening yourself up to a lawsuit. So this is something that if you’re trying to do this on your own, there’s a lot of overhead  you can imagine a team of 10+ people having to work around the clock for weeks if not months to remove PII and ensure confidentiality. Here at Applause, we have processes in place where we can protect confidentiality. And in this case, we instructed testers to fill out the forms with dummy data. That way, the organization is still getting the handwriting sample, but there’s no sacrifice of PII.
  • #25 Quantity Need hundreds, if not thousands of individuals to make this work Diversity Requires many different types of data (geographic, document type, etc.) to properly train the algorithm Privacy and confidentially Need process, dedicated resources to ensure privacy and not violating GDPR, exposing PII Sourcing often needs to be blind and account for the nature of the product Process and sustainable model Sourcing training data for AI at scale is a major undertaking  you need a team that is wholly dedicated to delivering on this project