SILVERCHAIR
Supporting AI: Best Practices for
Content Delivery Platforms
Jake Zarnegar
Chief Business Development Officer, Silverchair
SILVERCHAIR
Orientation to Today’s Talks
SILVERCHAIR
Our Work in Context Understanding
and improving our
worldThe Big Loop:
Research and
Education
Understanding and
improving research
communication and
education effectiveness
The Little Loop:
Publishing
We play a key role in
multiple learning and
improvement loops
Hypothesize
Do something
Generate data
Analyze/Interpret data
Incorporate learning
SILVERCHAIR
• It is our responsibility to identify and apply promising new technology
advancements to the “loops” we serve.
• Examples:
• Internet content distribution (!)
• Streaming video
• Dataset/computational code sharing
• Adaptive learning algorithms
• Emerging TDM and AI content analysis techniques
Embracing the New is Not Optional
SILVERCHAIR
• New technologies take a while to
be “fit to purpose” in particular
problem areas
• Hat tip to Flavius Chis, my Silverchair colleague
(who is a Romanian living in America) who used
this chart to help me understand our fundamental
difference in orientation to new things.
But… Caution is Warranted
SILVERCHAIR
Much like “the cloud,” “big data,” and “machine learning” before it, the term
“artificial intelligence” has been hijacked by marketers and advertising copywriters.
If the hype leaves you asking “What is A.I., really?,” don’t worry, you’re not alone. I
asked various experts to define the term and got different answers. The only thing
they all seem to agree on is that artificial intelligence is a set of technologies that
try to imitate or augment human intelligence.
To me, the emphasis is on augmentation, in which intelligent software helps us
interact and deal with the increasingly digital world we live in.
Om Malik, The New Yorker
http://www.newyorker.com/business/currency/the-hype-and-hope-of-artificial-intelligence
SILVERCHAIR
How Do AI Tools Augment Human Intelligence?
SILVERCHAIR Bloom, et al. 1956
SILVERCHAIR
• Software is far superior to
human experts at this level
of cognition
• Does not require AI methods
SILVERCHAIR Bloom, et al. 1956
SILVERCHAIR
Knowledge Discovery in Databases:
• Traditional knowledge discovery: “Insights are surfaced by highly-educated
human specialists finding, reading, and synthesizing the right few pieces of
information”
• TDM/AI-assisted knowledge discovery: “Insights are surfaced by specialist-
directed algorithms that perform higher-order cognitive analyses on
thousands, millions, or billions of pieces of information”
TDM/AI Methods are New Tools in An Expert’s Toolbox
SILVERCHAIR
http://news.mit.edu/2020/artificial-intelligence-identifies-new-antibiotic-0220
So We Can Just Sit Back and Watch the Magic Happen, Right?
SILVERCHAIR
• Online systems were initially designed to help human experts with
traditional knowledge discovery:
• Filtering/finding relevant content (bring it directly to them or the online
venues where they work)
• Assessing it quickly (visual display of titles, abstracts, tables/charts,
conclusions)
• Understanding it deeply (full narrative “story” of the research, visuals,
audio/video supplementation, citations to related research, PDFs to take
away and read later, etc.)
Not Quite…
SILVERCHAIR
SILVERCHAIR
To enable researchers (and publishers) to get the most value out of emerging
TDM/AI tools, we need to add and consistently evolve specifically-designed
content and data interfaces to our online content platforms.
New AI Tools = New Requirements for Content Platforms
SILVERCHAIR
Two Major Datasets
Content
• Research papers, datasets,
news, whitepapers, education
materials, courseware, etc.
• “Big Loop”
Interactions / Usage Activity
• Data generated from activity on
the platform (myriad basic and
complex measures).
• “Little Loop”
A modern online platform is a streaming content and analytics server
SILVERCHAIR
But how do TDM/AI projects actually work?
SILVERCHAIR
Knowledge Discovery in Databases (KDD) [1] is divided in four main phases: domain
exploration, data preparation, data mining, and interpretation of results.
1. The first phase is responsible for understanding the problem and what data will be
used in the knowledge discovery process.
2. The next phase selects, cleans, and transforms the data to a format that is
suitable for a specific data mining algorithm.
3. In the third phase, the chosen data mining algorithm performs some intelligent
techniques to discover patterns that can be of potential use.
4. The last phase is responsible for manipulating the extracted patterns to generate
interpretable knowledge for humans…
SILVERCHAIR
…Most of the research carried out in this area focus on the data mining phase, which
uses artificial intelligence algorithms like decision trees, artificial neural networks,
evolutionary computation, among others [2] to discover knowledge. On the other
hand, the data preparation phase, responsible for integration, cleaning, and
transformation of data, has not been the subject of much research. In fact, Pyle [3]
argues that “data preparation consumes 60 to 90% of the time needed to mine data
– and contributes 75 to 90% to the mining project’s success.”
From: Paulo M. Goncalves Jr. and Roberto S. M. Barros, "Automating Data Preprocessing with DMPML and
KDDML," 10th IEEE/ACIS International Conference on Computer and Information Science, 2011, DOI:
10.1109/ICIS.2011.23.
SILVERCHAIR
“I downloaded 2TB of Arxiv content last week but I can’t bring myself
to open it and start working on analyzing it because I know I have at
least 6 months of painstaking content cleanup & preparation ahead of
me before I can begin.”
--Mike M., Data Scientist, Fast Forward Labs
SILVERCHAIR
5 Platform Best Practices to Support AI Projects
SILVERCHAIR
• Most projects select content from multiple sources and add/mix with
internal proprietary data
• Even Facebook ingests third-party data to supplement their massive
consumer behavior database. (You’re not Facebook.)
• If your data is too hard to extract/work with, you may be bypassed for easier
alternatives
• Adopting this humble orientation will help you think less about your
preferences for format and delivery, and more about shared standards
1. Accept That You Don’t Have Everything They Need
SILVERCHAIR
• Publishing (especially research publishing) has a huge head start here, with
shared XML standards and many universal data structures
• Note that JSON format is preferred by many AI content analysis tools and
you may want to offer this format as well
• It is best if all of your content (including historical) is delivered in one
consistent format with consistent use of tagging structures
• The title, authors, abstract, where the conclusions are in the paper, etc.
2. Provide Consistent Content/Data Tagging
SILVERCHAIR
• Just letting bots crawl your site is not a programmatic interface. Web page
coding (primarily HTML, JS, CSS) is heavyweight -- built for visual layout and
interactions (traditional use case), although it can be enhanced with
embedded metadata.
• Best practice is to produce a separate programmatic interface that is more
lightweight and designed for concurrent, high-volume transactions.
3. Provide a Programmatic Interface
Of the Silverchair Platform’s last 500 million sessions, 380 million of them were
programmatic.
SILVERCHAIR
• Commercial Publisher Platforms:
https://www.elsevier.com/solutions/sciencedirect/support/api
• Aggregators:
https://www.dimensions.ai/dimensions-apis/
• Independent Platforms:
https://www.silverchair.com/news/silverchair-releases-analysis-and-content-
service/
• Rights/Licensing Organizations:
https://www.copyright.com/business/xmlformining/
API Examples to Examine
SILVERCHAIR
• Per our research and interviews, data scientists overwhelming desire to take
large amounts of content away and ingest it into their own systems (rather
than use your platform as a transactional dependency).
• Some APIs limit the amount that can be downloaded/taken away at one
time, applying old-school limits to the new techniques
• However, protection/vetting is necessary – this is a new type of relationship
4. Embrace Takeout & Delivery (vs. Eat-In)
SILVERCHAIR
• In a seeming paradox, ”some” of your data can be more valuable than “all” of
your data
• If you have data hooks and API tools for complex selection and filtering, the
AI or data scientist won’t need to develop tools of their own in their project
• Recently a licensing agent told me that signed a deal for more $ for a
specifically narrowed portion of a dataset than they quoted for the whole
dataset (not per object, more $ overall!)
5. Enable Complex Selection & Filtering
SILVERCHAIR
New Business Possibilities
SILVERCHAIR
“…data preparation consumes 60 to 90% of the time needed to mine data – and
contributes 75 to 90% to the mining project’s success”
From: Paulo M. Goncalves Jr. and Roberto S. M. Barros, "Automating Data Preprocessing with DMPML and
KDDML," 10th IEEE/ACIS International Conference on Computer and Information Science, 2011, DOI:
10.1109/ICIS.2011.23.
Remember the Business Need
SILVERCHAIR
• Full-text normalized JSON (or XML)
• Separate product/subscription for sale
• Enrich it further for complex filtering/selection or sell it “as is”
• Separate delivery mechanism (no human interface) but can piggyback on
existing content workflows
• Accesses a new class of customer w/deep pockets (AI creators or implementers)
• Requires new vetting/legal agreements
Consider Feeding AI Tools as a New Product
SILVERCHAIR
Or Create Your Own New Services Using AI
• Deliver new on-site content analysis tools for end users using a combination of
semantics & new text mining/AI tools (many of which are free to use from
Google/Amazon/Microsoft/others)
• Experiments, experiments, experiments (recommendation, auto-summarization,
analogues, image analysis, sentiment, prediction, etc.)
SILVERCHAIR
Thank You
Jake Zarnegar
Chief Business Development Officer, Silverchair

Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"

  • 1.
    SILVERCHAIR Supporting AI: BestPractices for Content Delivery Platforms Jake Zarnegar Chief Business Development Officer, Silverchair
  • 2.
  • 3.
    SILVERCHAIR Our Work inContext Understanding and improving our worldThe Big Loop: Research and Education Understanding and improving research communication and education effectiveness The Little Loop: Publishing We play a key role in multiple learning and improvement loops Hypothesize Do something Generate data Analyze/Interpret data Incorporate learning
  • 4.
    SILVERCHAIR • It isour responsibility to identify and apply promising new technology advancements to the “loops” we serve. • Examples: • Internet content distribution (!) • Streaming video • Dataset/computational code sharing • Adaptive learning algorithms • Emerging TDM and AI content analysis techniques Embracing the New is Not Optional
  • 5.
    SILVERCHAIR • New technologiestake a while to be “fit to purpose” in particular problem areas • Hat tip to Flavius Chis, my Silverchair colleague (who is a Romanian living in America) who used this chart to help me understand our fundamental difference in orientation to new things. But… Caution is Warranted
  • 6.
    SILVERCHAIR Much like “thecloud,” “big data,” and “machine learning” before it, the term “artificial intelligence” has been hijacked by marketers and advertising copywriters. If the hype leaves you asking “What is A.I., really?,” don’t worry, you’re not alone. I asked various experts to define the term and got different answers. The only thing they all seem to agree on is that artificial intelligence is a set of technologies that try to imitate or augment human intelligence. To me, the emphasis is on augmentation, in which intelligent software helps us interact and deal with the increasingly digital world we live in. Om Malik, The New Yorker http://www.newyorker.com/business/currency/the-hype-and-hope-of-artificial-intelligence
  • 7.
    SILVERCHAIR How Do AITools Augment Human Intelligence?
  • 8.
  • 9.
    SILVERCHAIR • Software isfar superior to human experts at this level of cognition • Does not require AI methods
  • 10.
  • 11.
    SILVERCHAIR Knowledge Discovery inDatabases: • Traditional knowledge discovery: “Insights are surfaced by highly-educated human specialists finding, reading, and synthesizing the right few pieces of information” • TDM/AI-assisted knowledge discovery: “Insights are surfaced by specialist- directed algorithms that perform higher-order cognitive analyses on thousands, millions, or billions of pieces of information” TDM/AI Methods are New Tools in An Expert’s Toolbox
  • 12.
  • 13.
    SILVERCHAIR • Online systemswere initially designed to help human experts with traditional knowledge discovery: • Filtering/finding relevant content (bring it directly to them or the online venues where they work) • Assessing it quickly (visual display of titles, abstracts, tables/charts, conclusions) • Understanding it deeply (full narrative “story” of the research, visuals, audio/video supplementation, citations to related research, PDFs to take away and read later, etc.) Not Quite…
  • 14.
  • 15.
    SILVERCHAIR To enable researchers(and publishers) to get the most value out of emerging TDM/AI tools, we need to add and consistently evolve specifically-designed content and data interfaces to our online content platforms. New AI Tools = New Requirements for Content Platforms
  • 16.
    SILVERCHAIR Two Major Datasets Content •Research papers, datasets, news, whitepapers, education materials, courseware, etc. • “Big Loop” Interactions / Usage Activity • Data generated from activity on the platform (myriad basic and complex measures). • “Little Loop” A modern online platform is a streaming content and analytics server
  • 17.
    SILVERCHAIR But how doTDM/AI projects actually work?
  • 18.
    SILVERCHAIR Knowledge Discovery inDatabases (KDD) [1] is divided in four main phases: domain exploration, data preparation, data mining, and interpretation of results. 1. The first phase is responsible for understanding the problem and what data will be used in the knowledge discovery process. 2. The next phase selects, cleans, and transforms the data to a format that is suitable for a specific data mining algorithm. 3. In the third phase, the chosen data mining algorithm performs some intelligent techniques to discover patterns that can be of potential use. 4. The last phase is responsible for manipulating the extracted patterns to generate interpretable knowledge for humans…
  • 19.
    SILVERCHAIR …Most of theresearch carried out in this area focus on the data mining phase, which uses artificial intelligence algorithms like decision trees, artificial neural networks, evolutionary computation, among others [2] to discover knowledge. On the other hand, the data preparation phase, responsible for integration, cleaning, and transformation of data, has not been the subject of much research. In fact, Pyle [3] argues that “data preparation consumes 60 to 90% of the time needed to mine data – and contributes 75 to 90% to the mining project’s success.” From: Paulo M. Goncalves Jr. and Roberto S. M. Barros, "Automating Data Preprocessing with DMPML and KDDML," 10th IEEE/ACIS International Conference on Computer and Information Science, 2011, DOI: 10.1109/ICIS.2011.23.
  • 20.
    SILVERCHAIR “I downloaded 2TBof Arxiv content last week but I can’t bring myself to open it and start working on analyzing it because I know I have at least 6 months of painstaking content cleanup & preparation ahead of me before I can begin.” --Mike M., Data Scientist, Fast Forward Labs
  • 21.
    SILVERCHAIR 5 Platform BestPractices to Support AI Projects
  • 22.
    SILVERCHAIR • Most projectsselect content from multiple sources and add/mix with internal proprietary data • Even Facebook ingests third-party data to supplement their massive consumer behavior database. (You’re not Facebook.) • If your data is too hard to extract/work with, you may be bypassed for easier alternatives • Adopting this humble orientation will help you think less about your preferences for format and delivery, and more about shared standards 1. Accept That You Don’t Have Everything They Need
  • 23.
    SILVERCHAIR • Publishing (especiallyresearch publishing) has a huge head start here, with shared XML standards and many universal data structures • Note that JSON format is preferred by many AI content analysis tools and you may want to offer this format as well • It is best if all of your content (including historical) is delivered in one consistent format with consistent use of tagging structures • The title, authors, abstract, where the conclusions are in the paper, etc. 2. Provide Consistent Content/Data Tagging
  • 24.
    SILVERCHAIR • Just lettingbots crawl your site is not a programmatic interface. Web page coding (primarily HTML, JS, CSS) is heavyweight -- built for visual layout and interactions (traditional use case), although it can be enhanced with embedded metadata. • Best practice is to produce a separate programmatic interface that is more lightweight and designed for concurrent, high-volume transactions. 3. Provide a Programmatic Interface Of the Silverchair Platform’s last 500 million sessions, 380 million of them were programmatic.
  • 25.
    SILVERCHAIR • Commercial PublisherPlatforms: https://www.elsevier.com/solutions/sciencedirect/support/api • Aggregators: https://www.dimensions.ai/dimensions-apis/ • Independent Platforms: https://www.silverchair.com/news/silverchair-releases-analysis-and-content- service/ • Rights/Licensing Organizations: https://www.copyright.com/business/xmlformining/ API Examples to Examine
  • 26.
    SILVERCHAIR • Per ourresearch and interviews, data scientists overwhelming desire to take large amounts of content away and ingest it into their own systems (rather than use your platform as a transactional dependency). • Some APIs limit the amount that can be downloaded/taken away at one time, applying old-school limits to the new techniques • However, protection/vetting is necessary – this is a new type of relationship 4. Embrace Takeout & Delivery (vs. Eat-In)
  • 27.
    SILVERCHAIR • In aseeming paradox, ”some” of your data can be more valuable than “all” of your data • If you have data hooks and API tools for complex selection and filtering, the AI or data scientist won’t need to develop tools of their own in their project • Recently a licensing agent told me that signed a deal for more $ for a specifically narrowed portion of a dataset than they quoted for the whole dataset (not per object, more $ overall!) 5. Enable Complex Selection & Filtering
  • 28.
  • 29.
    SILVERCHAIR “…data preparation consumes60 to 90% of the time needed to mine data – and contributes 75 to 90% to the mining project’s success” From: Paulo M. Goncalves Jr. and Roberto S. M. Barros, "Automating Data Preprocessing with DMPML and KDDML," 10th IEEE/ACIS International Conference on Computer and Information Science, 2011, DOI: 10.1109/ICIS.2011.23. Remember the Business Need
  • 30.
    SILVERCHAIR • Full-text normalizedJSON (or XML) • Separate product/subscription for sale • Enrich it further for complex filtering/selection or sell it “as is” • Separate delivery mechanism (no human interface) but can piggyback on existing content workflows • Accesses a new class of customer w/deep pockets (AI creators or implementers) • Requires new vetting/legal agreements Consider Feeding AI Tools as a New Product
  • 31.
    SILVERCHAIR Or Create YourOwn New Services Using AI • Deliver new on-site content analysis tools for end users using a combination of semantics & new text mining/AI tools (many of which are free to use from Google/Amazon/Microsoft/others) • Experiments, experiments, experiments (recommendation, auto-summarization, analogues, image analysis, sentiment, prediction, etc.)
  • 32.
    SILVERCHAIR Thank You Jake Zarnegar ChiefBusiness Development Officer, Silverchair

Editor's Notes

  • #14 Weaknesses Over-filtering Interdisciplinary connections Scalability