Successfully reported this slideshow.
Your SlideShare is downloading. ×

Content + Signals: The value of the entire data estate for machine learning

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Knowledge Graph Maintenance
Knowledge Graph Maintenance
Loading in …3
×

Check these out next

1 of 31 Ad

Content + Signals: The value of the entire data estate for machine learning

Download to read offline

Content-centric organizations have increasingly recognized the value of their material for analytics and decision support systems based on machine learning. However, as anyone involved in machine learning projects will tell you the difficulty is not in the provision of the content itself but in the production of annotations necessary to make use of that content for ML. The transformation of content into training data often requires manual human annotation. This is expensive particularly when the nature of the content requires subject matter experts to be involved.

In this talk, I highlight emerging approaches to tackling this challenge using what's known as weak supervision - using other signals to help annotate data. I discuss how content companies often overlook resources that they have in-house to provide these signals. I aim to show how looking at a data estate in terms of signals can amplify its value for artificial intelligence.

Content-centric organizations have increasingly recognized the value of their material for analytics and decision support systems based on machine learning. However, as anyone involved in machine learning projects will tell you the difficulty is not in the provision of the content itself but in the production of annotations necessary to make use of that content for ML. The transformation of content into training data often requires manual human annotation. This is expensive particularly when the nature of the content requires subject matter experts to be involved.

In this talk, I highlight emerging approaches to tackling this challenge using what's known as weak supervision - using other signals to help annotate data. I discuss how content companies often overlook resources that they have in-house to provide these signals. I aim to show how looking at a data estate in terms of signals can amplify its value for artificial intelligence.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Content + Signals: The value of the entire data estate for machine learning (20)

Advertisement

More from Paul Groth (16)

Recently uploaded (20)

Advertisement

Content + Signals: The value of the entire data estate for machine learning

  1. 1. Content + Signals The value of the entire data estate for machine learning Prof. Paul Groth | @pgroth | pgroth.com | indelab.org Thanks to Corey Harper, Çağatay Demiralp, Marieke van Erp ConTech Live 2021
  2. 2. Outline • Where I’m coming from • The Success of Machine Learning • The Need for Data • Reducing (Training) Data Acquisition Costs • Implications + Actions
  3. 3. • A national federation of AI research labs • One ICAI head office • Science Park Amsterdam • Five ICAI locations • Currently: • Amsterdam (2) • Delft • Nijmegen • Utrecht ING AI for Fintech Partnering with Industry
  4. 4. MACHINES CAN READ https://demo.allennlp.org/reading-comprehension
  5. 5. MACHINES CAN READ
  6. 6. gluebenchmark.com
  7. 7. DEEP NEURAL NETWORKS Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, Quoc V. Le: QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. ICLR (Poster) 2018
  8. 8. Source: Sharir, Or, Barak Peleg, and Yoav Shoham. "The Cost of Training NLP Models: A Concise Overview." arXiv preprint arXiv:2004.08900 (2020).
  9. 9. THE NEED FOR DATA Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014, September). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.
  10. 10. THE NEED FOR ANNOTATED DATA Zhang, Yuhao, et al. "Position-aware attention and supervised data improve slot filling." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.
  11. 11. Annotation is Expensive
  12. 12. Reduce the Cost of Data?
  13. 13. Reduce the Cost of Data? use what you have
  14. 14. Reduce the Cost of Annotated Data http://ai.stanford.edu/blog/weak-supervision/
  15. 15. Transfer Learning Source Symeonidou, Anthi, Viachaslau Sazonau, and Paul Groth. "Transfer Learning for Biomedical Named Entity Recognition with BioBERT." SEMANTICS Posters&Demos. 2019.
  16. 16. Transfer Learning https://transformer.huggingface.co/doc/arxiv-nlp
  17. 17. Active Learning prodi.gy
  18. 18. Source: Stephen H. Bach et al. 2019. Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). ACM, New York, NY, USA, 362-375. DOI: https://doi.org/10.1145/3299869.3314036 https://ai.googleblog.com/2019/03/harnessing- organizational-knowledge-for.html Weak Supervision
  19. 19. The really long tail - smell extraction Ryan Brate, Paul Groth and Marieke van Erp (2020) Towards Olfactory Information Extraction from Text: A Case Study on Detecting Smell Experiences in Novels. LaTeCH-CLfL 2020
  20. 20. Weak Supervision as Data Programming http://ai.stanford.edu/blog/weak-supervision/
  21. 21. Supervision Sources / Signals • Heuristics and rules: e.g. existing human-authored rules about the target domain. • Topic models, taggers, and classifiers: e.g. machine learning models about the target domain or a related domain. • Aggregate statistics: e.g. tracked metrics about the target domain. • Knowledge or entity graphs: e.g. databases of facts about the target domain. https://ai.googleblog.com/2019/03/harnessing-organizational-knowledge-for.html
  22. 22. Multi-modal Data Source: Dunnmon, J. A., Ratner, A. J., Saab, K., Khandwala, N., Markert, M., Sagreiya, H., ... & Ré, C. (2020). Cross-modal data programming enables rapid medical machine learning. Patterns, 100019.
  23. 23. End user data programming Source: Data Programming by Demonstration: A Framework for Interactively Learning Labeling Functions. S. Evensen, C. Ge, D. Choi, Ç. Demiralp Findings of EMNLP (Ruler), 2020.
  24. 24. Supervision with Observation Source: Wang, Xin, Nicolas Thome, and Matthieu Cord. "Gaze latent support vector machine for image classification improved by weakly supervised region selection." Pattern Recognition 72 (2017): 59-71.
  25. 25. Implications Premise Consequence Improving ability to use expertise Expertise is a critical resource Improving ability to use more and different signals Signal capture becomes imperative Multiple content sources buttress each other Understanding and use the entire data estate Machine learning SOTA is accessible Problem formulation is fundamental
  26. 26. knowledgescientist.org
  27. 27. Source: Michael Lauruhn and Paul Groth. “Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016). Action 1: Make a map
  28. 28. https://a16z.com/2019/02/22/humanity-ai-better-together/ Action 2: Problems + Expertise
  29. 29. Conclusion • Powerful ML models are available today • Data is the essential the driver • Don’t overlook your resources: • your content, your expertise your customer insight Paul Groth | p.groth@uva.nl | @pgroth | pgroth.com | indelab.org

Editor's Notes

  • 330K images (>200K labeled)
    1.5 million object instances

×