SlideShare a Scribd company logo
1 of 20
Download to read offline
AI and investigative journalism
Josh Nicholas
Data journalist
The Guardian
Agenda
● Introduction
● What is AI
○ Different forms
○ More than a black box
● Case studies
○ Extracting useful info from text
○ Fuzzy matching between datasets
○ Finding a needle in a haystack
● Homework
● Q + A
Code for all examples is on my Github
More resources in HANDOUT
After the session:
● Recording
● Handout
● Homework in our LinkedIn Group
● LINK to join
What is AI?
1
● Many AI terms are used
interchangeably
● We are going to focus on machine
learning models
● These are algorithms that can learn
their own rules from data
Artificial intelligence is catch-all
This graphic was adapted from Build a Large Language Model by Sebastian Raschka
What are ‘rules’?
Learning from the data
● Machines are great at identifying patterns that aren’t obvious to humans
● Given some examples to learn from, an algorithm can find more
AI and newsgathering
● Machine-learning algorithms are trained on large datasets
○ They can be fine-tuned on smaller datasets
● They are useful for “fuzzy” problems, when it’s hard to write explicit
rules/instructions
● You can access many pre-trained algorithms for free e.g.
○ Huggingface.co
○ Google, OpenAI, Mistral, Facebook etc.
and…
● If we can’t find an algorithm that fits our purpose, we can fine-tune an existing one
Examples we can steal from borrow
• Email spam filters
• Recommendation systems (Netflix, Spotify etc.)
• Language translation
• Audio transcription
• Facial recognition
• Object detection
• Predictive text
• Search engines
■ Google BERT etc.
Case studies
2
1) Extraction
The problem:
● Extracting names, locations and
dollar amounts from thousands of
text documents:
○ 34k+ Facebook posts
○ 2.4k media releases
● What if we don’t know the names
they’ll use?
● What if they say something vague
like a “a million for x”?
● We scraped thousands of Facebook
posts and media releases from official
websites
● We used a pre-trained model from
Spacy, a common Python library
● The model identified names, locations
and references to money in the texts
● Since 2022 these tools have become even easier to use
● You can also achieve similar results with GenAI tools ike ChatGPT
2) Fuzzy matching
The problem:
● We need to connect datasets that are
slightly different
○ Josh Nicholas vs Joshua Nicholas
● Previously we used a method called
Levenshtein Distance
○ Matching every name against every
other name
○ It took ages!!
Making use of the AI ecosystem
● When you input text into a chatbot it
turns the text into a series of numbers
● We can use this same technique to
match names
• Find the numbers that are most
similar
● This same technique can be scaled to
full sentences or even entire documents
● Can also be run in reverse - what things
are least similar
3) Finding a needle in a haystack
The problem:
● Who poses most with dogs, babies,
hi vis etc.?
● We need to search through
thousands of images, many of them
not captioned
● There are loads of models that are
immediately useful
• E.g. ones for workplace safety, that can
identify hard hats etc.
• Also lots of free datasets online
● We manually created a training dataset
with novelty cheques and hi vis vests
Training a detection model
● Machine learning models can learn their own rules from the patterns in
data
● This helps us when we need to work with fuzzier/unlabelled data
○ Images, entire documents etc.
● There are thousands of models available for free online
● We can fine tune them for specific tasks if necessary
● They can be run directly or built into interfaces for common problems
● GenAI tools can often do the same tasks, but harder to scale
Quick summary
● Homework 1 (if you can code),
○ Open the Huggingface MODELS tab and choose a model that
would solve an editorial problem for you
○ Try out the tool and share your results in the LinkedIn Group
■ Why/what did you choose?
● Homework 2 (If you can't code yet):
○ Open the Huggingface SPACES tab and choose one of the tools
○ Give it a prompt and share your results in the LinkedIn Group
■ Why/what did you choose?
● How would this help in a journalism context?
Homework
1. Join the Closed LinkedIn Group
2. Post your work for trainer feedback within 4 weeks
3. Leave constructive feedback on at least one other
person’s post - within 2 weeks
4. Follow the Group Rules!
How homework works
Any questions?
?
Josh Nicholas
Data journalist
The Guardian
josh.nicholas@theguardian.com
Thank you!

More Related Content

Similar to Webinar 3 - AI & Investigative Journalism - Training Slidedeck

ChatGPT in academic settings H2.de
ChatGPT in academic settings H2.deChatGPT in academic settings H2.de
ChatGPT in academic settings H2.deDavid Döring
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentationgustavosouto
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflowCharmi Chokshi
 
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent CerveauTheFamily
 
Data science unit 1 By: Professor Lili Saghafi
Data science unit 1 By: Professor Lili Saghafi Data science unit 1 By: Professor Lili Saghafi
Data science unit 1 By: Professor Lili Saghafi Professor Lili Saghafi
 
Tensorflow a brief introduction (1).pptx
Tensorflow a brief introduction (1).pptxTensorflow a brief introduction (1).pptx
Tensorflow a brief introduction (1).pptxAnandMenon54
 
How to do science in a large IT company (ICPC World Finals 2021, Moscow)
How to do science in a large IT company (ICPC World Finals 2021, Moscow)How to do science in a large IT company (ICPC World Finals 2021, Moscow)
How to do science in a large IT company (ICPC World Finals 2021, Moscow)Alexander Borzunov
 
Getting a Data Science Job
Getting a Data Science JobGetting a Data Science Job
Getting a Data Science JobAlexey Grigorev
 
Software Engineering Primer
Software Engineering PrimerSoftware Engineering Primer
Software Engineering PrimerGeorg Buske
 
Machine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup EventMachine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup EventBenjamin Schulte
 
What Are the Basics of Product Manager Interviews by Google PM
What Are the Basics of Product Manager Interviews by Google PMWhat Are the Basics of Product Manager Interviews by Google PM
What Are the Basics of Product Manager Interviews by Google PMProduct School
 
Take the Open Source road: learn, share, grow
Take the Open Source road: learn, share, growTake the Open Source road: learn, share, grow
Take the Open Source road: learn, share, growNaLUG
 
Curtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahooCurtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahoo羽祈 張
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabCloudxLab
 
PETE&C 2018: Let's Get Digital: Problem solving that is!
PETE&C 2018: Let's Get Digital: Problem solving that is!PETE&C 2018: Let's Get Digital: Problem solving that is!
PETE&C 2018: Let's Get Digital: Problem solving that is!The Source for Learning, Inc.
 

Similar to Webinar 3 - AI & Investigative Journalism - Training Slidedeck (20)

DocGPT
DocGPTDocGPT
DocGPT
 
ChatGPT in academic settings H2.de
ChatGPT in academic settings H2.deChatGPT in academic settings H2.de
ChatGPT in academic settings H2.de
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentation
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
Let's talk FOSS!
Let's talk FOSS!Let's talk FOSS!
Let's talk FOSS!
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
 
Data science unit 1 By: Professor Lili Saghafi
Data science unit 1 By: Professor Lili Saghafi Data science unit 1 By: Professor Lili Saghafi
Data science unit 1 By: Professor Lili Saghafi
 
Tensorflow a brief introduction (1).pptx
Tensorflow a brief introduction (1).pptxTensorflow a brief introduction (1).pptx
Tensorflow a brief introduction (1).pptx
 
Binary crosswords
Binary crosswordsBinary crosswords
Binary crosswords
 
How to do science in a large IT company (ICPC World Finals 2021, Moscow)
How to do science in a large IT company (ICPC World Finals 2021, Moscow)How to do science in a large IT company (ICPC World Finals 2021, Moscow)
How to do science in a large IT company (ICPC World Finals 2021, Moscow)
 
Getting a Data Science Job
Getting a Data Science JobGetting a Data Science Job
Getting a Data Science Job
 
Software Engineering Primer
Software Engineering PrimerSoftware Engineering Primer
Software Engineering Primer
 
Machine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup EventMachine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup Event
 
What Are the Basics of Product Manager Interviews by Google PM
What Are the Basics of Product Manager Interviews by Google PMWhat Are the Basics of Product Manager Interviews by Google PM
What Are the Basics of Product Manager Interviews by Google PM
 
Take the Open Source road: learn, share, grow
Take the Open Source road: learn, share, growTake the Open Source road: learn, share, grow
Take the Open Source road: learn, share, grow
 
Getting it Built
Getting it BuiltGetting it Built
Getting it Built
 
Curtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahooCurtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahoo
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
PETE&C 2018: Let's Get Digital: Problem solving that is!
PETE&C 2018: Let's Get Digital: Problem solving that is!PETE&C 2018: Let's Get Digital: Problem solving that is!
PETE&C 2018: Let's Get Digital: Problem solving that is!
 

More from walkleys

List of Pacific Media Outlets and Sources 2024
List of Pacific Media Outlets and Sources 2024List of Pacific Media Outlets and Sources 2024
List of Pacific Media Outlets and Sources 2024walkleys
 
Sean Dorney Grant Frequently Asked Questions - Slide Deck
Sean Dorney Grant Frequently Asked Questions - Slide DeckSean Dorney Grant Frequently Asked Questions - Slide Deck
Sean Dorney Grant Frequently Asked Questions - Slide Deckwalkleys
 
PNG's Women in Waiting, Essay by Jo Chandler
PNG's Women in Waiting, Essay by Jo ChandlerPNG's Women in Waiting, Essay by Jo Chandler
PNG's Women in Waiting, Essay by Jo Chandlerwalkleys
 
Climate justice in the Pacific, by Jo Chandler
Climate justice in the Pacific, by Jo ChandlerClimate justice in the Pacific, by Jo Chandler
Climate justice in the Pacific, by Jo Chandlerwalkleys
 
Webinar 2 - Slides_Making the business case for solutions journalism.pdf
Webinar 2 - Slides_Making the business case for solutions journalism.pdfWebinar 2 - Slides_Making the business case for solutions journalism.pdf
Webinar 2 - Slides_Making the business case for solutions journalism.pdfwalkleys
 
SLIDE PDF - Learn about AI for Text Journalism.pdf
SLIDE PDF - Learn about AI for Text Journalism.pdfSLIDE PDF - Learn about AI for Text Journalism.pdf
SLIDE PDF - Learn about AI for Text Journalism.pdfwalkleys
 

More from walkleys (6)

List of Pacific Media Outlets and Sources 2024
List of Pacific Media Outlets and Sources 2024List of Pacific Media Outlets and Sources 2024
List of Pacific Media Outlets and Sources 2024
 
Sean Dorney Grant Frequently Asked Questions - Slide Deck
Sean Dorney Grant Frequently Asked Questions - Slide DeckSean Dorney Grant Frequently Asked Questions - Slide Deck
Sean Dorney Grant Frequently Asked Questions - Slide Deck
 
PNG's Women in Waiting, Essay by Jo Chandler
PNG's Women in Waiting, Essay by Jo ChandlerPNG's Women in Waiting, Essay by Jo Chandler
PNG's Women in Waiting, Essay by Jo Chandler
 
Climate justice in the Pacific, by Jo Chandler
Climate justice in the Pacific, by Jo ChandlerClimate justice in the Pacific, by Jo Chandler
Climate justice in the Pacific, by Jo Chandler
 
Webinar 2 - Slides_Making the business case for solutions journalism.pdf
Webinar 2 - Slides_Making the business case for solutions journalism.pdfWebinar 2 - Slides_Making the business case for solutions journalism.pdf
Webinar 2 - Slides_Making the business case for solutions journalism.pdf
 
SLIDE PDF - Learn about AI for Text Journalism.pdf
SLIDE PDF - Learn about AI for Text Journalism.pdfSLIDE PDF - Learn about AI for Text Journalism.pdf
SLIDE PDF - Learn about AI for Text Journalism.pdf
 

Recently uploaded

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 

Recently uploaded (20)

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 

Webinar 3 - AI & Investigative Journalism - Training Slidedeck

  • 1. AI and investigative journalism Josh Nicholas Data journalist The Guardian
  • 2. Agenda ● Introduction ● What is AI ○ Different forms ○ More than a black box ● Case studies ○ Extracting useful info from text ○ Fuzzy matching between datasets ○ Finding a needle in a haystack ● Homework ● Q + A Code for all examples is on my Github More resources in HANDOUT After the session: ● Recording ● Handout ● Homework in our LinkedIn Group ● LINK to join
  • 4. ● Many AI terms are used interchangeably ● We are going to focus on machine learning models ● These are algorithms that can learn their own rules from data Artificial intelligence is catch-all This graphic was adapted from Build a Large Language Model by Sebastian Raschka
  • 6. Learning from the data ● Machines are great at identifying patterns that aren’t obvious to humans ● Given some examples to learn from, an algorithm can find more
  • 7. AI and newsgathering ● Machine-learning algorithms are trained on large datasets ○ They can be fine-tuned on smaller datasets ● They are useful for “fuzzy” problems, when it’s hard to write explicit rules/instructions ● You can access many pre-trained algorithms for free e.g. ○ Huggingface.co ○ Google, OpenAI, Mistral, Facebook etc. and… ● If we can’t find an algorithm that fits our purpose, we can fine-tune an existing one
  • 8. Examples we can steal from borrow • Email spam filters • Recommendation systems (Netflix, Spotify etc.) • Language translation • Audio transcription • Facial recognition • Object detection • Predictive text • Search engines ■ Google BERT etc.
  • 10. 1) Extraction The problem: ● Extracting names, locations and dollar amounts from thousands of text documents: ○ 34k+ Facebook posts ○ 2.4k media releases ● What if we don’t know the names they’ll use? ● What if they say something vague like a “a million for x”?
  • 11. ● We scraped thousands of Facebook posts and media releases from official websites ● We used a pre-trained model from Spacy, a common Python library ● The model identified names, locations and references to money in the texts ● Since 2022 these tools have become even easier to use ● You can also achieve similar results with GenAI tools ike ChatGPT
  • 12. 2) Fuzzy matching The problem: ● We need to connect datasets that are slightly different ○ Josh Nicholas vs Joshua Nicholas ● Previously we used a method called Levenshtein Distance ○ Matching every name against every other name ○ It took ages!!
  • 13. Making use of the AI ecosystem ● When you input text into a chatbot it turns the text into a series of numbers ● We can use this same technique to match names • Find the numbers that are most similar ● This same technique can be scaled to full sentences or even entire documents ● Can also be run in reverse - what things are least similar
  • 14. 3) Finding a needle in a haystack The problem: ● Who poses most with dogs, babies, hi vis etc.? ● We need to search through thousands of images, many of them not captioned
  • 15. ● There are loads of models that are immediately useful • E.g. ones for workplace safety, that can identify hard hats etc. • Also lots of free datasets online ● We manually created a training dataset with novelty cheques and hi vis vests Training a detection model
  • 16. ● Machine learning models can learn their own rules from the patterns in data ● This helps us when we need to work with fuzzier/unlabelled data ○ Images, entire documents etc. ● There are thousands of models available for free online ● We can fine tune them for specific tasks if necessary ● They can be run directly or built into interfaces for common problems ● GenAI tools can often do the same tasks, but harder to scale Quick summary
  • 17. ● Homework 1 (if you can code), ○ Open the Huggingface MODELS tab and choose a model that would solve an editorial problem for you ○ Try out the tool and share your results in the LinkedIn Group ■ Why/what did you choose? ● Homework 2 (If you can't code yet): ○ Open the Huggingface SPACES tab and choose one of the tools ○ Give it a prompt and share your results in the LinkedIn Group ■ Why/what did you choose? ● How would this help in a journalism context? Homework
  • 18. 1. Join the Closed LinkedIn Group 2. Post your work for trainer feedback within 4 weeks 3. Leave constructive feedback on at least one other person’s post - within 2 weeks 4. Follow the Group Rules! How homework works
  • 19. Any questions? ? Josh Nicholas Data journalist The Guardian josh.nicholas@theguardian.com