In this talk, we will build “Choir”. An OSINT (Open-source intelligent) project focused on gathering context-based connections between social profiles using AI models like LDA and topic modeling, written in Python to explain what the world discusses over a specific domain and by high-ranking influencers in that domain and focus on what’s going on at the margins.
2. Introducing Myself
2
Sveta Gimpelson
● CDO and Co-Founder at Memphis.dev
● Software engineer
● Excited about networks. And graphs
● Playing football on mom’s team
● Harry Potter fan
3. Analyzing Public Conversations of Influencers
3
Visualize the output
Analyze collected data
Collect data from social
networks
I want to know the highlights of what is happening in
the football domain.
4. Why Social Media
4 Source: Global social media statistics research summary 2023
5. Why Social Media
5 Source: Global social media statistics research summary 2023
6. Social Network Analysis (SNA)
6
A network is a number of points (or ‘nodes’) that are
connected by links.
Generally in social network analysis, the nodes are people
and the links are any social connection between them –
for example, friendship, marital/family ties, or financial ties.
follows/ retweeted/ liked a
tweet
replied to a question/ have >1
groups in common
?
7. Social Network Analysis (SNA)
7
Social network analysis aims to understand a community by
mapping the relationships that connect them as a network,
and then trying to draw out key individuals, groups within the
network (‘components’), and/or associations between the
individuals.
Source:
https://digi.uga.edu/network-graphs/
10. Collect data - Find the influencers
10
Begin with 10 accounts
1. GOAL – @goal
2. ESPN FC – @ESPNFC
3. FourFourTwo – @FourFourTwo
4. BBC Sport – @BBCSport
5. WhoScored.com – @WhoScored
6. Squawka – @Squawka
7. OptaJoe – @OptaJoe
8. Bleacher Report Football – @brfootball
9. Sky Sports News – @SkySportsNews
10. Transfermarkt – @Transfermarkt
11. Collect data - Find the influencers
11
Ranking mechanism:
Based on the following parameters
1. Accounts that this accounts follow
2. Use keywords related to football
3. Have many followers
4. Accounts that been retweeted by our network
Follows
28. Enhancements
28
Collector
● Produce to different partitions based on source/ account
● Collect from different sources - Facebook, Reddit, Telegram
● Enforce schema to the stations
● Use storage tiering - to enable batch processing (Apache iceberg,
Spark..)
29. Enhancements
29
Analyzer
● Consume from specific partition/ entire station/ communities
● Fine tuning to the LDA model
● Use other models/ combinations
● Use GPT to show more “readable” output