The 7 Things I Know About Cyber Security After 25 Years | April 2024
Insight presentation
1. Ranking User Influence in the Reddit Social Network
Adam James Costarino | Insight Data Science
2. Influence and Similarity Search
1
Technologies Hadoop, Spark, Parquet, Cassandra
2.5 Terabytes
Challenges
Data Size
Product
Motivations
Performing analysis on Social Network composed of over 100
Million distinct users and 1.2 Billion edges.
3. 2
User Graph
Username : ‘adam’
Subreddits : [‘AskReddit’,‘history’…
Id : 30n42a
Username : ‘james
Subreddits : [‘AskReddit’, ‘politics’ …
Link_Id : t3_30n42a
Edge Orientation : Commenter → Poster
2017 Data Size:
40 million interactions per month
Over 4 million distinct users in monthly graph
4. 3
Subreddit Graph
Edge Weight :
[Some Long]
There is no
directionality between
subreddit connections
Edge Weights are the number of intersecting users.
The Subreddit Graph is represented with an adjacency
matrix.
5. 4
Data Pipeline
2.5
Terabytes
of Raw
JSON data
in S3
Features
extract
then
compress
to HDFS.
(50 – 1)
All complex
graph
creation and
processing is
performed in
Spark
Lots of data → Quick writes
to Cassandra
6. 5
Solution: Use Scala mutable Map
Challenge 1: Storing Data in Cassandra Requiring Object
Parameterization
Module initialization
$ is used for third party
language tools. It is generally a
reserved identifier
7. 6
Challenge 1: Storing Data in Cassandra Requiring Object
Parameterization
Solution: Write an Interface that extends map
Now Simply Use:
MapStringLong.class
8. 7
Solution: Use .mapToPair to Map Scala mutable Map to Java Map
Challenge 1: Storing Data in Cassandra Requiring Object
Parameterization
9. 8
Challenge 2: Creating PageRank Algorithm
Solution: Create my own PageRank
Initialize Ranks to 1
Number of Total Users (N)
Sum of the in coming PageRanks
divided by the number outgoing links
Dampening factor is set to 0.15 as
recommended by Stanford
10. 9
• Physics
• Java Enthusiast
• Collegiate Skier in USCSA
• Other Interesting Projects
• Virtual Machine for LC-4 in C
• Compiler for J Programming Language in
C → LC-4
Editor's Notes
Steps to developing small molecule: Target validation and selection; chemical hit and lead generation; lead optimization to identify a clinical drug candidate; and finally hypothesis-driven, biomarker-led clinical trials