Yun yuan week4.0_demo

Reddit Story
What will happen as the comments
of the post keeps going
Yun Yuan

Motivation for Project
• Interested in Social News: All topics under the
sun
• From tree structure of comments to timeline
structure of comments
• See how opinions evolve as time flows

Input and Output
Data Input
• Reddit Comments from S3 Data Dump (JSON files)
• Reddit Posts Info from Reddit API (JSON files)
Data Output
• For each post, organize comments in timestamp- base
with some significant attributes, and show hottest
comments for that post
• Web App Presentation (Link): Graph and Short Texts
• Demo: Under Construction

Tentative Pipeline and Data Flows






+

Comments
JSON
Posts
JSON
Post -> Trends
-> Hot Comments Challenges Encountered:
• Null Value of field
from JSON when
doing ingestion
• Comment trends vary

Distributed Clusters



1 Cluster: 4 Nodes of m4.large
Hadoop/HDFS
Kafka/Zookeeper
Cassandra
1 Node of t2.micro
Flask
1 Cluster: 4 Nodes of m4.large
Spark
~$400 per mon

Yun yuan week4.0_demo

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to Yun yuan week4.0_demo

Similar to Yun yuan week4.0_demo (20)

Recently uploaded

Recently uploaded (20)

Yun yuan week4.0_demo