1. Insight Data
Engineering
2019 New York Cohort B
Bubble Breaker
Break filter bubbles by gaining diverse perspectives on global news
Ming-Yuan Lu
Insight Data Engineering 2019 New York
1
3. Popping Bubbles & Business Values
1 Recommending contents with similar views -- NO
Present the full “spectrum” of sentiments (“tone”) from news reports -- YES
2
Dashboard to visualize aggregated tone distribution and its evolution over time
3 Business use case: political campaign/studies identifying public opinion trends
3
4. Dataset: GDELT
● GDELT: Global Database of Events,
Language and Tone
● Monitors the world's news - every
country, >100 languages
● Identifies the people, locations, themes,
tones etc for each report
● Dataset updates daily until Apr 2019 on
AWS S3
● Size: 1.3TB/yr
4
7. Text Cleaning - Themes
● Themes are encoded in a specific taxonomy:
WB_567_CLIMATE_CHANGE ➜ CLIMATE_CHANGE
WB_2836_MIGRATION_POLICIES_AND_JOBS ➜ MIGRATION_POLICIES_AND_JOBS
TAX_WORLDLANGUAGES_KOREAN ➜ KOREAN
…
7
8. Text Cleaning - Taxonomy Building
● Obtain first two words from
WB_567_CLIMATE_CHANGE.split(‘_’)
● Count the occurrence of non-
numeric words
● If occurrence >= 10, add word
to taxonomy “dictionary”
● Clean themes according to this
dictionary
8
Taxonomy words
Regular words
9. About Me
9
● Physics Ph.D candidate at UW-Madison
● Processed and analyzed astronomical data
for rare high-energy neutrino detection
● Deployed detector at the South Pole
12. Tone
For each article, the sentiment is measured via a sentiment mining algorithm (Hu and Liu 2004)
identifying negative and positive words in the text, based on a positive-negative lexicon
dictionary. The sentiment value is obtained by subtracting the number of negative words to the
number of positive words.
Hu, M., and Liu, B. 2004. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’04, 168–177. New York, NY, USA: ACM.
13. TimescaleDB
13
● TimescaleDB as an extension of PostgreSQL - leverages all the benefits
of a RDBMS
● But offers 3 main advantages
- Higher data ingestion rate
- Better or equal query performances
- Time-oriented features
Abstraction of single continuous table for chunks Time-space partitioning
16. TimescaleDB - Cost vs Postgres
16
The one additional cost of TimescaleDB compared to
PostgreSQL is more complex planning (given that a single
hypertable can be comprised of many chunks). This can
translate to a few extra milliseconds of planning time, which
can have a disproportional influence for very low-latency
queries (< 10ms).
17. TimescaleDB - VS NoSQL
17
● Main advantage over NoSQL:
SQL support - simple, effective for complex queries, no
learning curve
● When NOT to use timescaleDB?
- fast & simple reads
- sparse or unstructured data
- heavy compression
18. Benchmarking
18
Data size (GiB) Duration
0.04 4m6.073s
0.16 3m57.620s
3.8 6m8.407s
38 20m22s
114 84m3.025s
1331 2173m27.114s
1+3 cluster of m4.large EC2 instances with 100GiB EBS attached
each
Editor's Notes
Studies have shown that these algorithms filter contents to its users. But an even exposure to all aspects of an issue is often important when people are forming their opinions. Without it, we sit in our own bubbles, and social tribalism become hard to avoid.
Time series data often immutable. Read/write often happen as most recent records, not as updates to past rows