Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

bdSquared

784 views

Published on

#bdSquared, Final project for BIG DIVE 2
A project by: Erika Vicaretti, Piero Molino, Francesco Corazza, David Martin-Borregòn

RAI dataset contains a list of links about news, showing trends over time and thematic relations between words and different topics.
The result is an interactive visualization representing a graph where each node is a word and the arcs show the relationships.

Link to Project: http://bdsquared.bigdive.eu/
Link to Video Presentation: http://www.youtube.com/watch?v=ZGX5x88HWb0
Link to BigDive page: http://www.bigdive.eu/final-projects

Published in: Education, Technology
  • Be the first to comment

bdSquared

  1. 1. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE BIG DATA BIAS DETECTION (BD)2 discover how news sources are talking about popular topics and the bias they have
  2. 2. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE BIG DATA BIAS DETECTION (BD)2 /TEAM who Piero Francesco David Erika skills #datascience #dev #dev #viz #dev#viz #viz
  3. 3. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE BIG DATA BIAS DETECTION (BD)2 /PROJECT goal show topic trends over time and how words relate with data sources in each different topic expexted results show sources bias peaks in topics general amount of discussion
  4. 4. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE BIG DATA BIAS DETECTION (BD)2 /RAI.DATASET line sample what we use how much “big”? id time stamp 2 MILLIONS DATA TIME26 MONTHS FROM 11.2010 TO 12.2012 url news title 1606616|2012-05-19 17:40:00+02|http://www.repubblica.it/cron- aca/2012/05/19/news/brindisi_rete_notte_musei-35484726/?rss|19|Annullata Notte dei musei, proteste in Rete "Perché si ferma la cultura e non lo sport?" |f|t
  5. 5. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE BIG DATA BIAS DETECTION (BD)2 /PRE_PROCESSING . . grouping by sources (second level domain) merge of duplicate sources filtering from jan 2011 to dec 2012
  6. 6. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE BIG DATA BIAS DETECTION (BD)2 /TITLE_FILTERING . . . stop word removal tokenization removing non-alfanumeric characters tool: NLTK original cleaned La storia d'amore fra Homer e il suo divano storia amore fra homer divano
  7. 7. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE BIG DATA BIAS DETECTION (BD)2 /TOPIC_DISCOVERING . . TFIDF vectorization soft clustering induction with sparse non negative matrix factorization tool: sklearn words topics topics documents
  8. 8. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE processing 10 MACHINES M1XLARG IN 30 MIN w = weight in NMF matrix BIG DATA BIAS DETECTION (BD)2 /MAP_REDUCE emit [0 - source] 1 [1 - source word] 1 [2 - source word topic] w [3 - source topic] w [4 - word topic] w [5 - word] w [6 - year-month topic] w tool: Amazon AWS
  9. 9. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE forces graph edges nodes BIG DATA BIAS DETECTION (BD)2 /DATA_GENERATION tool: json
  10. 10. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE timeline graph 13 clusters intensity by months BIG DATA BIAS DETECTION (BD)2 /DATA_GENERATION tool: json
  11. 11. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE cluster graph 13 clusters BIG DATA BIAS DETECTION (BD)2 /DATA_GENERATION tool: json
  12. 12. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE BIG DATA BIAS DETECTION (BD)2 /DATA_VIZ . . display diveded by category of data source dynamic generation of graphs from json line sample d3.json('data/timeline.json', function(source_data){ create_timeline_diagram(timeline_svg, source_data) }) tool: d3.js
  13. 13. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE header: title control panel: clusters footer: team+bd BIG DATA BIAS DETECTION (BD)2 /USER_INTERFACE BIG DATA BIAS DETECTION (BD)2 BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bd2 HACKING DEVELOPMENT, VISUALIZATION & SCIENCE force graph timeline graph
  14. 14. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE basic palette extended palette BIG DATA BIAS DETECTION (BD)2 /USER_INTERFACE
  15. 15. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE titles text BIG DATA BIAS DETECTION (BD)2 /USER_INTERFACE THIS IS MY TITLE HELVETICA BEBAS NEUE REGULAR Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas convallis dolor quis mi suscipit, non tempus arcu hendrerit. In massa erat, ullamcorper ut purus id, ornare dictum odio. Ut in arcu at odio sollicitudin imperdiet. Sed elementum erat ligula, quis tristique libero semper tincidunt. Maecenas tempus leo massa, in tristique felis imperdiet eu. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. 40% 70.934.812 news 2K data 86% 69,45 3/4 8
  16. 16. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE BIG DATA BIAS DETECTION (BD)2 /PROJECT_LAYOUT
  17. 17. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE BIG DATA BIAS DETECTION (BD)2 /OUR_FAILS . . . too many sources/words discarded topics visual clusters don’t always emerge
  18. 18. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE BIG DATA BIAS DETECTION (BD)2 /PROJECT_FUTURE deeper semantic analysis add connection among word with point wise mutual information to help visual clustering
  19. 19. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE BIG DATA BIAS DETECTION (BD)2 /PROJECT_FUTURE deeper visualization more statistics visualizations wordword word word word wordwordword word word word word word word word word word word word word word word word word word word word word REPUBBLICA.IT
  20. 20. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE BIG DATA BIAS DETECTION (BD)2 /THANK_YOU_FOR_YOUR_ATTENTION Piero Molino Francesco Corazza David Martín-Borregón Erika Vicaretti contacts piero.molino@gmail.com francesco@upendu.com david.mabodo@gmail.com erika.vicaretti@gmail.com
  21. 21. BDSQUARE team Piero Molino, Francesco Corazza, David Martín-Borregón, Erika Vicaretti #bdsquare HACKING DEVELOPMENT, VISUALIZATION & SCIENCE BIG DATA BIAS DETECTION (BD)2 discover how news sources are talking about popular topics and the bias they have

×