Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data mining news articles by Amir Othman for PyCon APAC 2017

Session on mining news articles by Amir Othman for PyCon APAC 2017

  • Login to see the comments

  • Be the first to like this

Data mining news articles by Amir Othman for PyCon APAC 2017

  1. 1. Data Mining News Articles Amir Othman
  2. 2. About myself. * Software engineer @ Instance * Education from Bauhaus Universität Weimar and Hochschule Ulm * Love my wife, building cool pieces of software and making music * http://www.instance.com.sg * http://www.amirmeludah.com
  3. 3. About this project. * Initially intended to be a part of a thesis project * Grew into a fun side project * Fulfilling a weird obsession about web scraping
  4. 4. What is data mining news articles? "Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems." source: https://en.wikipedia.org/wiki/Data_mining
  5. 5. What is data mining news articles? "Data mining, the science of extracting useful knowledge from such huge data repositories, has emerged as a young and interdisciplinary field in computer science." source: http://www.kdd.org/curriculum/index.html
  6. 6. What is data mining news articles? "Collecting as much relevant data as possible that with the hopes of gaining insights." - me
  7. 7. Collecting what? * News articles: * German news articles - Regional and national * Malaysian news articles - ALL OF THEM!
  8. 8. Why collecting these data? * Building a corpus as raw material for to test out NLP findings * Piece of digital history * News organizations go missing - Wayback Machine not practical * Cross-validating news sources
  9. 9. How to collect links to news articles? * As starting point before expanding * News aggregators :) * Search engine * Curated news from news portals * Result: Links pointing to news websites
  10. 10. How to get even more links? * Related articles – news aggregators * Tweets from journalists and news organizations
  11. 11. What about upcoming news? * We collected a bunch of static links * News need to be fresh and young
  12. 12. What about upcoming news? * Information retrieval * age * freshness * Effective Web Crawling: PhD Thesis by Carlos Castillo
  13. 13. Simple is better than complex
  14. 14. What about upcoming news? * News will (almost) always have RSS feeds * Slowly being replaced by Twitter feeds. * Advantage - Subscription instead of frontiers - Convenient way to get recent news articles - Structured
  15. 15. What about old news? * Identify the next/other/more links * Machine learning approach * Text classification task: - Is this link with this text a next/more/other link? - Train with labeled data - 400 sites from different news websites
  16. 16. What about old news? * Text classification task: - Is this link with this text a next/more/other link? - Train with labeled data - 400 sites from different news websites - FastText - On one iteration: from 5443 articles to 349111 articles
  17. 17. How To Verify? * Similarity – above similarity threshold * Put through information extraction pipeline. * Second layer of sanity check: - randomly pick link and inspect.
  18. 18. What do we have so far? * Links pointing to old articles * RSS and Twitter feeds for links pointing to new articles
  19. 19. How to retrieve and store the data? * Politeness when hitting servers - schedule delay when on the same domain * Queueing with Redis - One process to push it in a queue each time we find a new link - A different process pops the queue to get the content
  20. 20. How to retrieve and store the data? * Scaling with Redis - Redis Cluster - multiple servers to get the content
  21. 21. How to retrieve and store the data? * Store with MongoDB - Document database for documents - Require the flexibility of document database - Save all the extracted information inside MongoDB - Sharded Cluster
  22. 22. How to clean the data? * HTML ==> structured information { “title”:<news title>, “content”:<content of news>, “date”:<published date> }
  23. 23. How to clean the data? * Alternative 1: - BeautifulSoup - disadvantage: manual - advantage: precise * Alternative 2: - readability-lxml - date and title extraction for free! - disadvantage: error prone - advantage: fully automated
  24. 24. How to clean the data? * Alternative 1: - BeautifulSoup - disadvantage: manual - advantage: precise * Alternative 2: - readability-lxml - date and title extraction for free! - disadvantage: error prone - advantage: fully automated
  25. 25. How to clean the data? * For data coming from RSS and Twitter feeds: - Cross-validate with meta data
  26. 26. What can be extracted from the data? * Language detection: - pycld2 * Named Entity Recognition: - Spacy - Polyglot * Topic Modelling: - Gensim
  27. 27. Computing what is trending. * Extract named entities and rank them by their tf-idf score. * Named entity recognition: - extract names, places, etc. * tf-idf - A fancier way of counting the frequency of words
  28. 28. Querying and similarity * Querying: - ElasticSearch for full text search * Similarity lookup: - Run word2vec on entire corpus - Filter dictionary to only contain named entities - Get nearest neighbours
  29. 29. Use case: Automated timelines creation * Web application that consumes the data through a REST API * www.kronologimalaysia.com * www.diezeitachse.de
  30. 30. Questions othman.amir@gmail.com

×