More Related Content

Similar to Cenitpede: Analyzing Webcrawl(20)

Recently uploaded(20)

Cenitpede: Analyzing Webcrawl

  1. Centipede: Analyzing Web Crawl data for context of a location Vikas Bansal Primal Pappachan Abhishek Sethi
  2. Introduction
  3. Introduction
  4. Description A web service that presents the context associated with a location
  5. Context of a location 1. Weather 2. Healthcare 3. Crime 4. Employment 5. ……
  6. Customers 1. Moving/Travelling into a new place 2. Policy Makers 3. Journalists 4. Researchers
  7. Scenario
  8. Related Services ● Yelp ● Google news ● http://bestplaces.net/ ● http://www.nycgo.com/events/ ● http://www.stubhub.com/
  9. Technical Description of Service ● Analyze the web crawl data ● Create a list of locations ● Filter top 100 words from the files that mention a location from the list ● Build an index of location against list of words corresponding to that location
  10. System Architecture
  11. Data Sources •Common Crawl Data from Amazon S3 –Contains information on billions of web pages –Search through the contents –Use ARC and Text files
  12. Technologies and Resources ● Hadoop Cluster on Bluegrit System ● Apache Pig ○ Python for UDF’s ● Java/PHP for front end development ○ Use a Jboss container for Java, Xampp for PHP ● Elastic Search ● Map Reduce ● SQL/NoSQL database ● REST ● WSDL 2.0 ● AWS - RDS, R53, EC2
  13. MapReduce Job Splitter ● Sentence ● Paragraph ● Article
  14. Elastic Search ● Distributed restful search and analytics. ● Has near real-time search. ● Resilient clusters - detect and remove failed nodes.
  15. Challenges and Limitations •Amount of HDD space available. •Learning new technologies such as Apache Pig, WSDL etc. •Creating special UDF’s in Python.
  16. Timeline
  17. References ● Data set ● Common Crawl Web data ● Elastic Search ● Apache Pig ● Elastic Search for Term Filter lookup ● Hadoop Tutorial ● Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. ● Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.