Related Services
● Yelp
● Google news
● http://bestplaces.net/
● http://www.nycgo.com/events/
● http://www.stubhub.com/
Technical Description of Service
● Analyze the web crawl data
● Create a list of locations
● Filter top 100 words from the files that
mention a location from the list
● Build an index of location against list of
words corresponding to that location
Data Sources
•Common Crawl Data from Amazon S3
–Contains information on billions of web pages
–Search through the contents
–Use ARC and Text files
Technologies and Resources
● Hadoop Cluster on Bluegrit System
● Apache Pig
○ Python for UDF’s
● Java/PHP for front end development
○ Use a Jboss container for Java, Xampp for PHP
● Elastic Search
● Map Reduce
● SQL/NoSQL database
● REST
● WSDL 2.0
● AWS - RDS, R53, EC2
References
● Data set
● Common Crawl Web data
● Elastic Search
● Apache Pig
● Elastic Search for Term Filter lookup
● Hadoop Tutorial
● Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data
processing on large clusters." Communications of the ACM 51.1 (2008):
107-113.
● Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet
allocation." the Journal of machine Learning research 3 (2003): 993-1022.