SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
8.
Related Services
● Yelp
● Google news
● http://bestplaces.net/
● http://www.nycgo.com/events/
● http://www.stubhub.com/
9.
Technical Description of Service
● Analyze the web crawl data
● Create a list of locations
● Filter top 100 words from the files that
mention a location from the list
● Build an index of location against list of
words corresponding to that location
11.
Data Sources
•Common Crawl Data from Amazon S3
–Contains information on billions of web pages
–Search through the contents
–Use ARC and Text files
12.
Technologies and Resources
● Hadoop Cluster on Bluegrit System
● Apache Pig
○ Python for UDF’s
● Java/PHP for front end development
○ Use a Jboss container for Java, Xampp for PHP
● Elastic Search
● Map Reduce
● SQL/NoSQL database
● REST
● WSDL 2.0
● AWS - RDS, R53, EC2
14.
Elastic Search
● Distributed restful search and analytics.
● Has near real-time search.
● Resilient clusters - detect and remove failed
nodes.
15.
Challenges and Limitations
•Amount of HDD space available.
•Learning new technologies such as Apache
Pig, WSDL etc.
•Creating special UDF’s in Python.
17.
References
● Data set
● Common Crawl Web data
● Elastic Search
● Apache Pig
● Elastic Search for Term Filter lookup
● Hadoop Tutorial
● Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data
processing on large clusters." Communications of the ACM 51.1 (2008):
107-113.
● Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet
allocation." the Journal of machine Learning research 3 (2003): 993-1022.
0 likes
Be the first to like this
Views
Total views
3,129
On SlideShare
0
From Embeds
0
Number of Embeds
7
You have now unlocked unlimited access to 20M+ documents!
Unlimited Reading
Learn faster and smarter from top experts
Unlimited Downloading
Download to take your learnings offline and on the go
You also get free access to Scribd!
Instant access to millions of ebooks, audiobooks, magazines, podcasts and more.
Read and listen offline with any device.
Free access to premium services like Tuneln, Mubi and more.