SlideShare a Scribd company logo
1 of 73
Page 1 of 73 WE SCHOOL
CASE STUDY – SEARCH ENGINES AS IMAGE BUILDERS
HEENA JAISINGHANI
DPGD/JL13/1836
SPECIALIZATION: GENERAL MANAGEMENT
PRIN. L.N. WELINGKAR INSTITUE OF
MANAGEMENT DEVELOPMENT & RESEARCH
YEAR OF SUBMISSION: MARCH 2015
Page 2 of 73 WE SCHOOL
ANNEXURE 1
FLOW CHART INDICATING THE BASIC ELEMENTS OF THE PROJECT
To reach Search Engine as Image
Builder we need to know Search
Engine Optimization (SEO)
Code techniques that
minimize the use of flash
and frames
Keywords or Keyword Phrase
that fits the tatrget market
Linking Strategy
Page 3 of 73 WE SCHOOL
Page 4 of 73 WE SCHOOL
ANNEXURE 3
UNDERTAKING BY CANDIDATE
I declare that project entitled Case Study: Search Engine as Image Builders is my own
work conducted as part of my syllabus.
I further declare that the project work presented has been prepared personally by me and it is
not sourced from any outside agency. I understand that, any such malpractice will have very
serious consequence and my admission to the program will be cancelled without any refund
of fees.
I am also aware that, I may face legal action, if I follow such malpractice.
Heena Jaisinghani
(Signature of Candidate)
Page 5 of 73 WE SCHOOL
Table of contents
 Introduction
 Background
 Methodology
 Conclusions & Recommendations
 Limitations
 Bibliography
Page 6 of 73 WE SCHOOL
Introduction
Search Engine as Image Builder has two point of view which as follows
1) Using Search Engine how does the company Builds their Image
&
2) How does Image gets Build through Search Engine Spider Simulator with Search
Engine Optimization
Let us start with 2 topic first which will conclude us to 1 topic
Search Engine Optimization (SEO) is the process not only involves of making web pages
easy to find, easy to crawl and easy to categorise also make those pages rank high for certain
keywords or search terms.
The technique behind Image building is Search Engine Spider Simulator
Basically all search engine spiders function on the same principle – they crawl the Web and
pages, which are stored in a database and later use various algorithms to determine
page ranking, relevancy, etc. of the collected pages. While the algorithms of calculating
ranking and relevancy widely differ among search engines, the way they index sites is more
or less uniform and it is very important that you know what spiders are interested in and what
they neglect.
Businesses are growing more aware of the need to understand and implement at least the
basics of search engine optimization (SEO). But if you read a variety of blogs and websites,
you’ll quickly see that there’s a lot of uncertainty over what makes up “the basics.” Without
access to high-level consulting and without a lot of experience knowing what SEO resources
can be trusted, there’s also a lot of misinformation about SEO strategies and tactics.
Below are Techniquesfor usageof SEO
1. Commit yourself to the process: SEO isn’t a one-time event. Search engine algorithms
change regularly, so the tactics that worked last year may not work this year. SEO requires a
long-term outlook and commitment.
Page 7 of 73 WE SCHOOL
2. Be patient: SEO isn’t about instant gratification. Results often take months to see, and this
is especially true the smaller you are, and the newer you are to doing business online.
3. Ask a lot of questions when hiring an SEO company: It’s your job to know what kind of
tactics the company uses. Ask for specifics. Ask if there are any risks involved. Then get
online yourself and do your own research—about the company, about the tactics they
discussed, and so forth.
4. Become a student of SEO: If you’re taking the do-it-yourself route, you’ll have to become
a student of SEO and learn as much as you can.
5. Have web analytics in place at the start: You should have clearly defined goals for your
SEO efforts, and you’ll need web analytics software in place so you can track what’s working
and what’s not.
6. Build a great web site: Ask yourself, “Is my site really one of the 10 best sites in the
world on this topic?” Be honest. If it’s not, make it better.
7. Include a site map page: Spiders can’t index pages that can’t be crawled. A site map will
help spiders find all the important pages on your site, and help the spider understand your
site’s hierarchy. This is especially helpful if your site has a hard-to-crawl navigation menu. If
your site is large, make several site map pages. Keep each one to less than 100 links. It is
advisable 75 to the max to be safe.
8. Make SEO-friendly URLs: Use keywords in your URLs and file names, such as
yourdomain.com/red-widgets.html. Don’t overdo it, though. A file with 3+ hyphens tends to
look spammy and users may be hesitant to click on it. Use hyphens in URLs and file names,
not underscores. Hyphens are treated as a “space,” while underscores are not.
Page 8 of 73 WE SCHOOL
9. Do keyword research at the start of the project: If you’re on a tight budget, use the free
versions of Keyword Discovery or WordTracker, both of which also have more powerful
paid versions. Ignore the numbers these tools show; what’s important is the relative volume
of one keyword to another. Another good free tool is Google’s AdWords Keyword Tool,
which doesn’t show exact numbers.
10. Open up a PPC account: Whether it’s Google’s AdWords, Microsoft adCenter or
something else, this is a great way to get actual search volume for your keywords. Yes, it
costs money, but if you have the budget it’s worth the investment. It’s also the solution if you
didn’t like the “Be patient” suggestion above and are looking for instant visibility.
11. Use a unique and relevant title and Meta description on every page: The page title is
the single most important on-page SEO factor. It’s rare to rank highly for a primary term (2-3
words) without that term being part of the page title. The meta description tag won’t help you
rank, but it will often appear as the text snippet below your listing, so it should include the
relevant keyword(s) and be written so as to encourage searchers to click on your listing.
12. Write for users first: Google, Yahoo, etc., have pretty powerful bots crawling the web,
but to my knowledge these bots have never bought anything online, signed up for a
newsletter, or picked up the phone to call about your services. Humans do those things, so
write your page copy with humans in mind. Yes, you need keywords in the text, but don’t
stuff each page like a Thanksgiving turkey. Keep it readable.
13. Create great, unique content: This is important for everyone, but it’s a particular
challenge for online retailers. If you’re selling the same widget that 50 other retailers are
selling, and everyone is using the boilerplate descriptions from the manufacturer, this is a
great opportunity. Write your own product descriptions, using the keyword research you did
earlier (see #9 above) to target actual words searchers use, and make product pages that blow
the competition away. Plus, retailer or not, great content is a great way to get inbound links.
Page 9 of 73 WE SCHOOL
14. Use your keywords as anchor text when linking internally: Anchor text helps tells
spiders what the linked-to page is about. Links that say “click here” do nothing for your
search engine visibility.
15. Build links intelligently: Begin with foundational links like trusted directories. (Yahoo
and DMOZ are often cited as examples, but don’t waste time worrying about DMOZ
submission. Submit it and forget it.) Seek links from authority sites in your industry. If local
search matters to you (more on that coming up), seek links from trusted sites in your
geographic area — the Chamber of Commerce, local business directories, etc. Analyze the
inbound links to your competitors to find links you can acquire, too. Create great content on a
consistent basis and use social media to build awareness and links.
16. Use press releases wisely: Developing a relationship with media covering your industry
or your local region can be a great source of exposure, including getting links from trusted
media web sites. Distributing releases online can be an effective link building tactic, and
opens the door for exposure in news search sites. Only issue a release when you have
something newsworthy to report. Don’t waste journalists’ time.
17. Start a blog and participate with other related blogs: Search engines, Google
especially, love blogs for the fresh content and highly-structured data. Beyond that, there’s no
better way to join the conversations that are already taking place about your industry and/or
company. Reading and commenting on other blogs can also increase your exposure and help
you acquire new links. Put your blog at yourdomain.com/blog so your main domain gets the
benefit of any links to your blog posts. If that’s not possible, use blog.yourdomain.com.
18. Use social media marketing wisely. If your business has a visual element, join the
appropriate communities on Flickr and post high-quality photos there. If you’re a service-
oriented business, use Quora and/or Yahoo Answers to position yourself as an expert in your
industry. Any business should also be looking to make use of Twitter and Facebook, as social
information and signals from these are being used as part of search engine rankings for
Google and Bing. With any social media site you use, the first rule is don’t spam! Be an
Page 10 of 73 WE SCHOOL
active, contributing member of the site. The idea is to interact with potential customers, not
annoy them.
19. Take advantage of local search opportunities. Online research for offline buying is a
growing trend. Optimize your site to catch local traffic by showing your address and local
phone number prominently. Write a detailed Directions/Location page using neighbourhoods
and landmarks in the page text. Submit your site to the free local listings services that the
major search engines offer. Make sure your site is listed in local/social directories such as
CitySearch, Yelp, Local.com, etc., and encourage customers to leave reviews of your
business on these sites, too.
20. Take advantage of the tools the search engines give you. Sign up for
Google Webmaster Central, Bing Webmaster Tools and Yahoo Site Explorer to learn more
about how the search engines see your site, including how many inbound links they’re aware
of.
21. Diversify your traffic sources. Google may bring you 70% of your traffic today, but
what if the next big algorithm update hits you hard? What if your Google visibility goes away
tomorrow? Newsletters and other subscriber-based content can help you hold on to
traffic/customers no matter what the search engines do. In fact, many of the DOs on this
list—creating great content, starting a blog, using social media and local search, etc.—will
help you grow an audience of loyal prospects and customers that may help you survive the
whims of search engines
Page 11 of 73 WE SCHOOL
Background
Here it shows the behind the scene concept of Search Engine Spider Simulator
& Techniques Mentioned above
Are Your Hyperlinks Spiderable?
The search engine spider simulator can be of great help when trying to figure out if the
hyperlinks lead to the right place. For instance, link exchange websites often put fake links to
your site with _JavaScript (using mouse over events and stuff to make the link look genuine)
but actually this is not a link that search engines will see and follow. Since the spider
simulator would not display such links, you'll know that something with the link is wrong.
It is highly recommended to use the <noscript> tag, as opposed to _JavaScript based menus.
The reason is that _JavaScript based menus are not spiderable and all the links in them will
be ignored as page text. The solution to this problem is to put all menu item links in the
<noscript> tag. The <noscript> tag can hold a lot but please avoid using it for link stuffing or
any other kind of SEO manipulation.
If you happen to have tons of hyperlinks on your pages (although it is highly recommended to
have less than 100 hyperlinks on a page), then you might have hard times checking if they are
OK. For instance, if you have pages that display “403 Forbidden”, “404 Page Not Found” or
similar errors that prevent the spider from accessing the page, then it is certain that this page
will not be indexed. It is necessary to mention that a spider simulator does not deal with 403
and 404 errors because it is checking where links lead to not if the target of the link is in
place, so you need to use other tools for checking if the targets of hyperlinks are the intended
ones.
Looking for Your Keywords
While there are specific tools, like the Keyword Playground or the Website Keyword
Suggestions, which deal with keywords in more detail, search engine spider simulators also
help to see with the eyes of a spider where keywords are located among the text of the page.
Page 12 of 73 WE SCHOOL
Why is this important? Because keywords in the first paragraphs of a page weigh more than
keywords in the middle or at the end. And if keywords visually appear to us to be on the top,
this may not be the way spiders see them. Consider a standard Web page with tables. In this
case chronologically the code that describes the page layout (like navigation links or separate
cells with text that are the same sitewise) might come first and what is worse, can be so long
that the actual page-specific content will be screens away from the top of the page.
Are Dynamic Pages Too Dynamic to be SeenAt All
Dynamic pages (especially ones with question marks in the URL) are also an extra that
spiders do not love, although many search engines do index dynamic pages as well. Running
the spider simulator will give you an idea how well your dynamic pages are accepted by
search engines.
Meta Keywords and Meta Description
Meta keywords and meta description, as the name implies, are to be found in the <META>
tag of a HTML page. Once meta keywords and meta descriptions were the single most
important criterion for determining relevance of a page but now search engines employ
alternative mechanisms for determining relevancy, so you can safely skip listing keywords
and description in Meta tags (unless you want to add there instructions for the spider what to
index and what not but apart from that meta tags are not very useful anymore).
Meta tags are a great way for webmasters to provide search engines with information about
their sites. Meta tags can be used to provide information to all sorts of clients, and each
system processes only the meta tags they understand and ignores the rest. Meta tags are added
to the <head> section of your HTML page and generally look like this:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="Description" CONTENT="Author: A.N. Author, Illustrator: P. Picture,
Category: Books, Price: £9.24, Length: 784 pages">
Page 13 of 73 WE SCHOOL
Methodology
Here we come to know how the whole process started
Finding information on the World Wide Web had been a difficult and frustrating task, but
became much more usable with breakthroughs in search engine technology in the late 1990s.
A web search engine is a software system that is designed to search for information on the
World Wide Web. The search results are generally presented in a line of results often referred
to as search engine results pages (SERPs). The information may be a mix of web pages,
images, and other types of files. Some search engines also mine data available in databases or
open directories. Unlike web directories, which are maintained only by human editors, search
engines also maintain real-time information by running an algorithm on a web crawler (A
Web crawler is an Internet bot that systematically browses the World Wide Web, typically
for the purpose of Web indexing. A Web crawler may also be called a Web spider, an ant,
an automatic indexer)
Page 14 of 73 WE SCHOOL
History
Further information: Timeline of web search engines
Timeline (full list)
Year Engine Current status
1993 W3Catalog Inactive
Aliweb Inactive
JumpStation Inactive
WWW Worm Inactive
1994 WebCrawler Active, Aggregator
Go.com Active, Yahoo Search
Lycos Active
Infoseek Inactive
1995 AltaVista Inactive, redirected to Yahoo!
Daum Active
Magellan Inactive
Excite Active
SAPO Active
Yahoo! Active, Launched as a directory
1996 Dogpile Active, Aggregator
Inktomi Inactive, acquired by Yahoo!
HotBot Active (lycos.com)
Ask Jeeves Active (rebranded ask.com)
1997 Northern Light Inactive
Yandex Active
1998 Google Active
Ixquick Active also as Start page
MSN Search Active as Bing
empas Inactive (merged with NATE)
1999 AlltheWeb Inactive (URL redirected to Yahoo!)
GenieKnows Active, rebranded Yellowee.com
Naver Active
Teoma Inactive, redirects to Ask.com
Vivisimo Inactive
2000 Baidu Active
Exalead Active
Gigablast Active
2003 Info.com Active
Scroogle Inactive
Page 15 of 73 WE SCHOOL
2004 Yahoo! Search Active, Launched own web search
(see Yahoo! Directory, 1995)
A9.com Inactive
Sogou Active
2005 AOL Search Active
GoodSearch Active
SearchMe Inactive
2006 Soso (search engine) Active
Quaero Inactive
Ask.com Active
Live Search Active as Bing, Launched as
rebranded MSN Search
ChaCha Active
Guruji.com Inactive
2007 wikiseek Inactive
Sproose Inactive
Wikia Search Inactive
Blackle.com Active, Google Search
2008 Powerset Inactive (redirects to Bing)
Picollator Inactive
Viewzi Inactive
Boogami Inactive
LeapFish Inactive
Forestle Inactive (redirects to Ecosia)
DuckDuckGo Active
2009 Bing Active, Launched as
rebranded Live Search
Yebol Inactive
Mugurdy Inactive due to a lack of funding
Scout (Goby) Active
NATE Active
2010 Blekko Active
Cuil Inactive
Yandex Active, Launched global
(English) search
2011 YaCy Active, P2P web search engine
2012 Volunia Inactive
2013 Halalgoogling Active, Islamic / Halal
filter Search
Page 16 of 73 WE SCHOOL
During early development of the web, there was a list of webservers edited by Tim Berners-
Lee and hosted on the CERN webserver. One historical snapshot of the list in 1992 remains,
but as more and more webservers went online the central list could no longer keep up. On the
NCSA(National Center for Supercomputing Applications) site, new servers were announced
under the title "What's New!"
The first tool used for searching on the Internet was Archie. The name stands for "archive"
without the "v". It was created in 1990 by Alan Emtage, Bill Heelan and J. Peter Deutsch,
computer science students at McGill University in Montreal. The program downloaded the
directory listings of all the files located on public anonymous FTP (File Transfer Protocol)
sites, creating a searchable database of file names; however, Archie did not index the contents
of these sites since the amount of data was so limited it could be readily searched manually.
In June 1993, Matthew Gray, then at MIT (Massachusetts Institute of Technology), produced
what was probably the first web robot (is a software application that runs automated tasks
over the Internet.), the Perl (is about the programming language Perl is a family of high-level,
general-purpose, interpreted, dynamic programming languages. The languages in this family
include Perl 5 and Perl 6)-based World Wide Web Wanderer, and used it to generate an index
called 'Wandex'.
The purpose of the Wanderer was to measure the size of the World Wide Web, which it did
until late 1995. The web's second search engine Aliweb appeared in November 1993. Aliweb
did not use a web robot, but instead depended on being notified by website administrators of
the existence at each site of an index file in a particular format.
JumpStation (created in December 1993 by Jonathon Fletcher) used a web robot to find web
pages and to build its index, and used a web form (it allows a user to enter data that
Page 17 of 73 WE SCHOOL
is sent to a server for processing.) as the interface to its query program. It was thus the first
WWW -discovery tool to combine the three essential features of a web search engine
(crawling, indexing, and searching) as described below, Because of the limited resources
available on the platform it ran on, its indexing and hence searching were limited to the titles
and headings found in the web pages the crawler encountered.
One of the first "all text" crawler-based search engines was WebCrawler, which came out in
1994. Unlike its predecessors, it allowed users to search for any word in any webpage, which
has become the standard for all major search engines since. It was also the first one widely
known by the public. Also in 1994, Lycos (which started at Carnegie Mellon University) was
launched and became a major commercial endeavor.
Soon after, many search engines appeared and vied for popularity. These included Magellan,
Excite, Infoseek, Inktomi, Northern Light, and AltaVista. Yahoo! was among the most
popular ways for people to find web pages of interest, but its search function operated on its
web directory (it specializes in linking to other web sites and categorizing those
links.) rather than its full-text copies of web pages. Information seekers could also
browse the directory instead of doing a keyword-based search.
Google adopted the idea of selling search terms in 1998, from a small search engine company
named goto.com( it relates to internet advertising). This move had a significant effect on the
SE business, which went from struggling to one of the most profitable businesses in the
internet.
In 1996, Netscape (it’s an American computer services company, best known for Netscape
Navigator, its web browser) was looking to give a single search engine an exclusive deal as
the featured search engine on Netscape's web browser. There was so much interest that
Page 18 of 73 WE SCHOOL
instead Netscape struck deals with five of the major search engines: for $5 million a year,
each search engine would be in rotation on the Netscape search engine page. The five engines
were Yahoo!, Magellan, Lycos, Infoseek, and Excite.
Search engines were also known as some of the brightest stars in the Internet investing frenzy
that occurred in the late 1990s. Several companies entered the market spectacularly,
receiving record gains during their initial public offerings. Some have taken down their
public search engine, and are marketing enterprise-only editions, such as Northern Light.
Many search engine companies were caught up in the dot-com bubble, a speculation-driven
market boom that peaked in 1999 and ended in 2001.
Around 2000, Google's search engine rose to prominence. The company achieved better
results for many searches with an innovation called PageRank (it is a way of measuring the
importance of website pages), as was explained in the paper Anatomy of a Search Engine
written by Sergey Brin and Larry Page, the later founders of Google. This iterative algorithm
ranks web pages based on the number and PageRank of other web sites and pages that link
there, on the premise that good or desirable pages are linked to more than others. Google also
maintained a minimalist interface to its search engine. In contrast, many of its competitors
embedded a search engine in a web portal. In fact, Google search engine became so popular
that spoof engines emerged such as Mystery Seeker.
By 2000, Yahoo! was providing search services based on Inktomi's search engine. Yahoo!
acquired Inktomi in 2002, and Overture (which owned AlltheWeb and AltaVista) in 2003.
Yahoo! switched to Google's search engine until 2004, when it launched its own search
engine based on the combined technologies of its acquisitions.
Page 19 of 73 WE SCHOOL
Microsoft first launched MSN Search in the fall of 1998 using search results from Inktomi. In
early 1999 the site began to display listings from Looksmart (it is an American, publicly
traded, online advertising company founded in 1995), blended with results from
Inktomi. For a short time in 1999, MSN Search used results from AltaVista were instead. In
2004, Microsoft began a transition to its own search technology, powered by its own web
crawler (called msnbot). Microsoft's rebranded search engine, Bing, was launched on June 1,
2009. On July 29, 2009, Yahoo! and Microsoft finalized a deal in which Yahoo! Search
would be powered by Microsoft Bing technology.
Before going aheadwith further details would like to highlight about
Aggregators
“Aggregators” are the buzz word of choice for the various online companies that gather
information from fragmented marketplaces into a single portal to make life easier for
everyone. A classic example is the online airline and hotel reservations.
How web search engines work
A search engine operates in the following order:
1. Web crawling (it is an Internet bot that systematically browses the World Wide Web,
typically for the purpose of Web indexing)
2. Indexing (it collects, parses, and stores data to facilitate fast and accurate
information retrieval. Index design incorporates interdisciplinary concepts from
linguistics, cognitive psychology, mathematics, informatics, and computer science)
3. Searching (is a query that a user enters into a web search engine to satisfy his or her
information needs.)
Page 20 of 73 WE SCHOOL
Explanation of each is mentioned below
Web search engines work by storing information about many web pages, which they retrieve
from the HTML markup of the pages. These pages are retrieved by a Web crawler
(sometimes also known as a spider) — An automated Web crawler which follows every link
on the site. The site owner can exclude specific pages by using robots.txt.
The search engine then analyzes the contents of each page to determine how it should be
indexed (for example, words can be extracted from the titles, page content, headings, or
special fields called meta tags [They are part of a web page's head section ]. Data about web
pages are stored in an index database for use in later queries. A query from a user can be a
single word. The index helps find information relating to the query as quickly as possible.
Some search engines, such as Google, store all or part of the source page (referred to as a
cache) as well as information about the web pages, whereas others, such as AltaVista, store
every word of every page they find. This cached page always holds the actual search text
since it is the one that was actually indexed, so it can be very useful when the content of the
current page has been updated and the search terms are no longer in it. This problem might be
considered a mild form of linkrot, and Google's handling of it increases usability by
satisfying user expectations that the search terms will be on the returned webpage. This
satisfies the principle of least astonishment, since the user normally expects that the search
terms will be on the returned pages. Increased search relevance makes these cached pages
very useful as they may contain data that may no longer be available elsewhere.
Page 21 of 73 WE SCHOOL
High-level architecture of a standard Web crawler
When a user enters a query into a search engine (typically by using keywords), the engine
examines its inverted index and provides a listing of best-matching web pages according to
its criteria, usually with a short summary containing the document's title and sometimes parts
of the text. The index is built from the information stored with the data and the method by
which the information is indexed. From 2007 the Google.com search engine has allowed one
to search by date by clicking "Show search tools" in the leftmost column of the initial search
results page, and then selecting the desired date range. Most search engines support
the use of the Boolean operators (This article is about connectives in logical systems).AND,
and NOT to further specify the Web search query.
Boolean operators are for literal searches that allow the user to refine and extend the terms of
the search. The engine looks for the words or phrases exactly as entered. Some search
engines provide an advanced feature called proximity search, which allows users to define the
distance between keywords. There is also concept-based searching where the research
involves using statistical analysis on pages containing the words or phrases you search for.
As well, natural language queries allow the user to type a question in the same form one
Page 22 of 73 WE SCHOOL
would ask it to a human. A site like this would be ask.com.
The usefulness of a search engine depends on the relevance of the result set it gives back.
While there may be millions of web pages that include a particular word or phrase, some
pages may be more relevant, popular, or authoritative than others. Most search engines
employ methods to rank the results to provide the "best" results first. How a search engine
decides which pages are the best matches, and what order the results should be shown in,
varies widely from one engine to another. The methods also change over time as Internet
usage changes and new techniques evolve. There are two main types of search engine that
have evolved: one is a system of predefined and hierarchically ordered keywords that humans
have programmed extensively. The other is a system that generates an "inverted index" (In
computer science, an inverted index (also referred to as postings file or inverted file) is an
index data structure storing a mapping from content, such as words or numbers, to its
locations in a database file, or in a document or a set of documents. The purpose of an
inverted index is to allow fast full text searches, at a cost of increased processing when a
document is added to the database) by analyzing texts it locates. This first form relies much
more heavily on the computer itself to do the bulk of the work.
Most Web search engines are commercial ventures supported by advertising revenue and thus
some of them allow advertisers to have their listings ranked higher in search results for a fee.
Search engines that do not accept money for their search results make money by running
search related ads alongside the regular search engine results. The search engines make
money every time someone clicks on one of these ads.
Page 23 of 73 WE SCHOOL
Market share
Here we will come to know which site has the highest market share
Google is the world's most popular search engine, with a market share of 68.69 per cent.
Baidu comes in a distant second, answering 17.17 per cent online queries.
The world's most popular search engines are:
Search engine Market share in October
2014
Google 58.01%
Baidu 29.06%
Bing 8.01%
Yahoo! 4.01%
AOL 0.21%
Ask 0.10%
Excite 0.00%
East Asia and Russia
East Asian countries and Russia constitute a few places where Google is not the most popular
search engine. Soso (search engine) is more popular than Google in China.
Yandex commands a marketshare of 61.9 per cent in Russia, compared to Google's 28.3 per
cent. In China, Baidu is the most popular search engine. South Korea's homegrown
search portal, Naver, is used for 70 per cent online searches in the country. Yahoo! Japan
and Yahoo! Taiwan are the most popular avenues for internet search in Japan and Taiwan,
respectively.
Page 24 of 73 WE SCHOOL
Search engine bias
Although search engines are programmed to rank websites based on some combination of
their popularity and relevancy, empirical studies indicate various political, economic, and
social biases in the information they provide. These biases can be a direct result of
economic and commercial processes (e.g., companies that advertise with a search engine can
become also more popular in its organic search results), Organic search results are listings
on search engine results pages that appear because of their relevance to the search terms, as
opposed to their being advertisements. In contrast, non-organic search results may include
pay per click advertising. And political processes (e.g., the removal of search results to
comply with local laws). For example, Google will not surface certain Neo-Nazi websites in
and Germany, where Holocaust denial is illegal.
Biases can also be a result of social processes, as search engine algorithms are frequently
designed to exclude non-normative viewpoints in favor of more "popular" results. Indexing
algorithms of major search engines skew towards coverage of U.S.-based sites, rather than
websites from non-U.S. countries.
Google Bombing (The terms Google bomb and Googlewashing refer to the practice of
causing a web page to rank highly in search engine results for unrelated or off-topic search
terms by linking heavily.) is one example of an attempt to manipulate search results for
political, social or commercial reasons.
Page 25 of 73 WE SCHOOL
Customized results and filter bubbles
Many search engines such as Google and Bing provide customized results based on the user's
activity history. This leads to an effect that has been called a filter bubble. The term describes
a phenomenon in which websites use algorithms to selectively guess what information a user
would like to see, based on information about the user (such as location, past click behavior
and search history). As a result, websites tend to show only information that agrees with the
user's past viewpoint, effectively isolating the user in a bubble that tends to exclude contrary
information. Prime examples are Google's personalized search results and Facebook's
personalized news stream. According to Eli Pariser, who coined the term, users get less
exposure to conflicting viewpoints and are isolated intellectually in their own informational
bubble. Pursier related an example in which one user searched Google for "BP" and got
investment news about British Petroleum while another searcher got information about the
Deepwater Horizon oil spill and that the two search results pages were "strikingly
different. The bubble effect may have negative implications for civic discourse,
according to Pursier.
Since this problem has been identified, competing search engines have emerged that seek to
avoid this problem by not tracking or "bubbling" users.
Faith-based search engines
The global growth of the Internet and popularity of electronic contents in the Arab and
Muslim World during the last decade has encouraged faith adherents, notably in the Middle
East and Asian sub-continent, to "dream" of their own faith-based i.e. "Islamic" search
engines or filtered search portals filters that would enable users to avoid accessing forbidden
Page 26 of 73 WE SCHOOL
websites such as pornography and would only allow them to access sites that are compatible
to the Islamic faith. Shortly before the Muslim only month of Ramadan, Halalgoogling which
collects results from other search engines like Google and Bing was introduced to the world
July 2013 to presents the halal results to its users, nearly two years after I’mHalal, another
search engine initially (launched on September 2011) to serve Middle East Internet had to
close its search service due to what its owner blamed on lack of funding.
While lack of investment and slow pace in technologies in the Muslim World as the main
consumers or targeted end users has hindered progress and thwarted success of serious
Islamic search engine, the spectacular failure of heavily invested Muslim lifestyle web
projects like Muxlim, which received millions of dollars from investors like Rite Internet
Ventures, has - according to I’mHalal shutdown notice - made almost laughable the idea that
the next Facebook or Google can only come from the Middle East if you support your bright
youth. Yet Muslim internet experts have been determining for years what is or is not
allowed according to the "Law of Islam" and have been categorizing websites and such into
being either "halal" or "haram". All the existing and past Islamic search engines are merely
custom search indexed or monetized by web major search giants like Google, Yahoo and
Bing with only certain filtering systems applied to ensure that their users can't access Haram
sites, which include such sites as nudity, gay, gambling or anything that is deemed to be anti-
Islamic.
Another religiously-oriented search engine is Jewogle, which is the Jewish version of Google
and yet another is SeekFind.org, which is a Christian website that includes filters preventing
users from seeing anything on the internet that attacks or degrades their faith.
Page 27 of 73 WE SCHOOL
Till now we studied how search engines are build and their
contribution towards today’s high tech atmosphere
Now we look with the help of these technology how Image is build
How do I increase my site visibility to searchengines?
These days you don’t have to limit your search to just websites. Many other forms of content
are easy to find, including images. No matter what you’re looking for, an image is (for better
or worse) just one image search away.
You may wonder, however, how image search works. How are images sorted and classified,
making it possible to find tens or hundreds of relevant results? Perhaps you’re just curious, or
perhaps you run a site and want to know so you can improve your own ranking. In either
case, taking a deeper look could be helpful.
Some people assume that image search is conducted via fancy algorithms that determine what
an image is about and then index it. I know that’s where I started. As it turns out, however,
old fashioned text is one of the most important factors in an image’s ranking.
More specifically, the file name matters. Go ahead – do an image search. What do the top
results have in common? Almost invariably, it’s a portion of their file name. Most of the top
results for “pizza” have the word pizza in the file name.
That might seem obvious. But actually, it’s not. Most digital photographs, for example, will
start life with a file name like “1020302.jpg.” It’s only later that they’re re-named. For
webmasters, ensuring that a relevant file name is given to an image is just as basic and
important as making sure that a webpage’s keyword appears in that page’s metadata title
Page 28 of 73 WE SCHOOL
and/or description. But it’s not automatic. It takes constant effort.
We are now at the final step of building an image search engine — accepting a query image
and performing an actual search.
Let’s take a second to review how we got here:
 Step 1: Defining Your Image Descriptor. Before we even consider building an
image search engine, we need to consider how we are going to represent and quantify
our image using only a list of numbers (i.e. a feature vector). We explored three
aspects of an image that can easily be described: color, texture, and shape. We can use
one of these aspects, or many of them.
 Step 2: Indexing Your Dataset. Now that we have selected a descriptor, we can
apply the descriptor to extract features from each and every image in our dataset. The
process of extracting features from an image dataset is called “indexing”. These
features are then written to disk for later use. Indexing is also a task that is easily
made parallel by utilizing multiple cores/processors on our machine.
 Step 3: Defining Your Similarity Metric. In Step 1, we defined a method to extract
features from an image. Now, we need to define a method to compare our feature
vectors. A distance function should accept two feature vectors and then return a value
indicating how “similar” they are. Common choices for similarity functions include
(but are certainly not limited to) the Euclidean, Manhattan, Cosine, and Chi-Squared
distances.
Finally, we are now ready to perform our last step in building an image search engine:
Page 29 of 73 WE SCHOOL
Searching and Ranking
The Query
Before we can perform a search, we need a query.
The last time you went to Google, you typed in some keywords into the search box, right?
The text you entered into the input form was your “query”.
Google then took your query, analyzed it, and compared it to their gigantic index of
webpages, ranked them, and returned the most relevant webpages back to you.
Similarly, when we are building an image search engine, we need a query image.
Query images come in two flavors: an internal query image and an external query image.
As the name suggests, an internal query image already belongs in our index. We have already
analyzed it, extracted features from it, and stored its feature vector.
The second type of query image is an external query image. This is the equivalent to typing
our text keywords into Google. We have never seen this query image before and we can’t
make any assumptions about it. We simply apply our image descriptor, extract features, rank
the images in our index based on similarity to the query, and return the most relevant results.
Let’s think back to our similarity metrics for a second and assume that we are using the
Euclidean distance. The Euclidean distance has a nice property called the Coincidence
Axiom, implying that the function returns a value of 0 (indicating perfect similarity) if and
only if the two feature vectors are identical.
Example: If I were to search for an image already in my index, then the Euclidean distance
between the two feature vectors would be zero, implying perfect similarity. This image would
then be placed at the top of my search results since it is the most relevant. This makes sense
Page 30 of 73 WE SCHOOL
and is the intended behavior.
How strange it would be if I searched for an image already in my index and did not find it in
the #1 result position. That would likely imply that there was a bug in my code somewhere or
I’ve made some very poor choices in image descriptors and similarity metrics.
Overall, using an internal query image serves as a sanity check. It allows you to make sure
that your image search engine is functioning as expected.
Once you can confirm that your image search engine is working properly, you can then
accept external query images that are not already part of your index.
The Search
So what’s the process of actually performing a search? Checkout the outline below:
1. Accept a query image from the user
A user could be uploading an image from their desktop or from their mobile device. As
image search engines become more prevalent, I suspect that most queries will come from
devices such as iPhones and Droids. It’s simple and intuitive to snap a photo of a place,
object, or something that interests you using your cellphone, and then have it automatically
analyzed and relevant results returned.
2. Describe the query image
Now that you have a query image, you need to describe it using the exact same image
descriptor(s) as you did in the indexing phase. For example, if I used a RGB color histogram
with 32 bins per channel when I indexed the images in my dataset, I am going to use the same
32 bin per channel histogram when describing my query image. This ensures that I have a
consistent representation of my images. After applying my image descriptor, I now have a
feature vector for the query image.
Page 31 of 73 WE SCHOOL
3. Perform the Search
To perform the most basic method of searching, you need to loop over all the feature vectors
in your index. Then, you use your similarity metric to compare the feature vectors in your
index to the feature vectors from your query. Your similarity metric will tell you how
“similar” the two feature vectors are. Finally, sort your results by similarity.
Looping over your entire index may be feasible for small datasets. But if you have a large
image dataset, like Google or TinEye, this simply isn’t possible. You can’t compute the
distance between your query features and the billions of feature vectors already present in
your dataset.
4. Display Your Results to the User
Now that we have a ranked list of relevant images we need to display them to the user. This
can be done using a simple web interface if the user is on a desktop, or we can display the
images using some sort of app if they are on a mobile device. This step is pretty trivial in the
overall context of building an image search engine, but you should still give thought to the
user interface and how the user will interact with your image search engine.
Summary
So there you have it, the four steps of building an image search engine, from front to back:
1. Define your image descriptor.
2. Index your dataset.
3. Define your similarity metric.
4. Perform a search, rank the images in your index in terms of relevancy to the user, and
display the results to the user.
Page 32 of 73 WE SCHOOL
Here is the best example of image build searchengine againstabove
explanation
Think about it this way. When you go to Google and type “Lord of the Rings” into the search
box, you expect Google to return pages to you that are relevant to Tolkien’s books and the
movie franchise. Similarly, if we present an image search engine with a query image, we
expect it to return images that are relevant to the content of image — hence, we sometimes
call image search engines by what they are more commonly known in academic circles
as Content Based Image Retrieval (CBIR) systems.
So what’s the overall goal of our Lord of the Rings image search engine?
The goal, given a query image from one of our five different categories, is to return the
category’s corresponding images in the top 10 results.
That was a mouthful. Let’s use an example to make it more clear.
If I submitted a query image of The Shire to our system, I would expect it to give me all 5
Shire images in our dataset back in the first 10 results. And again, if I submitted a query
image of Rivendell, I would expect our system to give me all 5 Rivendell images in the first
10 results.
Make sense? Good. Let’s talk about the four steps to building our image search engine.
The 4 Steps to Building an Image Search Engine
On the most basic level, there are four steps to building an image search engine:
Define your descriptor: What type of descriptor are you going to use? Are you describing
color? Texture? Shape?
Index your dataset: Apply your descriptor to each image in your dataset, extracting a set of
features.
Page 33 of 73 WE SCHOOL
Define your similarity metric: How are you going to define how “similar” two images are?
You’ll likely be using some sort of distance metric. (a metric or distance function is a
function that defines a distance between elements of a set. A set with a metric is called a
metric space. A metric induces a topology on a set but not all topologies can be generated by
a metric.) Common choices include Euclidean, Cityblock (Manhattan), Cosine, and chi-
squared to name a few.
Searching: To perform a search, apply your descriptor to your query image, and then ask
your distance metric to rank how similar your images are in your index to your query images.
Sort your results via similarity and then examine them.
Step #1: The Descriptor – A 3D RGB Color Histogram
Our image descriptor is a 3D color histogram in the RGB color space with 8 bins per red,
green, and blue channel.
The best way to explain a 3D histogram is to use the conjunctive AND. This image descriptor
will ask a given image how many pixels have a Red value that falls into bin #1 AND a Green
value that falls into bin #2 AND how many Blue pixels fall into bin #1. This process will be
repeated for each combination of bins; however, it will be done in a computationally efficient
manner.
When computing a 3D histogram with 8 bins, OpenCV will store the feature vector as an (8,
8, 8) array. We’ll simply flatten it and reshape it to (512,). Once it’s flattened, we can easily
compare feature vectors together for similarity.
Ready to see some code? Okay, here we go:
3D RGB Histogram in OpenCV and Python
Python
Page 34 of 73 WE SCHOOL
# import the necessary package
import numpy as np
import cv2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# import the necessary packages
import numpy as np
import cv2
class RGBHistogram:
def __init__(self, bins):
# store the number of bins the histogram will use
self.bins = bins
def describe(self, image):
# compute a 3D histogram in the RGB colorspace,
# then normalize the histogram so that images
# with the same content, but either scaled larger
# or smaller will have (roughly) the same histogram
hist = cv2.calcHist([image], [0, 1, 2],
None, self.bins, [0, 256, 0, 256, 0, 256])
hist = cv2.normalize(hist)
# return out 3D histogram as a flattened array
return hist.flatten()
Page 35 of 73 WE SCHOOL
As you can see, RGBHistogram class has been defined. The reason for this is because you
rarely ever extract features from a single image alone. You instead extract features from an
entire dataset of images. Furthermore, you expect that the features extracted from all images
utilize the same parameters — in this case, the number of bins for the histogram. It wouldn’t
make much sense to extract a histogram using 32 bins from one image and then 128 bins for
another image if you intend on comparing them for similarity.
Let’s take the code apart and understand what’s going on:
Lines 6-8: Here I am defining the constructor for the RGBHistogram. The only parameter we
need is the number of bins for each channel in the histogram. Again, this is why I prefer using
classes instead of functions for image descriptors — by putting the relevant parameters in the
constructor, you ensure that the same parameters are utilized for each image.
Line 10: You guessed it. The describe method is used to “describe” the image and return a
feature vector.
Line 15: Here we extract the actual 3D RGB Histogram (or actually, BGR since OpenCV
stores the image as a NumPy array, but with the channels in reverse order). We assume
self.bins is a list of three integers, designating the number of bins for each channel.
Line 16: It’s important that we normalize the histogram in terms of pixel counts. If we used
the raw (integer) pixel counts of an image, then shrunk it by 50% and described it again, we
would have two different feature vectors for identical images. In most cases, you want to
avoid this scenario. We obtain scale invariance by converting the raw integer pixel counts
into real-valued percentages. For example, instead of saying bin #1 has 120 pixels in it, we
would say bin #1 has 20% of all pixels in it. Again, by using the percentages of pixel counts
rather than raw, integer pixel counts, we can assure that two identical images, differing only
Page 36 of 73 WE SCHOOL
in size, will have (roughly) identical feature vectors.
Line 20: When computing a 3D histogram, the histogram will be represented as a NumPy
array with (N, N, N) bins. In order to more easily compute the distance between histograms,
we simply flatten this histogram to have a shape of (N ** 3,). Example: When we instantiate
our RGBHistogram, we will use 8 bins per channel. Without flattening our histogram, the
shape would be (8, 8, 8). But by flattening it, the shape becomes (512,).
Now that we have defined our image descriptor, we can move on to the process of
indexing our dataset.
Step #2: Indexing our Dataset
Okay, so we’ve decided that our image descriptor is a 3D RGB histogram. The next step is to
apply our image descriptor to each image in the dataset.
This simply means that we are going to loop over our 25 image dataset, extract a 3D RGB
histogram from each image, store the features in a dictionary, and write the dictionary to file.
Yep, that’s it.
In reality, you can make indexing as simple or complex as you want. Indexing is a task that is
easily made parallel. If we had a four core machine, we could divide the work up between the
four cores and speedup the indexing process. But since we only have 25 images, that’s pretty
silly, especially given how fast it is to compute a histogram.
Let’s dive into some code: Indexing an Image Dataset using Python
Python
# import the necessary package
frompyimagesearch.rgbhistogr
import argparse
import cPickle
Page 37 of 73 WE SCHOOL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# import the necessary packages
from pyimagesearch.rgbhistogram import RGBHistogram
import argparse
import cPickle
import glob
import cv2
# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required = True,
help = "Path to the directory that contains the images to be indexed")
ap.add_argument("-i", "--index", required = True,
help = "Path to where the computed index will be stored")
args = vars(ap.parse_args())
# initialize the index dictionary to store our our quantifed
# images, with the 'key' of the dictionary being the image
# filename and the 'value' our computed features
index = {}
Alright, the first thing we are going to do is import the packages we need.
The --dataset argument is the path to where our images are stored on disk and the --index
option is the path to where we will store our index once it has been computed.
Finally, we’ll initialize our index — a builtin Python dictionary type. The key for the
Page 38 of 73 WE SCHOOL
dictionary will be the image filename. We’ve made the assumption that all filenames are
unique, and in fact, for this dataset, they are. The value for the dictionary will be the
computed histogram for the image.
Using a dictionary for this example makes the most sense, especially for explanation
purposes. Given a key, the dictionary points to some other object. When we use an image
filename as a key and the histogram as the value, we are implying that a given histogram H is
used to quantify and represent the image with filename K.
Again, you can make this process as simple or as complicated as you want. More complex
image descriptors make use of term frequency-inverse document frequency weighting (tf-idf)
and an inverted index, but for the time being, let’s keep it simple.
Indexing an Image Dataset using Python
Python
# initialize our image descriptor
# 8 bins per channel
desc = RGBHistogram([8, 8, 8])
1
2
3
# initialize our image descriptor -- a 3D RGB histogram with
# 8 bins per channel
desc = RGBHistogram([8, 8, 8])
Here we instantiate our RGBHistogram. Again, we will be using 8 bins for each, red, green,
and blue, channel, respectively.
Indexing an Image Dataset using Python
Page 39 of 73 WE SCHOOL
Python
# use glob to grab the image pa
for imagePath in glob.glob(args[
# extract our unique im
k = imagePath[imageP
1
2
3
4
5
6
7
8
9
10
# use glob to grab the image paths and loop over them
for imagePath in glob.glob(args["dataset"] + "/*.png"):
# extract our unique image ID (i.e. the filename)
k = imagePath[imagePath.rfind("/") + 1:]
# load the image, describe it using our RGB histogram
# descriptor, and update the index
image = cv2.imread(imagePath)
features = desc.describe(image)
index[k] = features
Here is where the actual indexing takes place. Let’s break it down:
Line 2: We use glob to grab the image paths and start to loop over our dataset.
Line 4: We extract the “key” for our dictionary. All filenames are unique in this sample
dataset, so the filename itself will be enough to serve as the key.
Line 8-10: The image is loaded off disk and we then use our RGBHistogram to extract a
histogram from the image. The histogram is then stored in the index.
Indexing an Image Dataset using Python
Page 40 of 73 WE SCHOOL
Python
# w e are now done indexing ou
# index to disk
f = open(args["index"], "w ")
f.w rite(cPickle.dumps(index))
1
2
3
4
5
# we are now done indexing our image -- now we can write our
# index to disk
f = open(args["index"], "w")
f.write(cPickle.dumps(index))
f.close()
Now that our index has been computed, let’s write it to disk so we can use it for searching
later on.
Step #3: The Search
We now have our index sitting on disk, ready to be searched.
The problem is, we need some code to perform the actual search. How are we going to
compare two feature vectors and how are we going to determine how similar they are?
This question is better addressed first with some code.
Building an Image Search Engine in Python and OpenCV
Python
# import the necessary package
import numpy as np
class Searcher:
Page 41 of 73 WE SCHOOL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# import the necessary packages
import numpy as np
class Searcher:
def __init__(self, index):
# store our index of images
self.index = index
def search(self, queryFeatures):
# initialize our dictionary of results
results = {}
# loop over the index
for (k, features) in self.index.items():
# compute the chi-squared distance between the features
# in our index and our query features -- using the
# chi-squared distance which is normally used in the
# computer vision field to compare histograms
d = self.chi2_distance(features, queryFeatures)
# now that we have the distance between the two feature
# vectors, we can udpate the results dictionary -- the
Page 42 of 73 WE SCHOOL
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# key is the current image ID in the index and the
# value is the distance we just computed, representing
# how 'similar' the image in the index is to our query
results[k] = d
# sort our results, so that the smaller distances (i.e. the
# more relevant images are at the front of the list)
results = sorted([(v, k) for (k, v) in results.items()])
# return our results
return results
def chi2_distance(self, histA, histB, eps = 1e-10):
# compute the chi-squared distance
d = 0.5 * np.sum([((a - b) ** 2) / (a + b + eps)
for (a, b) in zip(histA, histB)])
# return the chi-squared distance
return d
First off, most of this code is just comments. Don’t be scared that it’s 41 lines. If you haven’t
already guessed. Let’s investigate what’s going on:
Lines 4-7: The first thing I do is define a Searcher class and a constructor with a single
parameter — the index. This index is assumed to be the index dictionary that we wrote to file
Page 43 of 73 WE SCHOOL
during the indexing step.
Line 11: We define a dictionary to store our results. The key is the image filename (from the
index) and the value is how similar the given image is to the query image.
Lines 14-26: Here is the part where the actual searching takes place. We loop over the image
filenames and corresponding features in our index. We then use the chi-squared distance to
compare our color histograms. The computed distance is then stored in the results dictionary,
indicating how similar the two images are to each other.
Lines 30-33: The results are sorted in terms of relevancy (the smaller the chi-squared
distance, the relevant/similar) and returned.
Lines 35-41: Here we define the chi-squared distance function used to compare the two
histograms. In general, the difference between large bins vs. small bins is less important and
should be weighted as such. This is exactly what the chi-squared distance does. We provide
an epsilon dummy value to avoid those pesky “divide by zero” errors. Images will be
considered identical if their feature vectors have a chi-squared distance of zero. The larger the
distance gets, the less similar they are.
So there you have it, a Python class that can take an index and perform a search.
Now it’s time to put this searcher to work.
Step #4: Performing a Search
Finally. We are closing in on a functioning image search engine.
But we’re not quite there yet. We need a little extra code to handle loading the images off
disk and performing the search:
Building an Image Search Engine in Python and OpenCV
Page 44 of 73 WE SCHOOL
Python
# import the necessary package
frompyimagesearch.searcher i
import numpy as np
import argparse
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# import the necessary packages
from pyimagesearch.searcher import Searcher
import numpy as np
import argparse
import cPickle
import cv2
# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required = True,
help = "Path to the directory that contains the images we just indexed")
ap.add_argument("-i", "--index", required = True,
help = "Path to where we stored our index")
args = vars(ap.parse_args())
# load the index and initialize our searcher
index = cPickle.loads(open(args["index"]).read())
searcher = Searcher(index)
Page 45 of 73 WE SCHOOL
First things first. Import the packages that we will need. We then define our arguments in the
same manner that we did during the indexing step. Finally, we use cPickle to load our index
off disk and initialize our Searcher.
Python
# loop over images in the index
# a query image
for (query, queryFeatures) in in
# performthe search
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# loop over images in the index -- we will use each one as
# a query image
for (query, queryFeatures) in index.items():
# perform the search using the current query
results = searcher.search(queryFeatures)
# load the query image and display it
path = args["dataset"] + "/%s" % (query)
queryImage = cv2.imread(path)
cv2.imshow("Query", queryImage)
print "query: %s" % (query)
# initialize the two montages to display our results --
# we have a total of 25 images in the index, but let's only
Page 46 of 73 WE SCHOOL
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# display the top 10 results; 5 images per montage, with
# images that are 400x166 pixels
montageA = np.zeros((166 * 5, 400, 3), dtype = "uint8")
montageB = np.zeros((166 * 5, 400, 3), dtype = "uint8")
# loop over the top ten results
for j in xrange(0, 10):
# grab the result (we are using row-major order) and
# load the result image
(score, imageName) = results[j]
path = args["dataset"] + "/%s" % (imageName)
result = cv2.imread(path)
print "t%d. %s : %.3f" % (j + 1, imageName, score)
# check to see if the first montage should be used
if j < 5:
montageA[j * 166:(j + 1) * 166, :] = result
# otherwise, the second montage should be used
else:
montageB[(j - 5) * 166:((j - 5) + 1) * 166, :] = result
Page 47 of 73 WE SCHOOL
37
38
39
40
# show the results
cv2.imshow("Results 1-5", montageA)
cv2.imshow("Results 6-10", montageB)
cv2.waitKey(0)
Most of this code handles displaying the results. The actual “search” is done in a single line
(#31). Regardless, let’s examine what’s going on:
Line 3: We are going to treat each image in our index as a query and see what results we get
back. Normally, queries are external and not part of the dataset, but before we get to that,
let’s just perform some example searches.
Line 5: Here is where the actual search takes place. We treat the current image as our query
and perform the search.
Lines 8-11: Load and display our query image.
Lines 17-35: In order to display the top 10 results, I have decided to use two montage images.
The first montage shows results 1-5 and the second montage results 6-10. The name of the
image and distance is provided on Line 27.
Lines 38-40: Finally, we display our search results to the user.
So there you have it. An entire image search engine in Python.
Page 48 of 73 WE SCHOOL
Figure: Search Results using Mordor-002.png as a query. Our image search engine is able to
return images from Mordor and the Black Gate.
Let’s start at the ending of The Return of the King using Frodo and Sam’s ascent into the
volcano as our query image. As you can see, our top 5 results are from the “Mordor”
category.
Perhaps you are wondering why the query image of Frodo and Sam is also the image in the
#1 result position? Well, let’s think back to our chi-squared distance. We said that an image
would be considered “identical” if the distance between the two feature vectors is zero. Since
we are using images we have already indexed as queries, they are in fact identical and will
a distance of zero. Since a value of zero indicates perfect similarity, the query image appears
Page 49 of 73 WE SCHOOL
in the #1 result position.
Now, let’s try another image, this time using The Goblin King in Goblin Town:
Figure: Search Results using Goblin-004.png as a query. The top 5 images returned are from
Goblin Town.
The Goblin King doesn’t look very happy. But we sure are happy that all five images from
Goblin Town are in the top 10 results.
Finally, here are three more example searches for Dol-Guldur, Rivendell, and The Shire.
Again, we can clearly see that all five images from their respective categories are in the top
10 results.
Page 50 of 73 WE SCHOOL
Figure: Using images from Dol-Guldur (Dol-Guldur-004.png), Rivendell (Rivendell-
003.png), and The Shire (Shire-002.png) as queries.
But clearly, this is not how all image search engines work. Google allows you
to upload an image of your own. TinEye allows you to upload an image of your own. Why
can’t we? Let’s see how we can perform a search using an image that we haven’t already
indexed:
Building an Image Search Engine using Python and OpenCV
Python
# import the necessary package
frompyimagesearch.rgbhistogr
frompyimagesearch.searcher i
import numpy as np
1
2
3
# import the necessary packages
from pyimagesearch.rgbhistogram import RGBHistogram
from pyimagesearch.searcher import Searcher
Page 51 of 73 WE SCHOOL
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np
import argparse
import cPickle
import cv2
# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required = True,
help = "Path to the directory that contains the images we just indexed")
ap.add_argument("-i", "--index", required = True,
help = "Path to where we stored our index")
ap.add_argument("-q", "--query", required = True,
help = "Path to query image")
args = vars(ap.parse_args())
# load the query image and show it
queryImage = cv2.imread(args["query"])
cv2.imshow("Query", queryImage)
print "query: %s" % (args["query"])
# describe the query in the same way that we did in
# index.py -- a 3D RGB histogram with 8 bins per
Page 52 of 73 WE SCHOOL
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# channel
desc = RGBHistogram([8, 8, 8])
queryFeatures = desc.describe(queryImage)
# load the index perform the search
index = cPickle.loads(open(args["index"]).read())
searcher = Searcher(index)
results = searcher.search(queryFeatures)
# initialize the two montages to display our results --
# we have a total of 25 images in the index, but let's only
# display the top 10 results; 5 images per montage, with
# images that are 400x166 pixels
montageA = np.zeros((166 * 5, 400, 3), dtype = "uint8")
montageB = np.zeros((166 * 5, 400, 3), dtype = "uint8")
# loop over the top ten results
for j in xrange(0, 10):
# grab the result (we are using row-major order) and
# load the result image
(score, imageName) = results[j]
path = args["dataset"] + "/%s" % (imageName)
Page 53 of 73 WE SCHOOL
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
result = cv2.imread(path)
print "t%d. %s : %.3f" % (j + 1, imageName, score)
# check to see if the first montage should be used
if j < 5:
montageA[j * 166:(j + 1) * 166, :] = result
# otherwise, the second montage should be used
else:
montageB[(j - 5) * 166:((j - 5) + 1) * 166, :] = result
# show the results
cv2.imshow("Results 1-5", montageA)
cv2.imshow("Results 6-10", montageB)
cv2.waitKey(0)
Lines 2-17: This should feel like pretty standard stuff by now. We are importing our packages
and setting up our argument parser, although, you should note the new argument –query. This
is the path to our query image.
Lines 20-21: We’re going to load your query image and show it to you, just in case you
forgot what your query image is.
Lines 27-28: Instantiate our RGBHistogram with the exact same number of bins as during our
indexing step. We then extract features from our query image.
Page 54 of 73 WE SCHOOL
Lines 31-33: Load our index off disk using cPickle and perform the search.
Lines 39-62: Just as in the code above to perform a search, this code just shows us our results.
One of Rivendell and one of The Shire. These two images will be our queries.
Check out the results below:
Figure: Using external Rivendell (Left) and the Shire (Right) query images. For both cases,
we find the top 5 search results are from the same category.
In this case, we searched using two images that we haven’t seen previously. The one on the
left is of Rivendell. We can see from our results that the other 5 Rivendell images in our
index were returned, demonstrating that our image search engine is working properly.
On the right, we have a query image from The Shire. Again, this image is not present in our
index. But when we look at the search results, we can see that the other 5 Shire images were
returned from the image search engine, once again demonstrating that our image search
engine is returning semantically similar images.
Page 55 of 73 WE SCHOOL
Summary
Here we’ve explored how to create an image search engine from start to finish.
The first step was to choose an image descriptor — we used a 3D RGB histogram to
characterize the color of our images. We then indexed each image in our dataset using our
descriptor by extracting feature vectors (i.e. the histograms). From there, we used the chi-s
quared distance to define “similarity” between two images. Finally, we glued all the pieces
together and created a Lord of the Rings image search engine.
After above example we will now see that being difference in terms of text
search& image searchthey are correlatedto eachother
Most webmasters don’t see any difference between image alt text and title mostly keeping
them the same. A great discussion over at Google Webmaster Groups provides an exhaustive
information on the differences between an image Alt attribute and an image title and standard
recommendations of how to use them.
Alt text is meant to be an alternative information source for those people who have chosen
to disable images in their browsers and those user agents that are simply unable to “see” the
images. It should describe what the image is about and get those visitors interested to see it.
Without alt text, an image will be displayed as an empty iconIn Internet Explorer Alt text also
pops up when you hover over an image.
Plus, Google officially confirmed it mainly focuses on alt text when trying to understand
what an image is about. Image title (and the element name speaks for itself) should provide
additional information and follow the rules of the regular title: it should be relevant, short,
Page 56 of 73 WE SCHOOL
catchy, and concise (a title “offers advisory information about the element for which it is
set“). In FireFox and Opera it pops up when you hover over an image:
So based on the above, we can discuss how to properly handle them:
 Both tags are primarily meant for visitors (though alt text seems more important for
crawlers) – so provide explicit information on an image to encourage views.
 Include your main keywords in both, but change them up. Keyword stuffing in alt
text and title is still keyword stuffing, so keep them relevant and meaningful.
Another good point to take into consideration:
 According to Aaron Wall, alt text is crucially important when used for a site-
wide header banner.
One of the reasons Aaron Wall, was so motivated to change the tagline of this site recently
was because the new site design contained the site's logo as a background image. The logo
link was a regular static link, but it had no anchor text, only a link title to describe the link. If
you do not look at the source code, the link title attribute can seem like an image alt tag when
you scroll over it, but to a search engine they do not look that same. A link title is not
weighted anywhere near as aggressively as an image alt tag is.
The old link title on the header link for this site was search engine optimization book. While
this site ranks #6 and #8 for that query in Google, neither of the ranking pages are the
homepage (the tools page and sales letter rank). That shows that Google currently places
negligible, if any, weight on link titles.
If the only link to your homepage is a logo check the source code to verify you are using
descriptive image alt text.
Page 57 of 73 WE SCHOOL
Conclusion & Recommendations
Conclusion:
We conclude that with the help of above mentioned technologies that any company can build
their image which can be helpful to the society with these Search Engines.
While nobody can guarantee top level positioning in search engine organic results, proper
search engine optimization can help. Because the search engines, such as Google, Yahoo!,
and Bing, are so important today it is necessary to make each page in a Web site conform to
the principles of good SEO as much as possible.
To do this it is necessary to:
 Understand the basics of how search engines rate sites
 Use proper keywords and phrases throughout the Web site
 Avoid giving the appearance of spamming the search engines
 Write all text for real people, not just for search engines
 Use well-formed alternate attributes on images
 Make sure that the necessary meta tags (and title tag) are installed in the head of
each Web page
 Have good incoming links to establish popularity
 Make sure the Web site is regularly updated so that the content is fresh
Page 58 of 73 WE SCHOOL
Recommendations:
Following recommendations are made on the basis of overall usage of systems:
Overview
Recommender systems typically produce a list of recommendations in one of two ways –
through collaborative or content-based filtering. Collaborative filtering approaches building
a model from a user's past behaviour (items previously purchased or selected and/or
numerical ratings given to those items) as well as similar decisions made by other users; then
use that model to predict items (or ratings for items) that the user may have an interest in.
Content-based filtering approaches utilize a series of discrete characteristics of an item in
order to recommend additional items with similar properties. These approaches are often
combined.
Main Article
Collaborative filtering:
One approach to the design of recommender systems that has seen wide use is collaborative
filtering. Collaborative filtering methods are based on collecting and analyzing a large
amount of information on users’ behaviors, activities or preferences and predicting what
users will like based on their similarity to other users. A key advantage of the collaborative
filtering approach is that it does not rely on machine analyzable content and therefore it is
capable of accurately recommending complex items such as movies without requiring an
"understanding" of the item itself. Many algorithms have been used in measuring user
similarity or item similarity in recommender systems. For example, the k-nearest neighbour
(k-NN) approach and the Pearson Correlation.
Page 59 of 73 WE SCHOOL
Collaborative Filtering is based on the assumption that people who agreed in the past will
agree in the future, and that they will like similar kinds of items as they liked in the past.
When building a model from a user's profile, a distinction is often made between explicit and
implicit forms of data collection.
Examples of explicit data collection include the following:
 Asking a user to rate an item on a sliding scale.
 Asking a user to search.
 Asking a user to rank a collection of items from favourite to least favourite.
 Presenting two items to a user and asking him/her to choose the better one of them.
 Asking a user to create a list of items that he/she likes.
Examples of implicit data collection include the following:
 Observing the items that a user views in an online store.
 Analysing item/user viewing times
 Keeping a record of the items that a user purchases online.
 Obtaining a list of items that a user has listened to or watched on his/her computer.
 Analyzing the user's social network and discovering similar likes and dislikes
The recommender system compares the collected data to similar and dissimilar data collected
from others and calculates a list of recommended items for the user. Several commercial and
non-commercial examples are listed in the article on collaborative filtering systems.
One of the most famous examples of collaborative filtering is item-to-item collaborative
Page 60 of 73 WE SCHOOL
filtering (people who buy x also buy y), an algorithm popularized by Amazon.com's
recommender system.
Facebook, MySpace, LinkedIn, and other social networks use collaborative filtering to
recommend new friends, groups, and other social connections (by examining the network of
connections between a user and their friends). Twitter uses many signals and in-memory
computations for recommending who to follow to its users.
Collaborative filtering approaches often suffer from three problems:
Cold Start,
Scalability,
And
Sparsity.
Cold Start: These systems often require a large amount of existing data on a user in order to
make accurate recommendations.
Scalability: In many of the environments that these systems make recommendations in, there
are millions of users and products. Thus, a large amount of computation power is often
necessary to calculate recommendations.
Sparsity: The number of items sold on major e-commerce sites is extremely large. The most
active users will only have rated a small subset of the overall database. Thus, even the most
popular items have very few ratings. A particular type of collaborative filtering algorithm uses
matrix factorization, a low-rank matrix approximation technique.
Collaborative filtering are classified as memory-based and model based collaborative
filtering. A well known example of memory-based approaches is user-based algorithm
and that of model-based approaches is Kernel-Mapping Recommender.
Page 61 of 73 WE SCHOOL
Content-based filtering
Another common approach when designing recommender systems is content-based filtering.
Content-based filtering methods are based on a description of the item and a profile of the
user’s preference. In a content-based recommender system, keywords are used to describe
the items; beside, a user profile is built to indicate the type of item this user likes. In other
words, these algorithms try to recommend items that are similar to those that a user liked in
the past (or is examining in the present). In particular, various candidate items are compared
with items previously rated by the user and the best-matching items are recommended. This
approach has its roots in information retrieval and information filtering research.
To abstract the features of the items in the system, an item presentation algorithm is applied.
A widely used algorithm is the tf–idf, (short for term frequency–inverse document
frequency, is a numerical statistic that is intended to reflect how important a word is to a
document in a collection) representation (also called vector space representation).
To create user profile, the system mostly focuses on two types of information:
1) A model of the user's preference.
2) A history of the user's interaction with the recommender system.
Basically, these methods use an item profile (i.e. a set of discrete attributes and features)
characterizing the item within the system. The system creates a content-based profile of users
based on a weighted vector of item features. The weights denote the importance of each
feature to the user and can be computed from individually rated content vectors using a
variety of techniques.
Page 62 of 73 WE SCHOOL
Simple approaches use the average values of the rated item vector while other sophisticated
methods use machine learning techniques such as
 BayesianClassifiers (In machine learning, naive Bayes classifiers are a family of
simple probabilistic classifiers based on applying Bayes' theorem with strong (naive)
independence assumptions between the features.)
 Cluster Analysis (is the task of grouping a set of objects in such a way that objects in
the same group (called a cluster) are more similar (in some sense or another) to each
other than to those in other groups (clusters)).
 Decision Trees (A decision tree is a decision support tool that uses a tree-like graph
or model of decisions and their possible consequences, including chance event
outcomes, resource costs, and utility.)
&
 Artificial Neural Networks (In machine learning, artificial neural networks (ANNs)
are a family of statistical learning algorithms inspired by biological neural networks
(the central nervous systems of animals, in particular the brain) and are used to
estimate or approximate functions that can depend on a large number of inputs and
are generally unknown.) in order to estimate the probability that the user is going to like
the item.
Direct feedback from a user, usually in the form of a like or dislike button, can be used to
assign higher or lower weights on the importance of certain attributes.
A key issue with content-based filtering is whether the system is able to learn user
preferences from user's actions regarding one content source and use them across other
content types. When the system is limited to recommending content of the same type as the
Page 63 of 73 WE SCHOOL
user is already using, the value from the recommendation system is significantly less than
when other content types from other services can be recommended. For example,
recommending news articles based on browsing of news is useful, but it's much more useful
when music, videos, products, discussions etc. from different services can be recommended
based on news browsing.
Hybrid Recommender Systems
Recent research has demonstrated that a hybrid approach, combining collaborative filtering
and content-based filtering could be more effective in some cases. Hybrid approaches can be
implemented in several ways: by making content-based and collaborative-based predictions
separately and then combining them; by adding content-based capabilities to a collaborative-
based approach (and vice versa); or by unifying the approaches into one model.
Several studies empirically compare the performance of the hybrid with the pure
collaborative and content-based methods and demonstrate that the hybrid methods can
more accurate recommendations than pure approaches. These methods can also be used to
overcome some of the common problems in recommender systems such as cold start and the
sparsity problem.
Netflix is a good example of hybrid systems. They make recommendations by comparing the
watching and searching habits of similar users (i.e. collaborative filtering) as well as by
offering movies that share characteristics with films that a user has rated highly (content-
based filtering).
A variety of techniques have been proposed as the basis for recommender systems:
collaborative, content-based, knowledge-based, and demographic techniques. Each of these
techniques has known shortcomings, such as the well-known cold-start problem for
Page 64 of 73 WE SCHOOL
collaborative and content-based systems (what to do with new users with few ratings) and the
knowledge engineering bottleneck in knowledge-based approaches is a technology used to
store complex structured and unstructured information used by a computer system.
A hybrid recommender system is one that combines multiple techniques together to achieve
some synergy between them.
Collaborative: The system generates recommendations using only information about rating
profiles for different users. Collaborative systems locate peer users with a rating history
similar to the current user and generate recommendations using this neighbourhood.
Content-based: The system generates recommendations from two sources: the features
associated with products and the ratings that a user has given them. Content-based
recommenders treat recommendation as a user-specific classification problem and learn a
classifier for the user's likes and dislikes based on product features.
Demographic: A demographic recommender provides recommendations based on a
demographic profile of the user. Recommended products can be produced for different
demographic niches, by combining the ratings of users in those niches.
Knowledge-based: A knowledge-based recommender suggests products based on inferences
about a user’s needs and preferences. This knowledge will sometimes contain explicit
functional knowledge about how certain product features meet user needs.
The term hybrid recommender system is used here to describe any recommender system that
combines multiple recommendation techniques together to produce its output. There is no
reason why several different techniques of the same type could not be hybridized, for
example, two different content-based recommenders could work together, and a number of
projects have investigated this type of hybrid: NewsDude, which uses both naive Bayes and
Page 65 of 73 WE SCHOOL
kNN classifiers in its news recommendations is just one example.
Sevenhybridization techniques:
1) Weighted: The score of different recommendation components are combined
numerically.
2) Switching: The system chooses among recommendation components and applies the
selected one.
3) Mixed: Recommendations from different recommenders are presented together.
4) Feature Combination: Features derived from different knowledge sources are
combined together and given to a single recommendation algorithm.
5) Feature Augmentation: One recommendation technique is used to compute a feature
or set of features, which is then part of the input to the next technique.
6) Cascade: Recommenders are given strict priority, with the lower priority ones
breaking ties the scoring of the higher ones.
7) Meta-level: One recommendation technique is applied and produces some sort of
model, which is then the input used by the next technique.
Beyond Accuracy
Typically, research on recommender systems is concerned about finding the most accurate
recommendation algorithms. However, there is a number of factors that are also important.
Diversity - Users tend to be more satisfied with recommendations when there is a higher
intra-list diversity, i.e. items from e.g. different artists.
Recommender Persistence - In some situations it is more effective to re-show
recommendations, or let users re-rate items, than showing new items. There are several
Page 66 of 73 WE SCHOOL
reasons for this. Users may ignore items when they are shown for the first time, for instance,
because they had no time to inspect the recommendations carefully.
Privacy - Recommender systems usually have to deal with privacy concerns because
users have to reveal sensitive information. Building user profiles using collaborative filtering
can be problematic from a privacy point of view. Many European countries have a strong
culture of data privacy and every attempt to introduce any level of user profiling can result in
a negative customer response. A number of privacy issues arose around the dataset offered by
Netflix for the Netflix Prize competition. Although the data sets were anonymised in order to
preserve customer privacy, in 2007, two researchers from the University of Texas were able
to identify individual users by matching the data sets with film ratings on the Internet Movie
Database. As a result, in December 2009, an anonymous Netflix user sued Netflix in Doe v.
Netflix, alleging that Netflix had violated U.S. fair trade laws and the Video Privacy
Protection Act by releasing the datasets. This led in part to the cancellation of a second
Netflix Prize competition in 2010. Much research has been conducted on ongoing privacy
issues in this space. Ramakrishnan have conducted an extensive overview of the trade-
offs between personalization and privacy and found that the combination of weak ties
and other data sources can be used to uncover identities of users in an anonymised dataset.
User Demographics - Beel et al. found that user demographics may influence how satisfied
users are with recommendations. In their paper they show that elderly users tend to be
more interested in recommendations than younger users.
Robustness - When users can participate in the recommender system, the issue of fraud must
be addressed.
Serendipity - Serendipity is a measure "how surprising the recommendations are". For
Page 67 of 73 WE SCHOOL
instance, a recommender system that recommends milk to a customer in a grocery store,
might be perfectly accurate but still it is not a good recommendation because it is an obvious
item for the customer to buy.
Trust - A recommender system is of little value for a user if the user does not trust the
system. Trust can be built by a recommender system by explaining how it generates
recommendations, and why it recommends an item.
Labelling - User satisfaction with recommendations may be influenced by the labeling of the
recommendations. For instance, in the cited study click-through rate (CTR) is a way of
measuring the success of an online advertising campaign for a particular website as well as
the effectiveness of an email campaign by the number of users that clicked on a specific link
for recommendations labelled as "Sponsored" were lower (CTR=5.93%) than CTR for
identical recommendations labelled as "Organic" (CTR=8.86%). Interestingly,
recommendations with no label performed best (CTR=9.87%) in that study.
Mobile Recommender Systems
One growing area of research in the area of recommender systems is mobile recommender
systems. With the increasing ubiquity of internet-accessing smart phones, it is now possible to
offer personalized, context-sensitive recommendations. This is a particularly difficult area of
research as mobile data is more complex than recommender systems often have to deal with
(it is heterogeneous, noisy, requires spatial and temporal auto-correlation, and has validation
and generality problems). Additionally, mobile recommender systems suffer from a
transplantation problem - recommendations may not apply in all regions (for instance, it
would be unwise to recommend a recipe in an area where all of the ingredients may not be
available).
Page 68 of 73 WE SCHOOL
One example of a mobile recommender system is one that offers potentially profitable
driving routes for taxi drivers in a city. This system takes as input data in the form of GPS
traces of the routes that taxi drivers took while working, which include location (latitude and
longitude), time stamps, and operational status (with or without passengers). It then
recommends a list of pickup points along a route that will lead to optimal occupancy times
and profits. This type of system is obviously location-dependent, and as it must operate on a
handheld or embedded device, the computation and energy requirements must remain low.
An other example of mobile recommendation is what (Bouneffouf et al., 2012) developed for
professional users. This system takes as input data the GPS traces of the user and his agenda
to suggest him suitable information depending on his situation and interests. The system uses
machine learning techniques and reasoning process in order to adapt dynamically the mobile
system to the evolution of the user’s interest. The author called his algorithm hybrid-ε-
greedy.
Mobile recommendation systems have also been successfully built using the Web of Data as
a source for structured information. A good example of such system is
SMARTMUSEUM. The system uses semantic modelling, information retrieval and
machine learning techniques in order to recommend contents matching user’s interest, even
when the evidence of user's interests is initially vague and based on heterogeneous
information.
Risk-Aware Recommender Systems
The majority of existing approaches to RS focus on recommending the most relevant
documents to the users using the contextual information and do not take into account the risk
Page 69 of 73 WE SCHOOL
of disturbing the user in specific situations. However, in many applications, such as
recommending a personalized content, it is also important to incorporate the risk of upsetting
the user into the recommendation process in order not to recommend documents to users in
certain circumstances, for instance, during a professional meeting, early morning, late-night.
Therefore, the performance of the RS depends on the degree to which it has incorporated the
risk into the recommendation process.
Risk Definition: "The risk in recommender systems is the possibility to disturb or to upset
the user which leads to a bad answer of the user".
In response to this problems, the authors in have developed a dynamic risk sensitive
recommendation system called DRARS (Dynamic Risk-Aware Recommender System),
which model the context-aware recommendation as a bandit problem. This system combines
a content-based technique and a contextual bandit algorithm. They have shown that DRARS
improves the Upper Condense Bound (UCB) policy, the currently available best algorithm,
by calculating the most optimal exploration value to maintain a trade-off between exploration
and exploitation based on the risk level of the current user's situation. The authors conducted
experiments in an industrial context with real data and real users and has shown that taking
into account the risk level of users' situations significantly increased the performance of the
recommender systems.
The Netflix Prize
One of the key events that energized research in recommender systems was the Netflix prize.
From 2006 to 2009, Netflix sponsored a competition, offering a grand prize of $1,000,000 to
the team that could take an offered dataset of over 100 million movie ratings and return
recommendations that were 10% more accurate than those offered by the company's existing
Page 70 of 73 WE SCHOOL
recommender system. This competition energized the search for new and more accurate
algorithms. On 21 September 2009, the grand prize of US$1,000,000 was given to the
BellKor's Pragmatic Chaos team using tiebreaking rules.
The most accurate algorithm in 2007 used an ensemble method of 107 different algorithmic
approaches, blended into a single prediction.
Predictive accuracy is substantially improved when blending multiple predictors. Our
experience is that most efforts should be concentrated in deriving substantially different
approaches, rather than refining a single technique. Consequently, our solution is an
ensemble of many methods. Many benefits accrued to the web due to the Netflix project.
A second contest was planned, but was ultimately cancelled in response to an ongoing lawsuit
and concerns from the Federal Trade Commission. (The Federal Trade Commission (FTC) is
an independent agency of the United States government, established in 1914 by the Federal
Trade Commission Act. Its principal mission is the promotion of consumer protection and the
elimination and prevention of anticompetitive business practices, such as coercive monopoly.)
Multi-criteria Recommender Systems
Multi-Criteria Recommender Systems (MCRS) can be defined as Recommender Systems that
incorporate preference information upon multiple criteria. Instead of developing
recommendation techniques based on a single criterion values, the overall preference of user
u for the item i, these systems try to predict a rating for unexplored items of u by exploiting
preference information on multiple criteria that affect this overall preference value. Several
researchers approach MCRS as a Multi-criteria Decision Making (MCDM) problem, and
apply MCDM methods and techniques to implement MCRS systems.
Page 71 of 73 WE SCHOOL
The limitations of SEO
One important characteristic of an expert and professional consultant is that he or she
understands that the theories and techniques in their field are subject to quite concrete and
specific limitations. In this page, I review the important ones that apply to search engine
optimization and website marketing and their practical significance.
SEO is hand made to order. If you have planned a product launch, a wedding or been a party
to a lawsuit, you know that the best laid plans are seldom executed without some major
changes. Life is simply too complex. SEO is another one of those human activities because,
to be effective, it is hand made for a specific site and business.
There are other important limitations which you need to understand and take into
consideration.
Searching is an evolving process from the point of view of providers (the search engines),
users and website owners. What worked yesterday may not work today and be counter-
productive or harmful tomorrow. In the result, monitoring or regular checks of the key search
engines and directories is required to maintain a high ranking once it is achieved.
Quality is everything. Since virtually everything that we do to improve a site's ranking will be
known to anyone who knows how to get it, innovations tend to be short-lived. Moreover,
search engines are always on the lookout for exploits that manipulate their ranking
algorithms. The only thing that cannot be copied or exploited is high quality and value
content especially when others link to it for those reasons. Only higher and more valuable
content trumps it.
Page 72 of 73 WE SCHOOL
The cost of SEO is rising. More expertise is required than before and this trend will continue.
techniques employed are more sophisticated, complex and time consuming. There are fewer
worthwhile search engines and directories that offer free listings. Paid placement costs are
rising and the best key words expensive.
The search lottery. Search engines collect only a fraction of the billions of sites' pages for
various technological reasons which change over time but nonetheless will mean for the
foreseeable future that searching is akin to a lottery. SEO improves the odds but cannot
remove the uncertainty altogether.
SEO is a marketing exercise and, accordingly, the same old business rules apply. You can sell
almost anything to someone once but businesses are built and prosper through repeat
customers to whom the reputation, brand or goodwill is important. Content quality and value
is the key and that remains elusive, expensive and difficult to source but, in websites, is the
only basis of effective marketing using SEO techniques.
Suffice to say, if your site is included in a search engine and you achieve a high enough
ranking for your requirements, then these limitations are costs of doing business in
cyberspace.
Page 73 of 73 WE SCHOOL
Bibliography
The topic itself completes the Bibliography that is search engine
The entire project is done with the help of following sites & taken Guidance from
Mr C.P. Venkatesh in Digital Marketing Workshop
www.google.com
www.searchcounsel.com
www.searchenginejournal.com
www.pyimagesearch.com
www.seobook.com

More Related Content

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

Project 1 Search Engine

  • 1. Page 1 of 73 WE SCHOOL CASE STUDY – SEARCH ENGINES AS IMAGE BUILDERS HEENA JAISINGHANI DPGD/JL13/1836 SPECIALIZATION: GENERAL MANAGEMENT PRIN. L.N. WELINGKAR INSTITUE OF MANAGEMENT DEVELOPMENT & RESEARCH YEAR OF SUBMISSION: MARCH 2015
  • 2. Page 2 of 73 WE SCHOOL ANNEXURE 1 FLOW CHART INDICATING THE BASIC ELEMENTS OF THE PROJECT To reach Search Engine as Image Builder we need to know Search Engine Optimization (SEO) Code techniques that minimize the use of flash and frames Keywords or Keyword Phrase that fits the tatrget market Linking Strategy
  • 3. Page 3 of 73 WE SCHOOL
  • 4. Page 4 of 73 WE SCHOOL ANNEXURE 3 UNDERTAKING BY CANDIDATE I declare that project entitled Case Study: Search Engine as Image Builders is my own work conducted as part of my syllabus. I further declare that the project work presented has been prepared personally by me and it is not sourced from any outside agency. I understand that, any such malpractice will have very serious consequence and my admission to the program will be cancelled without any refund of fees. I am also aware that, I may face legal action, if I follow such malpractice. Heena Jaisinghani (Signature of Candidate)
  • 5. Page 5 of 73 WE SCHOOL Table of contents  Introduction  Background  Methodology  Conclusions & Recommendations  Limitations  Bibliography
  • 6. Page 6 of 73 WE SCHOOL Introduction Search Engine as Image Builder has two point of view which as follows 1) Using Search Engine how does the company Builds their Image & 2) How does Image gets Build through Search Engine Spider Simulator with Search Engine Optimization Let us start with 2 topic first which will conclude us to 1 topic Search Engine Optimization (SEO) is the process not only involves of making web pages easy to find, easy to crawl and easy to categorise also make those pages rank high for certain keywords or search terms. The technique behind Image building is Search Engine Spider Simulator Basically all search engine spiders function on the same principle – they crawl the Web and pages, which are stored in a database and later use various algorithms to determine page ranking, relevancy, etc. of the collected pages. While the algorithms of calculating ranking and relevancy widely differ among search engines, the way they index sites is more or less uniform and it is very important that you know what spiders are interested in and what they neglect. Businesses are growing more aware of the need to understand and implement at least the basics of search engine optimization (SEO). But if you read a variety of blogs and websites, you’ll quickly see that there’s a lot of uncertainty over what makes up “the basics.” Without access to high-level consulting and without a lot of experience knowing what SEO resources can be trusted, there’s also a lot of misinformation about SEO strategies and tactics. Below are Techniquesfor usageof SEO 1. Commit yourself to the process: SEO isn’t a one-time event. Search engine algorithms change regularly, so the tactics that worked last year may not work this year. SEO requires a long-term outlook and commitment.
  • 7. Page 7 of 73 WE SCHOOL 2. Be patient: SEO isn’t about instant gratification. Results often take months to see, and this is especially true the smaller you are, and the newer you are to doing business online. 3. Ask a lot of questions when hiring an SEO company: It’s your job to know what kind of tactics the company uses. Ask for specifics. Ask if there are any risks involved. Then get online yourself and do your own research—about the company, about the tactics they discussed, and so forth. 4. Become a student of SEO: If you’re taking the do-it-yourself route, you’ll have to become a student of SEO and learn as much as you can. 5. Have web analytics in place at the start: You should have clearly defined goals for your SEO efforts, and you’ll need web analytics software in place so you can track what’s working and what’s not. 6. Build a great web site: Ask yourself, “Is my site really one of the 10 best sites in the world on this topic?” Be honest. If it’s not, make it better. 7. Include a site map page: Spiders can’t index pages that can’t be crawled. A site map will help spiders find all the important pages on your site, and help the spider understand your site’s hierarchy. This is especially helpful if your site has a hard-to-crawl navigation menu. If your site is large, make several site map pages. Keep each one to less than 100 links. It is advisable 75 to the max to be safe. 8. Make SEO-friendly URLs: Use keywords in your URLs and file names, such as yourdomain.com/red-widgets.html. Don’t overdo it, though. A file with 3+ hyphens tends to look spammy and users may be hesitant to click on it. Use hyphens in URLs and file names, not underscores. Hyphens are treated as a “space,” while underscores are not.
  • 8. Page 8 of 73 WE SCHOOL 9. Do keyword research at the start of the project: If you’re on a tight budget, use the free versions of Keyword Discovery or WordTracker, both of which also have more powerful paid versions. Ignore the numbers these tools show; what’s important is the relative volume of one keyword to another. Another good free tool is Google’s AdWords Keyword Tool, which doesn’t show exact numbers. 10. Open up a PPC account: Whether it’s Google’s AdWords, Microsoft adCenter or something else, this is a great way to get actual search volume for your keywords. Yes, it costs money, but if you have the budget it’s worth the investment. It’s also the solution if you didn’t like the “Be patient” suggestion above and are looking for instant visibility. 11. Use a unique and relevant title and Meta description on every page: The page title is the single most important on-page SEO factor. It’s rare to rank highly for a primary term (2-3 words) without that term being part of the page title. The meta description tag won’t help you rank, but it will often appear as the text snippet below your listing, so it should include the relevant keyword(s) and be written so as to encourage searchers to click on your listing. 12. Write for users first: Google, Yahoo, etc., have pretty powerful bots crawling the web, but to my knowledge these bots have never bought anything online, signed up for a newsletter, or picked up the phone to call about your services. Humans do those things, so write your page copy with humans in mind. Yes, you need keywords in the text, but don’t stuff each page like a Thanksgiving turkey. Keep it readable. 13. Create great, unique content: This is important for everyone, but it’s a particular challenge for online retailers. If you’re selling the same widget that 50 other retailers are selling, and everyone is using the boilerplate descriptions from the manufacturer, this is a great opportunity. Write your own product descriptions, using the keyword research you did earlier (see #9 above) to target actual words searchers use, and make product pages that blow the competition away. Plus, retailer or not, great content is a great way to get inbound links.
  • 9. Page 9 of 73 WE SCHOOL 14. Use your keywords as anchor text when linking internally: Anchor text helps tells spiders what the linked-to page is about. Links that say “click here” do nothing for your search engine visibility. 15. Build links intelligently: Begin with foundational links like trusted directories. (Yahoo and DMOZ are often cited as examples, but don’t waste time worrying about DMOZ submission. Submit it and forget it.) Seek links from authority sites in your industry. If local search matters to you (more on that coming up), seek links from trusted sites in your geographic area — the Chamber of Commerce, local business directories, etc. Analyze the inbound links to your competitors to find links you can acquire, too. Create great content on a consistent basis and use social media to build awareness and links. 16. Use press releases wisely: Developing a relationship with media covering your industry or your local region can be a great source of exposure, including getting links from trusted media web sites. Distributing releases online can be an effective link building tactic, and opens the door for exposure in news search sites. Only issue a release when you have something newsworthy to report. Don’t waste journalists’ time. 17. Start a blog and participate with other related blogs: Search engines, Google especially, love blogs for the fresh content and highly-structured data. Beyond that, there’s no better way to join the conversations that are already taking place about your industry and/or company. Reading and commenting on other blogs can also increase your exposure and help you acquire new links. Put your blog at yourdomain.com/blog so your main domain gets the benefit of any links to your blog posts. If that’s not possible, use blog.yourdomain.com. 18. Use social media marketing wisely. If your business has a visual element, join the appropriate communities on Flickr and post high-quality photos there. If you’re a service- oriented business, use Quora and/or Yahoo Answers to position yourself as an expert in your industry. Any business should also be looking to make use of Twitter and Facebook, as social information and signals from these are being used as part of search engine rankings for Google and Bing. With any social media site you use, the first rule is don’t spam! Be an
  • 10. Page 10 of 73 WE SCHOOL active, contributing member of the site. The idea is to interact with potential customers, not annoy them. 19. Take advantage of local search opportunities. Online research for offline buying is a growing trend. Optimize your site to catch local traffic by showing your address and local phone number prominently. Write a detailed Directions/Location page using neighbourhoods and landmarks in the page text. Submit your site to the free local listings services that the major search engines offer. Make sure your site is listed in local/social directories such as CitySearch, Yelp, Local.com, etc., and encourage customers to leave reviews of your business on these sites, too. 20. Take advantage of the tools the search engines give you. Sign up for Google Webmaster Central, Bing Webmaster Tools and Yahoo Site Explorer to learn more about how the search engines see your site, including how many inbound links they’re aware of. 21. Diversify your traffic sources. Google may bring you 70% of your traffic today, but what if the next big algorithm update hits you hard? What if your Google visibility goes away tomorrow? Newsletters and other subscriber-based content can help you hold on to traffic/customers no matter what the search engines do. In fact, many of the DOs on this list—creating great content, starting a blog, using social media and local search, etc.—will help you grow an audience of loyal prospects and customers that may help you survive the whims of search engines
  • 11. Page 11 of 73 WE SCHOOL Background Here it shows the behind the scene concept of Search Engine Spider Simulator & Techniques Mentioned above Are Your Hyperlinks Spiderable? The search engine spider simulator can be of great help when trying to figure out if the hyperlinks lead to the right place. For instance, link exchange websites often put fake links to your site with _JavaScript (using mouse over events and stuff to make the link look genuine) but actually this is not a link that search engines will see and follow. Since the spider simulator would not display such links, you'll know that something with the link is wrong. It is highly recommended to use the <noscript> tag, as opposed to _JavaScript based menus. The reason is that _JavaScript based menus are not spiderable and all the links in them will be ignored as page text. The solution to this problem is to put all menu item links in the <noscript> tag. The <noscript> tag can hold a lot but please avoid using it for link stuffing or any other kind of SEO manipulation. If you happen to have tons of hyperlinks on your pages (although it is highly recommended to have less than 100 hyperlinks on a page), then you might have hard times checking if they are OK. For instance, if you have pages that display “403 Forbidden”, “404 Page Not Found” or similar errors that prevent the spider from accessing the page, then it is certain that this page will not be indexed. It is necessary to mention that a spider simulator does not deal with 403 and 404 errors because it is checking where links lead to not if the target of the link is in place, so you need to use other tools for checking if the targets of hyperlinks are the intended ones. Looking for Your Keywords While there are specific tools, like the Keyword Playground or the Website Keyword Suggestions, which deal with keywords in more detail, search engine spider simulators also help to see with the eyes of a spider where keywords are located among the text of the page.
  • 12. Page 12 of 73 WE SCHOOL Why is this important? Because keywords in the first paragraphs of a page weigh more than keywords in the middle or at the end. And if keywords visually appear to us to be on the top, this may not be the way spiders see them. Consider a standard Web page with tables. In this case chronologically the code that describes the page layout (like navigation links or separate cells with text that are the same sitewise) might come first and what is worse, can be so long that the actual page-specific content will be screens away from the top of the page. Are Dynamic Pages Too Dynamic to be SeenAt All Dynamic pages (especially ones with question marks in the URL) are also an extra that spiders do not love, although many search engines do index dynamic pages as well. Running the spider simulator will give you an idea how well your dynamic pages are accepted by search engines. Meta Keywords and Meta Description Meta keywords and meta description, as the name implies, are to be found in the <META> tag of a HTML page. Once meta keywords and meta descriptions were the single most important criterion for determining relevance of a page but now search engines employ alternative mechanisms for determining relevancy, so you can safely skip listing keywords and description in Meta tags (unless you want to add there instructions for the spider what to index and what not but apart from that meta tags are not very useful anymore). Meta tags are a great way for webmasters to provide search engines with information about their sites. Meta tags can be used to provide information to all sorts of clients, and each system processes only the meta tags they understand and ignores the rest. Meta tags are added to the <head> section of your HTML page and generally look like this: <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="Description" CONTENT="Author: A.N. Author, Illustrator: P. Picture, Category: Books, Price: £9.24, Length: 784 pages">
  • 13. Page 13 of 73 WE SCHOOL Methodology Here we come to know how the whole process started Finding information on the World Wide Web had been a difficult and frustrating task, but became much more usable with breakthroughs in search engine technology in the late 1990s. A web search engine is a software system that is designed to search for information on the World Wide Web. The search results are generally presented in a line of results often referred to as search engine results pages (SERPs). The information may be a mix of web pages, images, and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler (A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider, an ant, an automatic indexer)
  • 14. Page 14 of 73 WE SCHOOL History Further information: Timeline of web search engines Timeline (full list) Year Engine Current status 1993 W3Catalog Inactive Aliweb Inactive JumpStation Inactive WWW Worm Inactive 1994 WebCrawler Active, Aggregator Go.com Active, Yahoo Search Lycos Active Infoseek Inactive 1995 AltaVista Inactive, redirected to Yahoo! Daum Active Magellan Inactive Excite Active SAPO Active Yahoo! Active, Launched as a directory 1996 Dogpile Active, Aggregator Inktomi Inactive, acquired by Yahoo! HotBot Active (lycos.com) Ask Jeeves Active (rebranded ask.com) 1997 Northern Light Inactive Yandex Active 1998 Google Active Ixquick Active also as Start page MSN Search Active as Bing empas Inactive (merged with NATE) 1999 AlltheWeb Inactive (URL redirected to Yahoo!) GenieKnows Active, rebranded Yellowee.com Naver Active Teoma Inactive, redirects to Ask.com Vivisimo Inactive 2000 Baidu Active Exalead Active Gigablast Active 2003 Info.com Active Scroogle Inactive
  • 15. Page 15 of 73 WE SCHOOL 2004 Yahoo! Search Active, Launched own web search (see Yahoo! Directory, 1995) A9.com Inactive Sogou Active 2005 AOL Search Active GoodSearch Active SearchMe Inactive 2006 Soso (search engine) Active Quaero Inactive Ask.com Active Live Search Active as Bing, Launched as rebranded MSN Search ChaCha Active Guruji.com Inactive 2007 wikiseek Inactive Sproose Inactive Wikia Search Inactive Blackle.com Active, Google Search 2008 Powerset Inactive (redirects to Bing) Picollator Inactive Viewzi Inactive Boogami Inactive LeapFish Inactive Forestle Inactive (redirects to Ecosia) DuckDuckGo Active 2009 Bing Active, Launched as rebranded Live Search Yebol Inactive Mugurdy Inactive due to a lack of funding Scout (Goby) Active NATE Active 2010 Blekko Active Cuil Inactive Yandex Active, Launched global (English) search 2011 YaCy Active, P2P web search engine 2012 Volunia Inactive 2013 Halalgoogling Active, Islamic / Halal filter Search
  • 16. Page 16 of 73 WE SCHOOL During early development of the web, there was a list of webservers edited by Tim Berners- Lee and hosted on the CERN webserver. One historical snapshot of the list in 1992 remains, but as more and more webservers went online the central list could no longer keep up. On the NCSA(National Center for Supercomputing Applications) site, new servers were announced under the title "What's New!" The first tool used for searching on the Internet was Archie. The name stands for "archive" without the "v". It was created in 1990 by Alan Emtage, Bill Heelan and J. Peter Deutsch, computer science students at McGill University in Montreal. The program downloaded the directory listings of all the files located on public anonymous FTP (File Transfer Protocol) sites, creating a searchable database of file names; however, Archie did not index the contents of these sites since the amount of data was so limited it could be readily searched manually. In June 1993, Matthew Gray, then at MIT (Massachusetts Institute of Technology), produced what was probably the first web robot (is a software application that runs automated tasks over the Internet.), the Perl (is about the programming language Perl is a family of high-level, general-purpose, interpreted, dynamic programming languages. The languages in this family include Perl 5 and Perl 6)-based World Wide Web Wanderer, and used it to generate an index called 'Wandex'. The purpose of the Wanderer was to measure the size of the World Wide Web, which it did until late 1995. The web's second search engine Aliweb appeared in November 1993. Aliweb did not use a web robot, but instead depended on being notified by website administrators of the existence at each site of an index file in a particular format. JumpStation (created in December 1993 by Jonathon Fletcher) used a web robot to find web pages and to build its index, and used a web form (it allows a user to enter data that
  • 17. Page 17 of 73 WE SCHOOL is sent to a server for processing.) as the interface to its query program. It was thus the first WWW -discovery tool to combine the three essential features of a web search engine (crawling, indexing, and searching) as described below, Because of the limited resources available on the platform it ran on, its indexing and hence searching were limited to the titles and headings found in the web pages the crawler encountered. One of the first "all text" crawler-based search engines was WebCrawler, which came out in 1994. Unlike its predecessors, it allowed users to search for any word in any webpage, which has become the standard for all major search engines since. It was also the first one widely known by the public. Also in 1994, Lycos (which started at Carnegie Mellon University) was launched and became a major commercial endeavor. Soon after, many search engines appeared and vied for popularity. These included Magellan, Excite, Infoseek, Inktomi, Northern Light, and AltaVista. Yahoo! was among the most popular ways for people to find web pages of interest, but its search function operated on its web directory (it specializes in linking to other web sites and categorizing those links.) rather than its full-text copies of web pages. Information seekers could also browse the directory instead of doing a keyword-based search. Google adopted the idea of selling search terms in 1998, from a small search engine company named goto.com( it relates to internet advertising). This move had a significant effect on the SE business, which went from struggling to one of the most profitable businesses in the internet. In 1996, Netscape (it’s an American computer services company, best known for Netscape Navigator, its web browser) was looking to give a single search engine an exclusive deal as the featured search engine on Netscape's web browser. There was so much interest that
  • 18. Page 18 of 73 WE SCHOOL instead Netscape struck deals with five of the major search engines: for $5 million a year, each search engine would be in rotation on the Netscape search engine page. The five engines were Yahoo!, Magellan, Lycos, Infoseek, and Excite. Search engines were also known as some of the brightest stars in the Internet investing frenzy that occurred in the late 1990s. Several companies entered the market spectacularly, receiving record gains during their initial public offerings. Some have taken down their public search engine, and are marketing enterprise-only editions, such as Northern Light. Many search engine companies were caught up in the dot-com bubble, a speculation-driven market boom that peaked in 1999 and ended in 2001. Around 2000, Google's search engine rose to prominence. The company achieved better results for many searches with an innovation called PageRank (it is a way of measuring the importance of website pages), as was explained in the paper Anatomy of a Search Engine written by Sergey Brin and Larry Page, the later founders of Google. This iterative algorithm ranks web pages based on the number and PageRank of other web sites and pages that link there, on the premise that good or desirable pages are linked to more than others. Google also maintained a minimalist interface to its search engine. In contrast, many of its competitors embedded a search engine in a web portal. In fact, Google search engine became so popular that spoof engines emerged such as Mystery Seeker. By 2000, Yahoo! was providing search services based on Inktomi's search engine. Yahoo! acquired Inktomi in 2002, and Overture (which owned AlltheWeb and AltaVista) in 2003. Yahoo! switched to Google's search engine until 2004, when it launched its own search engine based on the combined technologies of its acquisitions.
  • 19. Page 19 of 73 WE SCHOOL Microsoft first launched MSN Search in the fall of 1998 using search results from Inktomi. In early 1999 the site began to display listings from Looksmart (it is an American, publicly traded, online advertising company founded in 1995), blended with results from Inktomi. For a short time in 1999, MSN Search used results from AltaVista were instead. In 2004, Microsoft began a transition to its own search technology, powered by its own web crawler (called msnbot). Microsoft's rebranded search engine, Bing, was launched on June 1, 2009. On July 29, 2009, Yahoo! and Microsoft finalized a deal in which Yahoo! Search would be powered by Microsoft Bing technology. Before going aheadwith further details would like to highlight about Aggregators “Aggregators” are the buzz word of choice for the various online companies that gather information from fragmented marketplaces into a single portal to make life easier for everyone. A classic example is the online airline and hotel reservations. How web search engines work A search engine operates in the following order: 1. Web crawling (it is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing) 2. Indexing (it collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science) 3. Searching (is a query that a user enters into a web search engine to satisfy his or her information needs.)
  • 20. Page 20 of 73 WE SCHOOL Explanation of each is mentioned below Web search engines work by storing information about many web pages, which they retrieve from the HTML markup of the pages. These pages are retrieved by a Web crawler (sometimes also known as a spider) — An automated Web crawler which follows every link on the site. The site owner can exclude specific pages by using robots.txt. The search engine then analyzes the contents of each page to determine how it should be indexed (for example, words can be extracted from the titles, page content, headings, or special fields called meta tags [They are part of a web page's head section ]. Data about web pages are stored in an index database for use in later queries. A query from a user can be a single word. The index helps find information relating to the query as quickly as possible. Some search engines, such as Google, store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others, such as AltaVista, store every word of every page they find. This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. This problem might be considered a mild form of linkrot, and Google's handling of it increases usability by satisfying user expectations that the search terms will be on the returned webpage. This satisfies the principle of least astonishment, since the user normally expects that the search terms will be on the returned pages. Increased search relevance makes these cached pages very useful as they may contain data that may no longer be available elsewhere.
  • 21. Page 21 of 73 WE SCHOOL High-level architecture of a standard Web crawler When a user enters a query into a search engine (typically by using keywords), the engine examines its inverted index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. The index is built from the information stored with the data and the method by which the information is indexed. From 2007 the Google.com search engine has allowed one to search by date by clicking "Show search tools" in the leftmost column of the initial search results page, and then selecting the desired date range. Most search engines support the use of the Boolean operators (This article is about connectives in logical systems).AND, and NOT to further specify the Web search query. Boolean operators are for literal searches that allow the user to refine and extend the terms of the search. The engine looks for the words or phrases exactly as entered. Some search engines provide an advanced feature called proximity search, which allows users to define the distance between keywords. There is also concept-based searching where the research involves using statistical analysis on pages containing the words or phrases you search for. As well, natural language queries allow the user to type a question in the same form one
  • 22. Page 22 of 73 WE SCHOOL would ask it to a human. A site like this would be ask.com. The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. The methods also change over time as Internet usage changes and new techniques evolve. There are two main types of search engine that have evolved: one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively. The other is a system that generates an "inverted index" (In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database) by analyzing texts it locates. This first form relies much more heavily on the computer itself to do the bulk of the work. Most Web search engines are commercial ventures supported by advertising revenue and thus some of them allow advertisers to have their listings ranked higher in search results for a fee. Search engines that do not accept money for their search results make money by running search related ads alongside the regular search engine results. The search engines make money every time someone clicks on one of these ads.
  • 23. Page 23 of 73 WE SCHOOL Market share Here we will come to know which site has the highest market share Google is the world's most popular search engine, with a market share of 68.69 per cent. Baidu comes in a distant second, answering 17.17 per cent online queries. The world's most popular search engines are: Search engine Market share in October 2014 Google 58.01% Baidu 29.06% Bing 8.01% Yahoo! 4.01% AOL 0.21% Ask 0.10% Excite 0.00% East Asia and Russia East Asian countries and Russia constitute a few places where Google is not the most popular search engine. Soso (search engine) is more popular than Google in China. Yandex commands a marketshare of 61.9 per cent in Russia, compared to Google's 28.3 per cent. In China, Baidu is the most popular search engine. South Korea's homegrown search portal, Naver, is used for 70 per cent online searches in the country. Yahoo! Japan and Yahoo! Taiwan are the most popular avenues for internet search in Japan and Taiwan, respectively.
  • 24. Page 24 of 73 WE SCHOOL Search engine bias Although search engines are programmed to rank websites based on some combination of their popularity and relevancy, empirical studies indicate various political, economic, and social biases in the information they provide. These biases can be a direct result of economic and commercial processes (e.g., companies that advertise with a search engine can become also more popular in its organic search results), Organic search results are listings on search engine results pages that appear because of their relevance to the search terms, as opposed to their being advertisements. In contrast, non-organic search results may include pay per click advertising. And political processes (e.g., the removal of search results to comply with local laws). For example, Google will not surface certain Neo-Nazi websites in and Germany, where Holocaust denial is illegal. Biases can also be a result of social processes, as search engine algorithms are frequently designed to exclude non-normative viewpoints in favor of more "popular" results. Indexing algorithms of major search engines skew towards coverage of U.S.-based sites, rather than websites from non-U.S. countries. Google Bombing (The terms Google bomb and Googlewashing refer to the practice of causing a web page to rank highly in search engine results for unrelated or off-topic search terms by linking heavily.) is one example of an attempt to manipulate search results for political, social or commercial reasons.
  • 25. Page 25 of 73 WE SCHOOL Customized results and filter bubbles Many search engines such as Google and Bing provide customized results based on the user's activity history. This leads to an effect that has been called a filter bubble. The term describes a phenomenon in which websites use algorithms to selectively guess what information a user would like to see, based on information about the user (such as location, past click behavior and search history). As a result, websites tend to show only information that agrees with the user's past viewpoint, effectively isolating the user in a bubble that tends to exclude contrary information. Prime examples are Google's personalized search results and Facebook's personalized news stream. According to Eli Pariser, who coined the term, users get less exposure to conflicting viewpoints and are isolated intellectually in their own informational bubble. Pursier related an example in which one user searched Google for "BP" and got investment news about British Petroleum while another searcher got information about the Deepwater Horizon oil spill and that the two search results pages were "strikingly different. The bubble effect may have negative implications for civic discourse, according to Pursier. Since this problem has been identified, competing search engines have emerged that seek to avoid this problem by not tracking or "bubbling" users. Faith-based search engines The global growth of the Internet and popularity of electronic contents in the Arab and Muslim World during the last decade has encouraged faith adherents, notably in the Middle East and Asian sub-continent, to "dream" of their own faith-based i.e. "Islamic" search engines or filtered search portals filters that would enable users to avoid accessing forbidden
  • 26. Page 26 of 73 WE SCHOOL websites such as pornography and would only allow them to access sites that are compatible to the Islamic faith. Shortly before the Muslim only month of Ramadan, Halalgoogling which collects results from other search engines like Google and Bing was introduced to the world July 2013 to presents the halal results to its users, nearly two years after I’mHalal, another search engine initially (launched on September 2011) to serve Middle East Internet had to close its search service due to what its owner blamed on lack of funding. While lack of investment and slow pace in technologies in the Muslim World as the main consumers or targeted end users has hindered progress and thwarted success of serious Islamic search engine, the spectacular failure of heavily invested Muslim lifestyle web projects like Muxlim, which received millions of dollars from investors like Rite Internet Ventures, has - according to I’mHalal shutdown notice - made almost laughable the idea that the next Facebook or Google can only come from the Middle East if you support your bright youth. Yet Muslim internet experts have been determining for years what is or is not allowed according to the "Law of Islam" and have been categorizing websites and such into being either "halal" or "haram". All the existing and past Islamic search engines are merely custom search indexed or monetized by web major search giants like Google, Yahoo and Bing with only certain filtering systems applied to ensure that their users can't access Haram sites, which include such sites as nudity, gay, gambling or anything that is deemed to be anti- Islamic. Another religiously-oriented search engine is Jewogle, which is the Jewish version of Google and yet another is SeekFind.org, which is a Christian website that includes filters preventing users from seeing anything on the internet that attacks or degrades their faith.
  • 27. Page 27 of 73 WE SCHOOL Till now we studied how search engines are build and their contribution towards today’s high tech atmosphere Now we look with the help of these technology how Image is build How do I increase my site visibility to searchengines? These days you don’t have to limit your search to just websites. Many other forms of content are easy to find, including images. No matter what you’re looking for, an image is (for better or worse) just one image search away. You may wonder, however, how image search works. How are images sorted and classified, making it possible to find tens or hundreds of relevant results? Perhaps you’re just curious, or perhaps you run a site and want to know so you can improve your own ranking. In either case, taking a deeper look could be helpful. Some people assume that image search is conducted via fancy algorithms that determine what an image is about and then index it. I know that’s where I started. As it turns out, however, old fashioned text is one of the most important factors in an image’s ranking. More specifically, the file name matters. Go ahead – do an image search. What do the top results have in common? Almost invariably, it’s a portion of their file name. Most of the top results for “pizza” have the word pizza in the file name. That might seem obvious. But actually, it’s not. Most digital photographs, for example, will start life with a file name like “1020302.jpg.” It’s only later that they’re re-named. For webmasters, ensuring that a relevant file name is given to an image is just as basic and important as making sure that a webpage’s keyword appears in that page’s metadata title
  • 28. Page 28 of 73 WE SCHOOL and/or description. But it’s not automatic. It takes constant effort. We are now at the final step of building an image search engine — accepting a query image and performing an actual search. Let’s take a second to review how we got here:  Step 1: Defining Your Image Descriptor. Before we even consider building an image search engine, we need to consider how we are going to represent and quantify our image using only a list of numbers (i.e. a feature vector). We explored three aspects of an image that can easily be described: color, texture, and shape. We can use one of these aspects, or many of them.  Step 2: Indexing Your Dataset. Now that we have selected a descriptor, we can apply the descriptor to extract features from each and every image in our dataset. The process of extracting features from an image dataset is called “indexing”. These features are then written to disk for later use. Indexing is also a task that is easily made parallel by utilizing multiple cores/processors on our machine.  Step 3: Defining Your Similarity Metric. In Step 1, we defined a method to extract features from an image. Now, we need to define a method to compare our feature vectors. A distance function should accept two feature vectors and then return a value indicating how “similar” they are. Common choices for similarity functions include (but are certainly not limited to) the Euclidean, Manhattan, Cosine, and Chi-Squared distances. Finally, we are now ready to perform our last step in building an image search engine:
  • 29. Page 29 of 73 WE SCHOOL Searching and Ranking The Query Before we can perform a search, we need a query. The last time you went to Google, you typed in some keywords into the search box, right? The text you entered into the input form was your “query”. Google then took your query, analyzed it, and compared it to their gigantic index of webpages, ranked them, and returned the most relevant webpages back to you. Similarly, when we are building an image search engine, we need a query image. Query images come in two flavors: an internal query image and an external query image. As the name suggests, an internal query image already belongs in our index. We have already analyzed it, extracted features from it, and stored its feature vector. The second type of query image is an external query image. This is the equivalent to typing our text keywords into Google. We have never seen this query image before and we can’t make any assumptions about it. We simply apply our image descriptor, extract features, rank the images in our index based on similarity to the query, and return the most relevant results. Let’s think back to our similarity metrics for a second and assume that we are using the Euclidean distance. The Euclidean distance has a nice property called the Coincidence Axiom, implying that the function returns a value of 0 (indicating perfect similarity) if and only if the two feature vectors are identical. Example: If I were to search for an image already in my index, then the Euclidean distance between the two feature vectors would be zero, implying perfect similarity. This image would then be placed at the top of my search results since it is the most relevant. This makes sense
  • 30. Page 30 of 73 WE SCHOOL and is the intended behavior. How strange it would be if I searched for an image already in my index and did not find it in the #1 result position. That would likely imply that there was a bug in my code somewhere or I’ve made some very poor choices in image descriptors and similarity metrics. Overall, using an internal query image serves as a sanity check. It allows you to make sure that your image search engine is functioning as expected. Once you can confirm that your image search engine is working properly, you can then accept external query images that are not already part of your index. The Search So what’s the process of actually performing a search? Checkout the outline below: 1. Accept a query image from the user A user could be uploading an image from their desktop or from their mobile device. As image search engines become more prevalent, I suspect that most queries will come from devices such as iPhones and Droids. It’s simple and intuitive to snap a photo of a place, object, or something that interests you using your cellphone, and then have it automatically analyzed and relevant results returned. 2. Describe the query image Now that you have a query image, you need to describe it using the exact same image descriptor(s) as you did in the indexing phase. For example, if I used a RGB color histogram with 32 bins per channel when I indexed the images in my dataset, I am going to use the same 32 bin per channel histogram when describing my query image. This ensures that I have a consistent representation of my images. After applying my image descriptor, I now have a feature vector for the query image.
  • 31. Page 31 of 73 WE SCHOOL 3. Perform the Search To perform the most basic method of searching, you need to loop over all the feature vectors in your index. Then, you use your similarity metric to compare the feature vectors in your index to the feature vectors from your query. Your similarity metric will tell you how “similar” the two feature vectors are. Finally, sort your results by similarity. Looping over your entire index may be feasible for small datasets. But if you have a large image dataset, like Google or TinEye, this simply isn’t possible. You can’t compute the distance between your query features and the billions of feature vectors already present in your dataset. 4. Display Your Results to the User Now that we have a ranked list of relevant images we need to display them to the user. This can be done using a simple web interface if the user is on a desktop, or we can display the images using some sort of app if they are on a mobile device. This step is pretty trivial in the overall context of building an image search engine, but you should still give thought to the user interface and how the user will interact with your image search engine. Summary So there you have it, the four steps of building an image search engine, from front to back: 1. Define your image descriptor. 2. Index your dataset. 3. Define your similarity metric. 4. Perform a search, rank the images in your index in terms of relevancy to the user, and display the results to the user.
  • 32. Page 32 of 73 WE SCHOOL Here is the best example of image build searchengine againstabove explanation Think about it this way. When you go to Google and type “Lord of the Rings” into the search box, you expect Google to return pages to you that are relevant to Tolkien’s books and the movie franchise. Similarly, if we present an image search engine with a query image, we expect it to return images that are relevant to the content of image — hence, we sometimes call image search engines by what they are more commonly known in academic circles as Content Based Image Retrieval (CBIR) systems. So what’s the overall goal of our Lord of the Rings image search engine? The goal, given a query image from one of our five different categories, is to return the category’s corresponding images in the top 10 results. That was a mouthful. Let’s use an example to make it more clear. If I submitted a query image of The Shire to our system, I would expect it to give me all 5 Shire images in our dataset back in the first 10 results. And again, if I submitted a query image of Rivendell, I would expect our system to give me all 5 Rivendell images in the first 10 results. Make sense? Good. Let’s talk about the four steps to building our image search engine. The 4 Steps to Building an Image Search Engine On the most basic level, there are four steps to building an image search engine: Define your descriptor: What type of descriptor are you going to use? Are you describing color? Texture? Shape? Index your dataset: Apply your descriptor to each image in your dataset, extracting a set of features.
  • 33. Page 33 of 73 WE SCHOOL Define your similarity metric: How are you going to define how “similar” two images are? You’ll likely be using some sort of distance metric. (a metric or distance function is a function that defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric.) Common choices include Euclidean, Cityblock (Manhattan), Cosine, and chi- squared to name a few. Searching: To perform a search, apply your descriptor to your query image, and then ask your distance metric to rank how similar your images are in your index to your query images. Sort your results via similarity and then examine them. Step #1: The Descriptor – A 3D RGB Color Histogram Our image descriptor is a 3D color histogram in the RGB color space with 8 bins per red, green, and blue channel. The best way to explain a 3D histogram is to use the conjunctive AND. This image descriptor will ask a given image how many pixels have a Red value that falls into bin #1 AND a Green value that falls into bin #2 AND how many Blue pixels fall into bin #1. This process will be repeated for each combination of bins; however, it will be done in a computationally efficient manner. When computing a 3D histogram with 8 bins, OpenCV will store the feature vector as an (8, 8, 8) array. We’ll simply flatten it and reshape it to (512,). Once it’s flattened, we can easily compare feature vectors together for similarity. Ready to see some code? Okay, here we go: 3D RGB Histogram in OpenCV and Python Python
  • 34. Page 34 of 73 WE SCHOOL # import the necessary package import numpy as np import cv2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # import the necessary packages import numpy as np import cv2 class RGBHistogram: def __init__(self, bins): # store the number of bins the histogram will use self.bins = bins def describe(self, image): # compute a 3D histogram in the RGB colorspace, # then normalize the histogram so that images # with the same content, but either scaled larger # or smaller will have (roughly) the same histogram hist = cv2.calcHist([image], [0, 1, 2], None, self.bins, [0, 256, 0, 256, 0, 256]) hist = cv2.normalize(hist) # return out 3D histogram as a flattened array return hist.flatten()
  • 35. Page 35 of 73 WE SCHOOL As you can see, RGBHistogram class has been defined. The reason for this is because you rarely ever extract features from a single image alone. You instead extract features from an entire dataset of images. Furthermore, you expect that the features extracted from all images utilize the same parameters — in this case, the number of bins for the histogram. It wouldn’t make much sense to extract a histogram using 32 bins from one image and then 128 bins for another image if you intend on comparing them for similarity. Let’s take the code apart and understand what’s going on: Lines 6-8: Here I am defining the constructor for the RGBHistogram. The only parameter we need is the number of bins for each channel in the histogram. Again, this is why I prefer using classes instead of functions for image descriptors — by putting the relevant parameters in the constructor, you ensure that the same parameters are utilized for each image. Line 10: You guessed it. The describe method is used to “describe” the image and return a feature vector. Line 15: Here we extract the actual 3D RGB Histogram (or actually, BGR since OpenCV stores the image as a NumPy array, but with the channels in reverse order). We assume self.bins is a list of three integers, designating the number of bins for each channel. Line 16: It’s important that we normalize the histogram in terms of pixel counts. If we used the raw (integer) pixel counts of an image, then shrunk it by 50% and described it again, we would have two different feature vectors for identical images. In most cases, you want to avoid this scenario. We obtain scale invariance by converting the raw integer pixel counts into real-valued percentages. For example, instead of saying bin #1 has 120 pixels in it, we would say bin #1 has 20% of all pixels in it. Again, by using the percentages of pixel counts rather than raw, integer pixel counts, we can assure that two identical images, differing only
  • 36. Page 36 of 73 WE SCHOOL in size, will have (roughly) identical feature vectors. Line 20: When computing a 3D histogram, the histogram will be represented as a NumPy array with (N, N, N) bins. In order to more easily compute the distance between histograms, we simply flatten this histogram to have a shape of (N ** 3,). Example: When we instantiate our RGBHistogram, we will use 8 bins per channel. Without flattening our histogram, the shape would be (8, 8, 8). But by flattening it, the shape becomes (512,). Now that we have defined our image descriptor, we can move on to the process of indexing our dataset. Step #2: Indexing our Dataset Okay, so we’ve decided that our image descriptor is a 3D RGB histogram. The next step is to apply our image descriptor to each image in the dataset. This simply means that we are going to loop over our 25 image dataset, extract a 3D RGB histogram from each image, store the features in a dictionary, and write the dictionary to file. Yep, that’s it. In reality, you can make indexing as simple or complex as you want. Indexing is a task that is easily made parallel. If we had a four core machine, we could divide the work up between the four cores and speedup the indexing process. But since we only have 25 images, that’s pretty silly, especially given how fast it is to compute a histogram. Let’s dive into some code: Indexing an Image Dataset using Python Python # import the necessary package frompyimagesearch.rgbhistogr import argparse import cPickle
  • 37. Page 37 of 73 WE SCHOOL 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # import the necessary packages from pyimagesearch.rgbhistogram import RGBHistogram import argparse import cPickle import glob import cv2 # construct the argument parser and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-d", "--dataset", required = True, help = "Path to the directory that contains the images to be indexed") ap.add_argument("-i", "--index", required = True, help = "Path to where the computed index will be stored") args = vars(ap.parse_args()) # initialize the index dictionary to store our our quantifed # images, with the 'key' of the dictionary being the image # filename and the 'value' our computed features index = {} Alright, the first thing we are going to do is import the packages we need. The --dataset argument is the path to where our images are stored on disk and the --index option is the path to where we will store our index once it has been computed. Finally, we’ll initialize our index — a builtin Python dictionary type. The key for the
  • 38. Page 38 of 73 WE SCHOOL dictionary will be the image filename. We’ve made the assumption that all filenames are unique, and in fact, for this dataset, they are. The value for the dictionary will be the computed histogram for the image. Using a dictionary for this example makes the most sense, especially for explanation purposes. Given a key, the dictionary points to some other object. When we use an image filename as a key and the histogram as the value, we are implying that a given histogram H is used to quantify and represent the image with filename K. Again, you can make this process as simple or as complicated as you want. More complex image descriptors make use of term frequency-inverse document frequency weighting (tf-idf) and an inverted index, but for the time being, let’s keep it simple. Indexing an Image Dataset using Python Python # initialize our image descriptor # 8 bins per channel desc = RGBHistogram([8, 8, 8]) 1 2 3 # initialize our image descriptor -- a 3D RGB histogram with # 8 bins per channel desc = RGBHistogram([8, 8, 8]) Here we instantiate our RGBHistogram. Again, we will be using 8 bins for each, red, green, and blue, channel, respectively. Indexing an Image Dataset using Python
  • 39. Page 39 of 73 WE SCHOOL Python # use glob to grab the image pa for imagePath in glob.glob(args[ # extract our unique im k = imagePath[imageP 1 2 3 4 5 6 7 8 9 10 # use glob to grab the image paths and loop over them for imagePath in glob.glob(args["dataset"] + "/*.png"): # extract our unique image ID (i.e. the filename) k = imagePath[imagePath.rfind("/") + 1:] # load the image, describe it using our RGB histogram # descriptor, and update the index image = cv2.imread(imagePath) features = desc.describe(image) index[k] = features Here is where the actual indexing takes place. Let’s break it down: Line 2: We use glob to grab the image paths and start to loop over our dataset. Line 4: We extract the “key” for our dictionary. All filenames are unique in this sample dataset, so the filename itself will be enough to serve as the key. Line 8-10: The image is loaded off disk and we then use our RGBHistogram to extract a histogram from the image. The histogram is then stored in the index. Indexing an Image Dataset using Python
  • 40. Page 40 of 73 WE SCHOOL Python # w e are now done indexing ou # index to disk f = open(args["index"], "w ") f.w rite(cPickle.dumps(index)) 1 2 3 4 5 # we are now done indexing our image -- now we can write our # index to disk f = open(args["index"], "w") f.write(cPickle.dumps(index)) f.close() Now that our index has been computed, let’s write it to disk so we can use it for searching later on. Step #3: The Search We now have our index sitting on disk, ready to be searched. The problem is, we need some code to perform the actual search. How are we going to compare two feature vectors and how are we going to determine how similar they are? This question is better addressed first with some code. Building an Image Search Engine in Python and OpenCV Python # import the necessary package import numpy as np class Searcher:
  • 41. Page 41 of 73 WE SCHOOL 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # import the necessary packages import numpy as np class Searcher: def __init__(self, index): # store our index of images self.index = index def search(self, queryFeatures): # initialize our dictionary of results results = {} # loop over the index for (k, features) in self.index.items(): # compute the chi-squared distance between the features # in our index and our query features -- using the # chi-squared distance which is normally used in the # computer vision field to compare histograms d = self.chi2_distance(features, queryFeatures) # now that we have the distance between the two feature # vectors, we can udpate the results dictionary -- the
  • 42. Page 42 of 73 WE SCHOOL 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 # key is the current image ID in the index and the # value is the distance we just computed, representing # how 'similar' the image in the index is to our query results[k] = d # sort our results, so that the smaller distances (i.e. the # more relevant images are at the front of the list) results = sorted([(v, k) for (k, v) in results.items()]) # return our results return results def chi2_distance(self, histA, histB, eps = 1e-10): # compute the chi-squared distance d = 0.5 * np.sum([((a - b) ** 2) / (a + b + eps) for (a, b) in zip(histA, histB)]) # return the chi-squared distance return d First off, most of this code is just comments. Don’t be scared that it’s 41 lines. If you haven’t already guessed. Let’s investigate what’s going on: Lines 4-7: The first thing I do is define a Searcher class and a constructor with a single parameter — the index. This index is assumed to be the index dictionary that we wrote to file
  • 43. Page 43 of 73 WE SCHOOL during the indexing step. Line 11: We define a dictionary to store our results. The key is the image filename (from the index) and the value is how similar the given image is to the query image. Lines 14-26: Here is the part where the actual searching takes place. We loop over the image filenames and corresponding features in our index. We then use the chi-squared distance to compare our color histograms. The computed distance is then stored in the results dictionary, indicating how similar the two images are to each other. Lines 30-33: The results are sorted in terms of relevancy (the smaller the chi-squared distance, the relevant/similar) and returned. Lines 35-41: Here we define the chi-squared distance function used to compare the two histograms. In general, the difference between large bins vs. small bins is less important and should be weighted as such. This is exactly what the chi-squared distance does. We provide an epsilon dummy value to avoid those pesky “divide by zero” errors. Images will be considered identical if their feature vectors have a chi-squared distance of zero. The larger the distance gets, the less similar they are. So there you have it, a Python class that can take an index and perform a search. Now it’s time to put this searcher to work. Step #4: Performing a Search Finally. We are closing in on a functioning image search engine. But we’re not quite there yet. We need a little extra code to handle loading the images off disk and performing the search: Building an Image Search Engine in Python and OpenCV
  • 44. Page 44 of 73 WE SCHOOL Python # import the necessary package frompyimagesearch.searcher i import numpy as np import argparse 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # import the necessary packages from pyimagesearch.searcher import Searcher import numpy as np import argparse import cPickle import cv2 # construct the argument parser and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-d", "--dataset", required = True, help = "Path to the directory that contains the images we just indexed") ap.add_argument("-i", "--index", required = True, help = "Path to where we stored our index") args = vars(ap.parse_args()) # load the index and initialize our searcher index = cPickle.loads(open(args["index"]).read()) searcher = Searcher(index)
  • 45. Page 45 of 73 WE SCHOOL First things first. Import the packages that we will need. We then define our arguments in the same manner that we did during the indexing step. Finally, we use cPickle to load our index off disk and initialize our Searcher. Python # loop over images in the index # a query image for (query, queryFeatures) in in # performthe search 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # loop over images in the index -- we will use each one as # a query image for (query, queryFeatures) in index.items(): # perform the search using the current query results = searcher.search(queryFeatures) # load the query image and display it path = args["dataset"] + "/%s" % (query) queryImage = cv2.imread(path) cv2.imshow("Query", queryImage) print "query: %s" % (query) # initialize the two montages to display our results -- # we have a total of 25 images in the index, but let's only
  • 46. Page 46 of 73 WE SCHOOL 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 # display the top 10 results; 5 images per montage, with # images that are 400x166 pixels montageA = np.zeros((166 * 5, 400, 3), dtype = "uint8") montageB = np.zeros((166 * 5, 400, 3), dtype = "uint8") # loop over the top ten results for j in xrange(0, 10): # grab the result (we are using row-major order) and # load the result image (score, imageName) = results[j] path = args["dataset"] + "/%s" % (imageName) result = cv2.imread(path) print "t%d. %s : %.3f" % (j + 1, imageName, score) # check to see if the first montage should be used if j < 5: montageA[j * 166:(j + 1) * 166, :] = result # otherwise, the second montage should be used else: montageB[(j - 5) * 166:((j - 5) + 1) * 166, :] = result
  • 47. Page 47 of 73 WE SCHOOL 37 38 39 40 # show the results cv2.imshow("Results 1-5", montageA) cv2.imshow("Results 6-10", montageB) cv2.waitKey(0) Most of this code handles displaying the results. The actual “search” is done in a single line (#31). Regardless, let’s examine what’s going on: Line 3: We are going to treat each image in our index as a query and see what results we get back. Normally, queries are external and not part of the dataset, but before we get to that, let’s just perform some example searches. Line 5: Here is where the actual search takes place. We treat the current image as our query and perform the search. Lines 8-11: Load and display our query image. Lines 17-35: In order to display the top 10 results, I have decided to use two montage images. The first montage shows results 1-5 and the second montage results 6-10. The name of the image and distance is provided on Line 27. Lines 38-40: Finally, we display our search results to the user. So there you have it. An entire image search engine in Python.
  • 48. Page 48 of 73 WE SCHOOL Figure: Search Results using Mordor-002.png as a query. Our image search engine is able to return images from Mordor and the Black Gate. Let’s start at the ending of The Return of the King using Frodo and Sam’s ascent into the volcano as our query image. As you can see, our top 5 results are from the “Mordor” category. Perhaps you are wondering why the query image of Frodo and Sam is also the image in the #1 result position? Well, let’s think back to our chi-squared distance. We said that an image would be considered “identical” if the distance between the two feature vectors is zero. Since we are using images we have already indexed as queries, they are in fact identical and will a distance of zero. Since a value of zero indicates perfect similarity, the query image appears
  • 49. Page 49 of 73 WE SCHOOL in the #1 result position. Now, let’s try another image, this time using The Goblin King in Goblin Town: Figure: Search Results using Goblin-004.png as a query. The top 5 images returned are from Goblin Town. The Goblin King doesn’t look very happy. But we sure are happy that all five images from Goblin Town are in the top 10 results. Finally, here are three more example searches for Dol-Guldur, Rivendell, and The Shire. Again, we can clearly see that all five images from their respective categories are in the top 10 results.
  • 50. Page 50 of 73 WE SCHOOL Figure: Using images from Dol-Guldur (Dol-Guldur-004.png), Rivendell (Rivendell- 003.png), and The Shire (Shire-002.png) as queries. But clearly, this is not how all image search engines work. Google allows you to upload an image of your own. TinEye allows you to upload an image of your own. Why can’t we? Let’s see how we can perform a search using an image that we haven’t already indexed: Building an Image Search Engine using Python and OpenCV Python # import the necessary package frompyimagesearch.rgbhistogr frompyimagesearch.searcher i import numpy as np 1 2 3 # import the necessary packages from pyimagesearch.rgbhistogram import RGBHistogram from pyimagesearch.searcher import Searcher
  • 51. Page 51 of 73 WE SCHOOL 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import numpy as np import argparse import cPickle import cv2 # construct the argument parser and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-d", "--dataset", required = True, help = "Path to the directory that contains the images we just indexed") ap.add_argument("-i", "--index", required = True, help = "Path to where we stored our index") ap.add_argument("-q", "--query", required = True, help = "Path to query image") args = vars(ap.parse_args()) # load the query image and show it queryImage = cv2.imread(args["query"]) cv2.imshow("Query", queryImage) print "query: %s" % (args["query"]) # describe the query in the same way that we did in # index.py -- a 3D RGB histogram with 8 bins per
  • 52. Page 52 of 73 WE SCHOOL 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 # channel desc = RGBHistogram([8, 8, 8]) queryFeatures = desc.describe(queryImage) # load the index perform the search index = cPickle.loads(open(args["index"]).read()) searcher = Searcher(index) results = searcher.search(queryFeatures) # initialize the two montages to display our results -- # we have a total of 25 images in the index, but let's only # display the top 10 results; 5 images per montage, with # images that are 400x166 pixels montageA = np.zeros((166 * 5, 400, 3), dtype = "uint8") montageB = np.zeros((166 * 5, 400, 3), dtype = "uint8") # loop over the top ten results for j in xrange(0, 10): # grab the result (we are using row-major order) and # load the result image (score, imageName) = results[j] path = args["dataset"] + "/%s" % (imageName)
  • 53. Page 53 of 73 WE SCHOOL 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 result = cv2.imread(path) print "t%d. %s : %.3f" % (j + 1, imageName, score) # check to see if the first montage should be used if j < 5: montageA[j * 166:(j + 1) * 166, :] = result # otherwise, the second montage should be used else: montageB[(j - 5) * 166:((j - 5) + 1) * 166, :] = result # show the results cv2.imshow("Results 1-5", montageA) cv2.imshow("Results 6-10", montageB) cv2.waitKey(0) Lines 2-17: This should feel like pretty standard stuff by now. We are importing our packages and setting up our argument parser, although, you should note the new argument –query. This is the path to our query image. Lines 20-21: We’re going to load your query image and show it to you, just in case you forgot what your query image is. Lines 27-28: Instantiate our RGBHistogram with the exact same number of bins as during our indexing step. We then extract features from our query image.
  • 54. Page 54 of 73 WE SCHOOL Lines 31-33: Load our index off disk using cPickle and perform the search. Lines 39-62: Just as in the code above to perform a search, this code just shows us our results. One of Rivendell and one of The Shire. These two images will be our queries. Check out the results below: Figure: Using external Rivendell (Left) and the Shire (Right) query images. For both cases, we find the top 5 search results are from the same category. In this case, we searched using two images that we haven’t seen previously. The one on the left is of Rivendell. We can see from our results that the other 5 Rivendell images in our index were returned, demonstrating that our image search engine is working properly. On the right, we have a query image from The Shire. Again, this image is not present in our index. But when we look at the search results, we can see that the other 5 Shire images were returned from the image search engine, once again demonstrating that our image search engine is returning semantically similar images.
  • 55. Page 55 of 73 WE SCHOOL Summary Here we’ve explored how to create an image search engine from start to finish. The first step was to choose an image descriptor — we used a 3D RGB histogram to characterize the color of our images. We then indexed each image in our dataset using our descriptor by extracting feature vectors (i.e. the histograms). From there, we used the chi-s quared distance to define “similarity” between two images. Finally, we glued all the pieces together and created a Lord of the Rings image search engine. After above example we will now see that being difference in terms of text search& image searchthey are correlatedto eachother Most webmasters don’t see any difference between image alt text and title mostly keeping them the same. A great discussion over at Google Webmaster Groups provides an exhaustive information on the differences between an image Alt attribute and an image title and standard recommendations of how to use them. Alt text is meant to be an alternative information source for those people who have chosen to disable images in their browsers and those user agents that are simply unable to “see” the images. It should describe what the image is about and get those visitors interested to see it. Without alt text, an image will be displayed as an empty iconIn Internet Explorer Alt text also pops up when you hover over an image. Plus, Google officially confirmed it mainly focuses on alt text when trying to understand what an image is about. Image title (and the element name speaks for itself) should provide additional information and follow the rules of the regular title: it should be relevant, short,
  • 56. Page 56 of 73 WE SCHOOL catchy, and concise (a title “offers advisory information about the element for which it is set“). In FireFox and Opera it pops up when you hover over an image: So based on the above, we can discuss how to properly handle them:  Both tags are primarily meant for visitors (though alt text seems more important for crawlers) – so provide explicit information on an image to encourage views.  Include your main keywords in both, but change them up. Keyword stuffing in alt text and title is still keyword stuffing, so keep them relevant and meaningful. Another good point to take into consideration:  According to Aaron Wall, alt text is crucially important when used for a site- wide header banner. One of the reasons Aaron Wall, was so motivated to change the tagline of this site recently was because the new site design contained the site's logo as a background image. The logo link was a regular static link, but it had no anchor text, only a link title to describe the link. If you do not look at the source code, the link title attribute can seem like an image alt tag when you scroll over it, but to a search engine they do not look that same. A link title is not weighted anywhere near as aggressively as an image alt tag is. The old link title on the header link for this site was search engine optimization book. While this site ranks #6 and #8 for that query in Google, neither of the ranking pages are the homepage (the tools page and sales letter rank). That shows that Google currently places negligible, if any, weight on link titles. If the only link to your homepage is a logo check the source code to verify you are using descriptive image alt text.
  • 57. Page 57 of 73 WE SCHOOL Conclusion & Recommendations Conclusion: We conclude that with the help of above mentioned technologies that any company can build their image which can be helpful to the society with these Search Engines. While nobody can guarantee top level positioning in search engine organic results, proper search engine optimization can help. Because the search engines, such as Google, Yahoo!, and Bing, are so important today it is necessary to make each page in a Web site conform to the principles of good SEO as much as possible. To do this it is necessary to:  Understand the basics of how search engines rate sites  Use proper keywords and phrases throughout the Web site  Avoid giving the appearance of spamming the search engines  Write all text for real people, not just for search engines  Use well-formed alternate attributes on images  Make sure that the necessary meta tags (and title tag) are installed in the head of each Web page  Have good incoming links to establish popularity  Make sure the Web site is regularly updated so that the content is fresh
  • 58. Page 58 of 73 WE SCHOOL Recommendations: Following recommendations are made on the basis of overall usage of systems: Overview Recommender systems typically produce a list of recommendations in one of two ways – through collaborative or content-based filtering. Collaborative filtering approaches building a model from a user's past behaviour (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users; then use that model to predict items (or ratings for items) that the user may have an interest in. Content-based filtering approaches utilize a series of discrete characteristics of an item in order to recommend additional items with similar properties. These approaches are often combined. Main Article Collaborative filtering: One approach to the design of recommender systems that has seen wide use is collaborative filtering. Collaborative filtering methods are based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users. A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an "understanding" of the item itself. Many algorithms have been used in measuring user similarity or item similarity in recommender systems. For example, the k-nearest neighbour (k-NN) approach and the Pearson Correlation.
  • 59. Page 59 of 73 WE SCHOOL Collaborative Filtering is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past. When building a model from a user's profile, a distinction is often made between explicit and implicit forms of data collection. Examples of explicit data collection include the following:  Asking a user to rate an item on a sliding scale.  Asking a user to search.  Asking a user to rank a collection of items from favourite to least favourite.  Presenting two items to a user and asking him/her to choose the better one of them.  Asking a user to create a list of items that he/she likes. Examples of implicit data collection include the following:  Observing the items that a user views in an online store.  Analysing item/user viewing times  Keeping a record of the items that a user purchases online.  Obtaining a list of items that a user has listened to or watched on his/her computer.  Analyzing the user's social network and discovering similar likes and dislikes The recommender system compares the collected data to similar and dissimilar data collected from others and calculates a list of recommended items for the user. Several commercial and non-commercial examples are listed in the article on collaborative filtering systems. One of the most famous examples of collaborative filtering is item-to-item collaborative
  • 60. Page 60 of 73 WE SCHOOL filtering (people who buy x also buy y), an algorithm popularized by Amazon.com's recommender system. Facebook, MySpace, LinkedIn, and other social networks use collaborative filtering to recommend new friends, groups, and other social connections (by examining the network of connections between a user and their friends). Twitter uses many signals and in-memory computations for recommending who to follow to its users. Collaborative filtering approaches often suffer from three problems: Cold Start, Scalability, And Sparsity. Cold Start: These systems often require a large amount of existing data on a user in order to make accurate recommendations. Scalability: In many of the environments that these systems make recommendations in, there are millions of users and products. Thus, a large amount of computation power is often necessary to calculate recommendations. Sparsity: The number of items sold on major e-commerce sites is extremely large. The most active users will only have rated a small subset of the overall database. Thus, even the most popular items have very few ratings. A particular type of collaborative filtering algorithm uses matrix factorization, a low-rank matrix approximation technique. Collaborative filtering are classified as memory-based and model based collaborative filtering. A well known example of memory-based approaches is user-based algorithm and that of model-based approaches is Kernel-Mapping Recommender.
  • 61. Page 61 of 73 WE SCHOOL Content-based filtering Another common approach when designing recommender systems is content-based filtering. Content-based filtering methods are based on a description of the item and a profile of the user’s preference. In a content-based recommender system, keywords are used to describe the items; beside, a user profile is built to indicate the type of item this user likes. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended. This approach has its roots in information retrieval and information filtering research. To abstract the features of the items in the system, an item presentation algorithm is applied. A widely used algorithm is the tf–idf, (short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection) representation (also called vector space representation). To create user profile, the system mostly focuses on two types of information: 1) A model of the user's preference. 2) A history of the user's interaction with the recommender system. Basically, these methods use an item profile (i.e. a set of discrete attributes and features) characterizing the item within the system. The system creates a content-based profile of users based on a weighted vector of item features. The weights denote the importance of each feature to the user and can be computed from individually rated content vectors using a variety of techniques.
  • 62. Page 62 of 73 WE SCHOOL Simple approaches use the average values of the rated item vector while other sophisticated methods use machine learning techniques such as  BayesianClassifiers (In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.)  Cluster Analysis (is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)).  Decision Trees (A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.) &  Artificial Neural Networks (In machine learning, artificial neural networks (ANNs) are a family of statistical learning algorithms inspired by biological neural networks (the central nervous systems of animals, in particular the brain) and are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown.) in order to estimate the probability that the user is going to like the item. Direct feedback from a user, usually in the form of a like or dislike button, can be used to assign higher or lower weights on the importance of certain attributes. A key issue with content-based filtering is whether the system is able to learn user preferences from user's actions regarding one content source and use them across other content types. When the system is limited to recommending content of the same type as the
  • 63. Page 63 of 73 WE SCHOOL user is already using, the value from the recommendation system is significantly less than when other content types from other services can be recommended. For example, recommending news articles based on browsing of news is useful, but it's much more useful when music, videos, products, discussions etc. from different services can be recommended based on news browsing. Hybrid Recommender Systems Recent research has demonstrated that a hybrid approach, combining collaborative filtering and content-based filtering could be more effective in some cases. Hybrid approaches can be implemented in several ways: by making content-based and collaborative-based predictions separately and then combining them; by adding content-based capabilities to a collaborative- based approach (and vice versa); or by unifying the approaches into one model. Several studies empirically compare the performance of the hybrid with the pure collaborative and content-based methods and demonstrate that the hybrid methods can more accurate recommendations than pure approaches. These methods can also be used to overcome some of the common problems in recommender systems such as cold start and the sparsity problem. Netflix is a good example of hybrid systems. They make recommendations by comparing the watching and searching habits of similar users (i.e. collaborative filtering) as well as by offering movies that share characteristics with films that a user has rated highly (content- based filtering). A variety of techniques have been proposed as the basis for recommender systems: collaborative, content-based, knowledge-based, and demographic techniques. Each of these techniques has known shortcomings, such as the well-known cold-start problem for
  • 64. Page 64 of 73 WE SCHOOL collaborative and content-based systems (what to do with new users with few ratings) and the knowledge engineering bottleneck in knowledge-based approaches is a technology used to store complex structured and unstructured information used by a computer system. A hybrid recommender system is one that combines multiple techniques together to achieve some synergy between them. Collaborative: The system generates recommendations using only information about rating profiles for different users. Collaborative systems locate peer users with a rating history similar to the current user and generate recommendations using this neighbourhood. Content-based: The system generates recommendations from two sources: the features associated with products and the ratings that a user has given them. Content-based recommenders treat recommendation as a user-specific classification problem and learn a classifier for the user's likes and dislikes based on product features. Demographic: A demographic recommender provides recommendations based on a demographic profile of the user. Recommended products can be produced for different demographic niches, by combining the ratings of users in those niches. Knowledge-based: A knowledge-based recommender suggests products based on inferences about a user’s needs and preferences. This knowledge will sometimes contain explicit functional knowledge about how certain product features meet user needs. The term hybrid recommender system is used here to describe any recommender system that combines multiple recommendation techniques together to produce its output. There is no reason why several different techniques of the same type could not be hybridized, for example, two different content-based recommenders could work together, and a number of projects have investigated this type of hybrid: NewsDude, which uses both naive Bayes and
  • 65. Page 65 of 73 WE SCHOOL kNN classifiers in its news recommendations is just one example. Sevenhybridization techniques: 1) Weighted: The score of different recommendation components are combined numerically. 2) Switching: The system chooses among recommendation components and applies the selected one. 3) Mixed: Recommendations from different recommenders are presented together. 4) Feature Combination: Features derived from different knowledge sources are combined together and given to a single recommendation algorithm. 5) Feature Augmentation: One recommendation technique is used to compute a feature or set of features, which is then part of the input to the next technique. 6) Cascade: Recommenders are given strict priority, with the lower priority ones breaking ties the scoring of the higher ones. 7) Meta-level: One recommendation technique is applied and produces some sort of model, which is then the input used by the next technique. Beyond Accuracy Typically, research on recommender systems is concerned about finding the most accurate recommendation algorithms. However, there is a number of factors that are also important. Diversity - Users tend to be more satisfied with recommendations when there is a higher intra-list diversity, i.e. items from e.g. different artists. Recommender Persistence - In some situations it is more effective to re-show recommendations, or let users re-rate items, than showing new items. There are several
  • 66. Page 66 of 73 WE SCHOOL reasons for this. Users may ignore items when they are shown for the first time, for instance, because they had no time to inspect the recommendations carefully. Privacy - Recommender systems usually have to deal with privacy concerns because users have to reveal sensitive information. Building user profiles using collaborative filtering can be problematic from a privacy point of view. Many European countries have a strong culture of data privacy and every attempt to introduce any level of user profiling can result in a negative customer response. A number of privacy issues arose around the dataset offered by Netflix for the Netflix Prize competition. Although the data sets were anonymised in order to preserve customer privacy, in 2007, two researchers from the University of Texas were able to identify individual users by matching the data sets with film ratings on the Internet Movie Database. As a result, in December 2009, an anonymous Netflix user sued Netflix in Doe v. Netflix, alleging that Netflix had violated U.S. fair trade laws and the Video Privacy Protection Act by releasing the datasets. This led in part to the cancellation of a second Netflix Prize competition in 2010. Much research has been conducted on ongoing privacy issues in this space. Ramakrishnan have conducted an extensive overview of the trade- offs between personalization and privacy and found that the combination of weak ties and other data sources can be used to uncover identities of users in an anonymised dataset. User Demographics - Beel et al. found that user demographics may influence how satisfied users are with recommendations. In their paper they show that elderly users tend to be more interested in recommendations than younger users. Robustness - When users can participate in the recommender system, the issue of fraud must be addressed. Serendipity - Serendipity is a measure "how surprising the recommendations are". For
  • 67. Page 67 of 73 WE SCHOOL instance, a recommender system that recommends milk to a customer in a grocery store, might be perfectly accurate but still it is not a good recommendation because it is an obvious item for the customer to buy. Trust - A recommender system is of little value for a user if the user does not trust the system. Trust can be built by a recommender system by explaining how it generates recommendations, and why it recommends an item. Labelling - User satisfaction with recommendations may be influenced by the labeling of the recommendations. For instance, in the cited study click-through rate (CTR) is a way of measuring the success of an online advertising campaign for a particular website as well as the effectiveness of an email campaign by the number of users that clicked on a specific link for recommendations labelled as "Sponsored" were lower (CTR=5.93%) than CTR for identical recommendations labelled as "Organic" (CTR=8.86%). Interestingly, recommendations with no label performed best (CTR=9.87%) in that study. Mobile Recommender Systems One growing area of research in the area of recommender systems is mobile recommender systems. With the increasing ubiquity of internet-accessing smart phones, it is now possible to offer personalized, context-sensitive recommendations. This is a particularly difficult area of research as mobile data is more complex than recommender systems often have to deal with (it is heterogeneous, noisy, requires spatial and temporal auto-correlation, and has validation and generality problems). Additionally, mobile recommender systems suffer from a transplantation problem - recommendations may not apply in all regions (for instance, it would be unwise to recommend a recipe in an area where all of the ingredients may not be available).
  • 68. Page 68 of 73 WE SCHOOL One example of a mobile recommender system is one that offers potentially profitable driving routes for taxi drivers in a city. This system takes as input data in the form of GPS traces of the routes that taxi drivers took while working, which include location (latitude and longitude), time stamps, and operational status (with or without passengers). It then recommends a list of pickup points along a route that will lead to optimal occupancy times and profits. This type of system is obviously location-dependent, and as it must operate on a handheld or embedded device, the computation and energy requirements must remain low. An other example of mobile recommendation is what (Bouneffouf et al., 2012) developed for professional users. This system takes as input data the GPS traces of the user and his agenda to suggest him suitable information depending on his situation and interests. The system uses machine learning techniques and reasoning process in order to adapt dynamically the mobile system to the evolution of the user’s interest. The author called his algorithm hybrid-ε- greedy. Mobile recommendation systems have also been successfully built using the Web of Data as a source for structured information. A good example of such system is SMARTMUSEUM. The system uses semantic modelling, information retrieval and machine learning techniques in order to recommend contents matching user’s interest, even when the evidence of user's interests is initially vague and based on heterogeneous information. Risk-Aware Recommender Systems The majority of existing approaches to RS focus on recommending the most relevant documents to the users using the contextual information and do not take into account the risk
  • 69. Page 69 of 73 WE SCHOOL of disturbing the user in specific situations. However, in many applications, such as recommending a personalized content, it is also important to incorporate the risk of upsetting the user into the recommendation process in order not to recommend documents to users in certain circumstances, for instance, during a professional meeting, early morning, late-night. Therefore, the performance of the RS depends on the degree to which it has incorporated the risk into the recommendation process. Risk Definition: "The risk in recommender systems is the possibility to disturb or to upset the user which leads to a bad answer of the user". In response to this problems, the authors in have developed a dynamic risk sensitive recommendation system called DRARS (Dynamic Risk-Aware Recommender System), which model the context-aware recommendation as a bandit problem. This system combines a content-based technique and a contextual bandit algorithm. They have shown that DRARS improves the Upper Condense Bound (UCB) policy, the currently available best algorithm, by calculating the most optimal exploration value to maintain a trade-off between exploration and exploitation based on the risk level of the current user's situation. The authors conducted experiments in an industrial context with real data and real users and has shown that taking into account the risk level of users' situations significantly increased the performance of the recommender systems. The Netflix Prize One of the key events that energized research in recommender systems was the Netflix prize. From 2006 to 2009, Netflix sponsored a competition, offering a grand prize of $1,000,000 to the team that could take an offered dataset of over 100 million movie ratings and return recommendations that were 10% more accurate than those offered by the company's existing
  • 70. Page 70 of 73 WE SCHOOL recommender system. This competition energized the search for new and more accurate algorithms. On 21 September 2009, the grand prize of US$1,000,000 was given to the BellKor's Pragmatic Chaos team using tiebreaking rules. The most accurate algorithm in 2007 used an ensemble method of 107 different algorithmic approaches, blended into a single prediction. Predictive accuracy is substantially improved when blending multiple predictors. Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique. Consequently, our solution is an ensemble of many methods. Many benefits accrued to the web due to the Netflix project. A second contest was planned, but was ultimately cancelled in response to an ongoing lawsuit and concerns from the Federal Trade Commission. (The Federal Trade Commission (FTC) is an independent agency of the United States government, established in 1914 by the Federal Trade Commission Act. Its principal mission is the promotion of consumer protection and the elimination and prevention of anticompetitive business practices, such as coercive monopoly.) Multi-criteria Recommender Systems Multi-Criteria Recommender Systems (MCRS) can be defined as Recommender Systems that incorporate preference information upon multiple criteria. Instead of developing recommendation techniques based on a single criterion values, the overall preference of user u for the item i, these systems try to predict a rating for unexplored items of u by exploiting preference information on multiple criteria that affect this overall preference value. Several researchers approach MCRS as a Multi-criteria Decision Making (MCDM) problem, and apply MCDM methods and techniques to implement MCRS systems.
  • 71. Page 71 of 73 WE SCHOOL The limitations of SEO One important characteristic of an expert and professional consultant is that he or she understands that the theories and techniques in their field are subject to quite concrete and specific limitations. In this page, I review the important ones that apply to search engine optimization and website marketing and their practical significance. SEO is hand made to order. If you have planned a product launch, a wedding or been a party to a lawsuit, you know that the best laid plans are seldom executed without some major changes. Life is simply too complex. SEO is another one of those human activities because, to be effective, it is hand made for a specific site and business. There are other important limitations which you need to understand and take into consideration. Searching is an evolving process from the point of view of providers (the search engines), users and website owners. What worked yesterday may not work today and be counter- productive or harmful tomorrow. In the result, monitoring or regular checks of the key search engines and directories is required to maintain a high ranking once it is achieved. Quality is everything. Since virtually everything that we do to improve a site's ranking will be known to anyone who knows how to get it, innovations tend to be short-lived. Moreover, search engines are always on the lookout for exploits that manipulate their ranking algorithms. The only thing that cannot be copied or exploited is high quality and value content especially when others link to it for those reasons. Only higher and more valuable content trumps it.
  • 72. Page 72 of 73 WE SCHOOL The cost of SEO is rising. More expertise is required than before and this trend will continue. techniques employed are more sophisticated, complex and time consuming. There are fewer worthwhile search engines and directories that offer free listings. Paid placement costs are rising and the best key words expensive. The search lottery. Search engines collect only a fraction of the billions of sites' pages for various technological reasons which change over time but nonetheless will mean for the foreseeable future that searching is akin to a lottery. SEO improves the odds but cannot remove the uncertainty altogether. SEO is a marketing exercise and, accordingly, the same old business rules apply. You can sell almost anything to someone once but businesses are built and prosper through repeat customers to whom the reputation, brand or goodwill is important. Content quality and value is the key and that remains elusive, expensive and difficult to source but, in websites, is the only basis of effective marketing using SEO techniques. Suffice to say, if your site is included in a search engine and you achieve a high enough ranking for your requirements, then these limitations are costs of doing business in cyberspace.
  • 73. Page 73 of 73 WE SCHOOL Bibliography The topic itself completes the Bibliography that is search engine The entire project is done with the help of following sites & taken Guidance from Mr C.P. Venkatesh in Digital Marketing Workshop www.google.com www.searchcounsel.com www.searchenginejournal.com www.pyimagesearch.com www.seobook.com