0
©2012 LHST
Search
Recovery and Discovery
Prof. Lee SCHLENKER
E-Stratégies
Sept 5th 2014
- Preliminary Draft -
How can you ...
©2012 LHST
Focus Improve Knowledge Leverage Mesure
Organization Processes Explicit Transactions Efficiency
Services Delive...
©2012 LHST©2009 LHSTProf. Lee SCHLENKER
Are business solutions anything
more than recovering something
something you once ...
©2012 LHST
• The size of the indexed world wide web in 2012
- Indexed by Google: about 40 billion pages
• Yahoo deals with...
©2012 LHST
• Search is the attempt to make sense
of information
• As the amount of information
explodes, search has become...
©2012 LHST
• Pourquoi Paul Ford fait un lien entre la
recherche de “meaning” et le “Semantic
Web” ?
• Comment définir le “...
©2012 LHST
• Web search applies search
technology to documents on the
open web, and
• Desktop search applies search
techno...
©2012 LHST
• Text-based (Bing, Google, Yahoo!).
Search by keywords. Limited search using
queries in natural language.
• Mu...
©2012 LHST
• Crawl the set of documents to
to skim the keywords from
their contents,
• Indexing the buzzwords (foam)
in a ...
©2012 LHST
• Boolean
• Vector
• Probabilistic
• Fuzzy retrieval
• Language modeling
• Latent semantic indexing
©2012 LHST
• The first step in classifying web pages is to
find an ‘index item’ that might relate
expressly to the ‘search...
©2012 LHST
• Searching for text-based content in structured
data formats (databases, XML, CSV etc.)
presents a special cha...
©2012 LHST
• Content Ingestion – push or pull
content collection
• Content processing and analysis –
normalizing content
•...
©2012 LHST
Profitability
Profit Margin (ttm): 27.48%
Operating Margin (ttm): 32.45%
Management Effectiveness
Return on Ass...
©2012 LHST
January 1996-December 1997 – Sergey Brin and Larry
Page create BackRub, the precursor to the Google search
engi...
©2012 LHST
“To organize the world's information
and make it universally accessible
and useful"
« You Can Make Money Withou...
©2012 LHST
PageRank algorithm looks at the links on a page,
the anchor text around those links, and the
popularity of the...
©2012 LHST
“Being a different kind of company"
encompasses more than the products we make
and the business we're building;...
©2012 LHST
Giving a different meaning to the concept of
« Portal »
Prof. Lee SCHLENKER
©2012 LHST
• You create your ads
• Your ads appear on
Google
• You attract customers
• You're charged only if
someone clic...
©2012 LHST
Automatically crawls the
content of your pages and
delivers ads (you can
choose both text or image
ads) that ar...
©2012 LHST
Gmail -- Offer custom email addresses to your organization with
up to 25 gigabytes of storage for each account,...
©2012 LHST
• Google continues to bet on centralized servers and thin
clients. That's why they are spending $600 million to...
©2012 LHST
Social Media
• Google plans to begin introducing a common
set of standards (Open Social) to allow
software deve...
©2012 LHST
• An application to handle all the
information, browser – Chrome
• Internet - Support net neutrality
initiative...
©2012 LHST
• Vic Gundotra, « Google's mobile moves are
driven by one objective: pushing the industry
to open up”
• The pho...
©2012 LHST
• Constant transformation: from big mainframes to
PCs, and from PCs to the Internet
• People increasingly rely ...
©2012 LHST
• Google's US ad revenue = 15 billion
• The size of the US Yellow Pages market is roughly 14
billion.
• Jonatha...
©2012 LHST
Rich content SERP will allow Google to
move into:
• Travel search
• Paid media (ebooks, music, magazines,
newsp...
©2012 LHST
Web Search Entreprise
Search
Validity Popular search + Deep Search
Algorithms Links Semantics
Scope Public page...
©2012 LHST
Architecture Issues
Query layer How will people find the data?
Indexing layer What metadata (context) is
releva...
©2012 LHST
• Before the Web we assumed that our
digital footprint was as ephemeral as a
phone
• Clickstreams can provide a...
©2012 LHST
• Blogs are personal statements of who they
are and who they wish to be in the
searchable world.
• The Blog is ...
©2012 LHST
• The Web is in the process of becoming the next
great computing platform, owned by no-one and
used by everyone...
©2012 LHST
• It’s what your job in marketing, sales
and management is all about
• Decisions are based on judgment
and prec...
Upcoming SlideShare
Loading in...5
×

Estrat search

83

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
83
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Set-theoretic models represent documents as sets of words or phrases. Similarities are usually derived from set-theoretic operations on those sets. 

    Algebraic models represent documents and queries usually as vectors, matrices, or tuples. The similarity of the query vector and document vector is represented as a scalar value

    Probabilistic models treat the process of document retrieval as a probabilistic inference. Similarities are computed as probabilities that a document is relevant for a given query. 

    In fuzzy-set theory, an element has a varying degree of membership, say dA, to a given set A instead of the traditional membership choice (is an element/is not an element).

    A statistical language model assigns a probability to a sequence of m words  by means of a probability distribution.


    Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.
  • Transcript of "Estrat search"

    1. 1. ©2012 LHST Search Recovery and Discovery Prof. Lee SCHLENKER E-Stratégies Sept 5th 2014 - Preliminary Draft - How can you use enterprise technologies to improve apprenticeship?
    2. 2. ©2012 LHST Focus Improve Knowledge Leverage Mesure Organization Processes Explicit Transactions Efficiency Services Delivery Implicit Interactions Effectiveness Networks Relationships Emerging Interactions Innovation Search Relevancy Connected Proximity CTR
    3. 3. ©2012 LHST©2009 LHSTProf. Lee SCHLENKER Are business solutions anything more than recovering something something you once knew or discovering something that is « out there » that others can’t find?
    4. 4. ©2012 LHST • The size of the indexed world wide web in 2012 - Indexed by Google: about 40 billion pages • Yahoo deals with 12TB of data per day (according to Ron Brachman) • Twitter hits 400 million tweets per day (June, 2012. Dick Costolo, CEO at Twitter) • Over 2.5 billion photos uploaded to Facebook each month (2010. blog.facebook.com) • 55 Million WordPress Sites in the World http://www.worldwidewebsize.com/
    5. 5. ©2012 LHST • Search is the attempt to make sense of information • As the amount of information explodes, search has become the user’s interface metaphor. • Twenty percent of searches are for entertainment, 15 percent are commercial in nature, and 65 percent are informational • On the Internet, all intent is commercial in form or another The perfect search engine," says Google co-founder Larry Page, "would understand exactly what you mean and give back exactly what you want." Prof. Lee SCHLENKER
    6. 6. ©2012 LHST • Pourquoi Paul Ford fait un lien entre la recherche de “meaning” et le “Semantic Web” ? • Comment définir le “The New Economy.” Cette notion a-t-elle un sens aujourd’hui ? • L’auteur compare Google à Amazon et EBay. Pourquoi le modèle de gestion (« business model ») de ce dernier est menacé aujourd’hui ? • Quelles sont les différences entre les notions de « web search » et d’« entreprise search » ? • Analysez la faisabilité aujourd’hui de sa notion de “personal agent” ?
    7. 7. ©2012 LHST • Web search applies search technology to documents on the open web, and • Desktop search applies search technology to the content on a single computer. • Enterprise search involves making diverse content searchable for a defined audience.With Search you won’t ever have to leave your house or open a physical book… Prof. Lee SCHLENKER Eric Borboen
    8. 8. ©2012 LHST • Text-based (Bing, Google, Yahoo!). Search by keywords. Limited search using queries in natural language. • Multimedia (QBIC, WebSeek, SaFe) Search by visual appearance (shapes, colors,… ). • Question answering systems (Ask, NSIR, Answerbus). Search in (restricted) natural language • Clustering systems (Vivísimo/IBM, Clusty) • Research systems (Lemur/MIT, Nutch)
    9. 9. ©2012 LHST • Crawl the set of documents to to skim the keywords from their contents, • Indexing the buzzwords (foam) in a semi-structured form, and • Resolving user entries/queries to return mostly relevant results Prof. Lee SCHLENKER Robert Korfhage
    10. 10. ©2012 LHST • Boolean • Vector • Probabilistic • Fuzzy retrieval • Language modeling • Latent semantic indexing
    11. 11. ©2012 LHST • The first step in classifying web pages is to find an ‘index item’ that might relate expressly to the ‘search term.’ • These days, a continuous crawl method is employed as opposed to an incidental discovery based on a seed list. • Most search engines use sophisticated scheduling algorithms to “decide” when to revisit a particular page, to appeal to its relevance. • The speed of the web server running the page as well as resource constraints like amount of hardware or bandwidth also figure in. With Search you won’t ever have to leave your house or open a physical book… Prof. Lee SCHLENKER
    12. 12. ©2012 LHST • Searching for text-based content in structured data formats (databases, XML, CSV etc.) presents a special challenges • Databases allow logical queries which full-text search doesn't (use of multi-field boolean logic for instance). • There is no crawling necessary for a database since the data is already structured. • Databases are slow when solving complex queries or using customize indexing formats (compounding, normalization, transformation, transliteration, etc.) Prof. Lee SCHLENKER
    13. 13. ©2012 LHST • Content Ingestion – push or pull content collection • Content processing and analysis – normalizing content • Indexing - dictionary of all unique words , ranking and frequency • Query parsing – user entries, multiple dimensional filters and paging information • Matching – comparing the query to the stored index Prof. Lee SCHLENKER
    14. 14. ©2012 LHST Profitability Profit Margin (ttm): 27.48% Operating Margin (ttm): 32.45% Management Effectiveness Return on Assets (ttm): 15.21% Return on Equity (ttm): 22.36% Income Statement Revenue (ttm): 13.43B Revenue Per Share (ttm): 43.676 Qtrly Revenue Growth (yoy): 57.70% Gross Profit (ttm): 6.38B Internet users spend about 15 million hours a month on the site. Nearly four out of five Internet searches happen on Google or on sites that license its technology Prof. Lee SCHLENKER
    15. 15. ©2012 LHST January 1996-December 1997 – Sergey Brin and Larry Page create BackRub, the precursor to the Google search engine. Sept. 7, 1998 - Google is incorporated and takes up residence in a Menlo Park, California, garage with four employees September-October 2002 - Google rolls out its keyword advertising program worldwide based on the GoTo.com model March-April 2002 - Google launches a beta version of Google News May-June 2003 - Google launches AdSense, an advertising program that delivers ads based on the content of Web sites 15 History Google is the fastest growing company ever – 400 000 percent revenue growth in five years. Prof. Lee SCHLENKER
    16. 16. ©2012 LHST “To organize the world's information and make it universally accessible and useful" « You Can Make Money Without Doing Evil » “You Can Be Serious Without a Suit » « No Pop Up Ads » 16 Larry Page : “I’m not a big believer in strategy” Prof. Lee SCHLENKER
    17. 17. ©2012 LHST PageRank algorithm looks at the links on a page, the anchor text around those links, and the popularity of the pages that link to another page for relevance Google has 175,000 computers dedicated to the job of crawling, more than all computers on earth in the early 70’s Google developed its own OS on top of its servers, unique approach to designing, cooling and stacking the components Prof. Lee SCHLENKER
    18. 18. ©2012 LHST “Being a different kind of company" encompasses more than the products we make and the business we're building; it means making sure that our core values inform our conduct in all aspects of our lives as Google employees. “ I. Serving our Users II. Respecting Each Other III. Avoiding Conflicts of Interest IV. Preserving Confidentiality V. Maintaining Books and Records VI. Protecting Google's Assets VII. Obeying the Law VIII. Using our Code Google tracks what products you shop for, the mail you send, which phrases you research in a book, which satellite photos and news stories you view,… Prof. Lee SCHLENKER
    19. 19. ©2012 LHST Giving a different meaning to the concept of « Portal » Prof. Lee SCHLENKER
    20. 20. ©2012 LHST • You create your ads • Your ads appear on Google • You attract customers • You're charged only if someone clicks your ad, not when your ad is displayed. ©2007 LHSTProf. Lee SCHLENKER
    21. 21. ©2012 LHST Automatically crawls the content of your pages and delivers ads (you can choose both text or image ads) that are relevant to your audience and your site ©2007 LHSTProf. Lee SCHLENKER
    22. 22. ©2012 LHST Gmail -- Offer custom email addresses to your organization with up to 25 gigabytes of storage for each account, search tools to help people find information fast, plus instant messaging and calendar tools built right into the email interface. Google Talk -- Your users can call or send instant messages to their contacts for free -- anytime, anywhere in the world. File sharing and voicemail is included, too. Google Calendar -- Your users can organize their schedules and share events, meetings and entire calendars with others. Your organization can also publish calendars and events on the web. Google Docs -- Your users can create documents, spreadsheets and presentations and collaborate with each other in real-time right inside a web browser window. The Start Page -- A central place for your users to preview their inboxes and calendars, access your essential content, and search the web. Google Page Creator -- Create and publish web pages for your domain quickly and easily with this what-you-see-is-what-you-get page design tool. Prof. Lee SCHLENKER
    23. 23. ©2012 LHST • Google continues to bet on centralized servers and thin clients. That's why they are spending $600 million to build a new data center in North Carolina - the purpose is to provide 100% uptime for business applications.. • Google built its web office suite via acquisitions. The startups they have acquired are: Gtalkr (instant messaging), Writely (word processing), iRows (spreadsheets), JotSpot (wiki), Tonic Systems (presentations), and Zenter (presentations). • Google, whose web office solutions are based on AJAX, has a clear online office strategy among the big companies. In order to provide offline capabilities Google developed Google Gears, which is a set of browser plugins and Javascript libraries that enable AJAX applications to run offline. Prof. Lee SCHLENKER
    24. 24. ©2012 LHST Social Media • Google plans to begin introducing a common set of standards (Open Social) to allow software developers to write programs for Google’s social network, Orkut, as well as others, including LinkedIn, hi5, Friendster, Plaxo, Ning as well as Salesforce and Oracle. • Google can benefit from their success, in part, by selling advertising on those sites, in part by incorporating social media functions inside their own applications • Google said it has advertising relationships with several social networks (including Facebook), and $900 million partnership to sell ads on MySpace. Prof. Lee SCHLENKER
    25. 25. ©2012 LHST • An application to handle all the information, browser – Chrome • Internet - Support net neutrality initiatives • Mobile OS - Android as an open platform • Mobile Device - Nexus
    26. 26. ©2012 LHST • Vic Gundotra, « Google's mobile moves are driven by one objective: pushing the industry to open up” • The phones sold on the Google website will all be available unlocked. • Google doesn't want to compete with other companies offering handsets. • They want to change the mindset of consumers towards having an open handset that will work with any network any where
    27. 27. ©2012 LHST • Constant transformation: from big mainframes to PCs, and from PCs to the Internet • People increasingly rely on powerful mobile phones instead of PCs to surf the Web • Online advertising may well lose its role as the Web's primary economic engine • Recent Google acquisitions include Android, maker of a mobile operating system; GrandCentral, a VOIP operator; and AdMob, a mobile advertising network • Google has invested heaviy in mapping and location technologies • Google's mobile strategy isn't hardware--- it's about generating money from its core business: advertising Sizing up Google's Nexus 10 tablet
    28. 28. ©2012 LHST • Google's US ad revenue = 15 billion • The size of the US Yellow Pages market is roughly 14 billion. • Jonathan Rosenberg : mobile ads are already a billion Dollar market for Google. • Google owns 97% search marketshare, while offering localized search auto-complete, ads that map to physical locations, and creating a mobile coupon offers network • Google Trusted Stores, Google Wallet, and now Google Local Delivery Prof. Lee SCHLENKER
    29. 29. ©2012 LHST Rich content SERP will allow Google to move into: • Travel search • Paid media (ebooks, music, magazines, newspapers, videos etc.) • Real estate • Large lead generation markets (like insurance, mortgage, credit cards, .edu) • Ecommerce search
    30. 30. ©2012 LHST Web Search Entreprise Search Validity Popular search + Deep Search Algorithms Links Semantics Scope Public pages + Private pages Type Web pages + Data stores Concerns Ranking + Security
    31. 31. ©2012 LHST Architecture Issues Query layer How will people find the data? Indexing layer What metadata (context) is relevant? Processing layer How should we interpret the data? Connector layer How can bring this data “home”? These are multiple opportunities to add value to the Microsoft platform!
    32. 32. ©2012 LHST • Before the Web we assumed that our digital footprint was as ephemeral as a phone • Clickstreams can provide a level of intelligence about how people use the Web • Innovative companies have figured out how to deliver great Web-based services by divining clickstream patterns • We have yet to aggregate the critical mass of clickstreams in a database of intentions Prof. Lee SCHLENKER
    33. 33. ©2012 LHST • Blogs are personal statements of who they are and who they wish to be in the searchable world. • The Blog is an indexable statement of individual’s social standing, relationships, interests and history. • Mass personalization – blogs can become proxies for personal taxonomies • Intelligent engines will be able to discern patterns among blogs that will provide third order relevance inputs that will help define and return far better search results John Battelle Prof. Lee SCHLENKER
    34. 34. ©2012 LHST • The Web is in the process of becoming the next great computing platform, owned by no-one and used by everyone. • The telephone, the automobile, the television, the stereo are all part of the network (your dog, your kid) • By tracking not only what searches you do, but what sites you visit, the engines of the future will be able to build a real-time profile of your interests • Recovery is everywhere you’ve been before, discovery is everything you may wish to find, but have yet to encounter. • In the near future we’ll store everything that can be digitalized on one massive platform – the Google grid?Prof. Lee SCHLENKER
    35. 35. ©2012 LHST • It’s what your job in marketing, sales and management is all about • Decisions are based on judgment and precision • Search ends with proof of value rather than a empty box • Enterprise Search is an integral part of BI, Collaboration, ECM, and UC
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×