This document discusses big data in the context of the web. It covers topics such as the volume, variety, and velocity of big data; how the wisdom of crowds can be leveraged through data aggregation; challenges like sparsity, noise, and privacy; and examples of mining web data through techniques like clustering pictures on Flickr and analyzing click distributions. The document emphasizes that data must address important problems and proposes focusing on problem-driven approaches to analyzing large, diverse web data.
Digital FDLP Louisiana GODORT 2012 slides+notesJames Jacobs
Keynote talk at the Spring 2012 meeting of the Louisiana Government Documents Round Table (LA GODORT) in Shreveport, LA Friday March 23, 2012.
The last slide includes a list of citations for further reading.
Blind Spots and Broken Links: Access to Government InformationJames Jacobs
Panel presentation given by James R. Jacobs as part of a program at American Library Association's 2015 annual conference set up by the Federal & Armed Forces Libraries Round Table (FAFLRT). The program, "Open Government: Current Trends and Practices Concerning FOIA, Open Access, and Other Post-Wiki-Leaks Issues" also featured Anneliese Taylor, Assistant Director of Scholarly Communications & Collections at UCSF, who gave an in-depth and very interesting presentation on open access and the OSTP directive on "Expanding Public Access to the Results of Federally Funded Research"
In its early days the Internet was often referred to as “the wild West” due to the lack of standards governing it. Though the Internet is somewhat more settled these days, one thing that still harkens back to the days of cattle ranchers and train robbers is reputation. In the age of Google, reputations can be ruined by those with genuine grievances and those with grudges alike. Would you know how to defend your reputation or that of your institution should it come under fire? Join Kimberley Barker for a closer look at the good, the bad, and the ugly of life in the reputation economy, and learn about practical steps that you can take to safeguard your good name
Web 2.0 Collective Intelligence - How to use collective intelligence techniqu...Paul Gilbreath
Source: http://www.helioteixeira.org/ How to use Collective Intelligence techniques to ensure that your web application can extract valuable data from its usage and deliver that value right back to the users. (MODULE 1)
The Reputation Economy: Protecting your most valuable asset in the age of GoogleKR_Barker
In its early days the Internet was often referred to as “the wild West” due to the lack of standards governing it. Though the Internet is somewhat more settled these days, one thing that still harkens back to the days of cattle ranchers and train robbers is reputation. In the age of Google, reputations can be ruined by those with genuine grievances and those with grudges alike. Would you know how to defend your reputation or that of your institution should it come under fire? Join Kimberley Barker for a closer look at the good, the bad, and the ugly of life in the reputation economy, and learn about practical steps that you can take to safeguard your good name.
Digital FDLP Louisiana GODORT 2012 slides+notesJames Jacobs
Keynote talk at the Spring 2012 meeting of the Louisiana Government Documents Round Table (LA GODORT) in Shreveport, LA Friday March 23, 2012.
The last slide includes a list of citations for further reading.
Blind Spots and Broken Links: Access to Government InformationJames Jacobs
Panel presentation given by James R. Jacobs as part of a program at American Library Association's 2015 annual conference set up by the Federal & Armed Forces Libraries Round Table (FAFLRT). The program, "Open Government: Current Trends and Practices Concerning FOIA, Open Access, and Other Post-Wiki-Leaks Issues" also featured Anneliese Taylor, Assistant Director of Scholarly Communications & Collections at UCSF, who gave an in-depth and very interesting presentation on open access and the OSTP directive on "Expanding Public Access to the Results of Federally Funded Research"
In its early days the Internet was often referred to as “the wild West” due to the lack of standards governing it. Though the Internet is somewhat more settled these days, one thing that still harkens back to the days of cattle ranchers and train robbers is reputation. In the age of Google, reputations can be ruined by those with genuine grievances and those with grudges alike. Would you know how to defend your reputation or that of your institution should it come under fire? Join Kimberley Barker for a closer look at the good, the bad, and the ugly of life in the reputation economy, and learn about practical steps that you can take to safeguard your good name
Web 2.0 Collective Intelligence - How to use collective intelligence techniqu...Paul Gilbreath
Source: http://www.helioteixeira.org/ How to use Collective Intelligence techniques to ensure that your web application can extract valuable data from its usage and deliver that value right back to the users. (MODULE 1)
The Reputation Economy: Protecting your most valuable asset in the age of GoogleKR_Barker
In its early days the Internet was often referred to as “the wild West” due to the lack of standards governing it. Though the Internet is somewhat more settled these days, one thing that still harkens back to the days of cattle ranchers and train robbers is reputation. In the age of Google, reputations can be ruined by those with genuine grievances and those with grudges alike. Would you know how to defend your reputation or that of your institution should it come under fire? Join Kimberley Barker for a closer look at the good, the bad, and the ugly of life in the reputation economy, and learn about practical steps that you can take to safeguard your good name.
Teaching information: from Google Search to Big DataMartin Patrick
The Internet is the biggest store of information the world has ever known and will be more and more central to eco- nomic activity in the future. All this information and activity comes at a price: surveys routinely show that employers are underwhelmed by young people’s information skills. In this session we will explore web-based resources that can help students better master information technology and skills us- ing resources freely available online. Together we will talk about ideas to use these resources to augment curricula, and
briefly explore the next big thing in information: Big Data.
Researchers, Discovery and the Internet: What Next?David Smith
A web2.0 issues and implications overview I put together for the Research Information Network as part of their workshop on researchers and discovery services.
http://www.rin.ac.uk/discovery-services-workshop
Introduction to Digital Life (March 2017)KR_Barker
Many people are surprised to learn that, even though they don’t participate on social media and only use their computers for work, they have a digital life. This is partly because publicly-available information about you is collected from the internet, and this information is used by companies to create records about you. Join Kimberley Barker for an overview of topics such as digital privacy, online reputation management, personal branding, and online identity.
Sharing on the internet - aka social networking in all of its myriad forms - is explored in this powerpoint presentation that was designed by Yesha Naik and Alexa Goldstein for Dr. Perry's Managing New Technologies class in fall of 2009.
Teaching information: from Google Search to Big DataMartin Patrick
The Internet is the biggest store of information the world has ever known and will be more and more central to eco- nomic activity in the future. All this information and activity comes at a price: surveys routinely show that employers are underwhelmed by young people’s information skills. In this session we will explore web-based resources that can help students better master information technology and skills us- ing resources freely available online. Together we will talk about ideas to use these resources to augment curricula, and
briefly explore the next big thing in information: Big Data.
Researchers, Discovery and the Internet: What Next?David Smith
A web2.0 issues and implications overview I put together for the Research Information Network as part of their workshop on researchers and discovery services.
http://www.rin.ac.uk/discovery-services-workshop
Introduction to Digital Life (March 2017)KR_Barker
Many people are surprised to learn that, even though they don’t participate on social media and only use their computers for work, they have a digital life. This is partly because publicly-available information about you is collected from the internet, and this information is used by companies to create records about you. Join Kimberley Barker for an overview of topics such as digital privacy, online reputation management, personal branding, and online identity.
Sharing on the internet - aka social networking in all of its myriad forms - is explored in this powerpoint presentation that was designed by Yesha Naik and Alexa Goldstein for Dr. Perry's Managing New Technologies class in fall of 2009.
Privacy, Ethics, and Future Uses of the Social WebMatthew Russell
A presentation to the Owen Graduate School of Management (Vanderbilt University) about social media and some of the technology behind the future uses of social media that are likely to shape the future of the Web as we know it.
You've heard the news, Data Science is the cool new career opportunity sweeping the world. Come learn from Thinkful Mentors all about this new and exciting industry.
Presentation discusses scientific method, common pitfalls of social media experiments. Defines some terms, shows neat tools, tries to move discussion forward.
A Semantic Web Primer: The History and Vision of Linked Open Data and the Web 3.0
There is a transformational change coming to the world-wide-web that will fundamentally alter how its vast array of data is structured, and as a result greatly enhance the way humans and machines interact with this indispensable resource. Given the inertia of existing infrastructure, this segue will be evolutionary as opposed to revolutionary, and indeed has been envisioned since the inception of the web. Come join us for a layman's look at the nature of the Web 3.0, its historical underpinnings, and the opportunities it presents.
Slides of the course on big data by C. Levallois from EMLYON Business School.
For business students. Check the online video connected with these slides.
-> Definition of text mining, the main categories of tools available (such as topic categorization or sentiment analysis) and their use for business.
A 1015 update to the 2012 "Data Big and Broad" talk - http://www.slideshare.net/jahendler/data-big-and-broad-oxford-2012 - extends coverage, brings more in context of recent "big data" work.
Big Data and the Social Sciences
Ex-Google engineer Abe Usher presents a talk about Big Data technology and methods applicable to social science.
Participants will learn techniques that are used by Google engineers to collect, clean, analyze, and visualize Big Data.
Additionally Mr. Usher will provide URLs to sample data, open source applications, and code to those interested in applying these Big Data methods themselves.
Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results.
Why is data science important?
Data science is important because it combines tools, methods, and technology to generate meaning from data. Modern organizations are inundated with data; there is a proliferation of devices that can automatically collect and store information. Online systems and payment portals capture more data in the fields of e-commerce, medicine, finance, and every other aspect of human life. We have text, audio, video, and image data available in vast quantities.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Mind map of terminologies used in context of Generative AI
Big data in the web
1. 6/28/13
1
Big Data
in
The Web
Ricardo Baeza-Yates
Yahoo! Labs
Barcelona & Santiago de Chile
- 3 -
Agenda
• Big Data
• Asking the Right Questions
• Wisdom of Crowds in the Web
• The Long Tail
• Issues and Examples
• Concluding Remarks
2. 6/28/13
2
- 4 -
4
Big Data
§ Capture, transfer, store, search, share, analyze,
and visualize large data in reasonable time
§ Large volume and growth
§ Petabytes to exabytes
§ Growth is estimated in 3 exabytes per day
§ Structured vs. non-structured data
§ Diversity
§ Types, formats, complexity, topics, etc.
§ Best Public Data Example: The Web
§ Content: text, multimedia
§ Structure: graphs
§ Usage: real time streams
- 5 -
5
Big Data
§ Focus on analytics
§ Many storage technologies:
§ DBs, DWs, distributed file systems, …
§ Many processing technologies:
§ Cloud computing, map-reduce (Hadoop), …
§ Data mining, clustering, classification, …
§ Machine learning, A/B testing, NLP, …
§ Simulation
§ Several technology providers
§ Initial best practices (see TDWI report, 2011)
§ Main challenges: scalability, online
3. 6/28/13
3
- 6 -
6
Big Data: The Five V’s
Characteristic Data Issue Computing Issue
Volume Scale,
Redundancy
Scalability
Variety Heterogeneity,
Complexity
Adaptability,
Extensibility
Veracity Completeness, Bias,
Sparsity, Noise, Spam
Reliability,
Trust
Velocity Real time Online
Value Usefulness,
Privacy
Business
dependent
- 7 -
7
Asking the Right Questions
§ Problem Driven
§ What data we need? How much?
§ How we collect it? How we store and transfer it?
§ Understanding the Data
§ How sparse is the data? How much noise?
§ There is redundancy? There are biases?
§ There is spam? Any outliers?
§ Analyzing the Data
§ Any privacy issues? Do we need to anonymize?
§ How well our algorithms scale?
§ Can we visualize the results?
4. 6/28/13
4
- 8 -
8
Too Much Data Available
§ The Web is a database!
§ Data does not imply information
§ Many analyses for the sake of it (data driven)
§ Analyzing data is not CS per se
§ Publish in the right forum!
§ Big Data or Right Data?
- 9 -
9
The Different Facets of the Web
5. 6/28/13
5
- 11 -
11
The Structure of the Web
- 12 -
Big Data in the Web
Metadata
RDF
Wikipedia ODP
Flickr
Text
Anchors + links
Y! Answers
Logs (Clicks+Queries)
Explicit Implicit
Wordnet
UGC
Private
Scale
Blogs,
Groups
Quality?
7. 6/28/13
7
- 15 -
15
Noise and Spam
§ Noise may come from many places:
§ Instruments that measure
§ How we interpret the data (example later)
§ Spam is everywhere
- 16 -
16
Web Spam
Deceiving text, links, clicks…
due to an economic incentive
Depending on the goal and the data,
spam is easier to generate
Depending on the type & target data,
spam is easier to fight
Disincentives for spammers?
• Social
• Economical
Web Spam is NOT Mail Spam
8. 6/28/13
8
- 17 -
17
- 18 -
Content and Metadata Trends
[Ramakrishnan and Tomkins 2007]
9. 6/28/13
9
- 19 -
Web Data Trends
• User Generated Content
– Massive (quality vs. quantity)
– Social Networks
– Real time (people + physical sensors)
• Impact
– Fragmentation of ownership
– Fragmentation of access (longer heavy tail)
– Fragmentation of right to access
• Viability
– Business model based in advertising
- 20 -
The Wisdom of Crowds
• James Surowiecki, a New Yorker columnist,
published this book in 2004
– “Under the right circumstances, groups are
remarkably intelligent”
• Importance of diversity, independence and
decentralization
“large groups of people are smarter than an elite few,
no matter how brilliant—they are better at solving
problems, fostering innovation, coming to wise
decisions, even predicting the future”.
Aggregating data
10. 6/28/13
10
- 21 -
21
Web Data Mining
• Content: text & multimedia mining
• Structure: link analysis, graph mining
• Usage: log analysis, query mining
• Relate all of the above
– Web characterization
– Particular applications
- 22 -
Flickr: Clustering Pictures
22
12. 6/28/13
12
- 27 -
“Crowd Sourcing”
Web-based “peer production” has produced a number of
successful products and communities
• Wikipedia, Y! Answers, YouTube, Flickr, Digg, ...
Can this form of production be harnessed for other ends?
• Existing successes are hard to replicate at will
Amazon Mechanical Turk (AMT)
• Like outsourcing, but in a micro-distributed fashion
• Thousands of “turkers” working on hundreds of “HITS” (tasks)
• Rates are typically few cents per task
• Quality of their work is positively evaluated (e.g. in IR)
- 28 -
The Wisdom of (Large) Crowds
– Crucial for Search Ranking
– Text: Web Writers & Editors
• not only for the Web!
– Links: Web Publishers
– Tags: Web Taggers
– Queries: All Web Users!
• Queries and actions (or no action!)
The crowd implicitly
knows the experts!
13. 6/28/13
13
- 30 -
30
Scalability
§ How to scale?
§ Doubling the data in the best case will double the time
§ Time complexity vs. result quality trade-off
§ Example: entity detection in linear time at almost state
of the art quality
§ That implies that there exists a text size n* for which
the linear algorithm will produce more correct entities
§ Distributed parallel processing
§ Map-reduce not always works
§ Parallelism is problem dependent
§ Online processing needs a different approach
- 31 -
31
Redundancy and Bias
§ There is any dependency in the data?
§ There is any duplication?
§ Lexical duplication in the Web is around 25%
§ Semantic duplication is larger
§ Are there any biases?
§ Example 1: clicks in search engines
§ Bias to the ranking and the interface
§ There is a ranking bias in the Web content
§ Example 2: tag recommendation
14. 6/28/13
14
- 32 -
We can suggest tags: nice but ....
- 33 -
Privacy Example:
AOL Query Logs Release Incident
No. 4417749 conducted hundreds of searches over a
three-month period on topics ranging from “numb
fingers” to “60 single men”.
Other queries: “landscapers in Lilburn, Ga,” several
people with the last name Arnold and “homes sold
in shadow lake subdivision gwinnett county
georgia.”
Data trail led to Thelma Arnold, a 62-year-old widow
who lives in Lilburn, Ga., frequently researches her
friends’ medical ailments and loves her three dogs.
A Face Is Exposed for AOL Searcher No. 4417749,
By MICHAEL BARBARO and TOM ZELLER Jr,
The New York Times, Aug 9 2006
33
15. 6/28/13
15
- 34 -
Risks of Privacy
(ZIP code, date of birth, gender)
is enough to identify 87% of
US citizens using public DB
(Sweeney, 2001)
K-anonymity
Suppress or generalize attributes until
each entry is identical to at least k-1
other entries
Federal Trade Commission in
US: Privacy policies should
“address the collection of data
itself and not just how the
data is used”, Dec 2010.
Data Protection Directive in EU
34
- 35 -
Risks of Privacy: Query Logs
Profile: [Jones, Kumar, Pang, Tompkins, CIKM 2007]
• Gender: 84%
• Age (±10): 79%
• Location (ZIP3): 35%
Vanity Queries: [Jones et al, CIKM 2008]
• Partial name: 8.9%
• Complete: 1.2%
More information:
• A Survey of query log privacy-enhancing techniques
from a policy perspective [Cooper, ACM TWEB 2008]
A good anonymization is still an open problem
16. 6/28/13
16
- 36 -
36
Sparsity
§ The Long Tail is always Sparse
§ Why there is a long tail?
§ When the crowd dominates
§ Empowering the tail
§ Example: Relations from Query Logs
- 38 -
The Wisdom of Crowds
– Popularity
– Diversity
– Quality
– Coverage
Long tail
Heavy tail
17. 6/28/13
17
- 39 -
The Long Tail
Most measures in the Web follow a power law
- 42 -
People
Interests
42
Heavy tail of user interests
Many queries, each asked very few times, make
up a large fraction of all queries
Movies watched, blogs read, words used, …
Normal
people
Weirdos
One explanation
18. 6/28/13
18
- 43 -
Many queries, each asked very few times, make
up a large fraction of all queries
Applies to word usage, web page access, …
We are all partially eclectic
People
Interests
Broder, Gabrilovich, Goel, Pang; WSDM 2009
The reality
Heavy tail of user interests
- 44 -
Example: Click Distribution
User interaction
is a
power law!
(Zipf’s principle
of minimal effort)
19. 6/28/13
19
- 45 -
When the crowd dominates
Kills the long tail
See (obsolete now)
“shwarzneger” example
45
- 46 -
Empowering the Tail
The Filter “Bubble”, Eli Pariser
• Avoid the Poor get Poorer Syndrome
Solutions:
• Diversity
• Novelty
• Serendipity
46
Explore & Exploit
20. 6/28/13
20
- 47 -
How to Circumvent Sparsity?
Wisdom of “ad-hoc” crowds?
Aggregate data in the “right way”
When data is sparse
Aggregate users around same intent, task, facet, ….
Change granularity “ad hoc”
• Middle age men
• Fans of Messi
47
- 48 -
48
Example: Mining Geo/time Data
• Optimal Touristic Paths from Flickr
• Good for tourists and locals
De Choudhury et al, HT 2010
21. 6/28/13
21
- 49 -
• The long tail is important not only for e-
commerce, but because we are all there
• Personalization vs. Contextualization
User interaction is another long tail
People
Interests
Aggregating in the Long Tail
- 69 -
69
Epilogue
l The Web is scientifically young
l The Web is intellectually diverse
l The technology mirrors the economic, legal and
sociological reality
l Data must be interesting! (Gerhard Weikum)
l Problem driven
l Plenty of challenges
22. 6/28/13
22
- 70 -
70
Mirror of Society
- 71 -
71
Exports/Imports vs. Domain Links
Baeza-Yates & Castillo, WWW2006