Digital FDLP Louisiana GODORT 2012 slides+notesJames Jacobs
Ā
Keynote talk at the Spring 2012 meeting of the Louisiana Government Documents Round Table (LA GODORT) in Shreveport, LA Friday March 23, 2012.
The last slide includes a list of citations for further reading.
Blind Spots and Broken Links: Access to Government InformationJames Jacobs
Ā
Panel presentation given by James R. Jacobs as part of a program at American Library Association's 2015 annual conference set up by the Federal & Armed Forces Libraries Round Table (FAFLRT). The program, "Open Government: Current Trends and Practices Concerning FOIA, Open Access, and Other Post-Wiki-Leaks Issues" also featured Anneliese Taylor, Assistant Director of Scholarly Communications & Collections at UCSF, who gave an in-depth and very interesting presentation on open access and the OSTP directive on "Expanding Public Access to the Results of Federally Funded Research"
In its early days the Internet was often referred to as āthe wild Westā due to the lack of standards governing it. Though the Internet is somewhat more settled these days, one thing that still harkens back to the days of cattle ranchers and train robbers is reputation. In the age of Google, reputations can be ruined by those with genuine grievances and those with grudges alike. Would you know how to defend your reputation or that of your institution should it come under fire? Join Kimberley Barker for a closer look at the good, the bad, and the ugly of life in the reputation economy, and learn about practical steps that you can take to safeguard your good name
Web 2.0 Collective Intelligence - How to use collective intelligence techniqu...Paul Gilbreath
Ā
Source: http://www.helioteixeira.org/ How to use Collective Intelligence techniques to ensure that your web application can extract valuable data from its usage and deliver that value right back to the users. (MODULE 1)
The Reputation Economy: Protecting your most valuable asset in the age of GoogleKR_Barker
Ā
In its early days the Internet was often referred to as āthe wild Westā due to the lack of standards governing it. Though the Internet is somewhat more settled these days, one thing that still harkens back to the days of cattle ranchers and train robbers is reputation. In the age of Google, reputations can be ruined by those with genuine grievances and those with grudges alike. Would you know how to defend your reputation or that of your institution should it come under fire? Join Kimberley Barker for a closer look at the good, the bad, and the ugly of life in the reputation economy, and learn about practical steps that you can take to safeguard your good name.
Digital FDLP Louisiana GODORT 2012 slides+notesJames Jacobs
Ā
Keynote talk at the Spring 2012 meeting of the Louisiana Government Documents Round Table (LA GODORT) in Shreveport, LA Friday March 23, 2012.
The last slide includes a list of citations for further reading.
Blind Spots and Broken Links: Access to Government InformationJames Jacobs
Ā
Panel presentation given by James R. Jacobs as part of a program at American Library Association's 2015 annual conference set up by the Federal & Armed Forces Libraries Round Table (FAFLRT). The program, "Open Government: Current Trends and Practices Concerning FOIA, Open Access, and Other Post-Wiki-Leaks Issues" also featured Anneliese Taylor, Assistant Director of Scholarly Communications & Collections at UCSF, who gave an in-depth and very interesting presentation on open access and the OSTP directive on "Expanding Public Access to the Results of Federally Funded Research"
In its early days the Internet was often referred to as āthe wild Westā due to the lack of standards governing it. Though the Internet is somewhat more settled these days, one thing that still harkens back to the days of cattle ranchers and train robbers is reputation. In the age of Google, reputations can be ruined by those with genuine grievances and those with grudges alike. Would you know how to defend your reputation or that of your institution should it come under fire? Join Kimberley Barker for a closer look at the good, the bad, and the ugly of life in the reputation economy, and learn about practical steps that you can take to safeguard your good name
Web 2.0 Collective Intelligence - How to use collective intelligence techniqu...Paul Gilbreath
Ā
Source: http://www.helioteixeira.org/ How to use Collective Intelligence techniques to ensure that your web application can extract valuable data from its usage and deliver that value right back to the users. (MODULE 1)
The Reputation Economy: Protecting your most valuable asset in the age of GoogleKR_Barker
Ā
In its early days the Internet was often referred to as āthe wild Westā due to the lack of standards governing it. Though the Internet is somewhat more settled these days, one thing that still harkens back to the days of cattle ranchers and train robbers is reputation. In the age of Google, reputations can be ruined by those with genuine grievances and those with grudges alike. Would you know how to defend your reputation or that of your institution should it come under fire? Join Kimberley Barker for a closer look at the good, the bad, and the ugly of life in the reputation economy, and learn about practical steps that you can take to safeguard your good name.
Teaching information: from Google Search to Big DataMartin Patrick
Ā
The Internet is the biggest store of information the world has ever known and will be more and more central to eco- nomic activity in the future. All this information and activity comes at a price: surveys routinely show that employers are underwhelmed by young peopleās information skills. In this session we will explore web-based resources that can help students better master information technology and skills us- ing resources freely available online. Together we will talk about ideas to use these resources to augment curricula, and
briefly explore the next big thing in information: Big Data.
Researchers, Discovery and the Internet: What Next?David Smith
Ā
A web2.0 issues and implications overview I put together for the Research Information Network as part of their workshop on researchers and discovery services.
http://www.rin.ac.uk/discovery-services-workshop
Introduction to Digital Life (March 2017)KR_Barker
Ā
Many people are surprised to learn that, even though they donāt participate on social media and only use their computers for work, they have a digital life. This is partly because publicly-available information about you is collected from the internet, and this information is used by companies to create records about you. Join Kimberley Barker for an overview of topics such as digital privacy, online reputation management, personal branding, and online identity.
Sharing on the internet - aka social networking in all of its myriad forms - is explored in this powerpoint presentation that was designed by Yesha Naik and Alexa Goldstein for Dr. Perry's Managing New Technologies class in fall of 2009.
Teaching information: from Google Search to Big DataMartin Patrick
Ā
The Internet is the biggest store of information the world has ever known and will be more and more central to eco- nomic activity in the future. All this information and activity comes at a price: surveys routinely show that employers are underwhelmed by young peopleās information skills. In this session we will explore web-based resources that can help students better master information technology and skills us- ing resources freely available online. Together we will talk about ideas to use these resources to augment curricula, and
briefly explore the next big thing in information: Big Data.
Researchers, Discovery and the Internet: What Next?David Smith
Ā
A web2.0 issues and implications overview I put together for the Research Information Network as part of their workshop on researchers and discovery services.
http://www.rin.ac.uk/discovery-services-workshop
Introduction to Digital Life (March 2017)KR_Barker
Ā
Many people are surprised to learn that, even though they donāt participate on social media and only use their computers for work, they have a digital life. This is partly because publicly-available information about you is collected from the internet, and this information is used by companies to create records about you. Join Kimberley Barker for an overview of topics such as digital privacy, online reputation management, personal branding, and online identity.
Sharing on the internet - aka social networking in all of its myriad forms - is explored in this powerpoint presentation that was designed by Yesha Naik and Alexa Goldstein for Dr. Perry's Managing New Technologies class in fall of 2009.
Privacy, Ethics, and Future Uses of the Social WebMatthew Russell
Ā
A presentation to the Owen Graduate School of Management (Vanderbilt University) about social media and some of the technology behind the future uses of social media that are likely to shape the future of the Web as we know it.
You've heard the news, Data Science is the cool new career opportunity sweeping the world. Come learn from Thinkful Mentors all about this new and exciting industry.
Presentation discusses scientific method, common pitfalls of social media experiments. Defines some terms, shows neat tools, tries to move discussion forward.
A Semantic Web Primer: The History and Vision of Linked Open Data and the Web 3.0
There is a transformational change coming to the world-wide-web that will fundamentally alter how its vast array of data is structured, and as a result greatly enhance the way humans and machines interact with this indispensable resource. Given the inertia of existing infrastructure, this segue will be evolutionary as opposed to revolutionary, and indeed has been envisioned since the inception of the web. Come join us for a layman's look at the nature of the Web 3.0, its historical underpinnings, and the opportunities it presents.
Slides of the course on big data by C. Levallois from EMLYON Business School.
For business students. Check the online video connected with these slides.
-> Definition of text mining, the main categories of tools available (such as topic categorization or sentiment analysis) and their use for business.
A 1015 update to the 2012 "Data Big and Broad" talk - http://www.slideshare.net/jahendler/data-big-and-broad-oxford-2012 - extends coverage, brings more in context of recent "big data" work.
Big Data and the Social Sciences
Ex-Google engineer Abe Usher presents a talk about Big Data technology and methods applicable to social science.
Participants will learn techniques that are used by Google engineers to collect, clean, analyze, and visualize Big Data.
Additionally Mr. Usher will provide URLs to sample data, open source applications, and code to those interested in applying these Big Data methods themselves.
Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results.
Why is data science important?
Data science is important because it combines tools, methods, and technology to generate meaning from data. Modern organizations are inundated with data; there is a proliferation of devices that can automatically collect and store information. Online systems and payment portals capture more data in the fields of e-commerce, medicine, finance, and every other aspect of human life. We have text, audio, video, and image data available in vast quantities.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
Ā
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
Ā
As AI technology is pushing into IT I was wondering myself, as an āinfrastructure container kubernetes guyā, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefitās both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Ā
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Ā
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as āpredictable inferenceā.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Ā
Are you looking to streamline your workflows and boost your projectsā efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, youāre in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part āEssentials of Automationā series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Hereās what youāll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
Weāll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Donāt miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Ā
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Ā
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Ā
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
Ā
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
Ā
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more āmechanicalā approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Elevating Tactical DDD Patterns Through Object Calisthenics
Ā
Keynote baezayates
1. 6/28/13
1
Big Data
in
The Web
Ricardo Baeza-Yates
Yahoo! Labs
Barcelona & Santiago de Chile
- 3 -
Agenda
ā¢āÆBig Data
ā¢āÆAsking the Right Questions
ā¢āÆWisdom of Crowds in the Web
ā¢āÆThe Long Tail
ā¢āÆIssues and Examples
ā¢āÆConcluding Remarks
2. 6/28/13
2
- 4 -
4
Big Data
Ā§ļ§āÆ Capture, transfer, store, search, share, analyze,
and visualize large data in reasonable time
Ā§ļ§āÆ Large volume and growth
Ā§ļ§āÆ Petabytes to exabytes
Ā§ļ§āÆ Growth is estimated in 3 exabytes per day
Ā§ļ§āÆ Structured vs. non-structured data
Ā§ļ§āÆ Diversity
Ā§ļ§āÆ Types, formats, complexity, topics, etc.
Ā§ļ§āÆ Best Public Data Example: The Web
Ā§ļ§āÆ Content: text, multimedia
Ā§ļ§āÆ Structure: graphs
Ā§ļ§āÆ Usage: real time streams
- 5 -
5
Big Data
Ā§ļ§āÆ Focus on analytics
Ā§ļ§āÆ Many storage technologies:
Ā§ļ§āÆ DBs, DWs, distributed file systems, ā¦
Ā§ļ§āÆ Many processing technologies:
Ā§ļ§āÆ Cloud computing, map-reduce (Hadoop), ā¦
Ā§ļ§āÆ Data mining, clustering, classification, ā¦
Ā§ļ§āÆ Machine learning, A/B testing, NLP, ā¦
Ā§ļ§āÆ Simulation
Ā§ļ§āÆ Several technology providers
Ā§ļ§āÆ Initial best practices (see TDWI report, 2011)
Ā§ļ§āÆ Main challenges: scalability, online
3. 6/28/13
3
- 6 -
6
Big Data: The Five Vās
Characteristic Data Issue Computing Issue
Volume Scale,
Redundancy
Scalability
Variety Heterogeneity,
Complexity
Adaptability,
Extensibility
Veracity Completeness, Bias,
Sparsity, Noise, Spam
Reliability,
Trust
Velocity Real time Online
Value Usefulness,
Privacy
Business
dependent
- 7 -
7
Asking the Right Questions
Ā§ļ§āÆ Problem Driven
Ā§ļ§āÆ What data we need? How much?
Ā§ļ§āÆ How we collect it? How we store and transfer it?
Ā§ļ§āÆ Understanding the Data
Ā§ļ§āÆ How sparse is the data? How much noise?
Ā§ļ§āÆ There is redundancy? There are biases?
Ā§ļ§āÆ There is spam? Any outliers?
Ā§ļ§āÆ Analyzing the Data
Ā§ļ§āÆ Any privacy issues? Do we need to anonymize?
Ā§ļ§āÆ How well our algorithms scale?
Ā§ļ§āÆ Can we visualize the results?
4. 6/28/13
4
- 8 -
8
Too Much Data Available
Ā§ļ§āÆ The Web is a database!
Ā§ļ§āÆ Data does not imply information
Ā§ļ§āÆ Many analyses for the sake of it (data driven)
Ā§ļ§āÆ Analyzing data is not CS per se
Ā§ļ§āÆ Publish in the right forum!
Ā§ļ§āÆ Big Data or Right Data?
- 9 -
9
The Different Facets of the Web
5. 6/28/13
5
- 11 -
11
The Structure of the Web
- 12 -
Big Data in the Web
Metadata
RDF
Wikipedia ODP
Flickr
Text
Anchors + links
Y! Answers
Logs (Clicks+Queries)
Explicit Implicit
Wordnet
UGC
Private
Scale
Blogs,
Groups
Quality?
7. 6/28/13
7
- 15 -
15
Noise and Spam
Ā§ļ§āÆ Noise may come from many places:
Ā§ļ§āÆ Instruments that measure
Ā§ļ§āÆ How we interpret the data (example later)
Ā§ļ§āÆ Spam is everywhere
- 16 -
16
Web Spam
Deceiving text, links, clicksā¦
due to an economic incentive
Depending on the goal and the data,
spam is easier to generate
Depending on the type & target data,
spam is easier to fight
Disincentives for spammers?
ā¢āÆ Social
ā¢āÆ Economical
Web Spam is NOT Mail Spam
8. 6/28/13
8
- 17 -
17
- 18 -
Content and Metadata Trends
[Ramakrishnan and Tomkins 2007]
9. 6/28/13
9
- 19 -
Web Data Trends
ā¢āÆ User Generated Content
āāÆMassive (quality vs. quantity)
āāÆSocial Networks
āāÆReal time (people + physical sensors)
ā¢āÆ Impact
āāÆFragmentation of ownership
āāÆFragmentation of access (longer heavy tail)
āāÆFragmentation of right to access
ā¢āÆ Viability
āāÆBusiness model based in advertising
- 20 -
The Wisdom of Crowds
ā¢āÆ James Surowiecki, a New Yorker columnist,
published this book in 2004
āāÆāUnder the right circumstances, groups are
remarkably intelligentā
ā¢āÆ Importance of diversity, independence and
decentralization
ālarge groups of people are smarter than an elite few,
no matter how brilliantāthey are better at solving
problems, fostering innovation, coming to wise
decisions, even predicting the futureā.
Aggregating data
10. 6/28/13
10
- 21 -
21
Web Data Mining
ā¢āÆ Content: text & multimedia mining
ā¢āÆ Structure: link analysis, graph mining
ā¢āÆ Usage: log analysis, query mining
ā¢āÆ Relate all of the above
āāÆWeb characterization
āāÆParticular applications
- 22 -
Flickr: Clustering Pictures
22
12. 6/28/13
12
- 27 -
āCrowd Sourcingā
Web-based āpeer productionā has produced a number of
successful products and communities
ā¢āÆ Wikipedia, Y! Answers, YouTube, Flickr, Digg, ...
Can this form of production be harnessed for other ends?
ā¢āÆ Existing successes are hard to replicate at will
Amazon Mechanical Turk (AMT)
ā¢āÆ Like outsourcing, but in a micro-distributed fashion
ā¢āÆ Thousands of āturkersā working on hundreds of āHITSā (tasks)
ā¢āÆ Rates are typically few cents per task
ā¢āÆ Quality of their work is positively evaluated (e.g. in IR)
- 28 -
The Wisdom of (Large) Crowds
āāÆCrucial for Search Ranking
āāÆText: Web Writers & Editors
ā¢āÆnot only for the Web!
āāÆLinks: Web Publishers
āāÆTags: Web Taggers
āāÆQueries: All Web Users!
ā¢āÆQueries and actions (or no action!)ā«āā¬
The crowd implicitly
knows the experts!
13. 6/28/13
13
- 30 -
30
Scalability
Ā§ļ§āÆ How to scale?
Ā§ļ§āÆ Doubling the data in the best case will double the time
Ā§ļ§āÆ Time complexity vs. result quality trade-off
Ā§ļ§āÆ Example: entity detection in linear time at almost state
of the art quality
Ā§ļ§āÆ That implies that there exists a text size n* for which
the linear algorithm will produce more correct entities
Ā§ļ§āÆ Distributed parallel processing
Ā§ļ§āÆ Map-reduce not always works
Ā§ļ§āÆ Parallelism is problem dependent
Ā§ļ§āÆ Online processing needs a different approach
- 31 -
31
Redundancy and Bias
Ā§ļ§āÆ There is any dependency in the data?
Ā§ļ§āÆ There is any duplication?
Ā§ļ§āÆ Lexical duplication in the Web is around 25%
Ā§ļ§āÆ Semantic duplication is larger
Ā§ļ§āÆ Are there any biases?
Ā§ļ§āÆ Example 1: clicks in search engines
Ā§ļ§āÆ Bias to the ranking and the interface
Ā§ļ§āÆ There is a ranking bias in the Web content
Ā§ļ§āÆ Example 2: tag recommendation
14. 6/28/13
14
- 32 -
We can suggest tags: nice but ....
- 33 -
Privacy Example:
AOL Query Logs Release Incident
No. 4417749 conducted hundreds of searches over a
three-month period on topics ranging from ānumb
fingersā to ā60 single menā.
Other queries: ālandscapers in Lilburn, Ga,ā several
people with the last name Arnold and āhomes sold
in shadow lake subdivision gwinnett county
georgia.ā
Data trail led to Thelma Arnold, a 62-year-old widow
who lives in Lilburn, Ga., frequently researches her
friendsā medical ailments and loves her three dogs.
A Face Is Exposed for AOL Searcher No. 4417749,
By MICHAEL BARBARO and TOM ZELLER Jr,
The New York Times, Aug 9 2006
33
15. 6/28/13
15
- 34 -
Risks of Privacy
(ZIP code, date of birth, gender)
is enough to identify 87% of
US citizens using public DB
(Sweeney, 2001)
K-anonymity
Suppress or generalize attributes until
each entry is identical to at least k-1
other entries
Federal Trade Commission in
US: Privacy policies should
āaddress the collection of data
itself and not just how the
data is usedā, Dec 2010.
Data Protection Directive in EU
34
- 35 -
Risks of Privacy: Query Logs
Profile: [Jones, Kumar, Pang, Tompkins, CIKM 2007]
ā¢āÆ Gender: 84%
ā¢āÆ Age (Ā±10): 79%
ā¢āÆ Location (ZIP3): 35%
Vanity Queries: [Jones et al, CIKM 2008]
ā¢āÆ Partial name: 8.9%
ā¢āÆ Complete: 1.2%
More information:
ā¢āÆ A Survey of query log privacy-enhancing techniques
from a policy perspective [Cooper, ACM TWEB 2008]
A good anonymization is still an open problem
16. 6/28/13
16
- 36 -
36
Sparsity
Ā§ļ§āÆ The Long Tail is always Sparse
Ā§ļ§āÆ Why there is a long tail?
Ā§ļ§āÆ When the crowd dominates
Ā§ļ§āÆ Empowering the tail
Ā§ļ§āÆ Example: Relations from Query Logs
- 38 -
The Wisdom of Crowds
āāÆPopularity
āāÆDiversity
āāÆQuality
āāÆCoverage
Long tail
Heavy tail
17. 6/28/13
17
- 39 -
The Long Tail
Most measures in the Web follow a power law
- 42 -
People
Interests
42
Heavy tail of user interests
Many queries, each asked very few times, make
up a large fraction of all queries
Movies watched, blogs read, words used, ā¦
Normal
people
Weirdos
One explanation
18. 6/28/13
18
- 43 -
Many queries, each asked very few times, make
up a large fraction of all queries
Applies to word usage, web page access, ā¦
We are all partially eclectic
People
Interests
Broder, Gabrilovich, Goel, Pang; WSDM 2009
The reality
Heavy tail of user interests
- 44 -
Example: Click Distribution
User interaction
is a
power law!
(Zipfās principle
of minimal effort)
19. 6/28/13
19
- 45 -
When the crowd dominates
Kills the long tail
See (obsolete now)
āshwarznegerā example
45
- 46 -
Empowering the Tail
The Filter āBubbleā, Eli Pariser
ā¢āÆ Avoid the Poor get Poorer Syndrome
Solutions:
ā¢āÆ Diversity
ā¢āÆ Novelty
ā¢āÆ Serendipity
46
Explore & Exploit
20. 6/28/13
20
- 47 -
How to Circumvent Sparsity?
Wisdom of āad-hocā crowds?
Aggregate data in the āright wayā
When data is sparse
Aggregate users around same intent, task, facet, ā¦.
Change granularity āad hocā
ā¢āÆ Middle age men
ā¢āÆ Fans of Messi
47
- 48 -
48
Example: Mining Geo/time Data
ā¢āÆ Optimal Touristic Paths from Flickr
ā¢āÆ Good for tourists and locals
De Choudhury et al, HT 2010
21. 6/28/13
21
- 49 -
ā¢āÆ The long tail is important not only for e-
commerce, but because we are all there
ā¢āÆ Personalization vs. Contextualization
User interaction is another long tail
People
Interests
Aggregating in the Long Tail
- 69 -
69
Epilogue
lļ¬āÆThe Web is scientifically young
lļ¬āÆThe Web is intellectually diverse
lļ¬āÆThe technology mirrors the economic, legal and
sociological reality
lļ¬āÆ Data must be interesting! (Gerhard Weikum)
lļ¬āÆ Problem driven
lļ¬āÆ Plenty of challenges
22. 6/28/13
22
- 70 -
70
Mirror of Society
- 71 -
71
Exports/Imports vs. Domain Links
Baeza-Yates & Castillo, WWW2006