Advances in AI-driven Image Recognition for Early Detection of Cancer
10 Jahre Web Science
1. 10 Jahre Web Science
Ein Blick auf die nächsten 10 Jahre
http://www.webscience.org/webscience10/tv-channel-webscience10/
Steffen Staab
Chair of WSTNet
WAIS Univ. of
Southampton
WeST Univ. of Koblenz-
Landau
Wolfgang Nejdl
L3S
Leibniz Universität
Hannover
Nikolaus Forgó
L3S
Leibniz Universität
Hannover
2.
3. World Wide Web
Work
Dating
17% marriages in US due to
online dating
Traveling
Learning
Leisure
Science
Open access papers cited
more often 11:7
https://flic.kr/p/F37KoU
7. 10 Jahre Web Science Research
Initiative
Keynotes
• Ricardo Baeza-Yatest, Yahoo!
• Andrew Tomkins, Google
• Daniel Olmedilla, Facebook
• Jure Leskovec, Stanford & Pinterest
• Daniel Miller, UCL, ERC Grant
„Social Network Sites and Social
Science”
• Helen Margetts, Oxford Internet
Institute
Panels
10 Years of Web Science
Computational Social Science
Privacy and Internet Governance
8th ACM Web Science Conference 2016
22.-25. Mai 2016 in Hannover
http://websci17.org/
Troy, NY, USA, 26-18 June 2017
8. WWSSS – WSTNet Web Science Summer School
Koblenz 2016 St. Petersburg 2017
30/11/16 8Thomas Risse
Next one: St. Petersburg, July 2017
9. Web Science
the grand challenge Web Science Observatory
Researchers around the world
gathering and sharing data
and evidence
Sharing tools, methods and
techniques
Web Science Collaboratories
Longitudinal studies
Wendy Hall - CeBIT 2013
11. ALEXANDRIA (ERC Advanced Grant, 2.5 Mio. Euro)
World Wide Web – Digitales Erbe der Gesellschaft
Was bleibt vom Web in 100 / 1000 Jahren, wenn es
niemand bewahrt?
Datensammlung durch Deutsche
Nationalbibliothek, British Library, Internet
Archive, u.a.
Suche und Analyse durch ALEXANDRIA
Entwicklung neuer Modelle und Algorithmen, die
es
ermöglichen, nicht nur auf die Gegenwart,
sondern auch auf die Vergangenheit des Web
zuzugreifen
Semantische und zeitliche Suche nach Rudolph Giuliani
1997
2000
2006
2014
Mayoral
Campaign
Mayoral
Campaign
Mayoral
Campaign 9/11
Post politics
endeavours
Senate,
Cancer,
Allegations
NumberofDocuments
Mayor
12. SoBigData - Social Mining & Big Data Ecosystem
Big Data Analytics & Social Mining
as a tool to measure,
understand and possibly
predict human behavior
Research infrastructure (RI) for
ethic-sensitive scientific
discoveries and advanced
applications of social data
mining to the various
dimensions of social life, as
recorded by “big data”.
13. Integrating key national infrastructures and
centers of excellence
CNR & Uni Pisa (SoBigData.it)
Social Data
Big Data Analytics and Social Mining Services
Uni Hannover/L3S (Alexandria)
German Web Archive (80 TB)
Services and expertise on Web Archives
Uni Sheffield (GATE Cloud)
Natural language processing and text mining
FhG IGD & FhG IAIS
Information Visualization and Visual Analytics
Aalto University
Data, services and competences on
social network analysis
Uni Tartu (E-Gov.data)
Estonian e-government and ehealth data
ETH Zürich:
Search engine for Open Data
14. 1st Call SoBigData-funded Transnational Access
Forschungsaufenthalte (bis zu 2 Monate) bei SoBigData Partnern
zu den Themen:
* City of Citizens * Well-being and Economy
* Societal Debates * Migration Studies
24. Uber, the world‘s largest taxi company, owns no vehicles.
Facebook ...most popular media owner, creates no content.
Alibaba, the most valuable retailer, has no inventory.
Airbnb... largest accommodation provider, owns no real estate.
Data Oligopolists
25. Uber Whom do you take a ride with?
- the right picture – also for online dating ...
Facebook Which source do you trust?
- rumor checks change the trust ....
Alibaba Whom do you trust to buy from?
- others‘ ratings
Airbnb Whom do you want for a sleepover?
Vertrauen
42. Diagnose
30% of Europeans have still never used the internet;
Europe has only 1% penetration of fibre-based high-speed
networks whereas (Japan 12%, South Korea is at 15%)
EU spending on ICT research and development stands at only
40% of US levels.
Four times as many legal music downloads in the US as in the
EU
58. Web Science – The next 10 Years
Social challenges
Discrimination
Trust
Moral AI
Legal challenges
regulation of
infrastructure
for economic competition
tracking everywhere
Political challenges
Misinformation
Participation
Internet governance
Technical challenges
Artificial Intelligence
Security
...
Editor's Notes
Gestern fielen hundertausende Internet Anschluesse aus – heute waren mehrere Erfahrungsberichte wie Leute nicht mehr wussten, was sie tun sollten
For example, in the US it is now reported that between 15-20% of newly married couples met their spouses on line (cf. http://www.statisticbrain.com/online-dating-statistics/).
https://www.timeshighereducation.com/home/open-access-papers-gain-more-traffic-and-citations/2014850.article
Out of 350K dierent
sites visited by 200K users over a 7 day period, 273K sites
contained trackers that were sending information that we
deemed unsafe. Data elements that are only and always
sent by a single user, or a reduced set of users, are considered
unsafe with regard to privacy.
50% of news site carry at least 11 different trackers
- The majority of your friends in facebook have more friends than you do
here: a. the majority of your friends are colored
b. the majority of your friends are non-colored (same network)
Practical example might be media biases
Big Data presents opportunities for data mining and machine learning previously unimaginable, given the vast size of datasets from which we are able to learn, cluster upon, find associations within, and generally search for insights not before attainable. Mining Big Data is not a plug-and-play, one-size-fit-all, (insert another cliche here) process, however; though there seems to be alarmingly little discussion anymore of their importance in relation to Big Data, statistical thinking, methods, and processes matter.
It is possible that the lack of discussion is because most people understand this fundamental truth already, which I find doubtful. Perhaps I simply have not come across relevant such topics of late, and they do, in fact, exist. I also find this doubtful. I fear that oversight or an essential lack of understanding are more likely to blame.
Big Data
This article is not a blanket criticism of learning from Big Data; instead, it is much more accurately a reminder that time-tested statistical methods are more valid now than ever, in this Era of Big Data. In that regard, this discussion will focus on 2 particular statistical issues to be on the look out for in your own work and in the work of others mining and learning from Big Data.
And for the practitioners out there, this is not about abstract statistical theory. This is about practicality. And the highly improbable probabilities that can be improperly gleaned from Big Data.
The Bonferroni Principle
There is a concept in statistics that goes like this: even in completely random datasets, you can expect particular events of interest to occur, and to occur in increasing numbers as the amount of data grows. These occurrences are nothing more than collections of random features that appear to be instances of interest, but are not. This bears repeating: even amounts of random data lead to what seem to be events of interest, and the number of these seemingly interesting events grows as does the size of the dataset.
The Bonferroni Principle1 is a statistical method for accounting for these random events. To employ it, determine the number of expected random events of interest in the dataset, and if the observed number is significantly greater than this number, the chances of any observations providing useful insight are almost nonexistent. The Bonferroni Correction is a technique for helping to avoid such observations.
Torture the data, and it will confess to anything.
— Ronald Coase, economist, Nobel Prize Laureate.
One of the most prominent and easy to understand examples of the Bonferroni Principle is that of the George W. Bush administration's Total Information Awareness data-collection and data mining plan of 20021. The criticism of the plan's effectiveness, and its relationship to the Bonferroni Principal, are as follows.
Suppose we are looking for terrorists, from a potential pool made up of a very large number of individuals. Let's say that, in actuality, however, there is an incredibly small number of individuals who are terrorists. Now suppose these potential terrorists are thought to be deliberately visiting particular locations in pairs for meetings, but let's further suppose that these potential terrorists are actually non-terrorists moving about randomly. By using hard numbers for such a scenario and working out the probabilities, Rajaraman & Ullman gives the example of one billion potential "evil-doers," and though the actual number may be something very small (they give the example of 10 pairs), statistical probabilities could put the number of suspected pairs meeting at given locations due to pure randomness at 250,000 (again, in this particular example).
Now, this is clearly a problem. In purely practical terms, imagine having to recruit, train, and pay enough police personnel to investigate each of these flagged individuals!
If a Big Data mining practitioner had first computed some number which could be proven a reasonable number of expected random events (the Bonferroni Principle in action), the entire investigation would have been immediately recognized as flawed, given the near-absolute certainty that this Bonferroni number would have been less than a quarter of a million, the suspected number of significant events shown above.
Knowing when our out-of-the-gate quantitative assumptions are off base is critically useful in the Era of Big Data. The Bonferroni Principle is one example of how Big Data can result in highly unlikely outcomes masquerading as statistically sound.