Centric - Jaap huisprijzen, GTST, The Bold, IKEA en IENS. Zomaar wat toepassi...BigDataExpo
Tijdens deze presentatie wordt duidelijk hoe je machine learning kunt toepassen in het dagelijks leven. Denk aan het kopen van een huis, het kijken van Goede Tijden Slechte Tijden, shoppen bij IKEA en het bezoeken van restaurants.
Jaap Huisprijzen, GTST, The Bold, IKEA en IensLonghow Lam
Jaap Huisprijzen, GTST, The Bold, IKEA en Iens, zomaar wat toepassingen van machine learning met Dataiku.
Slides of my presentation at BigDataExpo Utrect 20-Sep-2018
De groei van data en applicaties zorgt voor extra complexiteit binnen back-up & recovery. Is uw data nog wel voldoende beschermd?
In deze presentatie worden de verschillen weergegeven tussen het in eigen beheer houden en uitbesteden van back-up en recovery om aan deze complexiteit het hoofd te bieden.
Nieuwe eisen van de moderne IT:
- Cloud v.s. in eigen beheer?
- Beschikking over de meest efficiënte technologie?
- Welke keuzes zijn er?
Centric - Jaap huisprijzen, GTST, The Bold, IKEA en IENS. Zomaar wat toepassi...BigDataExpo
Tijdens deze presentatie wordt duidelijk hoe je machine learning kunt toepassen in het dagelijks leven. Denk aan het kopen van een huis, het kijken van Goede Tijden Slechte Tijden, shoppen bij IKEA en het bezoeken van restaurants.
Jaap Huisprijzen, GTST, The Bold, IKEA en IensLonghow Lam
Jaap Huisprijzen, GTST, The Bold, IKEA en Iens, zomaar wat toepassingen van machine learning met Dataiku.
Slides of my presentation at BigDataExpo Utrect 20-Sep-2018
De groei van data en applicaties zorgt voor extra complexiteit binnen back-up & recovery. Is uw data nog wel voldoende beschermd?
In deze presentatie worden de verschillen weergegeven tussen het in eigen beheer houden en uitbesteden van back-up en recovery om aan deze complexiteit het hoofd te bieden.
Nieuwe eisen van de moderne IT:
- Cloud v.s. in eigen beheer?
- Beschikking over de meest efficiënte technologie?
- Welke keuzes zijn er?
Web Archives and the dream of the Personal Search EngineArjen de Vries
Keynote at the 4th Alexandria Workshop organised by Avishek Anand and Wolfgang Nejdl, L3S, Hannover (Germany). I argue that Web Archives should act as a pivot while revisiting the idea of decentralised search.
See also http://alexandria-project.eu/events/4th-int-alexandria-workshop-19-20-october-2017/
Lecture on Information Retrieval and Social Media, given to PhD students in the User-Centred Social Media Summer School, in Duisburg, September 19, 2017.
See also https://www.ucsm.info/events/118-new-frontiers-in-social-media-research-%E2%80%93-international-summer-school-2018
Opening statement at the "Looking forward" panel at the 25 years of TREC celebration event, Nov 15th, 2016.
Webcast to appear within a week: https://www.nist.gov/news-events/events/2016/11/webcast-text-retrieval-conference
Huygens colloquium at Radboud University Science Faculty.
Effective web search engines (and the commercial success of a few internet giants) depend upon the data collected from the online seeking behaviour of huge numbers of users. Put differently, the high quality search results we accept for granted every day come at the price of reduced privacy.
A personal search engine would not only search the web, but also rich personal data including email, browsing history, documents read and contents of the user’s home directory. Results with so-called "slow search" indicate that the user experience can be improved significantly when the search engine gains access to additional data. However, will we be prepared to give up even more of our privacy, and eventually be prepared to give up control over all that personal information?
My proposal is to mitigate these concerns by developing a new architecture for web search, in which users control the trade-off between search result quality and the privacy risk inherent to sharing usage logs. Under this design, all data of the “personal search engine” (PSE) (web and usage data) resides in its owner’s personal digital infrastructure.
Two challenges need to be overcome to turn this into a viable alternative. Can we compensate for the loss of information about searches of large numbers of users? And, can we maintain an up-to-date index in a cost-effective manner? As a solution, I propose to organise personal search engines in a decentralised social network. This serves two goals: the index can be kept up-to-date collaboratively, and usage data may be traded with peers.
Models for Information Retrieval and RecommendationArjen de Vries
Online information services personalize the user experience by applying recommendation systems to identify the information that is most relevant to the user. The question how to estimate relevance has been the core concept in the field of information retrieval for many years. Not so surprisingly then, it turns out that the methods used in online recommendation systems are closely related to the models developed in the information retrieval area. In this lecture, I present a unified approach to information retrieval and collaborative filtering, and demonstrate how this let’s us turn a standard information retrieval system into a state-of-the-art recommendation system.
Better Contextual Suggestions by Applying Domain KnowledgeArjen de Vries
A talk summarizing the main lessons from the CWI participation in the 2014 TREC Contextual Suggestions track. If you want to suggest tourist locations, use tourist sources. If you want reproduceable research results, map these into Clueweb first.
Recommender systems aim to predict the content that a user would like based on observations of the online behaviour of its users. Research in the Information Access group addresses different aspects of this problem, varying from how to measure recommendation results, how recommender systems relate to information retrieval models, and how to build effective recommender systems (note: last Friday, we won the ACM RecSys 2013 News Recommender Systems challenge). We would like to develop a general methodology to diagnose weaknesses and strengths of recommender systems. In this talk, I discuss the initial results of an analysis of the core component of collaborative filtering recommenders: the similarity metric used to find the most similar users (neighbours) that will provide the basis for the recommendation to be made. The purpose is to shed light on the question why certain user similarity metrics have been found to perform better than others. We have studied statistics computed over the distance distribution in the neighbourhood as well as properties of the nearest neighbour graph. The features identified correlate strongly with measured prediction performance - however, we have not yet discovered how to deploy this knowledge to actually improve recommendations made.
Social media sites (by some referred to as the web 2.0) allow their users to interact with each other, for example in collecting and sharing so-called user-generated content - these can be just bookmarks, but also blogs, images, and videos. Social media support co-creation: processes where customers (or users, if you prefer) do not just consume but play an active role in defining and shaping the end product. Famous examples include Six Degrees, LiveJournal, Digg, Epinions, Myspace, Flickr, YouTube, Linked-in, and Pinterest. Of course, today's internet giants Facebook and Twitter are key new developments. Finally, Wikipedia should not be overlooked - a major resource in many language technologies including information retrieval!
The second part of the lecture looks into the opportunities for information retrieval research. Social media platforms tend to provide access to user profiles, connections between users, the content these users publish or share, and how they react to each other's content through commenting and rating. Also, the large majority of social media platforms allow their users to categorize content by means of tags (or, in direct communication, through hash-tags), resulting in collaborative ways of information organization known as folksonomies. However, these social media also form a challenge for information retrieval research: the many platforms vary in functionalities, and we have only very little understanding of clearly desirable features like combining tag usage and ratings in content recommendation! A unifying approach based on random walks will be discussed to illustrate how we can answer some of these questions [1], but clearly the area has ample opportunity to leave your own marks.
In the final part of the lecture I will briefly touch upon an even wider range of opportunities, where data derived from social media form a key component to enable new research and insights. I will review a few important results from research centered on Wikipedia, facebook and twitter data, as well as a diverse range of new information sources including the geo- and temporal information derived from images and tweets, product reviews and comments on youtube videos, and how url shorteners may give a view on what is popular on the web.
[1] Maarten Clements, Arjen P. De Vries, and Marcel J. T. Reinders. 2010. The task-dependent effect of tags and ratings on social media access. ACM Trans. Inf. Syst. 28, 4, Article 21 (November 2010), 42 pages. http://doi.acm.org/10.1145/1852102.1852107
Looking beyond plain text for document representation in the enterpriseArjen de Vries
In many real life scenarios, searching for information is not the user's end goal. In this presentation I look into the specific example of corporate strategy and business development in a university setting.
In today's academic institutions, strategic questions are those that relate to dependency on funding instruments, the public private partnerships that exist (and those that should be extended!), and the match between topic areas addressed by the research staff and those claimed important by policy makers. The professional search tasks encountered to answer questions in this domain are usually addressed by business intelligence (BI) tools, and not by search engines. However, professionals are known to be busy people inspired by their own research interests, and not particularly fond of keeping the
customer relationship management (CRM) or knowledge management systems up to date for the organisation's strategic interest. This then results in incomplete and inaccurate data.
Instead of requiring research staff (or their administrative support) to provide this management information, I will illustrate by example how the desired information usually exists already in the documents inherent to the academic work process. Information retrieval could thus play an important role in the computer systems that support the business analytics involved, and could significantly improve the coverage of entities of interest - i.e., to reduce the effort involved in achieving good recall in business analytics. The ranking functionality over the enterprise's (textual) content should however not be an isolated component. Our example setting integrates the information derived from research proposals, research publications and the financial systems, providing an excellent motivation for a more unified approach to structured and unstructured data.
Recommendation and Information Retrieval: Two Sides of the Same Coin?Arjen de Vries
Status update on our current understanding of how collaborative filtering relates far more closely to information retrieval than usually thought. Includes work by Jun Wang and Alejandro Bellogín. This presentation has been given at the Siks PhD student course on computational intelligence, May 24th, 2013
Web Archives and the dream of the Personal Search EngineArjen de Vries
Keynote at the 4th Alexandria Workshop organised by Avishek Anand and Wolfgang Nejdl, L3S, Hannover (Germany). I argue that Web Archives should act as a pivot while revisiting the idea of decentralised search.
See also http://alexandria-project.eu/events/4th-int-alexandria-workshop-19-20-october-2017/
Lecture on Information Retrieval and Social Media, given to PhD students in the User-Centred Social Media Summer School, in Duisburg, September 19, 2017.
See also https://www.ucsm.info/events/118-new-frontiers-in-social-media-research-%E2%80%93-international-summer-school-2018
Opening statement at the "Looking forward" panel at the 25 years of TREC celebration event, Nov 15th, 2016.
Webcast to appear within a week: https://www.nist.gov/news-events/events/2016/11/webcast-text-retrieval-conference
Huygens colloquium at Radboud University Science Faculty.
Effective web search engines (and the commercial success of a few internet giants) depend upon the data collected from the online seeking behaviour of huge numbers of users. Put differently, the high quality search results we accept for granted every day come at the price of reduced privacy.
A personal search engine would not only search the web, but also rich personal data including email, browsing history, documents read and contents of the user’s home directory. Results with so-called "slow search" indicate that the user experience can be improved significantly when the search engine gains access to additional data. However, will we be prepared to give up even more of our privacy, and eventually be prepared to give up control over all that personal information?
My proposal is to mitigate these concerns by developing a new architecture for web search, in which users control the trade-off between search result quality and the privacy risk inherent to sharing usage logs. Under this design, all data of the “personal search engine” (PSE) (web and usage data) resides in its owner’s personal digital infrastructure.
Two challenges need to be overcome to turn this into a viable alternative. Can we compensate for the loss of information about searches of large numbers of users? And, can we maintain an up-to-date index in a cost-effective manner? As a solution, I propose to organise personal search engines in a decentralised social network. This serves two goals: the index can be kept up-to-date collaboratively, and usage data may be traded with peers.
Models for Information Retrieval and RecommendationArjen de Vries
Online information services personalize the user experience by applying recommendation systems to identify the information that is most relevant to the user. The question how to estimate relevance has been the core concept in the field of information retrieval for many years. Not so surprisingly then, it turns out that the methods used in online recommendation systems are closely related to the models developed in the information retrieval area. In this lecture, I present a unified approach to information retrieval and collaborative filtering, and demonstrate how this let’s us turn a standard information retrieval system into a state-of-the-art recommendation system.
Better Contextual Suggestions by Applying Domain KnowledgeArjen de Vries
A talk summarizing the main lessons from the CWI participation in the 2014 TREC Contextual Suggestions track. If you want to suggest tourist locations, use tourist sources. If you want reproduceable research results, map these into Clueweb first.
Recommender systems aim to predict the content that a user would like based on observations of the online behaviour of its users. Research in the Information Access group addresses different aspects of this problem, varying from how to measure recommendation results, how recommender systems relate to information retrieval models, and how to build effective recommender systems (note: last Friday, we won the ACM RecSys 2013 News Recommender Systems challenge). We would like to develop a general methodology to diagnose weaknesses and strengths of recommender systems. In this talk, I discuss the initial results of an analysis of the core component of collaborative filtering recommenders: the similarity metric used to find the most similar users (neighbours) that will provide the basis for the recommendation to be made. The purpose is to shed light on the question why certain user similarity metrics have been found to perform better than others. We have studied statistics computed over the distance distribution in the neighbourhood as well as properties of the nearest neighbour graph. The features identified correlate strongly with measured prediction performance - however, we have not yet discovered how to deploy this knowledge to actually improve recommendations made.
Social media sites (by some referred to as the web 2.0) allow their users to interact with each other, for example in collecting and sharing so-called user-generated content - these can be just bookmarks, but also blogs, images, and videos. Social media support co-creation: processes where customers (or users, if you prefer) do not just consume but play an active role in defining and shaping the end product. Famous examples include Six Degrees, LiveJournal, Digg, Epinions, Myspace, Flickr, YouTube, Linked-in, and Pinterest. Of course, today's internet giants Facebook and Twitter are key new developments. Finally, Wikipedia should not be overlooked - a major resource in many language technologies including information retrieval!
The second part of the lecture looks into the opportunities for information retrieval research. Social media platforms tend to provide access to user profiles, connections between users, the content these users publish or share, and how they react to each other's content through commenting and rating. Also, the large majority of social media platforms allow their users to categorize content by means of tags (or, in direct communication, through hash-tags), resulting in collaborative ways of information organization known as folksonomies. However, these social media also form a challenge for information retrieval research: the many platforms vary in functionalities, and we have only very little understanding of clearly desirable features like combining tag usage and ratings in content recommendation! A unifying approach based on random walks will be discussed to illustrate how we can answer some of these questions [1], but clearly the area has ample opportunity to leave your own marks.
In the final part of the lecture I will briefly touch upon an even wider range of opportunities, where data derived from social media form a key component to enable new research and insights. I will review a few important results from research centered on Wikipedia, facebook and twitter data, as well as a diverse range of new information sources including the geo- and temporal information derived from images and tweets, product reviews and comments on youtube videos, and how url shorteners may give a view on what is popular on the web.
[1] Maarten Clements, Arjen P. De Vries, and Marcel J. T. Reinders. 2010. The task-dependent effect of tags and ratings on social media access. ACM Trans. Inf. Syst. 28, 4, Article 21 (November 2010), 42 pages. http://doi.acm.org/10.1145/1852102.1852107
Looking beyond plain text for document representation in the enterpriseArjen de Vries
In many real life scenarios, searching for information is not the user's end goal. In this presentation I look into the specific example of corporate strategy and business development in a university setting.
In today's academic institutions, strategic questions are those that relate to dependency on funding instruments, the public private partnerships that exist (and those that should be extended!), and the match between topic areas addressed by the research staff and those claimed important by policy makers. The professional search tasks encountered to answer questions in this domain are usually addressed by business intelligence (BI) tools, and not by search engines. However, professionals are known to be busy people inspired by their own research interests, and not particularly fond of keeping the
customer relationship management (CRM) or knowledge management systems up to date for the organisation's strategic interest. This then results in incomplete and inaccurate data.
Instead of requiring research staff (or their administrative support) to provide this management information, I will illustrate by example how the desired information usually exists already in the documents inherent to the academic work process. Information retrieval could thus play an important role in the computer systems that support the business analytics involved, and could significantly improve the coverage of entities of interest - i.e., to reduce the effort involved in achieving good recall in business analytics. The ranking functionality over the enterprise's (textual) content should however not be an isolated component. Our example setting integrates the information derived from research proposals, research publications and the financial systems, providing an excellent motivation for a more unified approach to structured and unstructured data.
Recommendation and Information Retrieval: Two Sides of the Same Coin?Arjen de Vries
Status update on our current understanding of how collaborative filtering relates far more closely to information retrieval than usually thought. Includes work by Jun Wang and Alejandro Bellogín. This presentation has been given at the Siks PhD student course on computational intelligence, May 24th, 2013
1. Bever Finale 2017 - 2018
Lezing over Big Data
Prof.dr.ir. Arjen P. de Vries
arjen@acm.org
Nijmegen, March 16th, 2018
2. Big Data
De 3 Vs van Big Data:
- Volume
We meten steeds meer, en wat we aan data verkrijgen groeit
sneller en sneller
- Velocity
Data komt sneller binnen dan we het kunnen analyseren –
aardbeving waarschuwing alleen nuttig als het voor de beving
is uitgerekend!
- Variety
Data is steeds vaker ongestructureerd, in de vorm van tekst,
beeld of video.
3. Big Data: nieuwe mogelijkheden!
Om
Data te genereren,
Delen,
Combineren
Analyseren
.. die leiden tot nieuwe inzichten en een nieuwe manier van
redeneren.
(bron: Definitie van big data van de Nationale DenkTank)
4. Bijvoorbeeld in de wetenschap!
(Banko and Brill, ACL 2001)
(Brants et al., EMNLP 2007)
5. Diversiteit aan data
Tweets!
Alles wat er op sociale netwerken wordt geplaatst
- Facebook, Instagram, Pinterest, …
Alles wat er aan sociale media wordt geproduceerd
- YouTube, Flickr, …
Communicatie:
- WhatsApp berichten en andere chat services als skype,
Snapchat, …
- Email
Locatie-informatie
- De plek waar we ons bevinden, via smartphone GPS bv.
Wat we kopen, bv. Bonuskaart, kortingsbonnen, …
en ga zo nog maar even door; denk bijvoorbeeld ook aan
het Internet of Things (Bv., de verwarmingsketel,
elektriceitsmeters, etc.)
6. Diversiteit aan data (Opdracht 1)
Doe jij ook mee met het creëren van data? Ja, daar kun je
niet om heen. Denk er maar eens over na.
Hoe creëer jij data?
Kun jij je data voorstellen die jij zelf gecreëerd hebt maar
waarvan je eigenlijk niet wilt dat anderen daar iets mee
gaan doen?
Kun jij je ook data voorstellen die, als dat op het internet
komt, voor jou nuttig kan zijn?
9. August 4, 2006: Logs voor academici
3 maanden, 650 duizend gebruikers, 20 miljoen zoekvragen
Anonieme User IDs
August 7, 2006: AOL haalde de data weg, maar… internet vergeet
nooit!
August 9, 2006: New York Times identificeert Thelma Arnold
“A Face Is Exposed for AOL Searcher No. 4417749”
Zoekvragen in een kleine gemeenschap, Lilburn, GA (pop. 11k)
Zoekvragen naar specifieke namen (Jarrett Arnold)
NYT journalist benadert alle 14 mensen in Lilburn met achternaam
Arnold
Thelma Arnold bevestigt haar zoekvragen
August 21, 2006: 2 AOL werknemers ontslagen, CTO zelf weg
September, 2006: “Class action rechtzaak” ingediend tegen AOL
AnonID Query QueryTime ItemRank ClickURL
---------- --------- --------------- ------------- ------------
1234567 uw cse 2006-04-04 18:18:18 1 http://www.cs.washington.edu/
1234567 uw admissions process 2006-04-04 18:18:18 3 http://admit.washington.edu/admission
1234567 computer science hci 2006-04-24 09:19:32
1234567 computer science hci 2006-04-24 09:20:04 2 http://www.hcii.cmu.edu
1234567 seattle restaurants 2006-04-24 09:25:50 2 http://seattletimes.nwsource.com/rests
1234567 perlman montreal 2006-04-24 10:15:14 4
http://oldwww.acm.org/perlman/guide.html
1234567 uw admissions notification 2006-05-20 13:13:13
…
AOL Search Dataset
Tnx Jamie Teevan
10. AOL Search Dataset
Anonieme IDs geen garantie voor anonimiteit
Logs bevatten direct identificerende informatie:
Namen, telefoonnummers, credit cards, BSNs
Tevens indirect identificerende informatie:
Thelma’s vragen uit het NYT artikel
Geboortedatum, geslacht en postcode is voldoende
om 87% van de Amerikanen uniek te identificeren!
Tnx Jamie Teevan
11. Big Data in NL
Aankopen bij bol.com
Bestemmingen van Booking.com op maat
Advertenties real-time verkopen, bv. bij nu.nl (Sanoma)
Nieuwsbrieven Blendle.nl
Kinderzoekmachine WizeNoze.com
Etc. etc.
18. Hoeveel data is dat dan?
Byte = getal tussen 0 en 255
of een getal tussen -128 en +127
Hoe zit dan dan met letters?
19.
20. Hoeveel data is dat dan?
6000 / s
x 1 KB /
= 6 MB / s
= 500 GB / dag
Alleen nog maar de Tweet tekst...
… dus we missen:
Plaatjes
Web pagina’s
Filmpjes
Etc.
21. BIG Data (Opdracht 2)
Nieuwe data:
30.000 GigaByte/s = 3x104x109 B/s = 3x1013 B/s
Hardeschijf: 2 TB = 2x1012 B
Dus, na 0,07 seconden is jouw harde schijf al vol!!
Bedenk hoe groot de harde schijf is die in jouw computer zit.
Weet je dat niet? Ga er dan maar van uit dat je een harde
schijf ter beschikking hebt van 2 TB (terabyte).
Hoeveel seconden (of minuten, uren of dagen) kun je aan
data op slaan, uitgaande van 30.000 gigabytes per
seconde?
22. BIG Data
24 uur = 86400 seconden
Met 3x1013 B/s is dat dan 2,6x1018 B aan data
Met 2x1012 B per schijf is dat 1.300.000 schijven
per seconde!
Dus zo big is big!
Hoeveel harde schijven van 2 TB heb je nodig om alle data
van een dag op te slaan?
23. Terug naar Twitter: Puzzeltje!
slideshare.net/raffikrikorian/twitter-by-the-numbers
27. A Prototype “Big Data Analysis” Task
Bekijk elk data-item
Extraheer “iets interessants”
Aggregeer de tussenresultaten
- Hiervoor moet je gewoonlijk alle data sorteren en herverdelen
over het datacentrum!
Genereer de gevraagde analyse-resultaten
(Dean and Ghemawat, OSDI 2004)
29. MapReduce
mapmap map map
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2 c c3 6 a c5 2 b c7 8
a 1 5 b 2 7 c 2 3 6 8
r1 s1 r2 s2 r3 s3
mapmap map map
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2ba 1 2 c c3 6c c3 6 a c5 2a c5 2 b c7 8b c7 8
a 1 5a 1 5 b 2 7b 2 7 c 2 3 6 8c 2 3 6 8
r1 s1r1 s1 r2 s2r2 s2 r3 s3r3 s3
30. MapReduce
mapmap map map
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2 c c3 6 a c5 2 b c7 8
a 1 5 b 2 7 c 2 3 6 8
r1 s1 r2 s2 r3 s3
mapmap map map
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2ba 1 2 c c3 6c c3 6 a c5 2a c5 2 b c7 8b c7 8
a 1 5a 1 5 b 2 7b 2 7 c 2 3 6 8c 2 3 6 8
r1 s1r1 s1 r2 s2r2 s2 r3 s3r3 s3
31. MapReduce
mapmap map map
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2 c c3 6 a c5 2 b c7 8
a 1 5 b 2 7 c 2 3 6 8
r1 s1 r2 s2 r3 s3
mapmap map map
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2ba 1 2 c c3 6c c3 6 a c5 2a c5 2 b c7 8b c7 8
a 1 5a 1 5 b 2 7b 2 7 c 2 3 6 8c 2 3 6 8
r1 s1r1 s1 r2 s2r2 s2 r3 s3r3 s3
32. MapReduce
mapmap map map
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2 c c3 6 a c5 2 b c7 8
a 1 5 b 2 7 c 2 3 6 8
r1 s1 r2 s2 r3 s3
mapmap map map
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2ba 1 2 c c3 6c c3 6 a c5 2a c5 2 b c7 8b c7 8
a 1 5a 1 5 b 2 7b 2 7 c 2 3 6 8c 2 3 6 8
r1 s1r1 s1 r2 s2r2 s2 r3 s3r3 s3
33. Combiners
Commutatieve en associatieve operators?
- Dan kan de reduce ook al voor de shuffle toegepast worden!
Commutatieve operator:
A + B = B + A
Associatieve operator:
(A + B) + C = A + (B + C)
34. combinecombine combine combine
ba 1 2 c 9 a c5 2 b c7 8
partition partition partition partition
mapmap map map
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2 c c3 6 a c5 2 b c7 8
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
a 1 5 b 2 7 c 2 9 8
r1 s1 r2 s2 r3 s3
c 2 3 6 8
35. combinecombine combine combine
ba 1 2 c 9 a c5 2 b c7 8
partition partition partition partition
mapmap map map
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2 c c3 6 a c5 2 b c7 8
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
a 1 5 b 2 7 c 2 9 8
r1 s1 r2 s2 r3 s3
c 2 3 6 8
36. combinecombine combine combine
ba 1 2 c 9 a c5 2 b c7 8
partition partition partition partition
mapmap map map
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2 c c3 6 a c5 2 b c7 8
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
a 1 5 b 2 7 c 2 9 8
r1 s1 r2 s2 r3 s3
c 2 3 6 8
40. Gemiddelde
Niet associatief!
“Het gemiddelde van een reeks gemiddelden is niet gelijk
aan het gemiddelde van de reeks oorspronkelijke getallen”
- Wanneer wel?!
42. Waarom werkt het nu wel?
De operator “gemiddelde” werkt niet meer op getallen,
maar op de combinatie van deelsom en aantal
Deze nieuwe operator is wel associatief en commutatief!
43. Wat heb je geleerd?
Nieuwe mogelijkheden door meer data
Bedrijven als Google en Twitter hebben heel veel
computers nodig – elke 12 seconden meer data op Twitter
dan wat Shakespeare in zijn hele leven heeft geschreven!
Met kennis van algebra kunnen we algorithmen hetzelfde
werk met minder computers laten doen
Informatica is heel erg leuk!
Editor's Notes
Released at SIGIR 2006
Thelma Arnold, a 62 year old woman from Lilburn, GA
Lawsuit asking for $5000/user
http://en.wikipedia.org/wiki/AOL_search_data_scandal
http://www.nytimes.com/2006/08/09/technology/09aol.html?_r=1
Basic Collection Statistics
Dates: 01 March, 2006 - 31 May, 2006
Normalized queries:
36,389,567 lines of data
21,011,340 instances of new queries (w/ or w/o click-through)
7,887,022 requests for "next page" of results
19,442,629 user click-through events
16,946,938 queries w/o user click-through
10,154,742 unique (normalized) queries
657,426 unique user ID's
Please reference the following publication when using this collection:
G. Pass, A. Chowdhury, C. Torgeson. A Picture of Search. The First International Conference on Scalable Information Systems, Hong Kong, June 2006.
User 927:
Inspired theatrical production by Katharine Clark Gray
User 711391:
Middle-aged woman, has an affair, ends it, tries to save her marriage.
Avg of avg is usually not equal to the avg (except if all groups are equal size)
Avg of avg is usually not equal to the avg (except if all groups are equal size)