Today 71 years ago the Dutch cruiser De Ruyter sank in the Java sea after being hit by a Japanese torpedo345 lives were lost among which that of commander Karel DoormanIt would be interesting to examine the reaction of the Dutch to this event
In 2013, many conversations take place on social media like Facebook, Twitter, Hyves and LinkedInThe conversations are stored and can be analyzed later
Social media generate huge data volumesApart from needing hardware you will need data processing specialists to deal with this data
Example projects: Ecology (gull with tracking device), Astronomy (radio telescope), Climate research (ocean level model)
eScience engineers are experts in dealing large-scale data problems
We have observed that Twitter is an interesting data source for researchersHowever, researchers find it difficult to select relevant data from Twitter
The red line in the graph shows the number of collected messages per minute by keyword tracking on Monday 10 February 2013. The green line shows the corresponding estimated total number of Dutch messages.
We collect more than 50 tweets per second (apart from the night hours)Hadoop is a framework for parallel computing used by Amazon, Facebook, Google, Twitter and Yahoo, among othersThe use of the framework has made searches over larger time frames possible: before one day, now several years
We are not allowed to distribute tweet text to others so rather than supplying tweets we supply tweet ids. The corresponding tweet text can then be looked up at Twitter’s website.See: https://dev.twitter.com/terms/api-terms (I.4.A)
Frequency of tweets containing the Dutch word “eten” (to eat) on Monday 7 January 2013. The three peaks correspond with breakfast, lunch and dinner.
Frequency of tweets containing the word “rechtbank” (court building) on 21 September 2013. At that date someone fired a mortar grenade at a court building in Amsterdam at 02:30 AM. The big peak shows the moment the topic was mentioned on the news. The small peaks indicate possible witness reports on Twitter. These reports are interesting for journalists looking for important current local topics.
Relative usage frequency of the word “griep”(flu) on Twitter in January and February 2013
The blue line shows the number of people with flu symptoms according to http://degrotegriepmeting.nl/Parts of the graphs correspond: the rise from 15/Jan to 4/Feb and gentle fall afterwardsThe graphs are different in the period 6/Jan – 10/Jan: because of the effect of news messages?
Locations of tweets containing the Dutch phrase “vast en zeker” (certainly)Dense population areas are clearly visible (Amsterdam, Rotterdam, The Hague, Utrecht, Eindhoven, Zwolle)Colors: blue (some tweets), green, yellow, red (many tweets)
In Flanders, the phrase “zeker en vast” is used rather than “vast en zeker”
Tweets containing dialect phrases can be used for automatically deriving words typical for the dialect
Demographics of persons behind the tweets of Tuesday 4 December 2012. The male-female ratio is close to 50-50. Teens (10-17 years) dominate the conversations, being responsible for about 50% of the tweets. People of students age (18-25 years) created about 25% as well as people from 26 years and older (about 25%).
Demographics related to tweets containing the word “bonus” in February 2013. Most relevant messages originate from male users (70%) and people that are older that 25 years (80%). This shows that different topics generate interest from users with different demographics.
“sneeuw” (snow) on Dutch Twitter on Monday 14 January 2013A snow front hit The Netherlands at 12:00, moving from south west to north eastThe moving front is visible on Twitter, by checking the locations of tweets containing the word “sneeuw”
Typical words appearing in tweets from February 2013 containing the token “srie13”
Twitter: listening to a kitchen table conversation involving 17 million people
Twitter: Listening to a Kitchen TableConversation Involving 17 Million PeopleErik Tjong Kim Sang (Netherlands eScience Center, Amsterdam)
28 February 1942: De Ruyter sinks Ship: HNLMS De Ruyter Commander: Karel Doorman Date: 28 February 1942 Location: Java Sea Lives lost: 345
How do we deal with the data volumes?Twitter:• Maximum of 140 characters per messageBut:• 200 million users (1-2 million NL)• 300 million tweets per day (3-4 million NL)
Netherlands eScience Center (NLeSC)by SURF and NWOSupporting data-intensive researchDistributing project calls offering bothmoney and expertiseInvolved in 21 projects dealing withvarious areas in scienceSee: http://esciencecenter.nl/
Engineers/Scientists @ NLeSCIn every NLeSC project, one eScience engineer isinvolved: a scientist (often with PhD) with strongcomputational skills
The project TwiNLwith Radboud University NijmegenTwitter: a social network on whichparticipants communicate by exchangingmessages of up to 140 charactersWith about 4 million messages in Dutch per day covering differenttopics, this is an interesting data source for researchersProject task: collect Dutch tweets, offer a search facility for the data andprovide different views on search results
Collecting the dataWe continuously search for common Dutch words on Twitter (keywordtracking)Additionally we collect all the tweets of most productive Dutch-speakingusers (user tracking)This generates about 2 milliontweets per day
Searching the dataA web interface has been built for searching the data: http://twiqs.nl
Challenges and solutionsChallenge 1: we have a lot of data (five terabytes)Challenge 2: the collection is growing every secondChallenge 3: we want to search both text and metadataSolution: parallel processing with Hadoop, make possible by :
Views on search resultsWe offer five different views on search results:• Tweets (restricted)• Keyword frequency graph• Keyword usage map• Vocabulary overview, including sentiment analysis• User analysisAll numbers used for creating the views are available for download
Concluding remarksSocial media conversations make up interesting research dataThe website http://twiqs.nl can be used for accessing this dataThe interface can be used by university staff and students for exploringTwitter conversationsExamples of applications are early news detection, tracking diseasespread, studying dialects and obtaining local weather information