Social Media Dataset

13,462 views

Published on

The goal of this presentation is to allow researchers to understand the possibilities of Social Media as a research field on the fields related to NLP/IR/DM.

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
13,462
On SlideShare
0
From Embeds
0
Number of Embeds
8,854
Actions
Shares
0
Downloads
110
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

Social Media Dataset

  1. 1. Social Media DataSet UPV, Aplicaciones de la Lingüística Computacional Abril 22nd, 2010 http://www.slideshare.com/jccortizo
  2. 2. José Carlos Cortizo Twitter: @josek_net
  3. 3. Goal ‣ Understand the possibilities of Social Media for Research
  4. 4. Index ‣ About the Speaker ‣ Research at UEM ‣ SGP, Wipley, BrainSins ‣ Social Media statistics ‣ Social Media as research field ‣ Applications ‣ Academic research vs Enterprise
  5. 5. José Carlos Cortizo 10 R&P projects, 4 workshops, 20 papers NLP, DM consultancy 2004 2005 2006 2007 2008 2009 2010
  6. 6. My Company SGP Videogamers Social Network 2.600 reg. users, ~10K monthly users SaaS software for SM & e-C RecSys, Social Search...
  7. 7. SGP Funding • €350K from CDTI (PID project) • €100K from FFF • still looking for extra €100K from BA
  8. 8. Web 2.0 vs. Social Web
  9. 9. What’s Web 2.0?
  10. 10. An evolution (of) and (based on) the Web...
  11. 11. ...focused on users http://www.flickr.com/photos/cayusa/431036565
  12. 12. Concept introduced by “Tim O’Reilly” in 2004 http://www.flickr.com/photos/thomashawk/153656919/
  13. 13. Users own the Information They should manage the information
  14. 14. It’s not a technology, it’s a philosophical concept
  15. 15. Wikis (and Wikipedia), blogs, etc.
  16. 16. Evolution of the Web
  17. 17. Concepts
  18. 18. AJAX • Web development techniques empowering Web 2.0 • Much developers misunderstand the concept of Web 2.0 and think about AJAX
  19. 19. Social Web • Describe how people socialize with each othe throughout the WWW • 2 descriptions • Web 2.0 • Proposal for a future network similar to WWW
  20. 20. Social Media • Media designed to be disseminated through Social interaction • Internet forums, blogs, microblogging, wikis, podcasts, social networks, etc. • More info: http://www.slideshare.net/jccortizo/taller-redes- sociales-presentation
  21. 21. Social Media Statistics
  22. 22. Facebook
  23. 23. Facebook stats. • > 400M. active users • 50% log in to FB in any given day • > 35M u. update their status each day • > 60M. status updates per day • 3 billion photos uploaded each month • 5 billion contents shared each week
  24. 24. Facebook stats. • Avg. user has 130 friends on FB • Avg. user sends 8 friend req. per month • Avg. user spends > 55 min. per day • > 70% FB users are outside USA • > 500K applications
  25. 25. Facebook stats. • > 60M FB users use FB Connect on external websites • > 100M accessing FB though mobile • > 200 mobile operators in 60 cuntries deploying/promoting FB mobile products http://www.facebook.com/press/info.php?statistics
  26. 26. Twitter
  27. 27. Twitter stats. • > 105M. registered users • 300K users sign up every day • > 180M. unique visitors per month • 75% traffic come from 3rd party apps. • > 600M search queries on Twitter/day • 37% of active users use mobile http://www.readwriteweb.com/archives/just_the_facts_statistics_from_twitter_chirp.php
  28. 28. Facts
  29. 29. SM is the New Web • Facebook traffic tops Google (for USA) • FB > 7% of US traffic • March 2010 • http://money.cnn.com/2010/03/16/technology/ facebook_most_visited/
  30. 30. SM envisioning the Future • Mobile Web • Search • Real-time search • Social search • Online identity http://www.madrimasd.org/blogs/sistemas_inteligentes/2009/01/19/111413
  31. 31. Mobile Web • No real mobile web until Social Media • 25% FB users and 37% Twitter users accesing from mobile devices • Trend: more mobile web users than “regular” ones within the next 5 years [1] [1] J. C. Cortizo, L. I. Diaz, F. Carrero, B. Monsalve, “On the Future of Mobile Phones as the Heart of Community Built Databases” to appear in 2011
  32. 32. Real-Time Search
  33. 33. Social Search
  34. 34. Online Identity • User identity is a real business • Facebook Connect, OAuth, OpenID...
  35. 35. Social Media as Research Field
  36. 36. We need Data • We need data to validate our research • Why use “non-real”/small/”non- relevant”/old-fashioned datasets • UCI • Reuters • ...
  37. 37. SM, Huge DataSet • Billions users • Billions contents • Textual, Multimedia (image, videos, etc.) • Billions of connections • Behaviors, preferences, trends...
  38. 38. SM Openness • It’s easy to get data from SM • SM based datasets • Developers APIS • Spidering the Web
  39. 39. Available DataSets • Social Tagging (CiteULike, Bibsonomy, MovieLens, Delicious, Flickr, Last.FM...) • http://kmi.tugraz.at/staff/markus/datasets/ • Yahoo! Firehose (750K ratings/day, 8K reviews/day, 150K comments/day, status updates, Flickr, Delicious...) • http://developer.yahoo.net/blog/archives/ 2010/04/yahoo_updates_firehose.html
  40. 40. Available DataSets • MySpace data (real-time data, multimedia content, ...) • http://blog.infochimps.org/2010/03/12/ announcing-bulk-redistribution-of- myspace-data/ • Spinn3r Blog Dataset/JDPA Sentiment Corpus • http://www.icwsm.org/data/
  41. 41. ...and more • http://delicious.com/pskomoroch/ dataset
  42. 42. Key Benefits • There’s a lot of data on SM • It’s fun! • You can work on a real-real domain • ¿Make (real) money with your research?
  43. 43. Where to publish? • ICSWM: AAAI Conference on Weblogs and Social Media • MSM/SMUC: Workshop on Search and Mining User generated Contents • WWW: 4 Social Networks sessions + other 15 S.M. related papers • ACM RecSys + Social Web workshop
  44. 44. Where to publish? • ICSWM: AAAI Conference on Weblogs and Social Media • MSM/SMUC: Workshop on Search and Mining User generated Contents • WWW: 4 Social Networks sessions + other 15 S.M. related papers • ACM RecSys + Social Web workshop
  45. 45. Where to publish? • Any other ‘typical’ conference from your research area • Social web/search/mining/networks analysis workshops on almost any relevant conference
  46. 46. Other Research Uses Twitter Lists
  47. 47. Other Research Uses Twitter Searches
  48. 48. Other Research Uses Twitter Users
  49. 49. Other Research Uses Blogs!
  50. 50. Other Research Uses
  51. 51. Other Research Uses
  52. 52. • Don’t wait ‘till the conferences to know about advances • Follow interesting researchers through Twitter and their blogs • Peer-reviewing sucks! • You can learn even more from failed attempts, or work in progress • Open your mind...
  53. 53. Some Applications
  54. 54. Buzzer: Twitter RecSys
  55. 55. Buzzer • O. Phelan (@phelo), K. McCarthy, B. Smith, “Using Twitter to Recommend Real-Time Topical News”, ACM RecSys 2009 • Goal: News Recommendation • Not using Reuters or similar datasets
  56. 56. Buzzer
  57. 57. Buzzer
  58. 58. Why use Twitter? • “Typical” news sites are boring • You’ll get compared to Google News • You’re innovating just by use Twitter • You’ll benefit from Twitter hype • You get a real and interesting system to deploy on real conditions
  59. 59. FlickrBabel: Multilingual multimedia search
  60. 60. Machine Translation • An open problem • But actual state-of-the-art enough for some applications
  61. 61. Idea • Do we need a Spanish Metamap? • F. Carrero, J. C. Cortizo, J. M. Gomez, M. de Buenaga “In the Development of a Spanish Metamap”, CIKM 2008 • F. Carrero, J. C. Cortizo, J. M. Gomez, “Testing Concept Indexing in Crosslingual Medical Text Classification”, ICDIM 2008
  62. 62. Idea
  63. 63. Idea
  64. 64. Idea • Results show that using Google Translate is just enough for “simulating” a Spanish Metamap • for classification purposes
  65. 65. Extending to SM • We applied the same idea to FlickrBabel
  66. 66. Extending to SM • We applied the same idea to FlickrBabel • Search images from Flickr • Babxel also searchs on YouTube • Expands the query and improves the recall
  67. 67. FlickrBabel
  68. 68. FlickrBabel • We got a lot of buzz/users from • Mashable • Loogic • WwwhatsNew • And thousand more blogs/sites
  69. 69. evri: entity based search engine (+ API)
  70. 70. evri • Not a typical search engine • Real-time • Semantic • Entity recognition • Opinion Mining
  71. 71. Offers a great API
  72. 72. What’s good • They integrate a lot of technologies from the state-of-the-art on NLP/IR into something usable • The API can be used to develop evri based products and applications • If you have a good technology, build a good product/service around it
  73. 73. Entity recognition
  74. 74. Recommendations
  75. 75. Sentiment Analysis
  76. 76. No good enough? • There isn’t “no good enough” technologies • There are useful or not useful products/ services • Show your technology to the world, they’d be the best ‘reviewers’
  77. 77. Academic vs Enterprise
  78. 78. · Too idealist Too pragmatic · · ‘Fantasy?’ world ‘Real?’ world · · Too many assumptions Too little assumptions · · Research ‘Innovate’ · · Guided by public funds Guided by revenues · · Non-applicable Cuts innovation ·
  79. 79. · Too idealist There’s a lot of pragmatic · Too opportunities ‘Real?’ world · · ‘Fantasy?’ world · Too many assumptions Too little assumptions · · Research in the middle!! ‘Innovate’ · · Guided by public funds Guided by revenues · · Non-applicable Cuts innovation ·
  80. 80. Entrepreneurship: the Research Way
  81. 81. • Choose a real world problem (take care of data availability, competitors and utility) • Develop a great technology • Test in a lab environment and publish • Develop a prototype and grant access to beta testers
  82. 82. • Analyze the new results • Write a (presentation based) business plan • Get money from FFF • Develop your product (out of beta) • Get some clients/users
  83. 83. • Write a full business plan • You can get help from your University/ other institutions • Get more funding from BA’s and VC’s • Hire the best coders/employees you can get • Monetize your product/service
  84. 84. • And don’t stop researching and innovating
  85. 85. Jo sé Ca rlo sC or t izo BY, 2010 Pé re z

×