BIG DATA and VERACITY:
A novel approach to data
veracity using crowd-sourcing
techniques
Samarth Bhargav, Bhoomika Agarwal,
Abhiram Ravikumar and Vrishabh DN
April 18, 2014
Presented at BMS Institute of Technology, Bangalore
Introduction
Big Data
● What is Big Data?
● The 3 traditional V’s
o Volume
o Velocity
o Variety
● Fourth V
● Crowdsourcing
Volume
VarietyVelocity
Veracity
The 4 Vs of Big Data
Source: http://well-managed-business-intelligence.blogspot.in/2012/06/big-data-fourth.html
Crowdsourcing - Models in place
GOOGLE MAPS
WIKIPEDIA
DUOLINGO
RECAPTCHA
AMAZON TURK
● Digitizing one word at a time
● Utilize the 10 seconds spent by humans, productively
● Digitizing old books - herculean task for computers
● An efficient alternative to OCR
● Workflow - entry, multiple-checks, verify, upload
● 20 years of The New York Times Daily was digitized in
just a couple of months
reCAPTCHA
● “Enrich Google Maps with your local knowledge”
● The Google Map Maker project
● Data used by Google Maps and Google Earth
● Projects like PhotoSphere and StreetView use huge
contributions from the masses
● Workflow
○ add/edit places
○ verified by a moderator
○ cross-referenced and updated
Google Maps
WIKIPEDIA
● Termed as the “mother of all encyclopedias”
● Hosts an immense pool of data, multi-linguistic in nature
and entirely community driven
● Run by donations from all over the world (crowdfunding)
● Dynamic and constantly updated, thus scores big over
traditional encyclopedias
● Unbiased and high-quality
information
● Data-verification and
validation done instantly
by both experts and
general public
DUOLINGO
● Learn a language and translate the Web
● Entirely free and crowd-driven
● Luis van Ahn - ESP games and reCAPTCHA
● Workflow
o website to be translated is uploaded
o broken into parts & given to students
o students translate the doc during learning procedure
o translated doc returned to owner
● Win-win situation for both students and corporates
● Popular on both web as well as mobile platforms
Amazon Mechanical Turk
● Use of artificial intelligence to run businesses
● HITs enable machine learning concepts
● Workflow
o Requester places task on the site or through API
o Provider picks a suitable task
o Payments made through Amazon gift certificates
● Advantages include
o Quality assurance
o Scalability options
o Lower cost
Analysis
● Handling data IS important
● Google FLU tracker
● KickStarter and CosmoQuest
● Lot of scope and wide opportunities
Repercussions
● Senator Kennedy’s story
● FCRA (Fair Credit Reporting Act)
● Crowds unaware of data-acquisition
● Confidential data and security-leaks to be
addressed with care
Conclusion
Crowdsourcing
model
Volume Velocity Variety Veracity
Google Maps terabytes high low medium
Duolingo terabytes medium high high
reCAPTCHA petabytes very high very high very high
Amazon Turk petabytes medium very high high
Wikipedia petabytes medium high very high
References
1. http://crowdsourcingweek.com/you-have-helped-digitize-millions-of-books-through-online-
collaboration/
2. http://www.loopinsight.com/2014/03/14/duolingo-recaptcha-and-a-magnificent-piece-of-
crowdsourcing/
3. http://www.cracked.com/article_19431_5-mind-blowing-things-crowds-do-better-than-
experts.html
4. http://royal.pingdom.com/2012/02/08/google-maps-turns-7-years-old-amazing-facts-and-figures/
5. http://en.wikipedia.org/wiki/Amazon_Mechanical_Turk
6. http://www.pomona.edu/academics/departments/psychology/files/Buhrmester%20-
Crowdsourcing-Amazon-MTurk.pdf
7. http://hcil2.cs.umd.edu/trs/2010-09/2010-09.pdf
8. http://www.slideshare.net/davidgracia/crowdsourcing-at-wikipedia-8586584
9. http://info.articleonepartners.com/crowdsourcing-series-wikipedia-the-godfather-of-
crowdsourcing/
10. http://ezinearticles.com/?Wikipedia---A-Successful-Crowdsourcing-Project&id=3736803
Question & Answers time! :-)
Source:http://2.bp.blogspot.com/
Thank you, UTSAHA 2k’14.

A novel approach to big data veracity using crowd-sourcing techniques

  • 1.
    BIG DATA andVERACITY: A novel approach to data veracity using crowd-sourcing techniques Samarth Bhargav, Bhoomika Agarwal, Abhiram Ravikumar and Vrishabh DN April 18, 2014 Presented at BMS Institute of Technology, Bangalore
  • 2.
    Introduction Big Data ● Whatis Big Data? ● The 3 traditional V’s o Volume o Velocity o Variety ● Fourth V ● Crowdsourcing Volume VarietyVelocity Veracity
  • 3.
    The 4 Vsof Big Data Source: http://well-managed-business-intelligence.blogspot.in/2012/06/big-data-fourth.html
  • 4.
    Crowdsourcing - Modelsin place GOOGLE MAPS WIKIPEDIA DUOLINGO RECAPTCHA AMAZON TURK
  • 5.
    ● Digitizing oneword at a time ● Utilize the 10 seconds spent by humans, productively ● Digitizing old books - herculean task for computers ● An efficient alternative to OCR ● Workflow - entry, multiple-checks, verify, upload ● 20 years of The New York Times Daily was digitized in just a couple of months reCAPTCHA
  • 6.
    ● “Enrich GoogleMaps with your local knowledge” ● The Google Map Maker project ● Data used by Google Maps and Google Earth ● Projects like PhotoSphere and StreetView use huge contributions from the masses ● Workflow ○ add/edit places ○ verified by a moderator ○ cross-referenced and updated Google Maps
  • 7.
    WIKIPEDIA ● Termed asthe “mother of all encyclopedias” ● Hosts an immense pool of data, multi-linguistic in nature and entirely community driven ● Run by donations from all over the world (crowdfunding) ● Dynamic and constantly updated, thus scores big over traditional encyclopedias ● Unbiased and high-quality information ● Data-verification and validation done instantly by both experts and general public
  • 8.
    DUOLINGO ● Learn alanguage and translate the Web ● Entirely free and crowd-driven ● Luis van Ahn - ESP games and reCAPTCHA ● Workflow o website to be translated is uploaded o broken into parts & given to students o students translate the doc during learning procedure o translated doc returned to owner ● Win-win situation for both students and corporates ● Popular on both web as well as mobile platforms
  • 9.
    Amazon Mechanical Turk ●Use of artificial intelligence to run businesses ● HITs enable machine learning concepts ● Workflow o Requester places task on the site or through API o Provider picks a suitable task o Payments made through Amazon gift certificates ● Advantages include o Quality assurance o Scalability options o Lower cost
  • 10.
    Analysis ● Handling dataIS important ● Google FLU tracker ● KickStarter and CosmoQuest ● Lot of scope and wide opportunities
  • 11.
    Repercussions ● Senator Kennedy’sstory ● FCRA (Fair Credit Reporting Act) ● Crowds unaware of data-acquisition ● Confidential data and security-leaks to be addressed with care
  • 12.
    Conclusion Crowdsourcing model Volume Velocity VarietyVeracity Google Maps terabytes high low medium Duolingo terabytes medium high high reCAPTCHA petabytes very high very high very high Amazon Turk petabytes medium very high high Wikipedia petabytes medium high very high
  • 13.
    References 1. http://crowdsourcingweek.com/you-have-helped-digitize-millions-of-books-through-online- collaboration/ 2. http://www.loopinsight.com/2014/03/14/duolingo-recaptcha-and-a-magnificent-piece-of- crowdsourcing/ 3.http://www.cracked.com/article_19431_5-mind-blowing-things-crowds-do-better-than- experts.html 4. http://royal.pingdom.com/2012/02/08/google-maps-turns-7-years-old-amazing-facts-and-figures/ 5. http://en.wikipedia.org/wiki/Amazon_Mechanical_Turk 6. http://www.pomona.edu/academics/departments/psychology/files/Buhrmester%20- Crowdsourcing-Amazon-MTurk.pdf 7. http://hcil2.cs.umd.edu/trs/2010-09/2010-09.pdf 8. http://www.slideshare.net/davidgracia/crowdsourcing-at-wikipedia-8586584 9. http://info.articleonepartners.com/crowdsourcing-series-wikipedia-the-godfather-of- crowdsourcing/ 10. http://ezinearticles.com/?Wikipedia---A-Successful-Crowdsourcing-Project&id=3736803
  • 14.
    Question & Answerstime! :-) Source:http://2.bp.blogspot.com/ Thank you, UTSAHA 2k’14.