Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Simran confidentiality protection in crowdsourcing

135 views

Published on

Crowdsourcing is the practice of getting information or input for a task or project from a number of people by floating out the task at hand to a pool of people who are usually not full-time employees. Use of Crowdsourcing to get work done is on the rise due to the benefits that it offers. Trends and surveys show hat the conventional workforce is moving towards gig-economy, and by 2020, 43% of the US workforce is expected to be on-demand workforce [9]. In our work, we focus on higher level software development/engineering tasks and how they could potentially lead to Confidentiality Loss in the process of giving the task and getting it done. We look at different stages and components of the crowdsourcing cycle to examine potential information leak sources. We conducted a survey to study how people perceive this problem and what is their level of understanding when it comes to sharing information online. We also analyzed a dataset of tasks posted online previously to get insight if the problem persists in the real world or not. Some conversations between the task posters and the workers were studied along with a dataset of reviews for workers to get a deeper insight into the problem and its detection. Based on the analysis, we propose NLP based techniques to detect such potential leaks and nudge the task poster before the information is dissipated. Having such an additional layer of scrutiny at the level of companies before a task is given for crowdsourcing out will keep in check any loss of confidential information.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Simran confidentiality protection in crowdsourcing

  1. 1. Confidentiality Protection in Crowdsourcing Simran Saxena linkedin/in/simransaxena @simran_s21 fb.com/simran.s21 Dr. Ponnurangam Kumaraguru (Chair) Dr. Alpana Dubey (Co-chair)
  2. 2. 2 Thesis Committee ◆ Dr. Arun Balaji Buduru, IIIT Delhi ◆ Dr. Niharika Sachdeva, InfoEdge ◆ Dr. Alpana Dubey, Accenture Technology Labs ◆ Dr. Ponnurangam Kumaraguru, IIIT Delhi
  3. 3. Demo ◆ Proof of Concept 3
  4. 4. 4
  5. 5. Outline ◆ Research Motivation ◆ Research Aim ◆ Crowdsourcing at the level of organizations ◆ Survey to understand confidentiality ◆ Unboxing a typical crowdsourcing task ◆ Understanding conversations between workers and task posters ◆ Protecting confidentiality loss in crowdsourcing ◆ Conclusion 5
  6. 6. Shift Towards a Gig-Economy: Benefits 6 Cost Effective Faster delivery Access to a diverse talent pool Ref: http://isroset.org/pub_paper/IJSRCSE/2-IJSRCSE-0315.pdf
  7. 7. Benefits of Crowdsourcing 7 Cheaper and faster 60% of the times Reduces cost by 30 - 80% Reduced time-to-market because of Parallelism Ref: 1, 2
  8. 8. Happiness and Satisfaction Levels 8 74% 95% 72% Ref: https://fieldnation.com/wp-content/uploads/2017/03/The_2016_Field_Nation_Freelancer_Study_R1V1__1_-2.pdf
  9. 9. Crowdsourcing as the Next Big Revolution 9 By 2020, freelancers will form 43% of the workforce in USA
  10. 10. Crowdsourcing in Organizations 10 Many popular companies make use of Crowdsourcing
  11. 11. Crowdsourcing and Software Development 11 Coding Designing, Prototyping Testing Data Analysis Mobile Development Web Development
  12. 12. Crowdsourcing Platforms 12 And many more..
  13. 13. Stakeholders in Crowdsourcing 13 Task Poster -> tp Worker -> w Company/Organization -> c
  14. 14. Components of a Crowdsourcing Task ◆ ~ 20 attributes per task ○ Task ID ○ Task Title ○ Task Description ○ Task Category ○ Task Type ○ Task Workload ○ …. 14
  15. 15. Resources shared commonly 15 Textual Content: Title, Description, Documents, Read-me files. Database UI Screens Images Code
  16. 16. Sample Software Development Task [1] 16
  17. 17. Why is it a problem? 17 (A) (B)
  18. 18. 18 (A) (B) Exposed data of the Customers: ● Name ● Address ● Email ID ● Phone number ● Details of availed services ○ Date ○ Time ○ Cost
  19. 19. Why is it a problem? 19 (B)(A)
  20. 20. Why is it a problem? 20 (A) (B)
  21. 21. Impacts of Exploited Vulnerability ◆ Economic Exploitation ◆ Information Theft ◆ Intrusion on personal privacy ◆ Social engineering ◆ System penetration/attack ◆ Information Bribery ◆ Sale of personal information ◆ System sabotage ◆ Unauthorized system access ◆ CIA Loss 21
  22. 22. Research Aim: Given Crowdsourced Software Development: ◆ Identify ○ stages involved in confidentiality loss ○ critical attributes of a task ○ type of critical data ◆ Develop techniques to analyze tasks 22 INPUT Task Title Task Description OUTPUT A list of sensitive terms
  23. 23. Outline ◆ Research Motivation ◆ Research Aim ◆ Crowdsourcing at the level of organizations ◆ Survey to understand confidentiality ◆ Unboxing a typical crowdsourcing task ◆ Understanding conversations between workers and task posters ◆ Protecting confidentiality loss in crowdsourcing ◆ Conclusion 23
  24. 24. The Crowdsourcing Cycle 24
  25. 25. The Crowdsourcing Cycle 25
  26. 26. The Crowdsourcing Cycle 26
  27. 27. The Crowdsourcing Cycle 27
  28. 28. The Crowdsourcing Cycle 28
  29. 29. The Crowdsourcing Cycle 29
  30. 30. The Crowdsourcing Cycle 30
  31. 31. The Crowdsourcing Cycle 31
  32. 32. The Crowdsourcing Cycle 32 Stages from which information leak could occur
  33. 33. ◆ PII details are sensitive ◆ Sharing a database publicly may compromise with confidentiality ◆ Sensitive information in code ◆ Sensitive components in the comments Survey to validate some assumptions 33
  34. 34. Survey to validate some assumptions ◆ Wireframes and UX examples may give out information about the company posting the job: logo ◆ Market of the end result of the task can be inferred given a task ◆ Domain of the task can be identified from the task, given the type details it is asking for 34
  35. 35. Survey to understand Confidentiality Suppose you are a project manager in a multinational organization and you need to get some job done by an external freelancer. These jobs require sharing some details like sample code, documents, UX wireframes, etc. and some information might be sensitive in these resources. You need to answer the following questions considering the above scenario. 35
  36. 36. Likert scale based Questions 36Type of Data SensitivityLevelsAverage Sensitivity Levels of Sharing Various Details
  37. 37. ◆ 6.4% of the respondents believed that sharing a database publicly is not a sensitive activity ◆ 2.2% people think that sharing API Keys is okay Some observations 37
  38. 38. Situational Questions 38
  39. 39. Sample Task Question This is a sample task posted by a Task Poster, according to you, what all sensitive data does it contain? 39 I am using a online platform called tomakeanapp.com, there is a module which you can upload a picture but I HAVE TO DO IT one by one. I need a imacro or anyscript that could upload a batch of photo from a folder. No other requirement. That's it. It is dead simple and first come first serve. Please apply If you want to know what i am saying. Go to testmyapp.com Username: trialappdemo Password: password134 >Click my project> Click "Edit", then you will see "TraiApp" in YOUR ACTIVITY Then click "Edit" and you will see the upload screen I talked above
  40. 40. Sample Task Question This is a sample task posted by a Task Poster, according to you, what all sensitive data does it contain? 40 I am using a online platform called tomakeanapp.com, there is a module which you can upload a picture but I HAVE TO DO IT one by one. I need a imacro or anyscript that could upload a batch of photo from a folder. No other requirement. That's it. It is dead simple and first come first serve. Please apply If you want to know what i am saying. Go to testmyapp.com Username: trialappdemo Password: password134 >Click my project> Click "Edit", then you will see "TraiApp" in YOUR ACTIVITY Then click "Edit" and you will see the upload screen I talked above
  41. 41. Sample Code Question Following is a sample code snippet, point out the parts in the code that you think are sensitive: 41
  42. 42. Sample Code Question Following is a sample code snippet, point out the parts in the code that you think are sensitive: 42
  43. 43. A wireframe of an App has been attached, what all can you infer from it? 43
  44. 44. A wireframe of an App has been attached, what all can you infer from it? 44
  45. 45. What can you infer about the type of industry that this app is for? 45
  46. 46. What can you infer about the type of industry that this app is for? 46
  47. 47. 1. Master Database of a company 2. Personal Information: a. Name b. Email ID c. SSN / Unique Identification Number d. Address e. Date of Birth f. Employee ID g. Health related information 3. Context for the given task 4. Company related details: a. Details of the clients of a company b. Details of a company c. Details of a company’s future projects d. Any sort of financial data e. Proprietary Data Other Confidential Entities 47
  48. 48. 7. Credentials of any sort: Usernames and Passwords 8. Bank Account Number / Credit or Debit Card Details 9. API Keys 10. Server Details 11. Google Play keys / App Secrets / OAuth Tokens 12. Named Entities 13. Variable names / Class Names 14. Test Inputs 15. Sensitive Datasets where the users can be uniquely identified 16. Images and Diagrams with confidential information Other Confidential Entities 48
  49. 49. Dataset of 65,000 tasks ◆ 4 unique Task Categories: ○ Data Science & Analytics ○ IT & Networking ○ Web, Mobile & Software Dev ○ Writing ◆ 22 Unique Task Subcategories ○ Game Development ○ Data Visualization ○ Network and System Administration ○ Machine Learning ○ ... 49
  50. 50. Analysis of attributes of a task ◆ ~ 20 fields per task ○ Task ID ○ Task Title ○ Task Description ○ Task Category ○ Task Sub-Category ○ Date of creation of Task ○ Task Type ○ Task Workload ○ Task Duration ○ Task Budget ○ Preferred Location of the worker 50 ○ Preferred range of feedback score ○ Level of English Skills required ○ Payment range ○ Range of working hours per day ○ Requirement of Cover Letter ○ Requirement of Portfolio ○ Candidates Registered for the task ○ Skills required for the task
  51. 51. ◆ Crucial attributes: ○ Task Title ○ Task Description ◆ Observations: ○ 23% tasks contained links ○ 2.2% tasks had credentials ○ 2.46% tasks had GitHub links Analysis of Tasks 51
  52. 52. Outline ◆ Research Motivation ◆ Research Aim ◆ Crowdsourcing at the level of organizations ◆ Survey to understand confidentiality ◆ Unboxing a typical crowdsourcing task ◆ Understanding conversations between workers and task posters ◆ Protecting confidentiality loss in crowdsourcing ◆ Conclusion 52
  53. 53. Understanding Conversations between Workers and Task Posters ◆ ~ 214 lines per task (average) ◆ ~30 files exchanged [min: 0, max: 90] ◆ 42% tasks involved Skype calls ◆ 14% involved sharing data over cloud 53
  54. 54. 54 Analysis of a Task for Sensitivity
  55. 55. Pipeline 55 List of sensitive terms
  56. 56. Algorithm 56
  57. 57. 57
  58. 58. Testing Induced sensitivity in the 65,000 tasks dataset ◆ Curated a list of sensitive terms: ○ Names of companies ○ Email Addresses from the Enron Dataset ○ Dummy API Keys, Passwords, etc ○ List of URLs ○ List of countries and cities ○ Randomly generated usernames ○ ... 58
  59. 59. Performance Metrics 59 Precision Recall F1 Score 0.68 0.82 0.74
  60. 60. Real World Application ◆ Analyze ○ Tasks ○ Content before putting on cloud ○ Content before sharing 60
  61. 61. Conclusion ◆ Narrow down on the crucial stages in the crowdsourcing cycle ◆ Enumerate critical attributes in a task ◆ Highlight type of confidential data ◆ NLP and Rule based algorithm to detect confidentiality loss 61
  62. 62. Challenges, Limitations and Future Work ◆ Dataset availability ◆ Lack of labeled data ◆ Expand the algorithm ○ Cater to images, databases, etc ◆ Incorporate Machine Learning techniques for classification ◆ Sanitization ◆ More fine grained analysis 62
  63. 63. Acknowledgement ◆ Committee Members ◆ Abhinav and Sakshi, Accenture Labs Bangalore ◆ Indira, Gurpriya, Arpit, Shubham, Sonu, Divyansh ◆ Members of Precog family ◆ Family and friends 63
  64. 64. References ◆ http://www.timbroder.com/2015/01/my-2375-amazon-ec2-m istake.html ◆ https://www.topcoder.com/about-topcoder/customer-stories/ ◆ https://www.merriam-webster.com/dictionary/crowdsourcing ◆ https://www.figure-eight.com/ ◆ https://www.flaticon.com/authors/freepik 64
  65. 65. Thanks! simran13104@iiitd.ac.in @simran_s21 65

×