Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data:
the weakest link
Vivek Nair, Tim Menzies
{vivekaxl,tim.menzies}@gmail.com
HPCC Eng. Summit - Sept 29, 2015
Where is the weakest link?
2
Where is the weakest link?
3
Where is the weakest link?
4
Where is the weakest link?
5
Where is the weakest link?
6
Premise of Big Data
Analysis is a “systems” task?
• Better conclusions =
same algorithms + more
data + more cpu
• If so, t...
Q: Is Big Data a “Systems” or “Human”-task?
A: Yes
8
Code used in my
last paper
(1100 LOC of Python
calling scikitlearn)
9
Use a Higher-Level languages?
• ECL solves this problem?
• But if you can write it quick,
– you can write it wrong, quick....
Is this really a problem?
• Q: What would we expect
to see if…
– Top experts, publishing in top
journals
– Many of the sam...
• Software analytics
– Defect prediction
– Many of the same learners,
– Many of the same data sets
• 42 papers,
top journa...
13
http://fivethirt
yeight.com/fe
atures/science
-isnt-broken/
A little theory
• James D. Herbsleb, CMU
• Socio-Technical Coordination
• A predictor for higher defects:
– Groups of prog...
Q: How to find expertise groups
within the HPCC community?
A: using data mining
15
Static features and commit history
can act as a cue for expertise
● Our motivation
o “relation between embodiment and lang...
Software analytics results:
learn predictors for expertise
● “...counts of the cumulative number of different
developers c...
Q: And what data mining suite will we
use to mine data about programmers?
• A: need you ask?
18
Source Code
19
But what are we clustering?
Developer products
• Lightweight parsing of source code
• Developers profiles, accessed
via so...
Languages Used
Skill Set (self promotion)
Data processing
1. Github repos (for code) ➔ Social media(for years of work)
2. Static code analysis: frequency counts of ...
Classification
- Features: Nodes of AST
- Algorithms Used: Simple Cart, Random
Forest, Naive Bayes etc.
- Can distinguish ...
Current status
The good news
• Can auto-find groups of
better programmers
• Can do that for very large
data sets
– The ECL...
Where is the weakest link?
26
Where is the weakest link?
27
We can make that link stronger
28
Acknowledgements:
Thanks to funding from LexisNexis
29
Upcoming SlideShare
Loading in …5
×

Big Data: the weakest link

732 views

Published on

Vivek Nair, Tim Menzies
{vivekaxl,tim.menzies}@gmail.com
HPCC Eng. Summit - Sept 29, 2015

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Big Data: the weakest link

  1. 1. Big Data: the weakest link Vivek Nair, Tim Menzies {vivekaxl,tim.menzies}@gmail.com HPCC Eng. Summit - Sept 29, 2015
  2. 2. Where is the weakest link? 2
  3. 3. Where is the weakest link? 3
  4. 4. Where is the weakest link? 4
  5. 5. Where is the weakest link? 5
  6. 6. Where is the weakest link? 6
  7. 7. Premise of Big Data Analysis is a “systems” task? • Better conclusions = same algorithms + more data + more cpu • If so, then … – No role for human error – All insight is auto-generated from CPUs. Analysis is a “human” task? • Current results on “software analytics” – A human-intensive process 7
  8. 8. Q: Is Big Data a “Systems” or “Human”-task? A: Yes 8
  9. 9. Code used in my last paper (1100 LOC of Python calling scikitlearn) 9
  10. 10. Use a Higher-Level languages? • ECL solves this problem? • But if you can write it quick, – you can write it wrong, quick. 10
  11. 11. Is this really a problem? • Q: What would we expect to see if… – Top experts, publishing in top journals – Many of the same data sets – 8 years of trying • A: – Perhaps some upward progress – Perhaps a little less variance 11 So, what do we see?
  12. 12. • Software analytics – Defect prediction – Many of the same learners, – Many of the same data sets • 42 papers, top journals, • 23 author groups • 2002 to 2010 • Y-axis measures mean performance 12 Researcher Bias: The Use of Machine Learning in Software Defect Prediction, Martin Shepperd, David Bowes, and Tracy Hall, IEEE TRANS on Soft. Eng. , 40(6), JUNE 2014
  13. 13. 13 http://fivethirt yeight.com/fe atures/science -isnt-broken/
  14. 14. A little theory • James D. Herbsleb, CMU • Socio-Technical Coordination • A predictor for higher defects: – Groups of programmers working on similar functions then, – but do not sharing that expertise 14
  15. 15. Q: How to find expertise groups within the HPCC community? A: using data mining 15
  16. 16. Static features and commit history can act as a cue for expertise ● Our motivation o “relation between embodiment and language acquisition by locating the ‘minimal set of necessary features’ that enable language of any kind to be learned” - The Philosophy of Expertise 16
  17. 17. Software analytics results: learn predictors for expertise ● “...counts of the cumulative number of different developers changing a file over its lifetime can help to improve defect predictions…”[1] ● “Quantify person's experience with a part of code using change history of the code”[2] ● “RevFinder, a file location-based code-reviewer recommendation approach” [3] ● “30% of its code entities has more than 0.3 of similarity with at least one developer vocabulary” [4] 17 [1] Ostrand, Thomas J., Elaine J. Weyuker, and Robert M. Bell. "Programmer-based fault prediction." Proceedings of the 6th International Conference on Predictive Models in Software Engineering. ACM, 2010. [2] Mockus, Audris, and James D. Herbsleb. "Expertise browser: a quantitative approach to identifying expertise." Proceedings of the 24th international conference on software engineering. ACM, 2002. [3] Thongtanunam, Patanamon, et al. "Who should review my code? A file location-based code-reviewer recommendation approach for Modern Code Review."Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on. IEEE, 2015. [4] Santos, Katyusco de F., Dalton DS Guerrero, and Jorge CA de Figueiredo. "Using Developers Contributions on Software Vocabularies to Identify Experts."Information Technology-New Generations (ITNG), 2015 12th International Conference on. IEEE, 2015.
  18. 18. Q: And what data mining suite will we use to mine data about programmers? • A: need you ask? 18
  19. 19. Source Code 19
  20. 20. But what are we clustering? Developer products • Lightweight parsing of source code • Developers profiles, accessed via social media sites
  21. 21. Languages Used
  22. 22. Skill Set (self promotion)
  23. 23. Data processing 1. Github repos (for code) ➔ Social media(for years of work) 2. Static code analysis: frequency counts of AST features (e.g. count loops, returns, var comparisons, map, etc ) 3. Bayes classifier Early career Later career
  24. 24. Classification - Features: Nodes of AST - Algorithms Used: Simple Cart, Random Forest, Naive Bayes etc. - Can distinguish expert from novice programmers •precision= 78% early career •precision = 74% later career * Using Weka
  25. 25. Current status The good news • Can auto-find groups of better programmers • Can do that for very large data sets – The ECL advantages The other news • Seeking larger data sets • Talking to HackerRank • Looking at ways to instrument the HPCC forums – Matchmaker tools – Affinity groups 25
  26. 26. Where is the weakest link? 26
  27. 27. Where is the weakest link? 27
  28. 28. We can make that link stronger 28
  29. 29. Acknowledgements: Thanks to funding from LexisNexis 29

×