Naive application of Machine Learning to Software Development

764 views
685 views

Published on

Naive application of Machine Learning to Software Development: get tickets from Django trac ticket tracking system and try to predict how long it will take to close the ticket.

Facts that developers aren't putting RIGHT information into their tracking systems :)

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
764
On SlideShare
0
From Embeds
0
Number of Embeds
55
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Naive application of Machine Learning to Software Development

  1. 1. Naive application ofMachine Learning toSoftware Development
  2. 2. Naive application ofMachine Learning toSoftware Developmentor... what developers donttell :)
  3. 3. What and why42 Coffee Cups: completely distributed development team
  4. 4. What and why42 Coffee Cups: completely distributed development teamHard facts about how software is done
  5. 5. What and why42 Coffee Cups: completely distributed development teamHard facts about how software is doneLOTS OF THEM
  6. 6. What and whyFacts
  7. 7. What and whyFacts Profit
  8. 8. What and whyFacts ??? Profit
  9. 9. What and why???Toy problem: get ticket and predict how long it will take to close it
  10. 10. What and why???Toy problem: get ticket and predict how long it will take to close itBonus: learn scikit-learn :)
  11. 11. Install scikit-learn● sudo apt-get install python- dev
  12. 12. Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev
  13. 13. Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev python-scipy
  14. 14. Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev python-scipy python- setuptools libatlas-dev g++
  15. 15. Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev python-scipy python- setuptools libatlas-dev g++● pip install -U scikit-learn
  16. 16. Data: closed ticketsimport urllib2url = https://code.djangoproject.com/query?format=csv+&col=id&col=time&col=changetime&col=reporter + &col=summary&col=status&col=owner&col=type + &col=component&order=prioritytickets = urllib2.urlopen(url).read()open(2012-10-09.csv,w).write(tickets)
  17. 17. Data: closed ticketsid,time,changetime,reporter,summary,status,owner,type,component1,2005-07-13 12:03:27,2012-05-20 08:12:37,adrian,Create architecture for anonymous sessions,closed,jacob,enhancement,Core (Other)2,2005-07-13 12:04:45,2007-07-03 16:04:18,anonymous,Calendar popup - next/previous monthlinks close the popup in Safari,closed,jacob,defect,contrib.admin
  18. 18. Data: closed date and descriptiondef get_data(ticket): url = https://code.djangoproject.com/ticket/%s % ticket ticket_html = urllib2.urlopen(url) bs = BeautifulSoup(ticket_html)
  19. 19. Data: closed date and description# get closing dated = bs.find_all(div,date)[0]p = list(d.children)[3]href = p.find(a)[href]close_time_str = urlparse.parse_qs(href)[/timeline?from][0]close_time = datetime.datetime.strptime(close_time_str[:-6], %Y-%m-%dT%H:%M:%S)# ... more black magic, see code
  20. 20. Data: closed date and descriptiondef get_data(ticket): [...] # get description and return de = bs.find_all(div, description)[0] return close_time, de.text
  21. 21. Data: closed date and descriptiontickets_file = csv.reader(open(2012-10-09.csv))output = csv.writer(open(2012-10-09.close.csv,w))for id, time, changetime, reporter, summary, status, owner, type, component in tickets_file: closetime, descr = get_data(id) row = [id, time, changetime, closetime, reporter, summary, status, owner, type, component, descr.encode(utf-8), ],) output.writerow(row)
  22. 22. Scoring: Train/Test set splitcross_validation.train_test_split(tickets_train, tickets_test, times_train,times_test) = cross_validation.train_test_split( tickets, times, test_size=0.2, random_state=0)
  23. 23. Scoring: Mean squared errorsklearn.metrics.mean_squared_errortrain_error = metrics.mean_squared_error( times_train, times_train_predict)test_error = metrics.mean_squared_error( times_test, times_test_predict)
  24. 24. Fun #1: just ticket number?for number, created, ... in tickets_file: row = [] created = dt.datetime.strptime(created, time_format) closetime = dt.datetime.strptime(closetime, time_format) time_to_fix = closetime - created row.append(float(number)) tickets.append(row) times.append(total_seconds(time_to_fix))
  25. 25. Fun #1: just ticket number?import numpy as npfrom sklearn import preprocessingscaler = preprocessing.Scaler().fit( np.array(tickets))tickets = scaler.transform(tickets)
  26. 26. Fun #1: just ticket number?clf = SVR()clf.fit(tickets_train, times_train)times_train_predict = clf.predict(tickets_train)times_test_predict = clf.predict(tickets_test)
  27. 27. Fun #1: just ticket number?train_error = metrics.mean_squared_error(times_train, times_train_predict)test_error = metrics.mean_squared_error(times_test,times_test_predict)print Train error: %.1fn Test error: %.2f % ( math.sqrt(train_error)/(24*3600), math.sqrt(test_error)/(24*3600))# .. in days
  28. 28. Fun #1: just ticket number?Train error: 363.4Test error: 361.41
  29. 29. Finding best parametersSVM C controls regularization:larger C leads to● closer fit to the train data● with the risk of overfitting
  30. 30. Finding best parametersCs = np.logspace(-1, 10, 10)for c in Cs: learn(c)
  31. 31. Finding best parameters0.1: Train error: 363.4 Test error: 361.411.71: Train error: 363.4 Test error: 361.4127.8: Train error: 363.4 Test error: 361.39464.2: Train error: 363.2 Test error: 361.177742.6: Train error: 362.5 Test error: 360.41129155.0: Train error: 362.1 Test error: 360.002154434.7: Train error: 362.0 Test error: 359.8235938136.6: Train error: 361.7 Test error: 359.60599484250.3: Train error: 361.5 Test error: 359.3610000000000.0: Train error: 361.1 Test error:358.91
  32. 32. Finding best parameterssklearn.grid_search.GridSearchCV bonus: it can run in parallelclf = GridSearchCV(estimator=SVR( param_grid=dict(C=np.logspace(-1,10,10)), n_jobs=-1)clf.fit(tickets_train, times_train)
  33. 33. Finding best parameterssklearn.grid_search.GridSearchCV bonus: it can run in parallelclf = GridSearchCV(estimator=SVR( param_grid=dict(C=np.logspace(-1,10,10)), n_jobs=-1)clf.fit(tickets_train, times_train) Train error: 361.1 Test error: 358.91 Best C: 1.0e+10
  34. 34. Fun #2: creation date? row = [] row.append(float(number)) row.append(float(time.mktime( created.timetuple()))) tickets.append(row)
  35. 35. Fun #2: creation date?Train error: 360.6 Test error: 358.39Best C: 1.0e+10
  36. 36. String vectorizer and Tfidf transformfrom sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
  37. 37. String vectorizer and Tfidf transformreporters = []for number, ... in tickets_file: [...] reporters.append(reporter)
  38. 38. String vectorizer and Tfidf transformCountVectorizer().fit_transform(reporters) -> TfidfTransformer().fit_transform( … ) -> hstack((tickets, …)note: TF-IDF matrix is sparse!
  39. 39. String vectorizer and Tfidf transformimport scipy.sparse as sptickets = sp.hstack(( tickets, TfidfTransformer().fit_transform( CountVectorizer().fit_transform(reporters))))# remember to re-scale!scaler = preprocessing.Scaler(with_mean=False ).fit(tickets)tickets = scaler.transform(tickets)
  40. 40. Fun #3: reporterTrain error: 338.7 Test error: 353.38Best C: 1.8e+07
  41. 41. Fun #3: subject subjects = [] for number, created, ... in tickets_file: [...] subjects.append(summary) [...] tickets = sp.hstack((tickets, TfidfTransformer().fit_transform( CountVectorizer(ngram_range=(1,3) ).fit_transform(subjects))))
  42. 42. Fun #3: subjectTrain error: 21.0 Test error: XXXXBest C: 1.0e+10
  43. 43. Fun #3: subjectTrain error: 21.0 Test error: 331.79Best C: 1.0e+10
  44. 44. Different SVM kernelsdef learn(kernel=rbf, param_grid=None,verbose=False):[...] clf = GridSearchCV( estimator=SVR(kernel=kernel, verbose=verbose), param_grid=param_grid, n_jobs=-1)[...]
  45. 45. Different SVM kernelsRBFTrain error: 21.0 Test error: 331.79Best C: 1.0e+10LinearTrain error: 343.1 Test error: 355.56Best C: 1.0e+02
  46. 46. Fun #5: account for theComponentcomponents = []for number, .. component, ... in tickets_file: [...] components.append(component) [...]tickets = sp.hstack((tickets, TfidfTransformer().fit_transform(CountVectorizer().fit_transform(components))))
  47. 47. Fun #5: account for theComponentRBFTrain error: 18.9 Test error: 327.79Best C: 1.0e+10Linear:Train error: 342.2 Test error: 354.89Best C: 1.0e+02
  48. 48. Fun #6: ticket Descriptiondescriptions = []for number, ... description in tickets_file: [...] descriptions.append(description) [...]tickets = sp.hstack((tickets, TfidfTransformer().fit_transform( CountVectorizer(ngram_range=(1,3)).fit_transform( descriptions))))
  49. 49. Fun #6: ticket DescriptionRBFTrain error: 10.8 Test error: 328.44Best C: 1.0e+10LinearTrain error: 14.0 Test error: 331.52Best C: 3.2e+03
  50. 50. Conclusions● All steps of a simple machine learning algo
  51. 51. Conclusions● All steps of a simple machine learning algo● scikit-learn
  52. 52. Conclusions● All steps of a simple machine learning algo● scikit-learn● data, explicitly available in tickets is NOT ENOUGH to predict closing date
  53. 53. Developers,what are you hiding? :)
  54. 54. Questions?Source code and dataset available athttps://github.com/42/django-trac-learning.gitContacts:● @akhavr● http://42coffeecups.com/

×