Your SlideShare is downloading. ×
Naive application of Machine Learning to Software Development
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Naive application of Machine Learning to Software Development

582

Published on

Naive application of Machine Learning to Software Development: get tickets from Django trac ticket tracking system and try to predict how long it will take to close the ticket. …

Naive application of Machine Learning to Software Development: get tickets from Django trac ticket tracking system and try to predict how long it will take to close the ticket.

Facts that developers aren't putting RIGHT information into their tracking systems :)

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
582
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Naive application ofMachine Learning toSoftware Development
  • 2. Naive application ofMachine Learning toSoftware Developmentor... what developers donttell :)
  • 3. What and why42 Coffee Cups: completely distributed development team
  • 4. What and why42 Coffee Cups: completely distributed development teamHard facts about how software is done
  • 5. What and why42 Coffee Cups: completely distributed development teamHard facts about how software is doneLOTS OF THEM
  • 6. What and whyFacts
  • 7. What and whyFacts Profit
  • 8. What and whyFacts ??? Profit
  • 9. What and why???Toy problem: get ticket and predict how long it will take to close it
  • 10. What and why???Toy problem: get ticket and predict how long it will take to close itBonus: learn scikit-learn :)
  • 11. Install scikit-learn● sudo apt-get install python- dev
  • 12. Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev
  • 13. Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev python-scipy
  • 14. Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev python-scipy python- setuptools libatlas-dev g++
  • 15. Install scikit-learn● sudo apt-get install python- dev python-numpy python-numpy- dev python-scipy python- setuptools libatlas-dev g++● pip install -U scikit-learn
  • 16. Data: closed ticketsimport urllib2url = https://code.djangoproject.com/query?format=csv+&col=id&col=time&col=changetime&col=reporter + &col=summary&col=status&col=owner&col=type + &col=component&order=prioritytickets = urllib2.urlopen(url).read()open(2012-10-09.csv,w).write(tickets)
  • 17. Data: closed ticketsid,time,changetime,reporter,summary,status,owner,type,component1,2005-07-13 12:03:27,2012-05-20 08:12:37,adrian,Create architecture for anonymous sessions,closed,jacob,enhancement,Core (Other)2,2005-07-13 12:04:45,2007-07-03 16:04:18,anonymous,Calendar popup - next/previous monthlinks close the popup in Safari,closed,jacob,defect,contrib.admin
  • 18. Data: closed date and descriptiondef get_data(ticket): url = https://code.djangoproject.com/ticket/%s % ticket ticket_html = urllib2.urlopen(url) bs = BeautifulSoup(ticket_html)
  • 19. Data: closed date and description# get closing dated = bs.find_all(div,date)[0]p = list(d.children)[3]href = p.find(a)[href]close_time_str = urlparse.parse_qs(href)[/timeline?from][0]close_time = datetime.datetime.strptime(close_time_str[:-6], %Y-%m-%dT%H:%M:%S)# ... more black magic, see code
  • 20. Data: closed date and descriptiondef get_data(ticket): [...] # get description and return de = bs.find_all(div, description)[0] return close_time, de.text
  • 21. Data: closed date and descriptiontickets_file = csv.reader(open(2012-10-09.csv))output = csv.writer(open(2012-10-09.close.csv,w))for id, time, changetime, reporter, summary, status, owner, type, component in tickets_file: closetime, descr = get_data(id) row = [id, time, changetime, closetime, reporter, summary, status, owner, type, component, descr.encode(utf-8), ],) output.writerow(row)
  • 22. Scoring: Train/Test set splitcross_validation.train_test_split(tickets_train, tickets_test, times_train,times_test) = cross_validation.train_test_split( tickets, times, test_size=0.2, random_state=0)
  • 23. Scoring: Mean squared errorsklearn.metrics.mean_squared_errortrain_error = metrics.mean_squared_error( times_train, times_train_predict)test_error = metrics.mean_squared_error( times_test, times_test_predict)
  • 24. Fun #1: just ticket number?for number, created, ... in tickets_file: row = [] created = dt.datetime.strptime(created, time_format) closetime = dt.datetime.strptime(closetime, time_format) time_to_fix = closetime - created row.append(float(number)) tickets.append(row) times.append(total_seconds(time_to_fix))
  • 25. Fun #1: just ticket number?import numpy as npfrom sklearn import preprocessingscaler = preprocessing.Scaler().fit( np.array(tickets))tickets = scaler.transform(tickets)
  • 26. Fun #1: just ticket number?clf = SVR()clf.fit(tickets_train, times_train)times_train_predict = clf.predict(tickets_train)times_test_predict = clf.predict(tickets_test)
  • 27. Fun #1: just ticket number?train_error = metrics.mean_squared_error(times_train, times_train_predict)test_error = metrics.mean_squared_error(times_test,times_test_predict)print Train error: %.1fn Test error: %.2f % ( math.sqrt(train_error)/(24*3600), math.sqrt(test_error)/(24*3600))# .. in days
  • 28. Fun #1: just ticket number?Train error: 363.4Test error: 361.41
  • 29. Finding best parametersSVM C controls regularization:larger C leads to● closer fit to the train data● with the risk of overfitting
  • 30. Finding best parametersCs = np.logspace(-1, 10, 10)for c in Cs: learn(c)
  • 31. Finding best parameters0.1: Train error: 363.4 Test error: 361.411.71: Train error: 363.4 Test error: 361.4127.8: Train error: 363.4 Test error: 361.39464.2: Train error: 363.2 Test error: 361.177742.6: Train error: 362.5 Test error: 360.41129155.0: Train error: 362.1 Test error: 360.002154434.7: Train error: 362.0 Test error: 359.8235938136.6: Train error: 361.7 Test error: 359.60599484250.3: Train error: 361.5 Test error: 359.3610000000000.0: Train error: 361.1 Test error:358.91
  • 32. Finding best parameterssklearn.grid_search.GridSearchCV bonus: it can run in parallelclf = GridSearchCV(estimator=SVR( param_grid=dict(C=np.logspace(-1,10,10)), n_jobs=-1)clf.fit(tickets_train, times_train)
  • 33. Finding best parameterssklearn.grid_search.GridSearchCV bonus: it can run in parallelclf = GridSearchCV(estimator=SVR( param_grid=dict(C=np.logspace(-1,10,10)), n_jobs=-1)clf.fit(tickets_train, times_train) Train error: 361.1 Test error: 358.91 Best C: 1.0e+10
  • 34. Fun #2: creation date? row = [] row.append(float(number)) row.append(float(time.mktime( created.timetuple()))) tickets.append(row)
  • 35. Fun #2: creation date?Train error: 360.6 Test error: 358.39Best C: 1.0e+10
  • 36. String vectorizer and Tfidf transformfrom sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
  • 37. String vectorizer and Tfidf transformreporters = []for number, ... in tickets_file: [...] reporters.append(reporter)
  • 38. String vectorizer and Tfidf transformCountVectorizer().fit_transform(reporters) -> TfidfTransformer().fit_transform( … ) -> hstack((tickets, …)note: TF-IDF matrix is sparse!
  • 39. String vectorizer and Tfidf transformimport scipy.sparse as sptickets = sp.hstack(( tickets, TfidfTransformer().fit_transform( CountVectorizer().fit_transform(reporters))))# remember to re-scale!scaler = preprocessing.Scaler(with_mean=False ).fit(tickets)tickets = scaler.transform(tickets)
  • 40. Fun #3: reporterTrain error: 338.7 Test error: 353.38Best C: 1.8e+07
  • 41. Fun #3: subject subjects = [] for number, created, ... in tickets_file: [...] subjects.append(summary) [...] tickets = sp.hstack((tickets, TfidfTransformer().fit_transform( CountVectorizer(ngram_range=(1,3) ).fit_transform(subjects))))
  • 42. Fun #3: subjectTrain error: 21.0 Test error: XXXXBest C: 1.0e+10
  • 43. Fun #3: subjectTrain error: 21.0 Test error: 331.79Best C: 1.0e+10
  • 44. Different SVM kernelsdef learn(kernel=rbf, param_grid=None,verbose=False):[...] clf = GridSearchCV( estimator=SVR(kernel=kernel, verbose=verbose), param_grid=param_grid, n_jobs=-1)[...]
  • 45. Different SVM kernelsRBFTrain error: 21.0 Test error: 331.79Best C: 1.0e+10LinearTrain error: 343.1 Test error: 355.56Best C: 1.0e+02
  • 46. Fun #5: account for theComponentcomponents = []for number, .. component, ... in tickets_file: [...] components.append(component) [...]tickets = sp.hstack((tickets, TfidfTransformer().fit_transform(CountVectorizer().fit_transform(components))))
  • 47. Fun #5: account for theComponentRBFTrain error: 18.9 Test error: 327.79Best C: 1.0e+10Linear:Train error: 342.2 Test error: 354.89Best C: 1.0e+02
  • 48. Fun #6: ticket Descriptiondescriptions = []for number, ... description in tickets_file: [...] descriptions.append(description) [...]tickets = sp.hstack((tickets, TfidfTransformer().fit_transform( CountVectorizer(ngram_range=(1,3)).fit_transform( descriptions))))
  • 49. Fun #6: ticket DescriptionRBFTrain error: 10.8 Test error: 328.44Best C: 1.0e+10LinearTrain error: 14.0 Test error: 331.52Best C: 3.2e+03
  • 50. Conclusions● All steps of a simple machine learning algo
  • 51. Conclusions● All steps of a simple machine learning algo● scikit-learn
  • 52. Conclusions● All steps of a simple machine learning algo● scikit-learn● data, explicitly available in tickets is NOT ENOUGH to predict closing date
  • 53. Developers,what are you hiding? :)
  • 54. Questions?Source code and dataset available athttps://github.com/42/django-trac-learning.gitContacts:● @akhavr● http://42coffeecups.com/

×