Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mining python-software-pyconuk13


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Mining python-software-pyconuk13

  1. 1. Mining Python Software Sarah Mount - @snim2
  2. 2. What do you want to know today? What do we know about software? ● How to make it correct ● How long it will take to write ● Expected bugs per kloc Er … yeah.
  3. 3. Health warning... This is a work in progress, don’t take the numbers and charts too seriously just yet...
  4. 4. Options for mining Python software
  5. 5. <?xml version="1.0" encoding="UTF-8"?> <response> <status>success</status> <result> <project> <id>1</id> <name>Subversion</name> <created_at>2006-10-10T15:51:31Z</created_at> <updated_at>2007-08-22T17:31:17Z</updated_at> <homepage_url></homepage_url> <download_url> </download_url> <updated_at>2007-07-12T12:21:11Z</updated_at> <logged_at>2007-07-12T12:18:54Z</logged_at> <min_month>2001-08-01T00:00:00Z</min_month> <max_month>2007-07-01T00:00:00Z</max_month> ...
  6. 6. { "repository":{ "url":"", "has_downloads":false, "created_at":"2012/01/19 14:15:34 -0800", "has_issues":true, "description":"SPDY is an experiment with protocols for the web", "forks":10, "fork":false, "has_wiki":false, "homepage":"", "size":420, "private":false, "name":"spdy", "owner":"igrigorik", "open_issues":4, "watchers":206, "pushed_at":"2012/01/11 10:38:16 -0700", "language":"Ruby" }, "created_at":"2012/02/11 10:38:16 -0700", "public":true, "actor":"igrigorik", "payload":{ "head":"98f44cab69becb274c6f3b9035ef8e0bd7b2b1b7", "size":1, ... ], "ref":"refs/heads/master" }, "url":"", "type":"PushEvent" }
  7. 7. Google bigquery interface /* top 100 repos for Ruby by number of pushes */ SELECT repository_name, count(repository_name) as pushes, repository_description, repository_url FROM [githubarchive:github.timeline] WHERE type="PushEvent" AND repository_language="Ruby" AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00') GROUP BY repository_name, repository_description, repository_url ORDER BY pushes DESC LIMIT 100
  8. 8. Some preliminary work
  9. 9. Code clones Type 1: Identical code, copy & pasted Type 2: Identical code modulo names, layout, comments, etc. Type 3: Type 2 plus further modifications such as changes in statements Type 4: Different code, same semantics Roy & Cordy (2007)
  10. 10. Sentiment (in comments)
  11. 11. Some ideas for mining projects
  12. 12. Mining ideas ● How do programming idioms develop and spread? ● How do projects reach a critical mass of developers and become “popular”? ● Are metrics like cyclomatic complexity, fan out and Halstead’s complexity measure useful, or are they all just proportional to kLOCs?
  13. 13. Thank you.