Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Diagnosing Open-Source Community Health with Spark-(William Benton, Red Hat)

1,182 views

Published on

Presentation at Spark Summit 2015

Published in: Data & Analytics
  • Be the first to comment

Diagnosing Open-Source Community Health with Spark-(William Benton, Red Hat)

  1. 1. DIAGNOSING OPEN- SOURCE COMMUNITY HEALTH WITH SPARK William Benton Red Hat Emerging Technology willb@redhat.com
  2. 2. BACKGROUND FEDORA MESSAGING BULK DATA INGEST ANALYZING COMMUNITY NEXT STEPS 2 BACKGROUND
  3. 3. BACKGROUND FEDORA MESSAGING BULK DATA INGEST ANALYZING COMMUNITY NEXT STEPS 2 BACKGROUND
  4. 4. WHAT MAKES A GREAT OPEN-SOURCE COMMUNITY? 3
  5. 5. 4
  6. 6. QUICK FACTS ABOUT FEDORA • Seven architectures • ~17k packages (in Fedora 22) • Tens of thousands of contributors • Millions of users • Many moving parts in Fedora infrastructure 5
  7. 7. BACKGROUND FEDORA MESSAGING BULK DATA INGEST ANALYZING COMMUNITY NEXT STEPS 6
  8. 8. 7 fedmsg bus
  9. 9. 7 fedmsg bus package SCM fedmsg-hub ansible MediaWiki (via mod_php) COPR and secondary arch builds
  10. 10. 7 fedmsg bus package SCM fedmsg-hub ansible MediaWiki (via mod_php) COPR and secondary arch builds meetbot
  11. 11. 7 fedmsg bus package SCM fedmsg-hub ansible MediaWiki (via mod_php) COPR and secondary arch builds meetbot koji
  12. 12. 7 fedmsg bus package SCM fedmsg-hub ansible MediaWiki (via mod_php) COPR and secondary arch builds meetbot koji Apache mod_wsgi bodhi accounts ticketing package database …many others!
  13. 13. 7 fedmsg bus package SCM fedmsg-hub ansible MediaWiki (via mod_php) COPR and secondary arch builds meetbot koji Apache mod_wsgi bodhi accounts ticketing package database …many others! IRC gateway CLI tools web gateway #fedora-fedmsg
  14. 14. 7 fedmsg bus package SCM fedmsg-hub ansible MediaWiki (via mod_php) COPR and secondary arch builds meetbot koji Apache mod_wsgi bodhi accounts ticketing package database …many others! IRC gateway CLI tools web gateway notifications #fedora-fedmsg
  15. 15. 7 fedmsg bus package SCM fedmsg-hub ansible MediaWiki (via mod_php) COPR and secondary arch builds meetbot koji Apache mod_wsgi bodhi accounts ticketing package database …many others! IRC gateway CLI tools web gateway notifications datanommer (warehouse) #fedora-fedmsg
  16. 16. 7 fedmsg bus package SCM fedmsg-hub ansible MediaWiki (via mod_php) COPR and secondary arch builds meetbot koji Apache mod_wsgi bodhi accounts ticketing package database …many others! IRC gateway CLI tools web gateway notifications Fedora badges datanommer (warehouse) #fedora-fedmsg
  17. 17. 8
  18. 18. BACKGROUND FEDORA MESSAGING BULK DATA INGEST ANALYZING COMMUNITY NEXT STEPS 9
  19. 19. MESSAGE SOURCES • Bulk historical data (PostgreSQL dump) • Specific historical records (REST/JSON) • Streaming messages (∅MQ) 10
  20. 20. 11 CREATE TABLE messages ( id integer NOT NULL, i integer NOT NULL, "timestamp" timestamp NOT NULL, certificate text, signature text, topic text, _msg text NOT NULL, category text, source_name text, source_version text, msg_id text ); CREATE TABLE messages ( id integer NOT NULL, i integer NOT NULL, "timestamp" timestamp NOT NULL, certificate text, signature text, topic text, _msg text NOT NULL, category text, source_name text, source_version text, msg_id text );
  21. 21. 11 CREATE TABLE i certificate signature topic _msg category source_name source_version msg_id ); CREATE TABLE messages ( id integer NOT NULL, i integer NOT NULL, "timestamp" timestamp NOT NULL, certificate text, signature text, topic text, _msg text NOT NULL, category text, source_name text, source_version text, msg_id text );
  22. 22. “I’LL JUST USE JSON!” 12 { "timestamp": "now", "you-have": [{"problems": 2}] }
  23. 23. JSON AND SCHEMA INFERENCE 13 { "animals" : { "duck" : { "mammal": false, "bill": true, "eggs": true }, "zebra" : { "mammal": true, "bill": false, "eggs": false }, "platypus" : { "mammal": true, "bill": true, "eggs": true } } }
  24. 24. JSON AND SCHEMA INFERENCE 13 { "animals" : [ {"k": "duck", "v": { "mammal": false, "bill": true, "eggs": true } }, {"k": "zebra", "v": { “mammal": true, "bill": false, "eggs": false }, /* ... */ ] }
  25. 25. DIVERGING SCHEMAS 14 { "branches" : ["f22", "master"] } { "branches" : [ { "name": "f22", "commit" : "05afffa1" } ] }
  26. 26. PREPROCESSING WITH JSON4S 15 // use json4s and partial functions // see the Silex library for an easy interface! def renameBranches(msg: JValue) = { msg transformField { case JField("branches", v@JArray(JString(_)::_)) => ("pkg_branches", v) case JField("branches", v@JArray(JObject(_)::_)) => ("git_branches", v) } }
  27. 27. BACKGROUND FEDORA MESSAGING BULK DATA INGEST ANALYZING COMMUNITY NEXT STEPS 16
  28. 28. MESSAGE COUNT DISTRIBUTION 17 100% 40% 60% 80% 10 103 107 message counts (log scale) userswithatmost 105
  29. 29. UNIQUE TOPIC DISTRIBUTION 18 100% 40% 60% 80% 5 10 15 20 25 unique topics userswithatmost
  30. 30. TOPIC SETS AS FEATURES • Idea: characterize users by the sets of unique message topics affecting them • Use ML Pipeline transformers to go from sets (modeled as arrays) to feature vectors 19
  31. 31. TOPIC SETS AS FEATURES 20 // expose this Spark-private DataType package org.apache.spark.hacks { type VectorType = org.apache.spark.mllib.linalg.VectorUDT }
  32. 32. TOPIC SETS AS FEATURES 21 trait SVParams extends Params { val inputCol = new Param[String](this, "inputCol", "...") val outputCol = new Param[String](this, "outputCol", "...") val vecSize = new IntParam(this, "vecSize", "...") // ... }
  33. 33. TOPIC SETS AS FEATURES 21 trait SVParams extends Params { // ... def pvals(pm: ParamMap) = ( // 1.3-specific paramMap.get(inputCol).getOrElse("topicSet"), paramMap.get(outputCol).getOrElse("features"), paramMap.get(vecSize).getOrElse(128) ) }
  34. 34. 22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT = new org.apache.spark.hacks.VectorType() def transformSchema(schema: StructType, params: ParamMap) = { val outc = paramMap.get(outputCol).getOrElse("features") StructType(schema.fields ++ Seq(StructField(outc, VT, true))) } def transform(df: DataFrame, params: ParamMap) = { val (inc, outc, vs) = pvals(paramMap) df.withColumn(outc, callUDF({ a: Seq[Int] => Vectors.sparse(vs, a.toArray, Array.fill(a.size)(1.0)) }, VT, df(inc))) } }
  35. 35. CLASSIFYING FEDORA USERS • Problem: can we identify Fedora packager sponsors by their message topic sets? • Logistic regression was easy to try • ~1% prediction error rate…but only ~10% of actual sponsors were predicted as such! 23
  36. 36. A BETTER APPROACH 24 Building a decision tree identified the following message topics as likely to predict that someone was a Fedora packager sponsor: • org.fedora.prod.fas.role.update • org.fedora.prod.fas.user.update
  37. 37. BACKGROUND FEDORA MESSAGING BULK DATA INGEST ANALYZING COMMUNITY NEXT STEPS 25
  38. 38. WHAT’S NEXT • More about fedmsg: http://fedmsg.com • Our Silex library includes helpers for JSON data ingest: http://silex.freevariable.com • More details about this project:
 http://chapeau.freevariable.com1 26 1 http://chapeau.freevariable.com/2015/06/summit-fedmsg.html
  39. 39. THANKS! 27

×