When a company is hacked, and sensitive user data released without authorization, it's clear that information security has been breached. However companies may be willingly, unknowingly leaking information in non-obvious ways today, and the villain is also a hero of the big data era: data science. Applying data science to released data sets may discover latent information within that wasn't intended to be made public. In this webinar, we'll look at a few high-profile examples of this, and discuss what went wrong and how it could have been avoided. For two cases, we'll explore the problem and solution in more detail with technologies like Apache Spark: the New York City Taxi and Limousine Commission's release of data on taxi rides in the city, and Netflix's release of movie rating data as part of its $1M Netflix Prize.
Uniqueness of Simple Demographics in the U.S. Population LIDAP-WP4 Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA: 2000 (1000) by Latanya Sweeney
http://www.citeulike.org/user/burd/article/5822736
val digits = '0' to '9'val alpha = 'A' to 'Z'val nxnn = sc.parallelize(digits).flatMap(a => for (b <- alpha; c <- digits; d <- digits) yield new String(Array(a, b, c, d)))val xx = for (a <- alpha; b <- alpha) yield Array(a, b)val xxnnn = sc.parallelize(xx).flatMap(prefix => for (c <- digits; d <- digits; e <- digits) yield new String(prefix ++ Array(c, d, e)))val xxxnnn = sc.parallelize(xx).flatMap(prefix => for (c <- alpha; d <- digits; e <- digits; f <- digits) yield new String(prefix ++ Array(c, d, e, f)))def toHex(bytes: Array[Byte]) = bytes.map { b => val u = b & 0xFF if (u < 16) "0" + u.toHexString else u.toHexString }.mkString.toUpperCasesc.union(nxnn, xxnnn, xxxnnn).mapPartitions { medallions => val md5 = java.security.MessageDigest.getInstance("MD5") medallions.map { medallion => val bytes = medallion.getBytes(java.nio.charset.StandardCharsets.UTF_8) (toHex(md5.digest(bytes)), medallion) }}.collectAsMap()