Session talk for Data Day Texas 2015, showing GraphX and SparkSQL for text analytics and graph analytics of an Apache developer email list -- including an implementation of TextRank in Spark.
4. For Python 2.7, check out Anaconda by
Continuum Analytics for a full-featured
platform:
store.continuum.io/cshop/anaconda/
Downloads: Python
5. Let’s get started using Apache Spark, in just a few
easy steps… Download code from:
databricks.com/spark-training-resources#itas
or for a fallback: spark.apache.org/downloads.html
!
Also, the GitHub project:
github.com/ceteri/spark-exercises/tree/master/exsto
Downloads: Spark
6. Connect into the inflated “spark” directory,
then run:
./bin/spark-shell!
Downloads: Spark
10. // base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
Spark Deconstructed: Log Mining Example
discussing the other part
11. Spark Deconstructed: Log Mining Example
scala> messages.toDebugString!
res5: String = !
MappedRDD[4] at map at <console>:16 (3 partitions)!
MappedRDD[3] at map at <console>:16 (3 partitions)!
FilteredRDD[2] at filter at <console>:14 (3 partitions)!
MappedRDD[1] at textFile at <console>:12 (3 partitions)!
HadoopRDD[0] at textFile at <console>:12 (3 partitions)
At this point, take a look at the transformed
RDD operator graph:
12. Driver
Worker
Worker
Worker
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
13. Driver
Worker
Worker
Worker
block 1
block 2
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
14. Driver
Worker
Worker
Worker
block 1
block 2
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
15. Driver
Worker
Worker
Worker
block 1
block 2
block 3
read
HDFS
block
read
HDFS
block
read
HDFS
block
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
16. Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process,
cache data
process,
cache data
process,
cache data
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
17. Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
18. // base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
discussing the other part
19. Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process
from cache
process
from cache
process
from cache
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains(“mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
20. Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains(“mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
23. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin
graphlab.org/files/osdi2012-gonzalez-low-gu-
bickson-guestrin.pdf
Pregel: Large-scale graph computing at Google
Grzegorz Czajkowski, et al.
googleresearch.blogspot.com/2009/06/large-scale-
graph-computing-at-google.html
GraphX: Unified Graph Analytics on Spark
Ankur Dave, Databricks
databricks-training.s3.amazonaws.com/slides/
graphx@sparksummit_2014-07.pdf
Advanced Exercises: GraphX
databricks-training.s3.amazonaws.com/graph-
analytics-with-graphx.html
GraphX
28. Workflows: Scraper pipeline
Typical data rates, e.g., for dev@spark.apache.org:
• ~2K msgs/month
• ~6 MB as JSON
• ~13 MB parsed
Three months’ list activity represents a graph of:
• 1061 senders
• 753,400 nodes
• 1,027,806 edges
A big graph! However, it satisfies definition for a
graph-parallel system; lots of data locality to leverage
29. Workflows: A Few Notes about Microservices and Containers
The Strengths andWeaknesses of Microservices
Abel Avram
http://www.infoq.com/news/2014/05/microservices
DockerCon EU Keynote: State of the Art in Microservices
Adrian Cockcroft
https://blog.docker.com/2014/12/dockercon-
europe-keynote-state-of-the-art-in-microservices-
by-adrian-cockcroft-battery-ventures/
Microservices Architecture
Martin Fowler
http://martinfowler.com/articles/microservices.html
30. Workflows: An Example…
Python-based service in a Docker container?
Just Enough Math, IPython+Docker
Paco Nathan, Andrew Odewahn, Kyle Kelly
https://github.com/ceteri/jem-docker
https://registry.hub.docker.com/u/ceteri/jem/
Docker Jumpstart
Andrew Odewahn
http://odewahn.github.io/docker-jumpstart/
31. Workflows: A Brief Note about ETL in SparkSQL
Spark SQL Data Sources API: Unified Data Access for
the Spark Platform
Michael Armbrust
databricks.com/blog/2015/01/09/spark-sql-
data-sources-api-unified-data-access-for-
the-spark-platform.html
32. This Workflow: Microservices meet Parallel Processing
services
email
archives community
leaderboards
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data Unique
Word IDs
TextRank,
Word2Vec,
etc.
community
insights
not so big data… relatively big compute…
34. Workflows: Scraper pipeline
message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs
{!
"date": "2014-10-01T00:16:08+00:00",!
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",!
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",!
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQ
"prev_thread": "",!
"sender": "Debasish Das <debasish.da...@gmail.com>",!
"subject": "Re: memory vs data_size",!
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....n
}
41. TextRank impl: load parquet files
import org.apache.spark.graphx._!
import org.apache.spark.rdd.RDD!
!
val sqlCtx = new org.apache.spark.sql.SQLContext(sc)!
import sqlCtx._!
!
val edge = sqlCtx.parquetFile("graf_edge.parquet")!
edge.registerTempTable("edge")!
!
val node = sqlCtx.parquetFile("graf_node.parquet")!
node.registerTempTable("node")!
!
// pick one message as an example; at scale we'd parallelize!
val msg_id = "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw"
42. TextRank impl: use SparkSQL to collect node list + edge list
val sql = """!
SELECT node_id, root !
FROM node !
WHERE id='%s' AND keep='1'!
""".format(msg_id)!
!
val n = sqlCtx.sql(sql.stripMargin).distinct()!
val nodes: RDD[(Long, String)] = n.map{ p =>!
(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[String])!
}!
nodes.collect()!
!
val sql = """!
SELECT node0, node1 !
FROM edge !
WHERE id='%s'!
""".format(msg_id)!
!
val e = sqlCtx.sql(sql.stripMargin).distinct()!
val edges: RDD[Edge[Int]] = e.map{ p =>!
Edge(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[Int].toLong, 0)!
}!
edges.collect()
43. TextRank impl: use GraphX to run PageRank
// run PageRank!
val g: Graph[String, Int] = Graph(nodes, edges)!
val r = g.pageRank(0.0001).vertices!
!
r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)!
!
// save the ranks!
case class Rank(id: Int, rank: Float)!
val rank = r.map(p => Rank(p._1.toInt, p._2.toFloat))!
rank.registerTempTable("rank")!
!
def median[T](s: Seq[T])(implicit n: Fractional[T]) = {!
import n._!
val (lower, upper) = s.sortWith(_<_).splitAt(s.size / 2)!
if (s.size % 2 == 0) (lower.last + upper.head) / fromInt(2) else upper.head!
}!
!
val min_rank = median(r.map(_._2).collect())
44. TextRank impl: join ranked words with parsed text
var span:List[String] = List()!
var last_index = -1!
var rank_sum = 0.0!
!
var phrases:collection.mutable.Map[String, Double] = collection.mutable.Map()!
!
val sql = """!
SELECT n.num, n.raw, r.rank!
FROM node n JOIN rank r ON n.node_id = r.id !
WHERE n.id='%s' AND n.keep='1'!
ORDER BY n.num!
""".format(msg_id)!
!
val s = sqlCtx.sql(sql.stripMargin).collect()
45. TextRank impl: “pull strings” for the top-ranked keyphrases
s.foreach { x => !
//println (x)!
val index = x.getInt(0)!
val word = x.getString(1)!
val rank = x.getFloat(2)!
var isStop = false!
!
// test for break from past!
if (span.size > 0 && rank < min_rank) isStop = true!
if (span.size > 0 && (index - last_index > 1)) isStop = true!
!
// clear accumulation!
if (isStop) {!
val phrase = span.mkString(" ")!
phrases += (phrase -> rank_sum)!
!
span = List()!
last_index = index!
rank_sum = 0.0!
}!
!
// start or append!
if (rank >= min_rank) {!
span = span :+ word!
last_index = index!
rank_sum += rank!
}!
}!
46. TextRank impl: report the top keyphrases
// summarize the text as a list of ranked keyphrases!
val summary = sc.parallelize(phrases.toSeq)!
.distinct()!
.sortBy(_._2, ascending=false)
48. Reply Graph: load parquet files
import org.apache.spark.graphx._!
import org.apache.spark.rdd.RDD!
!
val sqlCtx = new org.apache.spark.sql.SQLContext(sc)!
import sqlCtx._!
!
val edge = sqlCtx.parquetFile("reply_edge.parquet")!
edge.registerTempTable("edge")!
!
val node = sqlCtx.parquetFile("reply_node.parquet")!
node.registerTempTable("node")!
!
edge.schemaString!
node.schemaString
49. Reply Graph: use SparkSQL to collect node list + edge list
val sql = "SELECT id, sender FROM node"!
val n = sqlCtx.sql(sql).distinct()!
val nodes: RDD[(Long, String)] = n.map{ p =>!
(p(0).asInstanceOf[Long], p(1).asInstanceOf[String])!
}!
nodes.collect()!
!
val sql = "SELECT replier, sender, num FROM edge"!
val e = sqlCtx.sql(sql).distinct()!
val edges: RDD[Edge[Int]] = e.map{ p =>!
Edge(p(0).asInstanceOf[Long], p(1).asInstanceOf[Long], p(2).asInstanceOf[Int])!
}!
edges.collect()
50. Reply Graph: use GraphX to run graph analytics
// run graph analytics!
val g: Graph[String, Int] = Graph(nodes, edges)!
val r = g.pageRank(0.0001).vertices!
r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)!
!
// define a reduce operation to compute the highest degree vertex!
def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {!
if (a._2 > b._2) a else b!
}!
!
// compute the max degrees!
val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max)!
val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)!
val maxDegrees: (VertexId, Int) = g.degrees.reduce(max)!
!
// connected components!
val scc = g.stronglyConnectedComponents(10).vertices!
node.join(scc).foreach(println)
52. Reply Graph: What SSSP looks like in GraphX/Pregel
github.com/ceteri/spark-exercises/blob/master/src/main/scala/
com/databricks/apps/graphx/sssp.scala
53. Look Ahead: Where is this heading?
Feature learning withWord2Vec
Matt Krzus
www.yseam.com/blog/WV.html
ranked
phrases
GraphX
run
Con.Comp.
MLlib
run
Word2Vec
aggregated
by topic
MLlib
run
KMeans
topic
vectors
better than
LDA?
features… models… insights…
55. Apache Spark developer certificate program
• http://oreilly.com/go/sparkcert
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise
certification:
56. MOOCs:
Anthony Joseph
UC Berkeley
begins 2015-02-23
edx.org/course/uc-berkeleyx/uc-
berkeleyx-cs100-1x-
introduction-big-6181
Ameet Talwalkar
UCLA
begins 2015-04-14
edx.org/course/uc-berkeleyx/
uc-berkeleyx-cs190-1x-
scalable-machine-6066
59. confs:
Strata CA
San Jose, Feb 18-20
strataconf.com/strata2015
Spark Summit East
NYC, Mar 18-19
spark-summit.org/east
Big Data Tech Con
Boston, Apr 26-28
bigdatatechcon.com
Strata EU
London, May 5-7
strataconf.com/big-data-conference-uk-2015
Spark Summit 2015
SF, Jun 15-17
spark-summit.org
60. books:
Fast Data Processing
with Spark
Holden Karau
Packt (2013)
shop.oreilly.com/product/
9781782167068.do
Spark in Action
Chris Fregly
Manning (2015*)
sparkinaction.com/
Learning Spark
Holden Karau,
Andy Konwinski,
Matei Zaharia
O’Reilly (2015*)
shop.oreilly.com/product/
0636920028512.do
61. presenter:
Just Enough Math
O’Reilly, 2014
justenoughmath.com
preview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates,
events, conf summaries, etc.:
liber118.com/pxn/
Enterprise Data Workflows
with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do