Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Microservices, Containers,
and Machine Learning
Paco Nathan, @pacoid
Downloads
oracle.com/technetwork/java/javase/downloads/
jdk7-downloads-1880260.html	

• follow the license agreement instructions	

...
For Python 2.7, check out Anaconda by
Continuum Analytics for a full-featured
platform:	

store.continuum.io/cshop/anacond...
Let’s get started using Apache Spark, in just a few
easy steps… Download code from:	

databricks.com/spark-training-resour...
Connect into the inflated “spark” directory,
then run:	

./bin/spark-shell!
Downloads: Spark
Spark Deconstructed
// load error messages from a log into memory!
// then interactively search for various patterns!
// https://gist.github.c...
Driver
Worker
Worker
Worker
Spark Deconstructed: Log Mining Example
We start with Spark running on a cluster…

submitting ...
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR")...
Spark Deconstructed: Log Mining Example
scala> messages.toDebugString!
res5: String = !
MappedRDD[4] at map at <console>:1...
Driver
Worker
Worker
Worker
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
...
Driver
Worker
Worker
Worker
block 1
block 2
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.te...
Driver
Worker
Worker
Worker
block 1
block 2
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.te...
Driver
Worker
Worker
Worker
block 1
block 2
block 3
read
HDFS
block
read
HDFS
block
read
HDFS
block
Spark Deconstructed: L...
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process,
cache data
process,
cache data
proces...
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
// bas...
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR")...
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process
from cache
process
from cache
process
...
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
// bas...
GraphX
GraphX
spark.apache.org/docs/latest/graphx-
programming-guide.html	

!
Key Points:	

!
• graph-parallel systems	

• import...
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs

J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin
...
// http://spark.apache.org/docs/latest/graphx-programming-guide.html!
!
import org.apache.spark.graphx._!
import org.apach...
TextRank Demo:	

!
cdn.liber118.com/spark/ipynb/textrank/
PySparkTextRank.ipynb	

!
IPYTHON_OPTS="notebook --pylab inline"...
Workflows
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Le...
Workflows: Scraper pipeline
Typical data rates, e.g., for dev@spark.apache.org:	

• ~2K msgs/month	

• ~6 MB as JSON	

• ~1...
Workflows: A Few Notes about Microservices and Containers
The Strengths andWeaknesses of Microservices

Abel Avram

http://...
Workflows: An Example…
Python-based service in a Docker container?	

Just Enough Math, IPython+Docker

Paco Nathan, Andrew ...
Workflows: A Brief Note about ETL in SparkSQL
Spark SQL Data Sources API: Unified Data Access for
the Spark Platform

Michae...
This Workflow: Microservices meet Parallel Processing
services
email
archives community
leaderboards
SparkSQL
Data Prep
Fea...
Workflows: Scraper pipeline
message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by d...
Workflows: Scraper pipeline
message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by d...
TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON...
TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON...
Workflows: TextRank pipeline
Spark
create
word graph
RDD
word
graph
NetworkX
visualize
graph
GraphX
run
TextRank
Spark
extr...
Workflows: TextRank pipeline
"Compatibility of systems of linear constraints"
[{'index': 0, 'stem': 'compat', 'tag': 'NNP',...
https://en.wikipedia.org/wiki/PageRank
Workflows: TextRank – how it works
TextRank impl
TextRank impl: load parquet files
import org.apache.spark.graphx._!
import org.apache.spark.rdd.RDD!
!
val sqlCtx = new org...
TextRank impl: use SparkSQL to collect node list + edge list
val sql = """!
SELECT node_id, root !
FROM node !
WHERE id='%...
TextRank impl: use GraphX to run PageRank
// run PageRank!
val g: Graph[String, Int] = Graph(nodes, edges)!
val r = g.page...
TextRank impl: join ranked words with parsed text
var span:List[String] = List()!
var last_index = -1!
var rank_sum = 0.0!...
TextRank impl: “pull strings” for the top-ranked keyphrases
s.foreach { x => !
//println (x)!
val index = x.getInt(0)!
val...
TextRank impl: report the top keyphrases
// summarize the text as a list of ranked keyphrases!
val summary = sc.paralleliz...
Reply Graph
Reply Graph: load parquet files
import org.apache.spark.graphx._!
import org.apache.spark.rdd.RDD!
!
val sqlCtx = new org.a...
Reply Graph: use SparkSQL to collect node list + edge list
val sql = "SELECT id, sender FROM node"!
val n = sqlCtx.sql(sql...
Reply Graph: use GraphX to run graph analytics
// run graph analytics!
val g: Graph[String, Int] = Graph(nodes, edges)!
va...
Reply Graph: PageRank of top dev@spark email, 4Q2014
(389,(22.690229478710016,Sean Owen <so...@cloudera.com>))!
(857,(20.8...
Reply Graph: What SSSP looks like in GraphX/Pregel
github.com/ceteri/spark-exercises/blob/master/src/main/scala/
com/datab...
Look Ahead: Where is this heading?
Feature learning withWord2Vec

Matt Krzus

www.yseam.com/blog/WV.html
ranked
phrases
Gr...
Resources
Apache Spark developer certificate program
• http://oreilly.com/go/sparkcert
• defined by Spark experts @Databricks
• asse...
MOOCs:
Anthony Joseph

UC Berkeley	

begins 2015-02-23	

edx.org/course/uc-berkeleyx/uc-
berkeleyx-cs100-1x-
introduction-...
community:
spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
!
video+preso archives: spark-summit.org
resour...
http://spark-summit.org/
confs:
Strata CA

San Jose, Feb 18-20

strataconf.com/strata2015
Spark Summit East

NYC, Mar 18-19

spark-summit.org/east
...
books:
Fast Data Processing 

with Spark

Holden Karau

Packt (2013)

shop.oreilly.com/product/
9781782167068.do
Spark in ...
presenter:
Just Enough Math
O’Reilly, 2014
justenoughmath.com

preview: youtu.be/TQ58cWgdCpA
monthly newsletter for update...
Upcoming SlideShare
Loading in …5
×

Microservices, Containers, and Machine Learning

7,038 views

Published on

Session talk for Data Day Texas 2015, showing GraphX and SparkSQL for text analytics and graph analytics of an Apache developer email list -- including an implementation of TextRank in Spark.

Published in: Technology

Microservices, Containers, and Machine Learning

  1. 1. Microservices, Containers, and Machine Learning Paco Nathan, @pacoid
  2. 2. Downloads
  3. 3. oracle.com/technetwork/java/javase/downloads/ jdk7-downloads-1880260.html • follow the license agreement instructions • then click the download for your OS • need JDK instead of JRE (for Maven, etc.) • JDK 6, 7, 8 is fine Downloads: Java JDK
  4. 4. For Python 2.7, check out Anaconda by Continuum Analytics for a full-featured platform: store.continuum.io/cshop/anaconda/ Downloads: Python
  5. 5. Let’s get started using Apache Spark, in just a few easy steps… Download code from: databricks.com/spark-training-resources#itas or for a fallback: spark.apache.org/downloads.html ! Also, the GitHub project: github.com/ceteri/spark-exercises/tree/master/exsto Downloads: Spark
  6. 6. Connect into the inflated “spark” directory, then run: ./bin/spark-shell! Downloads: Spark
  7. 7. Spark Deconstructed
  8. 8. // load error messages from a log into memory! // then interactively search for various patterns! // https://gist.github.com/ceteri/8ae5b9509a08c08a1132! ! // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() Spark Deconstructed: Log Mining Example
  9. 9. Driver Worker Worker Worker Spark Deconstructed: Log Mining Example We start with Spark running on a cluster…
 submitting code to be evaluated on it:
  10. 10. // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() Spark Deconstructed: Log Mining Example discussing the other part
  11. 11. Spark Deconstructed: Log Mining Example scala> messages.toDebugString! res5: String = ! MappedRDD[4] at map at <console>:16 (3 partitions)! MappedRDD[3] at map at <console>:16 (3 partitions)! FilteredRDD[2] at filter at <console>:14 (3 partitions)! MappedRDD[1] at textFile at <console>:12 (3 partitions)! HadoopRDD[0] at textFile at <console>:12 (3 partitions) At this point, take a look at the transformed RDD operator graph:
  12. 12. Driver Worker Worker Worker Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  13. 13. Driver Worker Worker Worker block 1 block 2 block 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  14. 14. Driver Worker Worker Worker block 1 block 2 block 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  15. 15. Driver Worker Worker Worker block 1 block 2 block 3 read HDFS block read HDFS block read HDFS block Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  16. 16. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 process, cache data process, cache data process, cache data Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  17. 17. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  18. 18. // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 Spark Deconstructed: Log Mining Example discussing the other part
  19. 19. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 process from cache process from cache process from cache Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains(“mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  20. 20. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains(“mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  21. 21. GraphX
  22. 22. GraphX spark.apache.org/docs/latest/graphx- programming-guide.html ! Key Points: ! • graph-parallel systems • importance of workflows • optimizations
  23. 23. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
 J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin
 graphlab.org/files/osdi2012-gonzalez-low-gu- bickson-guestrin.pdf Pregel: Large-scale graph computing at Google
 Grzegorz Czajkowski, et al.
 googleresearch.blogspot.com/2009/06/large-scale- graph-computing-at-google.html GraphX: Unified Graph Analytics on Spark
 Ankur Dave, Databricks
 databricks-training.s3.amazonaws.com/slides/ graphx@sparksummit_2014-07.pdf Advanced Exercises: GraphX
 databricks-training.s3.amazonaws.com/graph- analytics-with-graphx.html GraphX
  24. 24. // http://spark.apache.org/docs/latest/graphx-programming-guide.html! ! import org.apache.spark.graphx._! import org.apache.spark.rdd.RDD! ! case class Peep(name: String, age: Int)! ! val nodeArray = Array(! (1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),! (3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),! (5L, Peep("Leslie", 45))! )! val edgeArray = Array(! Edge(2L, 1L, 7), Edge(2L, 4L, 2),! Edge(3L, 2L, 4), Edge(3L, 5L, 3),! Edge(4L, 1L, 1), Edge(5L, 3L, 9)! )! ! val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)! val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)! val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)! ! val results = g.triplets.filter(t => t.attr > 7)! ! for (triplet <- results.collect) {! println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")! } GraphX: demo
  25. 25. TextRank Demo: ! cdn.liber118.com/spark/ipynb/textrank/ PySparkTextRank.ipynb ! IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark! GraphX: demo
  26. 26. Workflows
  27. 27. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Typical Workflows:
  28. 28. Workflows: Scraper pipeline Typical data rates, e.g., for dev@spark.apache.org: • ~2K msgs/month • ~6 MB as JSON • ~13 MB parsed Three months’ list activity represents a graph of: • 1061 senders • 753,400 nodes • 1,027,806 edges A big graph! However, it satisfies definition for a 
 graph-parallel system; lots of data locality to leverage
  29. 29. Workflows: A Few Notes about Microservices and Containers The Strengths andWeaknesses of Microservices
 Abel Avram
 http://www.infoq.com/news/2014/05/microservices DockerCon EU Keynote: State of the Art in Microservices
 Adrian Cockcroft
 https://blog.docker.com/2014/12/dockercon- europe-keynote-state-of-the-art-in-microservices- by-adrian-cockcroft-battery-ventures/ Microservices Architecture
 Martin Fowler
 http://martinfowler.com/articles/microservices.html
  30. 30. Workflows: An Example… Python-based service in a Docker container? Just Enough Math, IPython+Docker
 Paco Nathan, Andrew Odewahn, Kyle Kelly
 https://github.com/ceteri/jem-docker
 https://registry.hub.docker.com/u/ceteri/jem/ Docker Jumpstart
 Andrew Odewahn
 http://odewahn.github.io/docker-jumpstart/
  31. 31. Workflows: A Brief Note about ETL in SparkSQL Spark SQL Data Sources API: Unified Data Access for the Spark Platform
 Michael Armbrust
 databricks.com/blog/2015/01/09/spark-sql- data-sources-api-unified-data-access-for- the-spark-platform.html
  32. 32. This Workflow: Microservices meet Parallel Processing services email archives community leaderboards SparkSQL Data Prep Features Explore Scraper / Parser NLTK data Unique Word IDs TextRank, Word2Vec, etc. community insights not so big data… relatively big compute…
  33. 33. Workflows: Scraper pipeline message JSON Py filter quoted content Apache email list archive urllib2 crawl monthly list by date Py segment paragraphs
  34. 34. Workflows: Scraper pipeline message JSON Py filter quoted content Apache email list archive urllib2 crawl monthly list by date Py segment paragraphs {! "date": "2014-10-01T00:16:08+00:00",! "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",! "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQ "prev_thread": "",! "sender": "Debasish Das <debasish.da...@gmail.com>",! "subject": "Re: memory vs data_size",! "text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....n }
  35. 35. TextBlob tag and lemmatize words TextBlob segment sentences TextBlob sentiment analysis Py generate skip-grams parsed JSON message JSON Treebank, WordNet Workflows: Parser pipeline
  36. 36. TextBlob tag and lemmatize words TextBlob segment sentences TextBlob sentiment analysis Py generate skip-grams parsed JSON message JSON Treebank, WordNet Workflows: Parser pipeline {! "graf": [ [1, "Only", "only", "RB", 1, 0], [2, "fit", "fit", "VBP", 1, 1 ] ... ],! "id": “CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "polr": 0.2,! "sha1": "178b7a57ec6168f20a8a4f705fb8b0b04e59eeb7",! "size": 14,! "subj": 0.7,! "tile": [ [1, 2], [2, 3], [3, 4] ... ]! ]! } {! "date": "2014-10-01T00:16:08+00:00",! "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",! "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p "prev_thread": "",! "sender": "Debasish Das <debasish.da...@gmail.com>",! "subject": "Re: memory vs data_size",! "text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....nnFor }
  37. 37. Workflows: TextRank pipeline Spark create word graph RDD word graph NetworkX visualize graph GraphX run TextRank Spark extract phrases ranked phrases parsed JSON
  38. 38. Workflows: TextRank pipeline "Compatibility of systems of linear constraints" [{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'}, {'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'}, {'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'}, {'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}] compat system linear constraint 1: 2: 3: TextRank: Bringing Order intoTexts Rada Mihalcea, Paul Tarau http://web.eecs.umich.edu/~mihalcea/ papers/mihalcea.emnlp04.pdf
  39. 39. https://en.wikipedia.org/wiki/PageRank Workflows: TextRank – how it works
  40. 40. TextRank impl
  41. 41. TextRank impl: load parquet files import org.apache.spark.graphx._! import org.apache.spark.rdd.RDD! ! val sqlCtx = new org.apache.spark.sql.SQLContext(sc)! import sqlCtx._! ! val edge = sqlCtx.parquetFile("graf_edge.parquet")! edge.registerTempTable("edge")! ! val node = sqlCtx.parquetFile("graf_node.parquet")! node.registerTempTable("node")! ! // pick one message as an example; at scale we'd parallelize! val msg_id = "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw"
  42. 42. TextRank impl: use SparkSQL to collect node list + edge list val sql = """! SELECT node_id, root ! FROM node ! WHERE id='%s' AND keep='1'! """.format(msg_id)! ! val n = sqlCtx.sql(sql.stripMargin).distinct()! val nodes: RDD[(Long, String)] = n.map{ p =>! (p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[String])! }! nodes.collect()! ! val sql = """! SELECT node0, node1 ! FROM edge ! WHERE id='%s'! """.format(msg_id)! ! val e = sqlCtx.sql(sql.stripMargin).distinct()! val edges: RDD[Edge[Int]] = e.map{ p =>! Edge(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[Int].toLong, 0)! }! edges.collect()
  43. 43. TextRank impl: use GraphX to run PageRank // run PageRank! val g: Graph[String, Int] = Graph(nodes, edges)! val r = g.pageRank(0.0001).vertices! ! r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)! ! // save the ranks! case class Rank(id: Int, rank: Float)! val rank = r.map(p => Rank(p._1.toInt, p._2.toFloat))! rank.registerTempTable("rank")! ! def median[T](s: Seq[T])(implicit n: Fractional[T]) = {! import n._! val (lower, upper) = s.sortWith(_<_).splitAt(s.size / 2)! if (s.size % 2 == 0) (lower.last + upper.head) / fromInt(2) else upper.head! }! ! val min_rank = median(r.map(_._2).collect())
  44. 44. TextRank impl: join ranked words with parsed text var span:List[String] = List()! var last_index = -1! var rank_sum = 0.0! ! var phrases:collection.mutable.Map[String, Double] = collection.mutable.Map()! ! val sql = """! SELECT n.num, n.raw, r.rank! FROM node n JOIN rank r ON n.node_id = r.id ! WHERE n.id='%s' AND n.keep='1'! ORDER BY n.num! """.format(msg_id)! ! val s = sqlCtx.sql(sql.stripMargin).collect()
  45. 45. TextRank impl: “pull strings” for the top-ranked keyphrases s.foreach { x => ! //println (x)! val index = x.getInt(0)! val word = x.getString(1)! val rank = x.getFloat(2)! var isStop = false! ! // test for break from past! if (span.size > 0 && rank < min_rank) isStop = true! if (span.size > 0 && (index - last_index > 1)) isStop = true! ! // clear accumulation! if (isStop) {! val phrase = span.mkString(" ")! phrases += (phrase -> rank_sum)! ! span = List()! last_index = index! rank_sum = 0.0! }! ! // start or append! if (rank >= min_rank) {! span = span :+ word! last_index = index! rank_sum += rank! }! }!
  46. 46. TextRank impl: report the top keyphrases // summarize the text as a list of ranked keyphrases! val summary = sc.parallelize(phrases.toSeq)! .distinct()! .sortBy(_._2, ascending=false)
  47. 47. Reply Graph
  48. 48. Reply Graph: load parquet files import org.apache.spark.graphx._! import org.apache.spark.rdd.RDD! ! val sqlCtx = new org.apache.spark.sql.SQLContext(sc)! import sqlCtx._! ! val edge = sqlCtx.parquetFile("reply_edge.parquet")! edge.registerTempTable("edge")! ! val node = sqlCtx.parquetFile("reply_node.parquet")! node.registerTempTable("node")! ! edge.schemaString! node.schemaString
  49. 49. Reply Graph: use SparkSQL to collect node list + edge list val sql = "SELECT id, sender FROM node"! val n = sqlCtx.sql(sql).distinct()! val nodes: RDD[(Long, String)] = n.map{ p =>! (p(0).asInstanceOf[Long], p(1).asInstanceOf[String])! }! nodes.collect()! ! val sql = "SELECT replier, sender, num FROM edge"! val e = sqlCtx.sql(sql).distinct()! val edges: RDD[Edge[Int]] = e.map{ p =>! Edge(p(0).asInstanceOf[Long], p(1).asInstanceOf[Long], p(2).asInstanceOf[Int])! }! edges.collect()
  50. 50. Reply Graph: use GraphX to run graph analytics // run graph analytics! val g: Graph[String, Int] = Graph(nodes, edges)! val r = g.pageRank(0.0001).vertices! r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)! ! // define a reduce operation to compute the highest degree vertex! def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {! if (a._2 > b._2) a else b! }! ! // compute the max degrees! val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max)! val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)! val maxDegrees: (VertexId, Int) = g.degrees.reduce(max)! ! // connected components! val scc = g.stronglyConnectedComponents(10).vertices! node.join(scc).foreach(println)
  51. 51. Reply Graph: PageRank of top dev@spark email, 4Q2014 (389,(22.690229478710016,Sean Owen <so...@cloudera.com>))! (857,(20.832469059298248,Akhil Das <ak...@sigmoidanalytics.com>))! (652,(13.281821379806798,Michael Armbrust <mich...@databricks.com>))! (101,(9.963167550803664,Tobias Pfeiffer <...@preferred.jp>))! (471,(9.614436778460558,Steve Lewis <lordjoe2...@gmail.com>))! (931,(8.217073486575732,shahab <shahab.mok...@gmail.com>))! (48,(7.653814912512137,ll <duy.huynh....@gmail.com>))! (1011,(7.602002681952157,Ashic Mahtab <as...@live.com>))! (1055,(7.572376489758199,Cheng Lian <lian.cs....@gmail.com>))! (122,(6.87247388819558,Gerard Maas <gerard.m...@gmail.com>))! (904,(6.252657820614504,Xiangrui Meng <men...@gmail.com>))! (827,(6.0941062762076115,Jianshi Huang <jianshi.hu...@gmail.com>))! (887,(5.835053915864531,Davies Liu <dav...@databricks.com>))! (303,(5.724235650446037,Ted Yu <yuzhih...@gmail.com>))! (206,(5.430238461114108,Deep Pradhan <pradhandeep1...@gmail.com>))! (483,(5.332452537151523,Akshat Aranya <aara...@gmail.com>))! (185,(5.259438927615685,SK <skrishna...@gmail.com>))! (636,(5.235941228955769,Matei Zaharia <matei.zaha…@gmail.com>))! ! // seaaaaaaaaaan!! maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (389,126)! maxOutDegree: (org.apache.spark.graphx.VertexId, Int) = (389,170)! maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (389,296)
  52. 52. Reply Graph: What SSSP looks like in GraphX/Pregel github.com/ceteri/spark-exercises/blob/master/src/main/scala/ com/databricks/apps/graphx/sssp.scala
  53. 53. Look Ahead: Where is this heading? Feature learning withWord2Vec
 Matt Krzus
 www.yseam.com/blog/WV.html ranked phrases GraphX run Con.Comp. MLlib run Word2Vec aggregated by topic MLlib run KMeans topic vectors better than LDA? features… models… insights…
  54. 54. Resources
  55. 55. Apache Spark developer certificate program • http://oreilly.com/go/sparkcert • defined by Spark experts @Databricks • assessed by O’Reilly Media • establishes the bar for Spark expertise certification:
  56. 56. MOOCs: Anthony Joseph
 UC Berkeley begins 2015-02-23 edx.org/course/uc-berkeleyx/uc- berkeleyx-cs100-1x- introduction-big-6181 Ameet Talwalkar
 UCLA begins 2015-04-14 edx.org/course/uc-berkeleyx/ uc-berkeleyx-cs190-1x- scalable-machine-6066
  57. 57. community: spark.apache.org/community.html events worldwide: goo.gl/2YqJZK ! video+preso archives: spark-summit.org resources: databricks.com/spark-training-resources workshops: databricks.com/spark-training
  58. 58. http://spark-summit.org/
  59. 59. confs: Strata CA
 San Jose, Feb 18-20
 strataconf.com/strata2015 Spark Summit East
 NYC, Mar 18-19
 spark-summit.org/east Big Data Tech Con
 Boston, Apr 26-28
 bigdatatechcon.com Strata EU
 London, May 5-7
 strataconf.com/big-data-conference-uk-2015 Spark Summit 2015
 SF, Jun 15-17
 spark-summit.org
  60. 60. books: Fast Data Processing 
 with Spark
 Holden Karau
 Packt (2013)
 shop.oreilly.com/product/ 9781782167068.do Spark in Action
 Chris Fregly
 Manning (2015*)
 sparkinaction.com/ Learning Spark
 Holden Karau, 
 Andy Konwinski, Matei Zaharia
 O’Reilly (2015*)
 shop.oreilly.com/product/ 0636920028512.do
  61. 61. presenter: Just Enough Math O’Reilly, 2014 justenoughmath.com
 preview: youtu.be/TQ58cWgdCpA monthly newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/ Enterprise Data Workflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do

×