SlideShare a Scribd company logo
1 of 61
Download to read offline
Microservices, Containers,
and Machine Learning
Paco Nathan, @pacoid
Downloads
oracle.com/technetwork/java/javase/downloads/
jdk7-downloads-1880260.html	

• follow the license agreement instructions	

• then click the download for your OS	

• need JDK instead of JRE (for Maven, etc.)	

• JDK 6, 7, 8 is fine
Downloads: Java JDK
For Python 2.7, check out Anaconda by
Continuum Analytics for a full-featured
platform:	

store.continuum.io/cshop/anaconda/
Downloads: Python
Let’s get started using Apache Spark, in just a few
easy steps… Download code from:	

databricks.com/spark-training-resources#itas	

or for a fallback: spark.apache.org/downloads.html	

!
Also, the GitHub project:	

github.com/ceteri/spark-exercises/tree/master/exsto
Downloads: Spark
Connect into the inflated “spark” directory,
then run:	

./bin/spark-shell!
Downloads: Spark
Spark Deconstructed
// load error messages from a log into memory!
// then interactively search for various patterns!
// https://gist.github.com/ceteri/8ae5b9509a08c08a1132!
!
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
Spark Deconstructed: Log Mining Example
Driver
Worker
Worker
Worker
Spark Deconstructed: Log Mining Example
We start with Spark running on a cluster…

submitting code to be evaluated on it:
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
Spark Deconstructed: Log Mining Example
discussing the other part
Spark Deconstructed: Log Mining Example
scala> messages.toDebugString!
res5: String = !
MappedRDD[4] at map at <console>:16 (3 partitions)!
MappedRDD[3] at map at <console>:16 (3 partitions)!
FilteredRDD[2] at filter at <console>:14 (3 partitions)!
MappedRDD[1] at textFile at <console>:12 (3 partitions)!
HadoopRDD[0] at textFile at <console>:12 (3 partitions)
At this point, take a look at the transformed
RDD operator graph:
Driver
Worker
Worker
Worker
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
read
HDFS
block
read
HDFS
block
read
HDFS
block
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process,
cache data
process,
cache data
process,
cache data
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process
from cache
process
from cache
process
from cache
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains(“mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains(“mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
GraphX
GraphX
spark.apache.org/docs/latest/graphx-
programming-guide.html	

!
Key Points:	

!
• graph-parallel systems	

• importance of workflows	

• optimizations
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs

J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin

graphlab.org/files/osdi2012-gonzalez-low-gu-
bickson-guestrin.pdf	

Pregel: Large-scale graph computing at Google

Grzegorz Czajkowski, et al.

googleresearch.blogspot.com/2009/06/large-scale-
graph-computing-at-google.html	

GraphX: Unified Graph Analytics on Spark

Ankur Dave, Databricks

databricks-training.s3.amazonaws.com/slides/
graphx@sparksummit_2014-07.pdf	

Advanced Exercises: GraphX

databricks-training.s3.amazonaws.com/graph-
analytics-with-graphx.html
GraphX
// http://spark.apache.org/docs/latest/graphx-programming-guide.html!
!
import org.apache.spark.graphx._!
import org.apache.spark.rdd.RDD!
!
case class Peep(name: String, age: Int)!
!
val nodeArray = Array(!
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),!
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),!
(5L, Peep("Leslie", 45))!
)!
val edgeArray = Array(!
Edge(2L, 1L, 7), Edge(2L, 4L, 2),!
Edge(3L, 2L, 4), Edge(3L, 5L, 3),!
Edge(4L, 1L, 1), Edge(5L, 3L, 9)!
)!
!
val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)!
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)!
val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)!
!
val results = g.triplets.filter(t => t.attr > 7)!
!
for (triplet <- results.collect) {!
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")!
}
GraphX: demo
TextRank Demo:	

!
cdn.liber118.com/spark/ipynb/textrank/
PySparkTextRank.ipynb	

!
IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark!
GraphX: demo
Workflows
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Typical Workflows:
Workflows: Scraper pipeline
Typical data rates, e.g., for dev@spark.apache.org:	

• ~2K msgs/month	

• ~6 MB as JSON	

• ~13 MB parsed	

Three months’ list activity represents a graph of:	

• 1061 senders	

• 753,400 nodes	

• 1,027,806 edges	

A big graph! However, it satisfies definition for a 

graph-parallel system; lots of data locality to leverage
Workflows: A Few Notes about Microservices and Containers
The Strengths andWeaknesses of Microservices

Abel Avram

http://www.infoq.com/news/2014/05/microservices	

DockerCon EU Keynote: State of the Art in Microservices

Adrian Cockcroft

https://blog.docker.com/2014/12/dockercon-
europe-keynote-state-of-the-art-in-microservices-
by-adrian-cockcroft-battery-ventures/	

Microservices Architecture

Martin Fowler

http://martinfowler.com/articles/microservices.html
Workflows: An Example…
Python-based service in a Docker container?	

Just Enough Math, IPython+Docker

Paco Nathan, Andrew Odewahn, Kyle Kelly

https://github.com/ceteri/jem-docker

https://registry.hub.docker.com/u/ceteri/jem/	

Docker Jumpstart

Andrew Odewahn

http://odewahn.github.io/docker-jumpstart/
Workflows: A Brief Note about ETL in SparkSQL
Spark SQL Data Sources API: Unified Data Access for
the Spark Platform

Michael Armbrust

databricks.com/blog/2015/01/09/spark-sql-
data-sources-api-unified-data-access-for-
the-spark-platform.html
This Workflow: Microservices meet Parallel Processing
services
email
archives community
leaderboards
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data Unique
Word IDs
TextRank,
Word2Vec,
etc.
community
insights
not so big data… relatively big compute…
Workflows: Scraper pipeline
message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs
Workflows: Scraper pipeline
message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs
{!
"date": "2014-10-01T00:16:08+00:00",!
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",!
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",!
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQ
"prev_thread": "",!
"sender": "Debasish Das <debasish.da...@gmail.com>",!
"subject": "Re: memory vs data_size",!
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....n
}
TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON
message
JSON Treebank,
WordNet
Workflows: Parser pipeline
TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON
message
JSON Treebank,
WordNet
Workflows: Parser pipeline
{!
"graf": [ [1, "Only", "only", "RB", 1, 0], [2, "fit", "fit", "VBP", 1, 1 ] ... ],!
"id": “CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",!
"polr": 0.2,!
"sha1": "178b7a57ec6168f20a8a4f705fb8b0b04e59eeb7",!
"size": 14,!
"subj": 0.7,!
"tile": [ [1, 2], [2, 3], [3, 4] ... ]!
]!
}
{!
"date": "2014-10-01T00:16:08+00:00",!
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",!
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",!
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p
"prev_thread": "",!
"sender": "Debasish Das <debasish.da...@gmail.com>",!
"subject": "Re: memory vs data_size",!
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....nnFor
}
Workflows: TextRank pipeline
Spark
create
word graph
RDD
word
graph
NetworkX
visualize
graph
GraphX
run
TextRank
Spark
extract
phrases
ranked
phrases
parsed
JSON
Workflows: TextRank pipeline
"Compatibility of systems of linear constraints"
[{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'},
{'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'},
{'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'},
{'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}]
compat
system
linear
constraint
1:
2:
3:
TextRank: Bringing Order intoTexts	

Rada Mihalcea, Paul Tarau	

http://web.eecs.umich.edu/~mihalcea/
papers/mihalcea.emnlp04.pdf
https://en.wikipedia.org/wiki/PageRank
Workflows: TextRank – how it works
TextRank impl
TextRank impl: load parquet files
import org.apache.spark.graphx._!
import org.apache.spark.rdd.RDD!
!
val sqlCtx = new org.apache.spark.sql.SQLContext(sc)!
import sqlCtx._!
!
val edge = sqlCtx.parquetFile("graf_edge.parquet")!
edge.registerTempTable("edge")!
!
val node = sqlCtx.parquetFile("graf_node.parquet")!
node.registerTempTable("node")!
!
// pick one message as an example; at scale we'd parallelize!
val msg_id = "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw"
TextRank impl: use SparkSQL to collect node list + edge list
val sql = """!
SELECT node_id, root !
FROM node !
WHERE id='%s' AND keep='1'!
""".format(msg_id)!
!
val n = sqlCtx.sql(sql.stripMargin).distinct()!
val nodes: RDD[(Long, String)] = n.map{ p =>!
(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[String])!
}!
nodes.collect()!
!
val sql = """!
SELECT node0, node1 !
FROM edge !
WHERE id='%s'!
""".format(msg_id)!
!
val e = sqlCtx.sql(sql.stripMargin).distinct()!
val edges: RDD[Edge[Int]] = e.map{ p =>!
Edge(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[Int].toLong, 0)!
}!
edges.collect()
TextRank impl: use GraphX to run PageRank
// run PageRank!
val g: Graph[String, Int] = Graph(nodes, edges)!
val r = g.pageRank(0.0001).vertices!
!
r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)!
!
// save the ranks!
case class Rank(id: Int, rank: Float)!
val rank = r.map(p => Rank(p._1.toInt, p._2.toFloat))!
rank.registerTempTable("rank")!
!
def median[T](s: Seq[T])(implicit n: Fractional[T]) = {!
import n._!
val (lower, upper) = s.sortWith(_<_).splitAt(s.size / 2)!
if (s.size % 2 == 0) (lower.last + upper.head) / fromInt(2) else upper.head!
}!
!
val min_rank = median(r.map(_._2).collect())
TextRank impl: join ranked words with parsed text
var span:List[String] = List()!
var last_index = -1!
var rank_sum = 0.0!
!
var phrases:collection.mutable.Map[String, Double] = collection.mutable.Map()!
!
val sql = """!
SELECT n.num, n.raw, r.rank!
FROM node n JOIN rank r ON n.node_id = r.id !
WHERE n.id='%s' AND n.keep='1'!
ORDER BY n.num!
""".format(msg_id)!
!
val s = sqlCtx.sql(sql.stripMargin).collect()
TextRank impl: “pull strings” for the top-ranked keyphrases
s.foreach { x => !
//println (x)!
val index = x.getInt(0)!
val word = x.getString(1)!
val rank = x.getFloat(2)!
var isStop = false!
!
// test for break from past!
if (span.size > 0 && rank < min_rank) isStop = true!
if (span.size > 0 && (index - last_index > 1)) isStop = true!
!
// clear accumulation!
if (isStop) {!
val phrase = span.mkString(" ")!
phrases += (phrase -> rank_sum)!
!
span = List()!
last_index = index!
rank_sum = 0.0!
}!
!
// start or append!
if (rank >= min_rank) {!
span = span :+ word!
last_index = index!
rank_sum += rank!
}!
}!
TextRank impl: report the top keyphrases
// summarize the text as a list of ranked keyphrases!
val summary = sc.parallelize(phrases.toSeq)!
.distinct()!
.sortBy(_._2, ascending=false)
Reply Graph
Reply Graph: load parquet files
import org.apache.spark.graphx._!
import org.apache.spark.rdd.RDD!
!
val sqlCtx = new org.apache.spark.sql.SQLContext(sc)!
import sqlCtx._!
!
val edge = sqlCtx.parquetFile("reply_edge.parquet")!
edge.registerTempTable("edge")!
!
val node = sqlCtx.parquetFile("reply_node.parquet")!
node.registerTempTable("node")!
!
edge.schemaString!
node.schemaString
Reply Graph: use SparkSQL to collect node list + edge list
val sql = "SELECT id, sender FROM node"!
val n = sqlCtx.sql(sql).distinct()!
val nodes: RDD[(Long, String)] = n.map{ p =>!
(p(0).asInstanceOf[Long], p(1).asInstanceOf[String])!
}!
nodes.collect()!
!
val sql = "SELECT replier, sender, num FROM edge"!
val e = sqlCtx.sql(sql).distinct()!
val edges: RDD[Edge[Int]] = e.map{ p =>!
Edge(p(0).asInstanceOf[Long], p(1).asInstanceOf[Long], p(2).asInstanceOf[Int])!
}!
edges.collect()
Reply Graph: use GraphX to run graph analytics
// run graph analytics!
val g: Graph[String, Int] = Graph(nodes, edges)!
val r = g.pageRank(0.0001).vertices!
r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)!
!
// define a reduce operation to compute the highest degree vertex!
def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {!
if (a._2 > b._2) a else b!
}!
!
// compute the max degrees!
val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max)!
val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)!
val maxDegrees: (VertexId, Int) = g.degrees.reduce(max)!
!
// connected components!
val scc = g.stronglyConnectedComponents(10).vertices!
node.join(scc).foreach(println)
Reply Graph: PageRank of top dev@spark email, 4Q2014
(389,(22.690229478710016,Sean Owen <so...@cloudera.com>))!
(857,(20.832469059298248,Akhil Das <ak...@sigmoidanalytics.com>))!
(652,(13.281821379806798,Michael Armbrust <mich...@databricks.com>))!
(101,(9.963167550803664,Tobias Pfeiffer <...@preferred.jp>))!
(471,(9.614436778460558,Steve Lewis <lordjoe2...@gmail.com>))!
(931,(8.217073486575732,shahab <shahab.mok...@gmail.com>))!
(48,(7.653814912512137,ll <duy.huynh....@gmail.com>))!
(1011,(7.602002681952157,Ashic Mahtab <as...@live.com>))!
(1055,(7.572376489758199,Cheng Lian <lian.cs....@gmail.com>))!
(122,(6.87247388819558,Gerard Maas <gerard.m...@gmail.com>))!
(904,(6.252657820614504,Xiangrui Meng <men...@gmail.com>))!
(827,(6.0941062762076115,Jianshi Huang <jianshi.hu...@gmail.com>))!
(887,(5.835053915864531,Davies Liu <dav...@databricks.com>))!
(303,(5.724235650446037,Ted Yu <yuzhih...@gmail.com>))!
(206,(5.430238461114108,Deep Pradhan <pradhandeep1...@gmail.com>))!
(483,(5.332452537151523,Akshat Aranya <aara...@gmail.com>))!
(185,(5.259438927615685,SK <skrishna...@gmail.com>))!
(636,(5.235941228955769,Matei Zaharia <matei.zaha…@gmail.com>))!
!
// seaaaaaaaaaan!!
maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (389,126)!
maxOutDegree: (org.apache.spark.graphx.VertexId, Int) = (389,170)!
maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (389,296)
Reply Graph: What SSSP looks like in GraphX/Pregel
github.com/ceteri/spark-exercises/blob/master/src/main/scala/
com/databricks/apps/graphx/sssp.scala
Look Ahead: Where is this heading?
Feature learning withWord2Vec

Matt Krzus

www.yseam.com/blog/WV.html
ranked
phrases
GraphX
run
Con.Comp.
MLlib
run
Word2Vec
aggregated
by topic
MLlib
run
KMeans
topic
vectors
better than
LDA?
features… models… insights…
Resources
Apache Spark developer certificate program
• http://oreilly.com/go/sparkcert
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise
certification:
MOOCs:
Anthony Joseph

UC Berkeley	

begins 2015-02-23	

edx.org/course/uc-berkeleyx/uc-
berkeleyx-cs100-1x-
introduction-big-6181
Ameet Talwalkar

UCLA	

begins 2015-04-14	

edx.org/course/uc-berkeleyx/
uc-berkeleyx-cs190-1x-
scalable-machine-6066
community:
spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
!
video+preso archives: spark-summit.org
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training
http://spark-summit.org/
confs:
Strata CA

San Jose, Feb 18-20

strataconf.com/strata2015
Spark Summit East

NYC, Mar 18-19

spark-summit.org/east
Big Data Tech Con

Boston, Apr 26-28

bigdatatechcon.com
Strata EU

London, May 5-7

strataconf.com/big-data-conference-uk-2015
Spark Summit 2015

SF, Jun 15-17

spark-summit.org
books:
Fast Data Processing 

with Spark

Holden Karau

Packt (2013)

shop.oreilly.com/product/
9781782167068.do
Spark in Action

Chris Fregly

Manning (2015*)

sparkinaction.com/
Learning Spark

Holden Karau, 

Andy Konwinski,
Matei Zaharia

O’Reilly (2015*)

shop.oreilly.com/product/
0636920028512.do
presenter:
Just Enough Math
O’Reilly, 2014
justenoughmath.com

preview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates, 

events, conf summaries, etc.:
liber118.com/pxn/
Enterprise Data Workflows
with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do

More Related Content

What's hot

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architectureStepan Pushkarev
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowDatabricks
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesDatabricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversityAlex Zeltov
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software DevelopmentAlexis Seigneurin
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1AjayRawat971036
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 

What's hot (20)

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software Development
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 

Similar to Microservices, Containers, and Machine Learning

#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on MesosPaco Nathan
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand fordThu Hiền
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 

Similar to Microservices, Containers, and Machine Learning (20)

#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Spark 101
Spark 101Spark 101
Spark 101
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Microservices, Containers, and Machine Learning

  • 1. Microservices, Containers, and Machine Learning Paco Nathan, @pacoid
  • 3. oracle.com/technetwork/java/javase/downloads/ jdk7-downloads-1880260.html • follow the license agreement instructions • then click the download for your OS • need JDK instead of JRE (for Maven, etc.) • JDK 6, 7, 8 is fine Downloads: Java JDK
  • 4. For Python 2.7, check out Anaconda by Continuum Analytics for a full-featured platform: store.continuum.io/cshop/anaconda/ Downloads: Python
  • 5. Let’s get started using Apache Spark, in just a few easy steps… Download code from: databricks.com/spark-training-resources#itas or for a fallback: spark.apache.org/downloads.html ! Also, the GitHub project: github.com/ceteri/spark-exercises/tree/master/exsto Downloads: Spark
  • 6. Connect into the inflated “spark” directory, then run: ./bin/spark-shell! Downloads: Spark
  • 8. // load error messages from a log into memory! // then interactively search for various patterns! // https://gist.github.com/ceteri/8ae5b9509a08c08a1132! ! // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() Spark Deconstructed: Log Mining Example
  • 9. Driver Worker Worker Worker Spark Deconstructed: Log Mining Example We start with Spark running on a cluster…
 submitting code to be evaluated on it:
  • 10. // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() Spark Deconstructed: Log Mining Example discussing the other part
  • 11. Spark Deconstructed: Log Mining Example scala> messages.toDebugString! res5: String = ! MappedRDD[4] at map at <console>:16 (3 partitions)! MappedRDD[3] at map at <console>:16 (3 partitions)! FilteredRDD[2] at filter at <console>:14 (3 partitions)! MappedRDD[1] at textFile at <console>:12 (3 partitions)! HadoopRDD[0] at textFile at <console>:12 (3 partitions) At this point, take a look at the transformed RDD operator graph:
  • 12. Driver Worker Worker Worker Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 13. Driver Worker Worker Worker block 1 block 2 block 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 14. Driver Worker Worker Worker block 1 block 2 block 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 15. Driver Worker Worker Worker block 1 block 2 block 3 read HDFS block read HDFS block read HDFS block Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 16. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 process, cache data process, cache data process, cache data Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 17. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 18. // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 Spark Deconstructed: Log Mining Example discussing the other part
  • 19. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 process from cache process from cache process from cache Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains(“mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 20. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains(“mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 23. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
 J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin
 graphlab.org/files/osdi2012-gonzalez-low-gu- bickson-guestrin.pdf Pregel: Large-scale graph computing at Google
 Grzegorz Czajkowski, et al.
 googleresearch.blogspot.com/2009/06/large-scale- graph-computing-at-google.html GraphX: Unified Graph Analytics on Spark
 Ankur Dave, Databricks
 databricks-training.s3.amazonaws.com/slides/ graphx@sparksummit_2014-07.pdf Advanced Exercises: GraphX
 databricks-training.s3.amazonaws.com/graph- analytics-with-graphx.html GraphX
  • 24. // http://spark.apache.org/docs/latest/graphx-programming-guide.html! ! import org.apache.spark.graphx._! import org.apache.spark.rdd.RDD! ! case class Peep(name: String, age: Int)! ! val nodeArray = Array(! (1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),! (3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),! (5L, Peep("Leslie", 45))! )! val edgeArray = Array(! Edge(2L, 1L, 7), Edge(2L, 4L, 2),! Edge(3L, 2L, 4), Edge(3L, 5L, 3),! Edge(4L, 1L, 1), Edge(5L, 3L, 9)! )! ! val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)! val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)! val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)! ! val results = g.triplets.filter(t => t.attr > 7)! ! for (triplet <- results.collect) {! println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")! } GraphX: demo
  • 27. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Typical Workflows:
  • 28. Workflows: Scraper pipeline Typical data rates, e.g., for dev@spark.apache.org: • ~2K msgs/month • ~6 MB as JSON • ~13 MB parsed Three months’ list activity represents a graph of: • 1061 senders • 753,400 nodes • 1,027,806 edges A big graph! However, it satisfies definition for a 
 graph-parallel system; lots of data locality to leverage
  • 29. Workflows: A Few Notes about Microservices and Containers The Strengths andWeaknesses of Microservices
 Abel Avram
 http://www.infoq.com/news/2014/05/microservices DockerCon EU Keynote: State of the Art in Microservices
 Adrian Cockcroft
 https://blog.docker.com/2014/12/dockercon- europe-keynote-state-of-the-art-in-microservices- by-adrian-cockcroft-battery-ventures/ Microservices Architecture
 Martin Fowler
 http://martinfowler.com/articles/microservices.html
  • 30. Workflows: An Example… Python-based service in a Docker container? Just Enough Math, IPython+Docker
 Paco Nathan, Andrew Odewahn, Kyle Kelly
 https://github.com/ceteri/jem-docker
 https://registry.hub.docker.com/u/ceteri/jem/ Docker Jumpstart
 Andrew Odewahn
 http://odewahn.github.io/docker-jumpstart/
  • 31. Workflows: A Brief Note about ETL in SparkSQL Spark SQL Data Sources API: Unified Data Access for the Spark Platform
 Michael Armbrust
 databricks.com/blog/2015/01/09/spark-sql- data-sources-api-unified-data-access-for- the-spark-platform.html
  • 32. This Workflow: Microservices meet Parallel Processing services email archives community leaderboards SparkSQL Data Prep Features Explore Scraper / Parser NLTK data Unique Word IDs TextRank, Word2Vec, etc. community insights not so big data… relatively big compute…
  • 33. Workflows: Scraper pipeline message JSON Py filter quoted content Apache email list archive urllib2 crawl monthly list by date Py segment paragraphs
  • 34. Workflows: Scraper pipeline message JSON Py filter quoted content Apache email list archive urllib2 crawl monthly list by date Py segment paragraphs {! "date": "2014-10-01T00:16:08+00:00",! "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",! "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQ "prev_thread": "",! "sender": "Debasish Das <debasish.da...@gmail.com>",! "subject": "Re: memory vs data_size",! "text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....n }
  • 36. TextBlob tag and lemmatize words TextBlob segment sentences TextBlob sentiment analysis Py generate skip-grams parsed JSON message JSON Treebank, WordNet Workflows: Parser pipeline {! "graf": [ [1, "Only", "only", "RB", 1, 0], [2, "fit", "fit", "VBP", 1, 1 ] ... ],! "id": “CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "polr": 0.2,! "sha1": "178b7a57ec6168f20a8a4f705fb8b0b04e59eeb7",! "size": 14,! "subj": 0.7,! "tile": [ [1, 2], [2, 3], [3, 4] ... ]! ]! } {! "date": "2014-10-01T00:16:08+00:00",! "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",! "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p "prev_thread": "",! "sender": "Debasish Das <debasish.da...@gmail.com>",! "subject": "Re: memory vs data_size",! "text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....nnFor }
  • 37. Workflows: TextRank pipeline Spark create word graph RDD word graph NetworkX visualize graph GraphX run TextRank Spark extract phrases ranked phrases parsed JSON
  • 38. Workflows: TextRank pipeline "Compatibility of systems of linear constraints" [{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'}, {'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'}, {'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'}, {'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}] compat system linear constraint 1: 2: 3: TextRank: Bringing Order intoTexts Rada Mihalcea, Paul Tarau http://web.eecs.umich.edu/~mihalcea/ papers/mihalcea.emnlp04.pdf
  • 41. TextRank impl: load parquet files import org.apache.spark.graphx._! import org.apache.spark.rdd.RDD! ! val sqlCtx = new org.apache.spark.sql.SQLContext(sc)! import sqlCtx._! ! val edge = sqlCtx.parquetFile("graf_edge.parquet")! edge.registerTempTable("edge")! ! val node = sqlCtx.parquetFile("graf_node.parquet")! node.registerTempTable("node")! ! // pick one message as an example; at scale we'd parallelize! val msg_id = "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw"
  • 42. TextRank impl: use SparkSQL to collect node list + edge list val sql = """! SELECT node_id, root ! FROM node ! WHERE id='%s' AND keep='1'! """.format(msg_id)! ! val n = sqlCtx.sql(sql.stripMargin).distinct()! val nodes: RDD[(Long, String)] = n.map{ p =>! (p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[String])! }! nodes.collect()! ! val sql = """! SELECT node0, node1 ! FROM edge ! WHERE id='%s'! """.format(msg_id)! ! val e = sqlCtx.sql(sql.stripMargin).distinct()! val edges: RDD[Edge[Int]] = e.map{ p =>! Edge(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[Int].toLong, 0)! }! edges.collect()
  • 43. TextRank impl: use GraphX to run PageRank // run PageRank! val g: Graph[String, Int] = Graph(nodes, edges)! val r = g.pageRank(0.0001).vertices! ! r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)! ! // save the ranks! case class Rank(id: Int, rank: Float)! val rank = r.map(p => Rank(p._1.toInt, p._2.toFloat))! rank.registerTempTable("rank")! ! def median[T](s: Seq[T])(implicit n: Fractional[T]) = {! import n._! val (lower, upper) = s.sortWith(_<_).splitAt(s.size / 2)! if (s.size % 2 == 0) (lower.last + upper.head) / fromInt(2) else upper.head! }! ! val min_rank = median(r.map(_._2).collect())
  • 44. TextRank impl: join ranked words with parsed text var span:List[String] = List()! var last_index = -1! var rank_sum = 0.0! ! var phrases:collection.mutable.Map[String, Double] = collection.mutable.Map()! ! val sql = """! SELECT n.num, n.raw, r.rank! FROM node n JOIN rank r ON n.node_id = r.id ! WHERE n.id='%s' AND n.keep='1'! ORDER BY n.num! """.format(msg_id)! ! val s = sqlCtx.sql(sql.stripMargin).collect()
  • 45. TextRank impl: “pull strings” for the top-ranked keyphrases s.foreach { x => ! //println (x)! val index = x.getInt(0)! val word = x.getString(1)! val rank = x.getFloat(2)! var isStop = false! ! // test for break from past! if (span.size > 0 && rank < min_rank) isStop = true! if (span.size > 0 && (index - last_index > 1)) isStop = true! ! // clear accumulation! if (isStop) {! val phrase = span.mkString(" ")! phrases += (phrase -> rank_sum)! ! span = List()! last_index = index! rank_sum = 0.0! }! ! // start or append! if (rank >= min_rank) {! span = span :+ word! last_index = index! rank_sum += rank! }! }!
  • 46. TextRank impl: report the top keyphrases // summarize the text as a list of ranked keyphrases! val summary = sc.parallelize(phrases.toSeq)! .distinct()! .sortBy(_._2, ascending=false)
  • 48. Reply Graph: load parquet files import org.apache.spark.graphx._! import org.apache.spark.rdd.RDD! ! val sqlCtx = new org.apache.spark.sql.SQLContext(sc)! import sqlCtx._! ! val edge = sqlCtx.parquetFile("reply_edge.parquet")! edge.registerTempTable("edge")! ! val node = sqlCtx.parquetFile("reply_node.parquet")! node.registerTempTable("node")! ! edge.schemaString! node.schemaString
  • 49. Reply Graph: use SparkSQL to collect node list + edge list val sql = "SELECT id, sender FROM node"! val n = sqlCtx.sql(sql).distinct()! val nodes: RDD[(Long, String)] = n.map{ p =>! (p(0).asInstanceOf[Long], p(1).asInstanceOf[String])! }! nodes.collect()! ! val sql = "SELECT replier, sender, num FROM edge"! val e = sqlCtx.sql(sql).distinct()! val edges: RDD[Edge[Int]] = e.map{ p =>! Edge(p(0).asInstanceOf[Long], p(1).asInstanceOf[Long], p(2).asInstanceOf[Int])! }! edges.collect()
  • 50. Reply Graph: use GraphX to run graph analytics // run graph analytics! val g: Graph[String, Int] = Graph(nodes, edges)! val r = g.pageRank(0.0001).vertices! r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)! ! // define a reduce operation to compute the highest degree vertex! def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {! if (a._2 > b._2) a else b! }! ! // compute the max degrees! val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max)! val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)! val maxDegrees: (VertexId, Int) = g.degrees.reduce(max)! ! // connected components! val scc = g.stronglyConnectedComponents(10).vertices! node.join(scc).foreach(println)
  • 51. Reply Graph: PageRank of top dev@spark email, 4Q2014 (389,(22.690229478710016,Sean Owen <so...@cloudera.com>))! (857,(20.832469059298248,Akhil Das <ak...@sigmoidanalytics.com>))! (652,(13.281821379806798,Michael Armbrust <mich...@databricks.com>))! (101,(9.963167550803664,Tobias Pfeiffer <...@preferred.jp>))! (471,(9.614436778460558,Steve Lewis <lordjoe2...@gmail.com>))! (931,(8.217073486575732,shahab <shahab.mok...@gmail.com>))! (48,(7.653814912512137,ll <duy.huynh....@gmail.com>))! (1011,(7.602002681952157,Ashic Mahtab <as...@live.com>))! (1055,(7.572376489758199,Cheng Lian <lian.cs....@gmail.com>))! (122,(6.87247388819558,Gerard Maas <gerard.m...@gmail.com>))! (904,(6.252657820614504,Xiangrui Meng <men...@gmail.com>))! (827,(6.0941062762076115,Jianshi Huang <jianshi.hu...@gmail.com>))! (887,(5.835053915864531,Davies Liu <dav...@databricks.com>))! (303,(5.724235650446037,Ted Yu <yuzhih...@gmail.com>))! (206,(5.430238461114108,Deep Pradhan <pradhandeep1...@gmail.com>))! (483,(5.332452537151523,Akshat Aranya <aara...@gmail.com>))! (185,(5.259438927615685,SK <skrishna...@gmail.com>))! (636,(5.235941228955769,Matei Zaharia <matei.zaha…@gmail.com>))! ! // seaaaaaaaaaan!! maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (389,126)! maxOutDegree: (org.apache.spark.graphx.VertexId, Int) = (389,170)! maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (389,296)
  • 52. Reply Graph: What SSSP looks like in GraphX/Pregel github.com/ceteri/spark-exercises/blob/master/src/main/scala/ com/databricks/apps/graphx/sssp.scala
  • 53. Look Ahead: Where is this heading? Feature learning withWord2Vec
 Matt Krzus
 www.yseam.com/blog/WV.html ranked phrases GraphX run Con.Comp. MLlib run Word2Vec aggregated by topic MLlib run KMeans topic vectors better than LDA? features… models… insights…
  • 55. Apache Spark developer certificate program • http://oreilly.com/go/sparkcert • defined by Spark experts @Databricks • assessed by O’Reilly Media • establishes the bar for Spark expertise certification:
  • 56. MOOCs: Anthony Joseph
 UC Berkeley begins 2015-02-23 edx.org/course/uc-berkeleyx/uc- berkeleyx-cs100-1x- introduction-big-6181 Ameet Talwalkar
 UCLA begins 2015-04-14 edx.org/course/uc-berkeleyx/ uc-berkeleyx-cs190-1x- scalable-machine-6066
  • 57. community: spark.apache.org/community.html events worldwide: goo.gl/2YqJZK ! video+preso archives: spark-summit.org resources: databricks.com/spark-training-resources workshops: databricks.com/spark-training
  • 59. confs: Strata CA
 San Jose, Feb 18-20
 strataconf.com/strata2015 Spark Summit East
 NYC, Mar 18-19
 spark-summit.org/east Big Data Tech Con
 Boston, Apr 26-28
 bigdatatechcon.com Strata EU
 London, May 5-7
 strataconf.com/big-data-conference-uk-2015 Spark Summit 2015
 SF, Jun 15-17
 spark-summit.org
  • 60. books: Fast Data Processing 
 with Spark
 Holden Karau
 Packt (2013)
 shop.oreilly.com/product/ 9781782167068.do Spark in Action
 Chris Fregly
 Manning (2015*)
 sparkinaction.com/ Learning Spark
 Holden Karau, 
 Andy Konwinski, Matei Zaharia
 O’Reilly (2015*)
 shop.oreilly.com/product/ 0636920028512.do
  • 61. presenter: Just Enough Math O’Reilly, 2014 justenoughmath.com
 preview: youtu.be/TQ58cWgdCpA monthly newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/ Enterprise Data Workflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do