SlideShare a Scribd company logo
Microservices, Containers,
and Machine Learning
Paco Nathan, @pacoid
Downloads
oracle.com/technetwork/java/javase/downloads/
jdk7-downloads-1880260.html	

• follow the license agreement instructions	

• then click the download for your OS	

• need JDK instead of JRE (for Maven, etc.)	

• JDK 6, 7, 8 is fine
Downloads: Java JDK
For Python 2.7, check out Anaconda by
Continuum Analytics for a full-featured
platform:	

store.continuum.io/cshop/anaconda/
Downloads: Python
Let’s get started using Apache Spark, in just a few
easy steps… Download code from:	

databricks.com/spark-training-resources#itas	

or for a fallback: spark.apache.org/downloads.html	

!
Also, the GitHub project:	

github.com/ceteri/spark-exercises/tree/master/exsto
Downloads: Spark
Connect into the inflated “spark” directory,
then run:	

./bin/spark-shell!
Downloads: Spark
Spark Deconstructed
// load error messages from a log into memory!
// then interactively search for various patterns!
// https://gist.github.com/ceteri/8ae5b9509a08c08a1132!
!
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
Spark Deconstructed: Log Mining Example
Driver
Worker
Worker
Worker
Spark Deconstructed: Log Mining Example
We start with Spark running on a cluster…

submitting code to be evaluated on it:
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
Spark Deconstructed: Log Mining Example
discussing the other part
Spark Deconstructed: Log Mining Example
scala> messages.toDebugString!
res5: String = !
MappedRDD[4] at map at <console>:16 (3 partitions)!
MappedRDD[3] at map at <console>:16 (3 partitions)!
FilteredRDD[2] at filter at <console>:14 (3 partitions)!
MappedRDD[1] at textFile at <console>:12 (3 partitions)!
HadoopRDD[0] at textFile at <console>:12 (3 partitions)
At this point, take a look at the transformed
RDD operator graph:
Driver
Worker
Worker
Worker
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
read
HDFS
block
read
HDFS
block
read
HDFS
block
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process,
cache data
process,
cache data
process,
cache data
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process
from cache
process
from cache
process
from cache
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains(“mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains(“mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
GraphX
GraphX
spark.apache.org/docs/latest/graphx-
programming-guide.html	

!
Key Points:	

!
• graph-parallel systems	

• importance of workflows	

• optimizations
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs

J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin

graphlab.org/files/osdi2012-gonzalez-low-gu-
bickson-guestrin.pdf	

Pregel: Large-scale graph computing at Google

Grzegorz Czajkowski, et al.

googleresearch.blogspot.com/2009/06/large-scale-
graph-computing-at-google.html	

GraphX: Unified Graph Analytics on Spark

Ankur Dave, Databricks

databricks-training.s3.amazonaws.com/slides/
graphx@sparksummit_2014-07.pdf	

Advanced Exercises: GraphX

databricks-training.s3.amazonaws.com/graph-
analytics-with-graphx.html
GraphX
// http://spark.apache.org/docs/latest/graphx-programming-guide.html!
!
import org.apache.spark.graphx._!
import org.apache.spark.rdd.RDD!
!
case class Peep(name: String, age: Int)!
!
val nodeArray = Array(!
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),!
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),!
(5L, Peep("Leslie", 45))!
)!
val edgeArray = Array(!
Edge(2L, 1L, 7), Edge(2L, 4L, 2),!
Edge(3L, 2L, 4), Edge(3L, 5L, 3),!
Edge(4L, 1L, 1), Edge(5L, 3L, 9)!
)!
!
val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)!
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)!
val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)!
!
val results = g.triplets.filter(t => t.attr > 7)!
!
for (triplet <- results.collect) {!
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")!
}
GraphX: demo
TextRank Demo:	

!
cdn.liber118.com/spark/ipynb/textrank/
PySparkTextRank.ipynb	

!
IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark!
GraphX: demo
Workflows
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Typical Workflows:
Workflows: Scraper pipeline
Typical data rates, e.g., for dev@spark.apache.org:	

• ~2K msgs/month	

• ~6 MB as JSON	

• ~13 MB parsed	

Three months’ list activity represents a graph of:	

• 1061 senders	

• 753,400 nodes	

• 1,027,806 edges	

A big graph! However, it satisfies definition for a 

graph-parallel system; lots of data locality to leverage
Workflows: A Few Notes about Microservices and Containers
The Strengths andWeaknesses of Microservices

Abel Avram

http://www.infoq.com/news/2014/05/microservices	

DockerCon EU Keynote: State of the Art in Microservices

Adrian Cockcroft

https://blog.docker.com/2014/12/dockercon-
europe-keynote-state-of-the-art-in-microservices-
by-adrian-cockcroft-battery-ventures/	

Microservices Architecture

Martin Fowler

http://martinfowler.com/articles/microservices.html
Workflows: An Example…
Python-based service in a Docker container?	

Just Enough Math, IPython+Docker

Paco Nathan, Andrew Odewahn, Kyle Kelly

https://github.com/ceteri/jem-docker

https://registry.hub.docker.com/u/ceteri/jem/	

Docker Jumpstart

Andrew Odewahn

http://odewahn.github.io/docker-jumpstart/
Workflows: A Brief Note about ETL in SparkSQL
Spark SQL Data Sources API: Unified Data Access for
the Spark Platform

Michael Armbrust

databricks.com/blog/2015/01/09/spark-sql-
data-sources-api-unified-data-access-for-
the-spark-platform.html
This Workflow: Microservices meet Parallel Processing
services
email
archives community
leaderboards
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data Unique
Word IDs
TextRank,
Word2Vec,
etc.
community
insights
not so big data… relatively big compute…
Workflows: Scraper pipeline
message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs
Workflows: Scraper pipeline
message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs
{!
"date": "2014-10-01T00:16:08+00:00",!
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",!
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",!
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQ
"prev_thread": "",!
"sender": "Debasish Das <debasish.da...@gmail.com>",!
"subject": "Re: memory vs data_size",!
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....n
}
TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON
message
JSON Treebank,
WordNet
Workflows: Parser pipeline
TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON
message
JSON Treebank,
WordNet
Workflows: Parser pipeline
{!
"graf": [ [1, "Only", "only", "RB", 1, 0], [2, "fit", "fit", "VBP", 1, 1 ] ... ],!
"id": “CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",!
"polr": 0.2,!
"sha1": "178b7a57ec6168f20a8a4f705fb8b0b04e59eeb7",!
"size": 14,!
"subj": 0.7,!
"tile": [ [1, 2], [2, 3], [3, 4] ... ]!
]!
}
{!
"date": "2014-10-01T00:16:08+00:00",!
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",!
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",!
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p
"prev_thread": "",!
"sender": "Debasish Das <debasish.da...@gmail.com>",!
"subject": "Re: memory vs data_size",!
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....nnFor
}
Workflows: TextRank pipeline
Spark
create
word graph
RDD
word
graph
NetworkX
visualize
graph
GraphX
run
TextRank
Spark
extract
phrases
ranked
phrases
parsed
JSON
Workflows: TextRank pipeline
"Compatibility of systems of linear constraints"
[{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'},
{'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'},
{'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'},
{'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}]
compat
system
linear
constraint
1:
2:
3:
TextRank: Bringing Order intoTexts	

Rada Mihalcea, Paul Tarau	

http://web.eecs.umich.edu/~mihalcea/
papers/mihalcea.emnlp04.pdf
https://en.wikipedia.org/wiki/PageRank
Workflows: TextRank – how it works
TextRank impl
TextRank impl: load parquet files
import org.apache.spark.graphx._!
import org.apache.spark.rdd.RDD!
!
val sqlCtx = new org.apache.spark.sql.SQLContext(sc)!
import sqlCtx._!
!
val edge = sqlCtx.parquetFile("graf_edge.parquet")!
edge.registerTempTable("edge")!
!
val node = sqlCtx.parquetFile("graf_node.parquet")!
node.registerTempTable("node")!
!
// pick one message as an example; at scale we'd parallelize!
val msg_id = "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw"
TextRank impl: use SparkSQL to collect node list + edge list
val sql = """!
SELECT node_id, root !
FROM node !
WHERE id='%s' AND keep='1'!
""".format(msg_id)!
!
val n = sqlCtx.sql(sql.stripMargin).distinct()!
val nodes: RDD[(Long, String)] = n.map{ p =>!
(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[String])!
}!
nodes.collect()!
!
val sql = """!
SELECT node0, node1 !
FROM edge !
WHERE id='%s'!
""".format(msg_id)!
!
val e = sqlCtx.sql(sql.stripMargin).distinct()!
val edges: RDD[Edge[Int]] = e.map{ p =>!
Edge(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[Int].toLong, 0)!
}!
edges.collect()
TextRank impl: use GraphX to run PageRank
// run PageRank!
val g: Graph[String, Int] = Graph(nodes, edges)!
val r = g.pageRank(0.0001).vertices!
!
r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)!
!
// save the ranks!
case class Rank(id: Int, rank: Float)!
val rank = r.map(p => Rank(p._1.toInt, p._2.toFloat))!
rank.registerTempTable("rank")!
!
def median[T](s: Seq[T])(implicit n: Fractional[T]) = {!
import n._!
val (lower, upper) = s.sortWith(_<_).splitAt(s.size / 2)!
if (s.size % 2 == 0) (lower.last + upper.head) / fromInt(2) else upper.head!
}!
!
val min_rank = median(r.map(_._2).collect())
TextRank impl: join ranked words with parsed text
var span:List[String] = List()!
var last_index = -1!
var rank_sum = 0.0!
!
var phrases:collection.mutable.Map[String, Double] = collection.mutable.Map()!
!
val sql = """!
SELECT n.num, n.raw, r.rank!
FROM node n JOIN rank r ON n.node_id = r.id !
WHERE n.id='%s' AND n.keep='1'!
ORDER BY n.num!
""".format(msg_id)!
!
val s = sqlCtx.sql(sql.stripMargin).collect()
TextRank impl: “pull strings” for the top-ranked keyphrases
s.foreach { x => !
//println (x)!
val index = x.getInt(0)!
val word = x.getString(1)!
val rank = x.getFloat(2)!
var isStop = false!
!
// test for break from past!
if (span.size > 0 && rank < min_rank) isStop = true!
if (span.size > 0 && (index - last_index > 1)) isStop = true!
!
// clear accumulation!
if (isStop) {!
val phrase = span.mkString(" ")!
phrases += (phrase -> rank_sum)!
!
span = List()!
last_index = index!
rank_sum = 0.0!
}!
!
// start or append!
if (rank >= min_rank) {!
span = span :+ word!
last_index = index!
rank_sum += rank!
}!
}!
TextRank impl: report the top keyphrases
// summarize the text as a list of ranked keyphrases!
val summary = sc.parallelize(phrases.toSeq)!
.distinct()!
.sortBy(_._2, ascending=false)
Reply Graph
Reply Graph: load parquet files
import org.apache.spark.graphx._!
import org.apache.spark.rdd.RDD!
!
val sqlCtx = new org.apache.spark.sql.SQLContext(sc)!
import sqlCtx._!
!
val edge = sqlCtx.parquetFile("reply_edge.parquet")!
edge.registerTempTable("edge")!
!
val node = sqlCtx.parquetFile("reply_node.parquet")!
node.registerTempTable("node")!
!
edge.schemaString!
node.schemaString
Reply Graph: use SparkSQL to collect node list + edge list
val sql = "SELECT id, sender FROM node"!
val n = sqlCtx.sql(sql).distinct()!
val nodes: RDD[(Long, String)] = n.map{ p =>!
(p(0).asInstanceOf[Long], p(1).asInstanceOf[String])!
}!
nodes.collect()!
!
val sql = "SELECT replier, sender, num FROM edge"!
val e = sqlCtx.sql(sql).distinct()!
val edges: RDD[Edge[Int]] = e.map{ p =>!
Edge(p(0).asInstanceOf[Long], p(1).asInstanceOf[Long], p(2).asInstanceOf[Int])!
}!
edges.collect()
Reply Graph: use GraphX to run graph analytics
// run graph analytics!
val g: Graph[String, Int] = Graph(nodes, edges)!
val r = g.pageRank(0.0001).vertices!
r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)!
!
// define a reduce operation to compute the highest degree vertex!
def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {!
if (a._2 > b._2) a else b!
}!
!
// compute the max degrees!
val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max)!
val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)!
val maxDegrees: (VertexId, Int) = g.degrees.reduce(max)!
!
// connected components!
val scc = g.stronglyConnectedComponents(10).vertices!
node.join(scc).foreach(println)
Reply Graph: PageRank of top dev@spark email, 4Q2014
(389,(22.690229478710016,Sean Owen <so...@cloudera.com>))!
(857,(20.832469059298248,Akhil Das <ak...@sigmoidanalytics.com>))!
(652,(13.281821379806798,Michael Armbrust <mich...@databricks.com>))!
(101,(9.963167550803664,Tobias Pfeiffer <...@preferred.jp>))!
(471,(9.614436778460558,Steve Lewis <lordjoe2...@gmail.com>))!
(931,(8.217073486575732,shahab <shahab.mok...@gmail.com>))!
(48,(7.653814912512137,ll <duy.huynh....@gmail.com>))!
(1011,(7.602002681952157,Ashic Mahtab <as...@live.com>))!
(1055,(7.572376489758199,Cheng Lian <lian.cs....@gmail.com>))!
(122,(6.87247388819558,Gerard Maas <gerard.m...@gmail.com>))!
(904,(6.252657820614504,Xiangrui Meng <men...@gmail.com>))!
(827,(6.0941062762076115,Jianshi Huang <jianshi.hu...@gmail.com>))!
(887,(5.835053915864531,Davies Liu <dav...@databricks.com>))!
(303,(5.724235650446037,Ted Yu <yuzhih...@gmail.com>))!
(206,(5.430238461114108,Deep Pradhan <pradhandeep1...@gmail.com>))!
(483,(5.332452537151523,Akshat Aranya <aara...@gmail.com>))!
(185,(5.259438927615685,SK <skrishna...@gmail.com>))!
(636,(5.235941228955769,Matei Zaharia <matei.zaha…@gmail.com>))!
!
// seaaaaaaaaaan!!
maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (389,126)!
maxOutDegree: (org.apache.spark.graphx.VertexId, Int) = (389,170)!
maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (389,296)
Reply Graph: What SSSP looks like in GraphX/Pregel
github.com/ceteri/spark-exercises/blob/master/src/main/scala/
com/databricks/apps/graphx/sssp.scala
Look Ahead: Where is this heading?
Feature learning withWord2Vec

Matt Krzus

www.yseam.com/blog/WV.html
ranked
phrases
GraphX
run
Con.Comp.
MLlib
run
Word2Vec
aggregated
by topic
MLlib
run
KMeans
topic
vectors
better than
LDA?
features… models… insights…
Resources
Apache Spark developer certificate program
• http://oreilly.com/go/sparkcert
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise
certification:
MOOCs:
Anthony Joseph

UC Berkeley	

begins 2015-02-23	

edx.org/course/uc-berkeleyx/uc-
berkeleyx-cs100-1x-
introduction-big-6181
Ameet Talwalkar

UCLA	

begins 2015-04-14	

edx.org/course/uc-berkeleyx/
uc-berkeleyx-cs190-1x-
scalable-machine-6066
community:
spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
!
video+preso archives: spark-summit.org
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training
http://spark-summit.org/
confs:
Strata CA

San Jose, Feb 18-20

strataconf.com/strata2015
Spark Summit East

NYC, Mar 18-19

spark-summit.org/east
Big Data Tech Con

Boston, Apr 26-28

bigdatatechcon.com
Strata EU

London, May 5-7

strataconf.com/big-data-conference-uk-2015
Spark Summit 2015

SF, Jun 15-17

spark-summit.org
books:
Fast Data Processing 

with Spark

Holden Karau

Packt (2013)

shop.oreilly.com/product/
9781782167068.do
Spark in Action

Chris Fregly

Manning (2015*)

sparkinaction.com/
Learning Spark

Holden Karau, 

Andy Konwinski,
Matei Zaharia

O’Reilly (2015*)

shop.oreilly.com/product/
0636920028512.do
presenter:
Just Enough Math
O’Reilly, 2014
justenoughmath.com

preview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates, 

events, conf summaries, etc.:
liber118.com/pxn/
Enterprise Data Workflows
with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do

More Related Content

What's hot

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
Stepan Pushkarev
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
Databricks
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Alex Zeltov
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software Development
Alexis Seigneurin
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
AjayRawat971036
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
Databricks
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
Noam Shaish
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
Databricks
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Robert "Chip" Senkbeil
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 

What's hot (20)

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software Development
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 

Similar to Microservices, Containers, and Machine Learning

#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
Paco Nathan
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
Thu Hiền
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
Paco Nathan
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Li Ming Tsai
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
Giivee The
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
Paco Nathan
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
Spark 101
Spark 101Spark 101
Spark 101
Mohit Garg
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 

Similar to Microservices, Containers, and Machine Learning (20)

#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Spark 101
Spark 101Spark 101
Spark 101
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
Paco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Paco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Paco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Paco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Paco Nathan
 
Computable Content
Computable ContentComputable Content
Computable Content
Paco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
Paco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Paco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
Paco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
Paco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
Paco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Paco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Paco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
Paco Nathan
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 

Recently uploaded

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 

Recently uploaded (20)

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 

Microservices, Containers, and Machine Learning

  • 1. Microservices, Containers, and Machine Learning Paco Nathan, @pacoid
  • 3. oracle.com/technetwork/java/javase/downloads/ jdk7-downloads-1880260.html • follow the license agreement instructions • then click the download for your OS • need JDK instead of JRE (for Maven, etc.) • JDK 6, 7, 8 is fine Downloads: Java JDK
  • 4. For Python 2.7, check out Anaconda by Continuum Analytics for a full-featured platform: store.continuum.io/cshop/anaconda/ Downloads: Python
  • 5. Let’s get started using Apache Spark, in just a few easy steps… Download code from: databricks.com/spark-training-resources#itas or for a fallback: spark.apache.org/downloads.html ! Also, the GitHub project: github.com/ceteri/spark-exercises/tree/master/exsto Downloads: Spark
  • 6. Connect into the inflated “spark” directory, then run: ./bin/spark-shell! Downloads: Spark
  • 8. // load error messages from a log into memory! // then interactively search for various patterns! // https://gist.github.com/ceteri/8ae5b9509a08c08a1132! ! // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() Spark Deconstructed: Log Mining Example
  • 9. Driver Worker Worker Worker Spark Deconstructed: Log Mining Example We start with Spark running on a cluster…
 submitting code to be evaluated on it:
  • 10. // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() Spark Deconstructed: Log Mining Example discussing the other part
  • 11. Spark Deconstructed: Log Mining Example scala> messages.toDebugString! res5: String = ! MappedRDD[4] at map at <console>:16 (3 partitions)! MappedRDD[3] at map at <console>:16 (3 partitions)! FilteredRDD[2] at filter at <console>:14 (3 partitions)! MappedRDD[1] at textFile at <console>:12 (3 partitions)! HadoopRDD[0] at textFile at <console>:12 (3 partitions) At this point, take a look at the transformed RDD operator graph:
  • 12. Driver Worker Worker Worker Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 13. Driver Worker Worker Worker block 1 block 2 block 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 14. Driver Worker Worker Worker block 1 block 2 block 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 15. Driver Worker Worker Worker block 1 block 2 block 3 read HDFS block read HDFS block read HDFS block Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 16. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 process, cache data process, cache data process, cache data Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 17. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 18. // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 Spark Deconstructed: Log Mining Example discussing the other part
  • 19. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 process from cache process from cache process from cache Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains(“mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 20. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains(“mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 23. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
 J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin
 graphlab.org/files/osdi2012-gonzalez-low-gu- bickson-guestrin.pdf Pregel: Large-scale graph computing at Google
 Grzegorz Czajkowski, et al.
 googleresearch.blogspot.com/2009/06/large-scale- graph-computing-at-google.html GraphX: Unified Graph Analytics on Spark
 Ankur Dave, Databricks
 databricks-training.s3.amazonaws.com/slides/ graphx@sparksummit_2014-07.pdf Advanced Exercises: GraphX
 databricks-training.s3.amazonaws.com/graph- analytics-with-graphx.html GraphX
  • 24. // http://spark.apache.org/docs/latest/graphx-programming-guide.html! ! import org.apache.spark.graphx._! import org.apache.spark.rdd.RDD! ! case class Peep(name: String, age: Int)! ! val nodeArray = Array(! (1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),! (3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),! (5L, Peep("Leslie", 45))! )! val edgeArray = Array(! Edge(2L, 1L, 7), Edge(2L, 4L, 2),! Edge(3L, 2L, 4), Edge(3L, 5L, 3),! Edge(4L, 1L, 1), Edge(5L, 3L, 9)! )! ! val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)! val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)! val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)! ! val results = g.triplets.filter(t => t.attr > 7)! ! for (triplet <- results.collect) {! println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")! } GraphX: demo
  • 27. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Typical Workflows:
  • 28. Workflows: Scraper pipeline Typical data rates, e.g., for dev@spark.apache.org: • ~2K msgs/month • ~6 MB as JSON • ~13 MB parsed Three months’ list activity represents a graph of: • 1061 senders • 753,400 nodes • 1,027,806 edges A big graph! However, it satisfies definition for a 
 graph-parallel system; lots of data locality to leverage
  • 29. Workflows: A Few Notes about Microservices and Containers The Strengths andWeaknesses of Microservices
 Abel Avram
 http://www.infoq.com/news/2014/05/microservices DockerCon EU Keynote: State of the Art in Microservices
 Adrian Cockcroft
 https://blog.docker.com/2014/12/dockercon- europe-keynote-state-of-the-art-in-microservices- by-adrian-cockcroft-battery-ventures/ Microservices Architecture
 Martin Fowler
 http://martinfowler.com/articles/microservices.html
  • 30. Workflows: An Example… Python-based service in a Docker container? Just Enough Math, IPython+Docker
 Paco Nathan, Andrew Odewahn, Kyle Kelly
 https://github.com/ceteri/jem-docker
 https://registry.hub.docker.com/u/ceteri/jem/ Docker Jumpstart
 Andrew Odewahn
 http://odewahn.github.io/docker-jumpstart/
  • 31. Workflows: A Brief Note about ETL in SparkSQL Spark SQL Data Sources API: Unified Data Access for the Spark Platform
 Michael Armbrust
 databricks.com/blog/2015/01/09/spark-sql- data-sources-api-unified-data-access-for- the-spark-platform.html
  • 32. This Workflow: Microservices meet Parallel Processing services email archives community leaderboards SparkSQL Data Prep Features Explore Scraper / Parser NLTK data Unique Word IDs TextRank, Word2Vec, etc. community insights not so big data… relatively big compute…
  • 33. Workflows: Scraper pipeline message JSON Py filter quoted content Apache email list archive urllib2 crawl monthly list by date Py segment paragraphs
  • 34. Workflows: Scraper pipeline message JSON Py filter quoted content Apache email list archive urllib2 crawl monthly list by date Py segment paragraphs {! "date": "2014-10-01T00:16:08+00:00",! "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",! "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQ "prev_thread": "",! "sender": "Debasish Das <debasish.da...@gmail.com>",! "subject": "Re: memory vs data_size",! "text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....n }
  • 36. TextBlob tag and lemmatize words TextBlob segment sentences TextBlob sentiment analysis Py generate skip-grams parsed JSON message JSON Treebank, WordNet Workflows: Parser pipeline {! "graf": [ [1, "Only", "only", "RB", 1, 0], [2, "fit", "fit", "VBP", 1, 1 ] ... ],! "id": “CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "polr": 0.2,! "sha1": "178b7a57ec6168f20a8a4f705fb8b0b04e59eeb7",! "size": 14,! "subj": 0.7,! "tile": [ [1, 2], [2, 3], [3, 4] ... ]! ]! } {! "date": "2014-10-01T00:16:08+00:00",! "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",! "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p "prev_thread": "",! "sender": "Debasish Das <debasish.da...@gmail.com>",! "subject": "Re: memory vs data_size",! "text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....nnFor }
  • 37. Workflows: TextRank pipeline Spark create word graph RDD word graph NetworkX visualize graph GraphX run TextRank Spark extract phrases ranked phrases parsed JSON
  • 38. Workflows: TextRank pipeline "Compatibility of systems of linear constraints" [{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'}, {'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'}, {'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'}, {'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}] compat system linear constraint 1: 2: 3: TextRank: Bringing Order intoTexts Rada Mihalcea, Paul Tarau http://web.eecs.umich.edu/~mihalcea/ papers/mihalcea.emnlp04.pdf
  • 41. TextRank impl: load parquet files import org.apache.spark.graphx._! import org.apache.spark.rdd.RDD! ! val sqlCtx = new org.apache.spark.sql.SQLContext(sc)! import sqlCtx._! ! val edge = sqlCtx.parquetFile("graf_edge.parquet")! edge.registerTempTable("edge")! ! val node = sqlCtx.parquetFile("graf_node.parquet")! node.registerTempTable("node")! ! // pick one message as an example; at scale we'd parallelize! val msg_id = "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw"
  • 42. TextRank impl: use SparkSQL to collect node list + edge list val sql = """! SELECT node_id, root ! FROM node ! WHERE id='%s' AND keep='1'! """.format(msg_id)! ! val n = sqlCtx.sql(sql.stripMargin).distinct()! val nodes: RDD[(Long, String)] = n.map{ p =>! (p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[String])! }! nodes.collect()! ! val sql = """! SELECT node0, node1 ! FROM edge ! WHERE id='%s'! """.format(msg_id)! ! val e = sqlCtx.sql(sql.stripMargin).distinct()! val edges: RDD[Edge[Int]] = e.map{ p =>! Edge(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[Int].toLong, 0)! }! edges.collect()
  • 43. TextRank impl: use GraphX to run PageRank // run PageRank! val g: Graph[String, Int] = Graph(nodes, edges)! val r = g.pageRank(0.0001).vertices! ! r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)! ! // save the ranks! case class Rank(id: Int, rank: Float)! val rank = r.map(p => Rank(p._1.toInt, p._2.toFloat))! rank.registerTempTable("rank")! ! def median[T](s: Seq[T])(implicit n: Fractional[T]) = {! import n._! val (lower, upper) = s.sortWith(_<_).splitAt(s.size / 2)! if (s.size % 2 == 0) (lower.last + upper.head) / fromInt(2) else upper.head! }! ! val min_rank = median(r.map(_._2).collect())
  • 44. TextRank impl: join ranked words with parsed text var span:List[String] = List()! var last_index = -1! var rank_sum = 0.0! ! var phrases:collection.mutable.Map[String, Double] = collection.mutable.Map()! ! val sql = """! SELECT n.num, n.raw, r.rank! FROM node n JOIN rank r ON n.node_id = r.id ! WHERE n.id='%s' AND n.keep='1'! ORDER BY n.num! """.format(msg_id)! ! val s = sqlCtx.sql(sql.stripMargin).collect()
  • 45. TextRank impl: “pull strings” for the top-ranked keyphrases s.foreach { x => ! //println (x)! val index = x.getInt(0)! val word = x.getString(1)! val rank = x.getFloat(2)! var isStop = false! ! // test for break from past! if (span.size > 0 && rank < min_rank) isStop = true! if (span.size > 0 && (index - last_index > 1)) isStop = true! ! // clear accumulation! if (isStop) {! val phrase = span.mkString(" ")! phrases += (phrase -> rank_sum)! ! span = List()! last_index = index! rank_sum = 0.0! }! ! // start or append! if (rank >= min_rank) {! span = span :+ word! last_index = index! rank_sum += rank! }! }!
  • 46. TextRank impl: report the top keyphrases // summarize the text as a list of ranked keyphrases! val summary = sc.parallelize(phrases.toSeq)! .distinct()! .sortBy(_._2, ascending=false)
  • 48. Reply Graph: load parquet files import org.apache.spark.graphx._! import org.apache.spark.rdd.RDD! ! val sqlCtx = new org.apache.spark.sql.SQLContext(sc)! import sqlCtx._! ! val edge = sqlCtx.parquetFile("reply_edge.parquet")! edge.registerTempTable("edge")! ! val node = sqlCtx.parquetFile("reply_node.parquet")! node.registerTempTable("node")! ! edge.schemaString! node.schemaString
  • 49. Reply Graph: use SparkSQL to collect node list + edge list val sql = "SELECT id, sender FROM node"! val n = sqlCtx.sql(sql).distinct()! val nodes: RDD[(Long, String)] = n.map{ p =>! (p(0).asInstanceOf[Long], p(1).asInstanceOf[String])! }! nodes.collect()! ! val sql = "SELECT replier, sender, num FROM edge"! val e = sqlCtx.sql(sql).distinct()! val edges: RDD[Edge[Int]] = e.map{ p =>! Edge(p(0).asInstanceOf[Long], p(1).asInstanceOf[Long], p(2).asInstanceOf[Int])! }! edges.collect()
  • 50. Reply Graph: use GraphX to run graph analytics // run graph analytics! val g: Graph[String, Int] = Graph(nodes, edges)! val r = g.pageRank(0.0001).vertices! r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)! ! // define a reduce operation to compute the highest degree vertex! def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {! if (a._2 > b._2) a else b! }! ! // compute the max degrees! val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max)! val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)! val maxDegrees: (VertexId, Int) = g.degrees.reduce(max)! ! // connected components! val scc = g.stronglyConnectedComponents(10).vertices! node.join(scc).foreach(println)
  • 51. Reply Graph: PageRank of top dev@spark email, 4Q2014 (389,(22.690229478710016,Sean Owen <so...@cloudera.com>))! (857,(20.832469059298248,Akhil Das <ak...@sigmoidanalytics.com>))! (652,(13.281821379806798,Michael Armbrust <mich...@databricks.com>))! (101,(9.963167550803664,Tobias Pfeiffer <...@preferred.jp>))! (471,(9.614436778460558,Steve Lewis <lordjoe2...@gmail.com>))! (931,(8.217073486575732,shahab <shahab.mok...@gmail.com>))! (48,(7.653814912512137,ll <duy.huynh....@gmail.com>))! (1011,(7.602002681952157,Ashic Mahtab <as...@live.com>))! (1055,(7.572376489758199,Cheng Lian <lian.cs....@gmail.com>))! (122,(6.87247388819558,Gerard Maas <gerard.m...@gmail.com>))! (904,(6.252657820614504,Xiangrui Meng <men...@gmail.com>))! (827,(6.0941062762076115,Jianshi Huang <jianshi.hu...@gmail.com>))! (887,(5.835053915864531,Davies Liu <dav...@databricks.com>))! (303,(5.724235650446037,Ted Yu <yuzhih...@gmail.com>))! (206,(5.430238461114108,Deep Pradhan <pradhandeep1...@gmail.com>))! (483,(5.332452537151523,Akshat Aranya <aara...@gmail.com>))! (185,(5.259438927615685,SK <skrishna...@gmail.com>))! (636,(5.235941228955769,Matei Zaharia <matei.zaha…@gmail.com>))! ! // seaaaaaaaaaan!! maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (389,126)! maxOutDegree: (org.apache.spark.graphx.VertexId, Int) = (389,170)! maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (389,296)
  • 52. Reply Graph: What SSSP looks like in GraphX/Pregel github.com/ceteri/spark-exercises/blob/master/src/main/scala/ com/databricks/apps/graphx/sssp.scala
  • 53. Look Ahead: Where is this heading? Feature learning withWord2Vec
 Matt Krzus
 www.yseam.com/blog/WV.html ranked phrases GraphX run Con.Comp. MLlib run Word2Vec aggregated by topic MLlib run KMeans topic vectors better than LDA? features… models… insights…
  • 55. Apache Spark developer certificate program • http://oreilly.com/go/sparkcert • defined by Spark experts @Databricks • assessed by O’Reilly Media • establishes the bar for Spark expertise certification:
  • 56. MOOCs: Anthony Joseph
 UC Berkeley begins 2015-02-23 edx.org/course/uc-berkeleyx/uc- berkeleyx-cs100-1x- introduction-big-6181 Ameet Talwalkar
 UCLA begins 2015-04-14 edx.org/course/uc-berkeleyx/ uc-berkeleyx-cs190-1x- scalable-machine-6066
  • 57. community: spark.apache.org/community.html events worldwide: goo.gl/2YqJZK ! video+preso archives: spark-summit.org resources: databricks.com/spark-training-resources workshops: databricks.com/spark-training
  • 59. confs: Strata CA
 San Jose, Feb 18-20
 strataconf.com/strata2015 Spark Summit East
 NYC, Mar 18-19
 spark-summit.org/east Big Data Tech Con
 Boston, Apr 26-28
 bigdatatechcon.com Strata EU
 London, May 5-7
 strataconf.com/big-data-conference-uk-2015 Spark Summit 2015
 SF, Jun 15-17
 spark-summit.org
  • 60. books: Fast Data Processing 
 with Spark
 Holden Karau
 Packt (2013)
 shop.oreilly.com/product/ 9781782167068.do Spark in Action
 Chris Fregly
 Manning (2015*)
 sparkinaction.com/ Learning Spark
 Holden Karau, 
 Andy Konwinski, Matei Zaharia
 O’Reilly (2015*)
 shop.oreilly.com/product/ 0636920028512.do
  • 61. presenter: Just Enough Math O’Reilly, 2014 justenoughmath.com
 preview: youtu.be/TQ58cWgdCpA monthly newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/ Enterprise Data Workflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do