Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Analyzing log data with
Apache Spark
William Benton
Red Hat Emerging Technology
BACKGROUND
Challenges of log data
Challenges of log data
SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg)
FROM LOGS WHERE level='CRIT' AND msg L...
Challenges of log data
11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00
SELECT hostname, DATEPART(HH, timestamp) AS hour, C...
Challenges of log data
postgres
httpd
syslog
INFO INFO WARN CRIT DEBUG INFO
GET GET GET POST
WARN WARN INFO INFO INFO
GET ...
Challenges of log data
postgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
ht...
Challenges of log datapostgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
htt...
DATA INGEST
Collecting log data
collecting
Ingesting live log
data via rsyslog,
logstash, fluentd
normalizing
Reconciling log
record me...
Collecting log data
warehousing
Storing normalized
records in ES indices
analysis
cache warehoused
data as Parquet files
on...
Collecting log data
warehousing
Storing normalized
records in ES indices
analysis
cache warehoused
data as Parquet files
on...
Schema mediation
Schema mediation
Schema mediation
Schema mediation
timestamp, level, host, IP
addresses, message, &c.
rsyslog-style metadata, like
app name, facility, &c.
logs
.select("level").distinct
.map { case Row(s: String) => s }
.collect
Exploring structured data
logs
.groupBy($"level"...
Exploring structured data
logs
.groupBy($"level", $"rsyslog.app_name")
.agg(count("level").as("total"))
.orderBy($"total"....
Exploring structured data
logs
.groupBy($"level", $"rsyslog.app_name")
.agg(count("level").as("total"))
.orderBy($"total"....
FEATURE ENGINEERING
From log records to vectors
What does it mean for two sets of categorical features to be similar?
red
green
blue
orange
->...
From log records to vectors
What does it mean for two sets of categorical features to be similar?
red
green
blue
orange
->...
Similarity and distance
Similarity and distance
Similarity and distance
(q - p) • (q - p)
Similarity and distance
pi - qi
i=1
n
Similarity and distance
pi - qi
i=1
n
Similarity and distance
p • q
p q
WARN INFO INFOINFO
WARN DEBUGINFOINFOINFO
WARN WARNINFO INFO INFO
WAR
INFO
INFO
Other interesting features
host01
host02
h...
INFO INFOINFO
DEBUGINFOINFO
WARNNFO INFO INFO
WARN INFO INFOINFO
INFO INFO
INFOINFOINFO
INFOINFO INFO INFO
INFO INFO
WARN ...
INFO INFOINFO
INFO INFO
INFOINFO
INFO INFO INFO
WARN
DEBUG
WARN
INFO
INFO
EBUG
WARN
INFO
INFO
INFO
INFO
INFO
INFO INFO INF...
Other interesting features
: Great food, great service, a must-visit!
: Our whole table got gastroenteritis.
: This place ...
Other interesting features
INFO: Everything is great! Just checking in to let you know I’m OK.
Other interesting features
INFO: Everything is great! Just checking in to let you know I’m OK.
CRIT: No requests in last h...
Other interesting features
INFO: Everything is great! Just checking in to let you know I’m OK.
CRIT: No requests in last h...
Other interesting features
INFO: Everything is great! Just checking in to let you know I’m OK.
CRIT: No requests in last h...
VISUALIZING STRUCTURE
and FINDING OUTLIERS
Multidimensional data
Multidimensional data
[4,7]
Multidimensional data
[4,7]
Multidimensional data
[4,7] [2,3,5]
Multidimensional data
[4,7] [2,3,5]
Multidimensional data
[4,7] [2,3,5]
[7,1,6,5,12,

8,9,2,2,4,
7,11,6,1,5]
Multidimensional data
[4,7] [2,3,5]
[7,1,6,5,12,

8,9,2,2,4,
7,11,6,1,5]
A linear approach: PCA
0 0 0 1 1 0 1 0 1 0
0 0 1 0 0 0 1 1 0 0
1 0 1 1 0 1 0 0 0 0
0 0 0 0 0 0 1 1 0 1
0 1 0 0 1 0 0 1 0 0...
A linear approach: PCA
0 0 0 1 1 0 1 0 1 0
0 0 1 0 0 0 1 1 0 0
1 0 1 1 0 1 0 0 0 0
0 0 0 0 0 0 1 1 0 1
0 1 0 0 1 0 0 1 0 0...
Tree-based approaches
Tree-based approaches
yes
no
yes
no
if orange
if !orange
if red
if !red
if !gray
if !gray
Tree-based approaches
yes
no
yes
no
if orange
if !orange
if red
if !red
if !gray
if !gray
yes
no
no
yes
yes
no
yes
no
yes
...
Tree-based approaches
yes
no
yes
no
if orange
if !orange
if red
if !red
if !gray
if !gray
yes
no
no
yes
yes
no
yes
no
yes
...
Self-organizing maps
Self-organizing maps
Finding outliers with SOMs
Finding outliers with SOMs
Finding outliers with SOMs
Finding outliers with SOMs
Finding outliers with SOMs
Outliers in log data
Outliers in log data
0.95
Outliers in log data
0.95
0.97
Outliers in log data
0.95
0.97
0.92
Outliers in log data
0.95
0.97
0.92
0.37
An outlier is any
record whose best
match was at least
4σ below the mean.
0.94
0....
Out of 310 million log
records, we identified
0.0012% as outliers.
Thirty most extreme outliers
10 Can not communicate with power supply 2.
9 Power supply 2 failed.
8 Power supply redundanc...
SOM TRAINING in SPARK
On-line SOM training
On-line SOM training
On-line SOM training
On-line SOM training
On-line SOM training
while t < iterations:
for ex in examples:
t = t + 1
if t == iterations:
break
bestMatch = closest(som...
On-line SOM training
while t < iterations:
for ex in examples:
t = t + 1
if t == iterations:
break
bestMatch = closest(som...
On-line SOM training
while t < iterations:
for ex in examples:
t = t + 1
if t == iterations:
break
bestMatch = closest(som...
On-line SOM training
while t < iterations:
for ex in examples:
t = t + 1
if t == iterations:
break
bestMatch = closest(som...
On-line SOM training
On-line SOM training
sensitive to
learning rate
not parallel
sensitive to
example order
Batch SOM training
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood...
Batch SOM training
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood...
Batch SOM training
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood...
Batch SOM training
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood...
Batch SOM training
over all partitions
Batch SOM training
over all partitions
Batch SOM training
over all partitions
Batch SOM training
over all partitions
Batch SOM training
over all partitions
Batch SOM training
over all partitions
Batch SOM training
driver (using aggregate)
workers
driver (using aggregate)
workers
driver (using aggregate)
workers
driver (using aggregate)
workers
What if you have a 3 mb model and 2,048 partitions?
driver (using treeAggregate)
workers
driver (using treeAggregate)
workers
driver (using treeAggregate)
workers
driver (using treeAggregate)
workers
driver (using treeAggregate)
workers
SHARING MODELS
BEYOND SPARK
Sharing models
class Model(private var entries: breeze.linalg.DenseVector[Double],
/* ... lots of (possibly) mutable state...
Sharing models
class Model(private var entries: breeze.linalg.DenseVector[Double],
/* ... lots of (possibly) mutable state...
Sharing models
case class FrozenModel(entries: Array[Double], /* ... */ ) { }
class Model(private var entries: breeze.lina...
Sharing models
import org.json4s.jackson.Serialization
import org.json4s.jackson.Serialization.{read=>jread, write=>jwrite...
Sharing models
import org.json4s.jackson.Serialization
import org.json4s.jackson.Serialization.{read=>jread, write=>jwrite...
PRACTICAL MATTERS
Spark and ElasticSearch
Data locality is an issue and caching is even more
important than when running from local storage....
Structured queries in Spark
Always program defensively: mediate schemas,
explicitly convert null values, etc.
Use the Data...
Memory and partitioning
Large JVM heaps can lead to appalling GC pauses and
executor timeouts.
Use multiple JVMs or off-he...
Interoperability
Avoid brittle or language-specific model serializers
when sharing models with non-Spark environments.
JSON...
Feature engineering
Favor feature engineering effort over complex or
novel learning algorithms.
Prefer approaches that tra...
@willb • willb@redhat.com

https://chapeau.freevariable.com
THANKS!
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
Upcoming SlideShare
Loading in …5
×

Analyzing Log Data With Apache Spark

4,101 views

Published on

Spark Summit 2016 talk by William Benton (Red Hat)

Published in: Data & Analytics
  • Be the first to comment

Analyzing Log Data With Apache Spark

  1. 1. Analyzing log data with Apache Spark William Benton Red Hat Emerging Technology
  2. 2. BACKGROUND
  3. 3. Challenges of log data
  4. 4. Challenges of log data SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%' GROUP BY hostname, hour
  5. 5. Challenges of log data 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%' GROUP BY hostname, hour
  6. 6. Challenges of log data postgres httpd syslog INFO INFO WARN CRIT DEBUG INFO GET GET GET POST WARN WARN INFO INFO INFO GET (404) INFO (ca. 2000)
  7. 7. Challenges of log data postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN (ca. 2016)
  8. 8. Challenges of log datapostgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN How many services are generating logs in your datacenter today?
  9. 9. DATA INGEST
  10. 10. Collecting log data collecting Ingesting live log data via rsyslog, logstash, fluentd normalizing Reconciling log record metadata across sources warehousing Storing normalized records in ES indices analysis cache warehoused data as Parquet files on Gluster volume local to Spark cluster
  11. 11. Collecting log data warehousing Storing normalized records in ES indices analysis cache warehoused data as Parquet files on Gluster volume local to Spark cluster
  12. 12. Collecting log data warehousing Storing normalized records in ES indices analysis cache warehoused data as Parquet files on Gluster volume local to Spark cluster
  13. 13. Schema mediation
  14. 14. Schema mediation
  15. 15. Schema mediation
  16. 16. Schema mediation timestamp, level, host, IP addresses, message, &c. rsyslog-style metadata, like app name, facility, &c.
  17. 17. logs .select("level").distinct .map { case Row(s: String) => s } .collect Exploring structured data logs .groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total")) .orderBy($"total".desc) .show info kubelet 17933574 info kube-proxy 10961117 err journal 6867921 info systemd 5184475 … debug, notice, emerg, err, warning, crit, info, severe, alert
  18. 18. Exploring structured data logs .groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total")) .orderBy($"total".desc) .show info kubelet 17933574 info kube-proxy 10961117 err journal 6867921 info systemd 5184475 … logs .select("level").distinct .as[String].collect debug, notice, emerg, err, warning, crit, info, severe, alert
  19. 19. Exploring structured data logs .groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total")) .orderBy($"total".desc) .show info kubelet 17933574 info kube-proxy 10961117 err journal 6867921 info systemd 5184475 … logs .select("level").distinct .as[String].collect debug, notice, emerg, err, warning, crit, info, severe, alert This class must be declared outside the REPL!
  20. 20. FEATURE ENGINEERING
  21. 21. From log records to vectors What does it mean for two sets of categorical features to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010
  22. 22. From log records to vectors What does it mean for two sets of categorical features to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010 red pancakes orange waffles -> 00010000 -> 00101000
  23. 23. Similarity and distance
  24. 24. Similarity and distance
  25. 25. Similarity and distance (q - p) • (q - p)
  26. 26. Similarity and distance pi - qi i=1 n
  27. 27. Similarity and distance pi - qi i=1 n
  28. 28. Similarity and distance p • q p q
  29. 29. WARN INFO INFOINFO WARN DEBUGINFOINFOINFO WARN WARNINFO INFO INFO WAR INFO INFO Other interesting features host01 host02 host03
  30. 30. INFO INFOINFO DEBUGINFOINFO WARNNFO INFO INFO WARN INFO INFOINFO INFO INFO INFOINFOINFO INFOINFO INFO INFO INFO INFO WARN DEBUG Other interesting features host01 host02 host03
  31. 31. INFO INFOINFO INFO INFO INFOINFO INFO INFO INFO WARN DEBUG WARN INFO INFO EBUG WARN INFO INFO INFO INFO INFO INFO INFO INFO WARN WARN INFO WARN INFO Other interesting features host01 host02 host03
  32. 32. Other interesting features : Great food, great service, a must-visit! : Our whole table got gastroenteritis. : This place is so wonderful that it has ruined all other tacos for me and my family.
  33. 33. Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK.
  34. 34. Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers.
  35. 35. Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers. INFO: Phoenix datacenter is on fire; may not rise from ashes.
  36. 36. Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers. INFO: Phoenix datacenter is on fire; may not rise from ashes. See https://links.freevariable.com/nlp-logs/ for more!
  37. 37. VISUALIZING STRUCTURE and FINDING OUTLIERS
  38. 38. Multidimensional data
  39. 39. Multidimensional data [4,7]
  40. 40. Multidimensional data [4,7]
  41. 41. Multidimensional data [4,7] [2,3,5]
  42. 42. Multidimensional data [4,7] [2,3,5]
  43. 43. Multidimensional data [4,7] [2,3,5] [7,1,6,5,12,
 8,9,2,2,4, 7,11,6,1,5]
  44. 44. Multidimensional data [4,7] [2,3,5] [7,1,6,5,12,
 8,9,2,2,4, 7,11,6,1,5]
  45. 45. A linear approach: PCA 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
  46. 46. A linear approach: PCA 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
  47. 47. Tree-based approaches
  48. 48. Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray
  49. 49. Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no
  50. 50. Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no
  51. 51. Self-organizing maps
  52. 52. Self-organizing maps
  53. 53. Finding outliers with SOMs
  54. 54. Finding outliers with SOMs
  55. 55. Finding outliers with SOMs
  56. 56. Finding outliers with SOMs
  57. 57. Finding outliers with SOMs
  58. 58. Outliers in log data
  59. 59. Outliers in log data 0.95
  60. 60. Outliers in log data 0.95 0.97
  61. 61. Outliers in log data 0.95 0.97 0.92
  62. 62. Outliers in log data 0.95 0.97 0.92 0.37 An outlier is any record whose best match was at least 4σ below the mean. 0.94 0.89 0.91 0.93 0.96
  63. 63. Out of 310 million log records, we identified 0.0012% as outliers.
  64. 64. Thirty most extreme outliers 10 Can not communicate with power supply 2. 9 Power supply 2 failed. 8 Power supply redundancy is lost. 1 Drive A is removed. 1 Can not communicate with power supply 1. 1 Power supply 1 failed.
  65. 65. SOM TRAINING in SPARK
  66. 66. On-line SOM training
  67. 67. On-line SOM training
  68. 68. On-line SOM training
  69. 69. On-line SOM training
  70. 70. On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt
  71. 71. On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt at each step, we update each unit by adding its value from the previous step…
  72. 72. On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt to the example that we considered…
  73. 73. On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt scaled by a learning factor and the distance from this unit to its best match
  74. 74. On-line SOM training
  75. 75. On-line SOM training sensitive to learning rate not parallel sensitive to example order
  76. 76. Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)
  77. 77. Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) update the state of every cell in the neighborhood of the best matching unit, weighting by distance
  78. 78. Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) keep track of the distance weights we’ve seen for a weighted average
  79. 79. Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) since we can easily merge multiple states, we can train in parallel across many examples
  80. 80. Batch SOM training
  81. 81. over all partitions Batch SOM training
  82. 82. over all partitions Batch SOM training
  83. 83. over all partitions Batch SOM training
  84. 84. over all partitions Batch SOM training
  85. 85. over all partitions Batch SOM training
  86. 86. over all partitions Batch SOM training
  87. 87. driver (using aggregate) workers
  88. 88. driver (using aggregate) workers
  89. 89. driver (using aggregate) workers
  90. 90. driver (using aggregate) workers What if you have a 3 mb model and 2,048 partitions?
  91. 91. driver (using treeAggregate) workers
  92. 92. driver (using treeAggregate) workers
  93. 93. driver (using treeAggregate) workers
  94. 94. driver (using treeAggregate) workers
  95. 95. driver (using treeAggregate) workers
  96. 96. SHARING MODELS BEYOND SPARK
  97. 97. Sharing models class Model(private var entries: breeze.linalg.DenseVector[Double], /* ... lots of (possibly) mutable state ... */ ) implements java.io.Serializable { // lots of implementation details here }
  98. 98. Sharing models class Model(private var entries: breeze.linalg.DenseVector[Double], /* ... lots of (possibly) mutable state ... */ ) implements java.io.Serializable { // lots of implementation details here } case class FrozenModel(entries: Array[Double], /* ... */ ) { }
  99. 99. Sharing models case class FrozenModel(entries: Array[Double], /* ... */ ) { } class Model(private var entries: breeze.linalg.DenseVector[Double], /* ... lots of (possibly) mutable state ... */ ) implements java.io.Serializable { // lots of implementation details here def freeze: FrozenModel = // ... } object Model { def thaw(im: FrozenModel): Model = // ... }
  100. 100. Sharing models import org.json4s.jackson.Serialization import org.json4s.jackson.Serialization.{read=>jread, write=>jwrite} implicit val formats = Serialization.formats(NoTypeHints) def toJson(m: Model): String = { jwrite(som.freeze) } def fromJson(json: String): Try[Model] = { Try({ Model.thaw(jread[FrozenModel](json)) }) }
  101. 101. Sharing models import org.json4s.jackson.Serialization import org.json4s.jackson.Serialization.{read=>jread, write=>jwrite} implicit val formats = Serialization.formats(NoTypeHints) def toJson(m: Model): String = { jwrite(som.freeze) } def fromJson(json: String): Try[Model] = { Try({ Model.thaw(jread[FrozenModel](json)) }) } Also consider how you’ll share feature encoders and other parts of your learning pipeline!
  102. 102. PRACTICAL MATTERS
  103. 103. Spark and ElasticSearch Data locality is an issue and caching is even more important than when running from local storage. If your data are write-once, consider exporting ES indices to Parquet files and analyzing those instead.
  104. 104. Structured queries in Spark Always program defensively: mediate schemas, explicitly convert null values, etc. Use the Dataset API whenever possible to minimize boilerplate and benefit from query planning without (entirely) forsaking type safety.
  105. 105. Memory and partitioning Large JVM heaps can lead to appalling GC pauses and executor timeouts. Use multiple JVMs or off-heap storage (in Spark 2.0!) Tree aggregation can save you both memory and execution time by partially aggregating at worker nodes.
  106. 106. Interoperability Avoid brittle or language-specific model serializers when sharing models with non-Spark environments. JSON is imperfect but ubiquitous. However, json4s will serialize case classes for free! See also SPARK-13944, merged recently into 2.0.
  107. 107. Feature engineering Favor feature engineering effort over complex or novel learning algorithms. Prefer approaches that train interpretable models. Design your feature engineering pipeline so you can translate feature vectors back to factor values.
  108. 108. @willb • willb@redhat.com
 https://chapeau.freevariable.com THANKS!

×