SlideShare a Scribd company logo
pipeline.io
After Dark 2.0End-to-End,Real-time, Advanced Analytics and ML
Big Data Reference Pipeline
Atlanta Spark User Group
Sept 22, 2016
Thanks Emory Continuing Education!
Chris Fregly
Research Scientist @ PipelineIO
We’re Hiring - Only Nice People!
pipeline.io
advancedspark.com
pipeline.io
Who Am I?
2
Research Scientist @ PipelineIO
github.com/fluxcapacitor/pipeline
Meetup Founder
Advanced
Book Author
Advanced .
pipeline.io
Who Was I?
3
Streaming Data Engineer
Netflix Open Source Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
pipeline.io
Advanced Spark and Tensorflow Meetup
Meetup Metrics
Top 5 Most Active Spark Meetup!
4000+ Members in just 1 year!!
6000+ Downloads of Docker Image
with many Meetup Demos!!!
@ advancedspark.com
Meetup Goals
Code dive deep into Spark and related open source code bases
Study integrations with Cassandra, ElasticSearch,Tachyon, S3,
BlinkDB, Mesos, YARN, Kafka, R, etc
Surface and share patterns and idioms of well-designed,
distributed, big data processing systems
pipeline.io
Atlanta Hadoop User Meetup Last Night
http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016
pipeline.io
MLconf ATL Tomorrow
pipeline.io
Current PipelineIO Research
Model Deploying and Testing
Model Scaling and Serving
Online Model Training
Dynamic Model Optimizing
7
pipeline.io
PipelineIO Deliverables
100% Open Source!!
Github
https://github.com/fluxcapacitor/
DockerHub
https://hub.docker.com/r/fluxcapacitor
Workshop
http://pipeline.io
8
pipeline.io
Topics of This Talk (20-30 mins each)
① Spark Streaming and Spark ML
Generating Real-time Recommendations
② Spark Core
Tuning and Profiling
③ Spark SQL
Tuning and Customizing
9
pipeline.io
Live, Interactive Demo!
Kafka, Cassandra, ElasticSearch, Redis, Spark ML
pipeline.io
Audience Participation Required!
11
You -> Audience Instructions
①Navigate to
demo.pipeline.io
②Swipe software used in
Production Only!
Data ->
Scientist
This is Totally Anonymous!!
pipeline.io
Topics of This Talk (15-20 mins each)
① Spark Streaming and Spark ML
Kafka, Cassandra, ElasticSearch, Redis, Docker
② Spark Core
Tuning and Profiling
③ Spark SQL
Tuning and Customizing
12
pipeline.io
Mechanical Sympathy
“Hardware and software working together.”
- Martin Thompson
http://mechanical-sympathy.blogspot.com
“Whatever your data structure, my array will win.”
- Scott Meyers
Every C++ Book, basically
13
pipeline.io
Spark and Mechanical Sympathy
14
Project
Tungsten
(Spark 1.4-1.6+)
100TB
GraySort
Challenge
(Spark 1.1-1.2)
Minimize Memory & GC
Maximize CPU Cache
Saturate Network I/O
Saturate Disk I/O
pipeline.io
CPU Cache Refresher
15
aka
“LLC”
My
Laptop
pipeline.io
CPU Cache Sympathy (AlphaSort paper)
Key (10 bytes) + Pointer (4 bytes) = 14 bytes
16
Key Ptr
Pre-process & Pull Key From Record
PtrKey-Prefix
2x CPU Cache-line Friendly!
Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes
Key (10 bytes) + Pad (2 bytes) + Pointer (4 bytes) = 16 bytes
Ptr
Must Dereference to Compare Key
Key
Pointer (4 bytes) = 4 bytes
Key Ptr
Pad
/Pad CPU Cache-line Friendly!
Dereference full key only
to resolve prefix duplicates
pipeline.io
Sort Performance Comparison
17
pipeline.io
Sequential vs Random Cache Misses
18
pipeline.io
Demo!
Sorting
pipeline.io
Instrumenting and Monitoring CPU
Use Linux perf command!
20
http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
pipeline.io
Results of Random vs. Sequential Sort
21
Naïve Random Access
Pointer Sort
Cache Friendly Sequential
Key/Pointer Sort
Ptr Key
Must Dereference to Compare Key
Key Ptr
Pre-process & Pull Key From Record
-35%
-90%
-68%
-26%
% Change
-55%
perf stat –event 
L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses
pipeline.io
Demo!
Matrix Multiplication
pipeline.io
CPU Cache Naïve Matrix Multiplication
// Dot product of each row & column vector
for (i <- 0 until numRowA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];
23
Bad: Row-wise traversal,
not using full CPU cache line,
ineffective pre-fetching
pipeline.io
CPU Cache Friendly Matrix
Multiplication
// Transpose B
for (i <- 0 until numRowsB)
for (j <- 0 until numColsB)
matBT[ i ][ j ] = matB[ j ][ i ];
24
Good: Full CPU Cache Line,
Effective Prefetching
OLD: matB [ k ][ j ];
// Modify algo for Transpose B
for (i <- 0 until numRowsA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
pipeline.io
Results Of Matrix Multiplication
Cache-Friendly Matrix MultiplyNaïve Matrix Multiply
perf stat –event 
L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, 
LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
-96%
-93%
-93%
-70%
-53%
% Change
-63%
+8543%?
pipeline.io
Demo!
Thread Synchronization
pipeline.io
Thread and Context Switch Sympathy
Problem
Atomically Increment 2 Counters
(each at different increments) by
1000’s of Simultaneous Threads
27
Possible Solutions
① Synchronized Immutable
② Synchronized Mutable
③ AtomicReference CAS
④ Volatile?
Context Switches are Expen$ive!!
aka
“LLC”
pipeline.io
Synchronized Immutable Counters
case class Counters(left: Int, right: Int)
object SynchronizedImmutableCounters {
var counters = new Counters(0,0)
def getCounters(): Counters = {
this.synchronized { counters }
}
def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
this.synchronized {
counters = new Counters(counters.left + leftIncrement,
counters.right + rightIncrement)
}
}
} 28
Locks whole
outer object!!
pipeline.io
Synchronized Mutable Counters
class MutableCounters(left: Int, right: Int) {
def increment(leftIncrement: Int, rightIncrement: Int): Unit={
this.synchronized {…}
}
def getCountersTuple(): (Int, Int) = {
this.synchronized{ (counters.left, counters.right) }
}
}
object SynchronizedMutableCounters {
val counters = new MutableCounters(0,0)
…
def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
counters.increment(leftIncrement, rightIncrement)
}
29
Locks just
MutableCounters
pipeline.io
Lock-Free AtomicReference Counters
case class Counters(left: Int, right: Int)
object LockFreeAtomicReferenceCounters {
val counters = new AtomicReference[Counters](new Counters(0,0))
def increment(leftIncrement: Int, rightIncrement: Int) : Long = {
var originalCounters: Counters = null
var updatedCounters: Counters = null
do {
originalCounters = getCounters()
updatedCounters = new Counters(originalCounters.left+ leftIncrement,
originalCounters.right+ rightIncrement)
} // Retry lock-free, optimistic compareAndSet() until AtomicRef updates
while !(counters.compareAndSet(originalCounters, updatedCounters))
}
30Lock Free!!
pipeline.io
Lock-Free AtomicLong Counters
object LockFreeAtomicLongCounters {
// a single Long (64-bit) will maintain 2 separate Ints (32-bits each)
val counters = new AtomicLong()
…
def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
var originalCounters = 0L
var updatedCounters = 0L
do {
originalCounters = counters.get()
…
// Store two 32-bit Int into one 64-bit Long
// Use >>> 32 and << 32 to set and retrieve each Int from the Long
}
// Retry lock-free, optimistic compareAndSet() until AtomicLong updates
while !(counters.compareAndSet(originalCounters, updatedCounters))
}31 Lock Free!!
A: The JVM does not
guarantee atomic
updates of 64-bit
longs and doubles
Q: Why not use
@volatile long?
pipeline.io
Results of Thread Synchronization
Immutable Case Class
32
Lock-Free AtomicLong
-64%
-46%
-17%
-33%
perf stat –event 
context-switches,L1-dcache-load-misses,L1-dcache-prefetch-misses, 
LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
% Change
-31%
-32%
-33%
-27%
case class Counters(left:Int, right: Int)
...
this.synchronized {
counters = new Counters(counters.left + leftIncrement,
counters.right + rightIncrement)
}
val counters = new AtomicLong()
…
do {
…
} while !(counters.compareAndSet(originalCounters,
updatedCounters))
pipeline.io
Profile Visualizations: Flame Graphs
33
Example: Spark Word Count
Java Stack Traces are Good! JDK 1.8+
(-XX:-Inline -XX:+PreserveFramePointer)
Plateaus are Bad!
I/O stalls, Heavy CPU
serialization, etc
pipeline.io
Project Tungsten: CPU and Memory
Create Custom Data Structures & Algorithms
Operate on serialized and compressed ByteArrays!
Minimize Garbage Collection
Reuse ByteArrays
In-place updates for aggregations
Maximize CPU Cache Effectiveness
8-byte alignment
AlphaSort-based Key-Prefix
Utilize Catalyst Dynamic Code Generation
Dynamic optimizations using entire query plan
Developer implements genCode() to create Scala source code
(String)
34
pipeline.io
Why is CPU the Bottleneck?
CPU for serialization, hashing, & compression
Spark 1.2 updates saturated Network, Disk I/O
10x increase in I/O throughput relative to CPU
More partition, pruning, and pushdown support
Newer columnar file formats help reduce I/O
35
pipeline.io
Custom Data Structs & Algos: Aggs
UnsafeFixedWidthAggregationMap
Uses BytesToBytesMap internally
In-place updates of serialized aggregation
No object creation on hot-path
TungstenAggregate & TungstenAggregationIterator
Operates directly on serialized, binary UnsafeRow
2 steps to avoid single-key OOMs
① Hash-based (grouping) agg spills to disk if needed
② Sort-based agg performs external merge sort on spills
36
pipeline.io
Custom Data Structures & Algorithms
o.a.s.util.collection.unsafe.sort.
UnsafeSortDataFormat
UnsafeExternalSorter
UnsafeShuffleWriter
UnsafeInMemorySorter
RecordPointerAndKeyPrefix
37
PtrKey-Prefix
2x CPU Cache-line Friendly!
SortDataFormat<RecordPointerAndKeyPrefix, Long[ ]>
Note: Mixing multiplesubclasses of SortDataFormat
simultaneously will prevent JIT inlining.
Supports merging compressed records
(if compression CODEC supports it, ie. LZF)
In-place external sorting of spilled BytesToBytes data
AlphaSort-based, 8-byte aligned sort key
In-place sorting of BytesToBytesMap data
pipeline.io
Code Generation
Problem
Boxing creates excessive objects
Expression tree evaluations are costly
JVM can’t inline polymorphic impls
Lack of polymorphism == poor code design
Solution
Code generation enables inlining
Rewrite and optimize code using overall plan, 8-byte align
Defer source code generation to each operator, UDF,
UDAF
Use Janino to compile generated source code ->bytecode
38
pipeline.io
Autoscaling Spark Workers (Spark 1.5+)
Scaling up is easy J
SparkContext.addExecutors() until max is
reached
Scaling down is hard L
SparkContext.removeExecutors()
Lose RDD cache inside Executor JVM
Must rebuild active RDD partitions in another Executor
JVM
Uses External Shuffle Service from Spark 1.1-1.2
If Executor JVM dies/restarts, shuffle keeps shufflin’! 39
pipeline.io
“Hidden” Spark Submit REST API
http://arturmkrtchyan.com/apache-spark-hidden-rest-api
Submit Spark Job
curl -X POST http://127.0.0.1:6066/v1/submissions/create 
--header "Content-Type:application/json;charset=UTF-8" 
--data ’{"action" : "CreateSubmissionRequest”,
"mainClass" : "org.apache.spark.examples.SparkPi”,
"sparkProperties" : {
"spark.jars" : "file:/spark/lib/spark-examples-1.5.1.jar",
"spark.app.name" : "SparkPi",…
}}’
Get Spark Job Status
curl http://127.0.0.1:6066/v1/submissions/status/<job-id-from-submit-request>
Kill Spark Job
curl -X POST http://127.0.0.1:6066/v1/submissions/kill/<job-id-from-submit-request>
40
(the snitch)
pipeline.io
Outline
① Spark Streaming and Spark ML
Kafka, Cassandra, ElasticSearch, Redis, Docker
② Spark Core
Tuning and Profiling
③ Spark SQL
Tuning and Customizing
41
pipeline.io
Parquet Columnar File Format
Based on Google Dremel paper ~2010
Collaboration with Twitter and Cloudera
Columnar storage format for fast columnar aggs
Supports evolving schema
Supports pushdowns
Support nested partitions
Tight compression
Min/max heuristics enable file and chunk skipping 42
Min/Max Heuristics
For Chunk Skipping
pipeline.io
Partitions
Partition Based on Data Access Patterns
/genders.parquet/gender=M/…
/gender=F/… <-- Use Case: Access Users by Gender
/gender=U/…
Dynamic Partition Creation (Write)
Dynamically create partitions on write based on column (ie. Gender)
SQL: INSERT TABLE genders PARTITION (gender) SELECT …
DF: gendersDF.write.format("parquet").partitionBy("gender")
.save("/genders.parquet")
Partition Discovery (Read)
Dynamically infer partitions on read based on paths (ie./gender=F/…)
SQL: SELECT id FROM genders WHERE gender=F
DF: gendersDF.read.format("parquet").load("/genders.parquet/").select($"id").
.where("gender=F")
43
pipeline.io
Pruning
Partition Pruning
Filter out rows by partition
SELECT id, gender FROM genders WHERE gender = ‘F’
Column Pruning
Filter out columns by column filter
Extremely useful for columnar storage formats (Parquet)
Skip entire blocks of columns
SELECT id, gender FROM genders
44
pipeline.io
Pushdowns
aka. Predicate or Filter Pushdowns
Predicate returns true or false for given function
Filters rows deep into the data source
Reduces number of rows returned
Data Source must implement PrunedFilteredScan
def buildScan(requiredColumns: Array[String],
filters: Array[Filter]): RDD[Row]
45
pipeline.io
Demo!
File Formats, Partitions, Pushdowns, and Joins
pipeline.io
Predicate Pushdowns & Filter Collapsing
47
Filter pushdown
No extra pass
Filter combining
Only 1 extra pass
2 extra passes through the data after retrieval
pipeline.io
Join Between Partitioned &
Unpartitioned
48
pipeline.io
Join Between Partitioned & Partitioned
49
pipeline.io
Broadcast Join vs. Normal Shuffle Join
50
pipeline.io
Cartesian Join vs. Inner Join
51
pipeline.io
Visualizing the Query Plan
52
Effectiveness
of Filter
Cost-based
Join Optimization
Similar to
MapReduce
Map-side Join
& DistributedCache
Peak Memory for
Joins and Aggs
UnsafeFixedWidthAggregationMap
getPeakMemoryUsedBytes()
pipeline.io
Data Source API
Relations (o.a.s.sql.sources.interfaces.scala)
BaseRelation (abstract class): Provides schema of data
TableScan (impl): Read all data from source
PrunedFilteredScan (impl): Column pruning & predicate pushdowns
InsertableRelation (impl): Insert/overwrite data based on SaveMode
RelationProvider (trait/interface): Handle options, BaseRelation factory
Filters (o.a.s.sql.sources.filters.scala)
Filter (abstract class): Handles all filters supported by this source
EqualTo (impl)
GreaterThan (impl)
StringStartsWith (impl) 53
pipeline.io
Native Spark SQL Data Sources
54
pipeline.io
JSON Data Source
DataFrame
val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")
-- or –
val ratingsDF = sqlContext.read.json
("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL Code
CREATE TABLE genders USING json
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders.json.bz2")
55
json() convenience method
pipeline.io
Parquet Data Source
Configuration
spark.sql.parquet.filterPushdown=true
spark.sql.parquet.mergeSchema=false (unless your schema is evolving)
spark.sql.parquet.cacheMetadata=true (requires sqlContext.refreshTable())
spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFrames
val gendersDF = sqlContext.read.format("parquet")
.load("file:/root/pipeline/datasets/dating/genders.parquet")
gendersDF.write.format("parquet").partitionBy("gender")
.save("file:/root/pipeline/datasets/dating/genders.parquet")
SQL
CREATE TABLE genders USING parquet
OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet")
56
pipeline.io
ElasticSearch Data Source
Github
https://github.com/elastic/elasticsearch-hadoop
Maven
org.elasticsearch:elasticsearch-spark_2.10:2.1.0
Code
val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>",
"es.port" -> "<port>")
df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)
.options(esConfig).save("<index>/<document-type>")
57
pipeline.io
Cassandra Data Source
Github
https://github.com/datastax/spark-cassandra-connector
Maven
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1
Code
ratingsDF.write
.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace"->"<keyspace>",
"table"->"<table>")).save(…) 58
pipeline.io
Tips for Cassandra Analytics
By-pass Cassandra CQL “front door”
CQL Optimized for Transactions
Bulk read and write directly against SSTables
Check out Netflix OSS project “Aegisthus”
Cassandra becomesa first-class analytics option
Replicated analytics cluster no longer needed
59
pipeline.io
Creating a Custom Data Source
① Study existing implementations
o.a.s.sql.execution.datasources.jdbc.JDBCRelation
② Extend base traits & implement required methods
o.a.s.sql.sources.{BaseRelation,PrunedFilterScan}
Spark JDBC (o.a.s.sql.execution.datasources.jdbc)
class JDBCRelation extends BaseRelation
with PrunedFilteredScan
with InsertableRelation
DataStax Cassandra (o.a.s.sql.cassandra)
class CassandraSourceRelation extends BaseRelation
with PrunedFilteredScan
with InsertableRelation 60
pipeline.io
Demo!
Create a Custom Data Source
pipeline.io
Publishing Custom Data Sources
62
spark-packages.org
pipeline.io
Spark SQL UDF Code Generation
100+ UDFs now generating code
More to come in Spark 1.6+
Details in
SPARK-8159, SPARK-9571
Every UDF must use Expressions and
implement Expression.genCode()
to participate in the fun
Lambdas (RDD or Dataset API)
and sqlContext.udf.registerFunction()
are not enough!!
pipeline.io
Creating a Custom UDF with Code Gen
① Study existing implementations
o.a.s.sql.catalyst.expressions.Substring
② Extend and implement base trait
o.a.s.sql.catalyst.expressions.Expression.genCode
③ Don’t forget about Python!
python.pyspark.sql.functions.py
64
pipeline.io
Demo!
Creating a Custom UDF participating in Code Generation
pipeline.io
Spark 1.6 and 2.0 Improvements
Adaptiveness, Metrics, Datasets, and Streaming State
pipeline.io
Adaptive Query Execution
Adapt query execution using data from previous stages
Dynamically choose spark.sql.shuffle.partitions (default 200)
67
Broadcast Join
(popular keys)
Shuffle Join
(not-so-popular keys)
Adaptive
Hybrid
Join
pipeline.io
Adaptive Memory Management
Spark <1.6
Manual configure between 2 memory regions
Spark execution engine (shuffles, joins, sorts, aggs)
spark.shuffle.memoryFraction
RDD Data Cache
spark.storage.memoryFraction
Spark 1.6+
Unified memory regions
Dynamically expand/contract memory regions
Supports minimum for RDD storage (LRU Cache)
68
pipeline.io
Metrics
Shows exact memory usage per operator & node
Helps debugging and identifying skew
69
pipeline.io
Spark SQL API
Datasets type safe API (similar to RDDs) utilizing Tungsten
val ds = sqlContext.read.text("ratings.csv").as[String]
val df = ds.flatMap(_.split(",")).filter(_ != "").toDF() // RDD API, convert to DF
val agg = df.groupBy($"rating").agg(count("*") as "ct”).orderBy($"ct" desc)
Typed Aggregators used alongside UDFs and UDAFs
val simpleSum = new Aggregator[Int, Int, Int] with Serializable {
def zero: Int = 0
def reduce(b: Int, a: Int) = b + a
def merge(b1: Int, b2: Int) = b1 + b2
def finish(b: Int) = b
}.toColumn
val sum = Seq(1,2,3,4).toDS().select(simpleSum)
Query files directly without registerTempTable()
%sql SELECT * FROM json.`/datasets/movielens/ml-latest/movies.json` 70
pipeline.io
Spark Streaming State Management
New trackStateByKey()
Store deltas, compact later
More efficient per-key state update
Session TTL
Integrated A/B Testing (?!)
Show Failed Output in Admin UI
Better debugging
71
pipeline.io
Thank You!!
Chris Fregly
Research Scientist @ PipelineIO
(http://pipeline.io)
San Francisco, California, USA
advancedspark.com
Sign up for the Meetup and Book
Contribute on Github!
Run All Demos in Docker
~6000 Docker Downloads!!
Find me on LinkedIn, Twitter, Github, Email, Fax 72

More Related Content

What's hot

The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015
craig lehmann
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly
 
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowMigrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Databricks
 
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
Chris Fregly
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
Michael Noll
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
DataWorks Summit/Hadoop Summit
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
Knoldus Inc.
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Chris Fregly
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Python VS GO
Python VS GOPython VS GO
Python VS GO
Ofir Nir
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
Robert Evans
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark application
Tatsuhiro Chiba
 
On heap cache vs off-heap cache
On heap cache vs off-heap cacheOn heap cache vs off-heap cache
On heap cache vs off-heap cache
rgrebski
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
Kazuaki Ishizaki
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkAnu Shetty
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 

What's hot (20)

The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowMigrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
 
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Python VS GO
Python VS GOPython VS GO
Python VS GO
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark application
 
On heap cache vs off-heap cache
On heap cache vs off-heap cacheOn heap cache vs off-heap cache
On heap cache vs off-heap cache
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 

Viewers also liked

Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
Chris Fregly
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Chris Fregly
 
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Chris Fregly
 
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Chris Fregly
 
Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016
Chris Fregly
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
Chris Fregly
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Chris Fregly
 
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Chris Fregly
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Chris Fregly
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Chris Fregly
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Chris Fregly
 
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Chris Fregly
 
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsDC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
Chris Fregly
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
Chris Fregly
 
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chris Fregly
 
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Chris Fregly
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
 
Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015
Chris Fregly
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
Chris Fregly
 

Viewers also liked (20)

Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
 
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
 
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
 
Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
 
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
 
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsDC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
 
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
 
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
 

Similar to Atlanta Spark User Meetup 09 22 2016

Protocol Independence
Protocol IndependenceProtocol Independence
Protocol Independence
Open Networking Summits
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
University of Washington
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Till Rohrmann
 
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡
Bartosz Konieczny
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
DataWorks Summit
 
OpenMP And C++
OpenMP And C++OpenMP And C++
OpenMP And C++
Dragos Sbîrlea
 
Streaming 101: Hello World
Streaming 101:  Hello WorldStreaming 101:  Hello World
Streaming 101: Hello World
Josh Fischer
 
Let's Get to the Rapids
Let's Get to the RapidsLet's Get to the Rapids
Let's Get to the Rapids
Maurice Naftalin
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
source{d}
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
Golang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyGolang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war story
Aerospike
 
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Mingliang Liu
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in Streams
Jamie Grier
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
Jags Ramnarayan
 
Serial-War
Serial-WarSerial-War
Serial-War
Xuechao Wu
 
Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015
Sperasoft
 
Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...
Universität Rostock
 

Similar to Atlanta Spark User Meetup 09 22 2016 (20)

Protocol Independence
Protocol IndependenceProtocol Independence
Protocol Independence
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
OpenMP And C++
OpenMP And C++OpenMP And C++
OpenMP And C++
 
Streaming 101: Hello World
Streaming 101:  Hello WorldStreaming 101:  Hello World
Streaming 101: Hello World
 
Let's Get to the Rapids
Let's Get to the RapidsLet's Get to the Rapids
Let's Get to the Rapids
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 
Golang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyGolang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war story
 
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in Streams
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
Serial-War
Serial-WarSerial-War
Serial-War
 
Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015
 
Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...
 

More from Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
Chris Fregly
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
Chris Fregly
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Chris Fregly
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Chris Fregly
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
Chris Fregly
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Chris Fregly
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
Chris Fregly
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
Chris Fregly
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
Chris Fregly
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
Chris Fregly
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
Chris Fregly
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Chris Fregly
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
Chris Fregly
 
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
Chris Fregly
 
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Chris Fregly
 

More from Chris Fregly (20)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
 
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
 

Recently uploaded

Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 

Recently uploaded (20)

Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 

Atlanta Spark User Meetup 09 22 2016

  • 1. pipeline.io After Dark 2.0End-to-End,Real-time, Advanced Analytics and ML Big Data Reference Pipeline Atlanta Spark User Group Sept 22, 2016 Thanks Emory Continuing Education! Chris Fregly Research Scientist @ PipelineIO We’re Hiring - Only Nice People! pipeline.io advancedspark.com
  • 2. pipeline.io Who Am I? 2 Research Scientist @ PipelineIO github.com/fluxcapacitor/pipeline Meetup Founder Advanced Book Author Advanced .
  • 3. pipeline.io Who Was I? 3 Streaming Data Engineer Netflix Open Source Committer Data Solutions Engineer Apache Contributor Principal Data Solutions Engineer IBM Technology Center
  • 4. pipeline.io Advanced Spark and Tensorflow Meetup Meetup Metrics Top 5 Most Active Spark Meetup! 4000+ Members in just 1 year!! 6000+ Downloads of Docker Image with many Meetup Demos!!! @ advancedspark.com Meetup Goals Code dive deep into Spark and related open source code bases Study integrations with Cassandra, ElasticSearch,Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc Surface and share patterns and idioms of well-designed, distributed, big data processing systems
  • 5. pipeline.io Atlanta Hadoop User Meetup Last Night http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016
  • 7. pipeline.io Current PipelineIO Research Model Deploying and Testing Model Scaling and Serving Online Model Training Dynamic Model Optimizing 7
  • 8. pipeline.io PipelineIO Deliverables 100% Open Source!! Github https://github.com/fluxcapacitor/ DockerHub https://hub.docker.com/r/fluxcapacitor Workshop http://pipeline.io 8
  • 9. pipeline.io Topics of This Talk (20-30 mins each) ① Spark Streaming and Spark ML Generating Real-time Recommendations ② Spark Core Tuning and Profiling ③ Spark SQL Tuning and Customizing 9
  • 10. pipeline.io Live, Interactive Demo! Kafka, Cassandra, ElasticSearch, Redis, Spark ML
  • 11. pipeline.io Audience Participation Required! 11 You -> Audience Instructions ①Navigate to demo.pipeline.io ②Swipe software used in Production Only! Data -> Scientist This is Totally Anonymous!!
  • 12. pipeline.io Topics of This Talk (15-20 mins each) ① Spark Streaming and Spark ML Kafka, Cassandra, ElasticSearch, Redis, Docker ② Spark Core Tuning and Profiling ③ Spark SQL Tuning and Customizing 12
  • 13. pipeline.io Mechanical Sympathy “Hardware and software working together.” - Martin Thompson http://mechanical-sympathy.blogspot.com “Whatever your data structure, my array will win.” - Scott Meyers Every C++ Book, basically 13
  • 14. pipeline.io Spark and Mechanical Sympathy 14 Project Tungsten (Spark 1.4-1.6+) 100TB GraySort Challenge (Spark 1.1-1.2) Minimize Memory & GC Maximize CPU Cache Saturate Network I/O Saturate Disk I/O
  • 16. pipeline.io CPU Cache Sympathy (AlphaSort paper) Key (10 bytes) + Pointer (4 bytes) = 14 bytes 16 Key Ptr Pre-process & Pull Key From Record PtrKey-Prefix 2x CPU Cache-line Friendly! Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes Key (10 bytes) + Pad (2 bytes) + Pointer (4 bytes) = 16 bytes Ptr Must Dereference to Compare Key Key Pointer (4 bytes) = 4 bytes Key Ptr Pad /Pad CPU Cache-line Friendly! Dereference full key only to resolve prefix duplicates
  • 20. pipeline.io Instrumenting and Monitoring CPU Use Linux perf command! 20 http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
  • 21. pipeline.io Results of Random vs. Sequential Sort 21 Naïve Random Access Pointer Sort Cache Friendly Sequential Key/Pointer Sort Ptr Key Must Dereference to Compare Key Key Ptr Pre-process & Pull Key From Record -35% -90% -68% -26% % Change -55% perf stat –event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses
  • 23. pipeline.io CPU Cache Naïve Matrix Multiplication // Dot product of each row & column vector for (i <- 0 until numRowA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ]; 23 Bad: Row-wise traversal, not using full CPU cache line, ineffective pre-fetching
  • 24. pipeline.io CPU Cache Friendly Matrix Multiplication // Transpose B for (i <- 0 until numRowsB) for (j <- 0 until numColsB) matBT[ i ][ j ] = matB[ j ][ i ]; 24 Good: Full CPU Cache Line, Effective Prefetching OLD: matB [ k ][ j ]; // Modify algo for Transpose B for (i <- 0 until numRowsA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
  • 25. pipeline.io Results Of Matrix Multiplication Cache-Friendly Matrix MultiplyNaïve Matrix Multiply perf stat –event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend -96% -93% -93% -70% -53% % Change -63% +8543%?
  • 27. pipeline.io Thread and Context Switch Sympathy Problem Atomically Increment 2 Counters (each at different increments) by 1000’s of Simultaneous Threads 27 Possible Solutions ① Synchronized Immutable ② Synchronized Mutable ③ AtomicReference CAS ④ Volatile? Context Switches are Expen$ive!! aka “LLC”
  • 28. pipeline.io Synchronized Immutable Counters case class Counters(left: Int, right: Int) object SynchronizedImmutableCounters { var counters = new Counters(0,0) def getCounters(): Counters = { this.synchronized { counters } } def increment(leftIncrement: Int, rightIncrement: Int): Unit = { this.synchronized { counters = new Counters(counters.left + leftIncrement, counters.right + rightIncrement) } } } 28 Locks whole outer object!!
  • 29. pipeline.io Synchronized Mutable Counters class MutableCounters(left: Int, right: Int) { def increment(leftIncrement: Int, rightIncrement: Int): Unit={ this.synchronized {…} } def getCountersTuple(): (Int, Int) = { this.synchronized{ (counters.left, counters.right) } } } object SynchronizedMutableCounters { val counters = new MutableCounters(0,0) … def increment(leftIncrement: Int, rightIncrement: Int): Unit = { counters.increment(leftIncrement, rightIncrement) } 29 Locks just MutableCounters
  • 30. pipeline.io Lock-Free AtomicReference Counters case class Counters(left: Int, right: Int) object LockFreeAtomicReferenceCounters { val counters = new AtomicReference[Counters](new Counters(0,0)) def increment(leftIncrement: Int, rightIncrement: Int) : Long = { var originalCounters: Counters = null var updatedCounters: Counters = null do { originalCounters = getCounters() updatedCounters = new Counters(originalCounters.left+ leftIncrement, originalCounters.right+ rightIncrement) } // Retry lock-free, optimistic compareAndSet() until AtomicRef updates while !(counters.compareAndSet(originalCounters, updatedCounters)) } 30Lock Free!!
  • 31. pipeline.io Lock-Free AtomicLong Counters object LockFreeAtomicLongCounters { // a single Long (64-bit) will maintain 2 separate Ints (32-bits each) val counters = new AtomicLong() … def increment(leftIncrement: Int, rightIncrement: Int): Unit = { var originalCounters = 0L var updatedCounters = 0L do { originalCounters = counters.get() … // Store two 32-bit Int into one 64-bit Long // Use >>> 32 and << 32 to set and retrieve each Int from the Long } // Retry lock-free, optimistic compareAndSet() until AtomicLong updates while !(counters.compareAndSet(originalCounters, updatedCounters)) }31 Lock Free!! A: The JVM does not guarantee atomic updates of 64-bit longs and doubles Q: Why not use @volatile long?
  • 32. pipeline.io Results of Thread Synchronization Immutable Case Class 32 Lock-Free AtomicLong -64% -46% -17% -33% perf stat –event context-switches,L1-dcache-load-misses,L1-dcache-prefetch-misses, LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend % Change -31% -32% -33% -27% case class Counters(left:Int, right: Int) ... this.synchronized { counters = new Counters(counters.left + leftIncrement, counters.right + rightIncrement) } val counters = new AtomicLong() … do { … } while !(counters.compareAndSet(originalCounters, updatedCounters))
  • 33. pipeline.io Profile Visualizations: Flame Graphs 33 Example: Spark Word Count Java Stack Traces are Good! JDK 1.8+ (-XX:-Inline -XX:+PreserveFramePointer) Plateaus are Bad! I/O stalls, Heavy CPU serialization, etc
  • 34. pipeline.io Project Tungsten: CPU and Memory Create Custom Data Structures & Algorithms Operate on serialized and compressed ByteArrays! Minimize Garbage Collection Reuse ByteArrays In-place updates for aggregations Maximize CPU Cache Effectiveness 8-byte alignment AlphaSort-based Key-Prefix Utilize Catalyst Dynamic Code Generation Dynamic optimizations using entire query plan Developer implements genCode() to create Scala source code (String) 34
  • 35. pipeline.io Why is CPU the Bottleneck? CPU for serialization, hashing, & compression Spark 1.2 updates saturated Network, Disk I/O 10x increase in I/O throughput relative to CPU More partition, pruning, and pushdown support Newer columnar file formats help reduce I/O 35
  • 36. pipeline.io Custom Data Structs & Algos: Aggs UnsafeFixedWidthAggregationMap Uses BytesToBytesMap internally In-place updates of serialized aggregation No object creation on hot-path TungstenAggregate & TungstenAggregationIterator Operates directly on serialized, binary UnsafeRow 2 steps to avoid single-key OOMs ① Hash-based (grouping) agg spills to disk if needed ② Sort-based agg performs external merge sort on spills 36
  • 37. pipeline.io Custom Data Structures & Algorithms o.a.s.util.collection.unsafe.sort. UnsafeSortDataFormat UnsafeExternalSorter UnsafeShuffleWriter UnsafeInMemorySorter RecordPointerAndKeyPrefix 37 PtrKey-Prefix 2x CPU Cache-line Friendly! SortDataFormat<RecordPointerAndKeyPrefix, Long[ ]> Note: Mixing multiplesubclasses of SortDataFormat simultaneously will prevent JIT inlining. Supports merging compressed records (if compression CODEC supports it, ie. LZF) In-place external sorting of spilled BytesToBytes data AlphaSort-based, 8-byte aligned sort key In-place sorting of BytesToBytesMap data
  • 38. pipeline.io Code Generation Problem Boxing creates excessive objects Expression tree evaluations are costly JVM can’t inline polymorphic impls Lack of polymorphism == poor code design Solution Code generation enables inlining Rewrite and optimize code using overall plan, 8-byte align Defer source code generation to each operator, UDF, UDAF Use Janino to compile generated source code ->bytecode 38
  • 39. pipeline.io Autoscaling Spark Workers (Spark 1.5+) Scaling up is easy J SparkContext.addExecutors() until max is reached Scaling down is hard L SparkContext.removeExecutors() Lose RDD cache inside Executor JVM Must rebuild active RDD partitions in another Executor JVM Uses External Shuffle Service from Spark 1.1-1.2 If Executor JVM dies/restarts, shuffle keeps shufflin’! 39
  • 40. pipeline.io “Hidden” Spark Submit REST API http://arturmkrtchyan.com/apache-spark-hidden-rest-api Submit Spark Job curl -X POST http://127.0.0.1:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data ’{"action" : "CreateSubmissionRequest”, "mainClass" : "org.apache.spark.examples.SparkPi”, "sparkProperties" : { "spark.jars" : "file:/spark/lib/spark-examples-1.5.1.jar", "spark.app.name" : "SparkPi",… }}’ Get Spark Job Status curl http://127.0.0.1:6066/v1/submissions/status/<job-id-from-submit-request> Kill Spark Job curl -X POST http://127.0.0.1:6066/v1/submissions/kill/<job-id-from-submit-request> 40 (the snitch)
  • 41. pipeline.io Outline ① Spark Streaming and Spark ML Kafka, Cassandra, ElasticSearch, Redis, Docker ② Spark Core Tuning and Profiling ③ Spark SQL Tuning and Customizing 41
  • 42. pipeline.io Parquet Columnar File Format Based on Google Dremel paper ~2010 Collaboration with Twitter and Cloudera Columnar storage format for fast columnar aggs Supports evolving schema Supports pushdowns Support nested partitions Tight compression Min/max heuristics enable file and chunk skipping 42 Min/Max Heuristics For Chunk Skipping
  • 43. pipeline.io Partitions Partition Based on Data Access Patterns /genders.parquet/gender=M/… /gender=F/… <-- Use Case: Access Users by Gender /gender=U/… Dynamic Partition Creation (Write) Dynamically create partitions on write based on column (ie. Gender) SQL: INSERT TABLE genders PARTITION (gender) SELECT … DF: gendersDF.write.format("parquet").partitionBy("gender") .save("/genders.parquet") Partition Discovery (Read) Dynamically infer partitions on read based on paths (ie./gender=F/…) SQL: SELECT id FROM genders WHERE gender=F DF: gendersDF.read.format("parquet").load("/genders.parquet/").select($"id"). .where("gender=F") 43
  • 44. pipeline.io Pruning Partition Pruning Filter out rows by partition SELECT id, gender FROM genders WHERE gender = ‘F’ Column Pruning Filter out columns by column filter Extremely useful for columnar storage formats (Parquet) Skip entire blocks of columns SELECT id, gender FROM genders 44
  • 45. pipeline.io Pushdowns aka. Predicate or Filter Pushdowns Predicate returns true or false for given function Filters rows deep into the data source Reduces number of rows returned Data Source must implement PrunedFilteredScan def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] 45
  • 47. pipeline.io Predicate Pushdowns & Filter Collapsing 47 Filter pushdown No extra pass Filter combining Only 1 extra pass 2 extra passes through the data after retrieval
  • 50. pipeline.io Broadcast Join vs. Normal Shuffle Join 50
  • 52. pipeline.io Visualizing the Query Plan 52 Effectiveness of Filter Cost-based Join Optimization Similar to MapReduce Map-side Join & DistributedCache Peak Memory for Joins and Aggs UnsafeFixedWidthAggregationMap getPeakMemoryUsedBytes()
  • 53. pipeline.io Data Source API Relations (o.a.s.sql.sources.interfaces.scala) BaseRelation (abstract class): Provides schema of data TableScan (impl): Read all data from source PrunedFilteredScan (impl): Column pruning & predicate pushdowns InsertableRelation (impl): Insert/overwrite data based on SaveMode RelationProvider (trait/interface): Handle options, BaseRelation factory Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class): Handles all filters supported by this source EqualTo (impl) GreaterThan (impl) StringStartsWith (impl) 53
  • 54. pipeline.io Native Spark SQL Data Sources 54
  • 55. pipeline.io JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json") .load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or – val ratingsDF = sqlContext.read.json ("file:/root/pipeline/datasets/dating/ratings.json.bz2") SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") 55 json() convenience method
  • 56. pipeline.io Parquet Data Source Configuration spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=false (unless your schema is evolving) spark.sql.parquet.cacheMetadata=true (requires sqlContext.refreshTable()) spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo] DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet") SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet") 56
  • 57. pipeline.io ElasticSearch Data Source Github https://github.com/elastic/elasticsearch-hadoop Maven org.elasticsearch:elasticsearch-spark_2.10:2.1.0 Code val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document-type>") 57
  • 59. pipeline.io Tips for Cassandra Analytics By-pass Cassandra CQL “front door” CQL Optimized for Transactions Bulk read and write directly against SSTables Check out Netflix OSS project “Aegisthus” Cassandra becomesa first-class analytics option Replicated analytics cluster no longer needed 59
  • 60. pipeline.io Creating a Custom Data Source ① Study existing implementations o.a.s.sql.execution.datasources.jdbc.JDBCRelation ② Extend base traits & implement required methods o.a.s.sql.sources.{BaseRelation,PrunedFilterScan} Spark JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation DataStax Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation 60
  • 62. pipeline.io Publishing Custom Data Sources 62 spark-packages.org
  • 63. pipeline.io Spark SQL UDF Code Generation 100+ UDFs now generating code More to come in Spark 1.6+ Details in SPARK-8159, SPARK-9571 Every UDF must use Expressions and implement Expression.genCode() to participate in the fun Lambdas (RDD or Dataset API) and sqlContext.udf.registerFunction() are not enough!!
  • 64. pipeline.io Creating a Custom UDF with Code Gen ① Study existing implementations o.a.s.sql.catalyst.expressions.Substring ② Extend and implement base trait o.a.s.sql.catalyst.expressions.Expression.genCode ③ Don’t forget about Python! python.pyspark.sql.functions.py 64
  • 65. pipeline.io Demo! Creating a Custom UDF participating in Code Generation
  • 66. pipeline.io Spark 1.6 and 2.0 Improvements Adaptiveness, Metrics, Datasets, and Streaming State
  • 67. pipeline.io Adaptive Query Execution Adapt query execution using data from previous stages Dynamically choose spark.sql.shuffle.partitions (default 200) 67 Broadcast Join (popular keys) Shuffle Join (not-so-popular keys) Adaptive Hybrid Join
  • 68. pipeline.io Adaptive Memory Management Spark <1.6 Manual configure between 2 memory regions Spark execution engine (shuffles, joins, sorts, aggs) spark.shuffle.memoryFraction RDD Data Cache spark.storage.memoryFraction Spark 1.6+ Unified memory regions Dynamically expand/contract memory regions Supports minimum for RDD storage (LRU Cache) 68
  • 69. pipeline.io Metrics Shows exact memory usage per operator & node Helps debugging and identifying skew 69
  • 70. pipeline.io Spark SQL API Datasets type safe API (similar to RDDs) utilizing Tungsten val ds = sqlContext.read.text("ratings.csv").as[String] val df = ds.flatMap(_.split(",")).filter(_ != "").toDF() // RDD API, convert to DF val agg = df.groupBy($"rating").agg(count("*") as "ct”).orderBy($"ct" desc) Typed Aggregators used alongside UDFs and UDAFs val simpleSum = new Aggregator[Int, Int, Int] with Serializable { def zero: Int = 0 def reduce(b: Int, a: Int) = b + a def merge(b1: Int, b2: Int) = b1 + b2 def finish(b: Int) = b }.toColumn val sum = Seq(1,2,3,4).toDS().select(simpleSum) Query files directly without registerTempTable() %sql SELECT * FROM json.`/datasets/movielens/ml-latest/movies.json` 70
  • 71. pipeline.io Spark Streaming State Management New trackStateByKey() Store deltas, compact later More efficient per-key state update Session TTL Integrated A/B Testing (?!) Show Failed Output in Admin UI Better debugging 71
  • 72. pipeline.io Thank You!! Chris Fregly Research Scientist @ PipelineIO (http://pipeline.io) San Francisco, California, USA advancedspark.com Sign up for the Meetup and Book Contribute on Github! Run All Demos in Docker ~6000 Docker Downloads!! Find me on LinkedIn, Twitter, Github, Email, Fax 72