SlideShare a Scribd company logo
1 of 65
2
How to create a Pipeline capable of
processing 2.5 Billion records/day
in just 3 Months
Josef Habdank
Lead Data Scientist & Data Platform Architect
at Infare Solutions
@jahabdank
jha@infare.com
linkedin.com/in/jahabdank
• Using Spark currently is the most fundamental skill in BigData and
DataScience world
• Main reason: helps to solve most of BigData problems:
processing, transformation, abstraction and Machine Learning
You are in the right place!
Spark currently is de facto standard for BigData
Google Trends for “Apache Spark”
2013-2015 was insane
Pretty much all serious BigData
players are using Spark
Good old‘n’slow Hadoop
MapReduce days
What is this talk about?
Presentation consists of 4 parts:
• Quick intro to Spark internals and optimization
• N-billion rows/day system architecture
• How exactly we did what we did
• Focus on Spark’s Performance, getting maximum
bang for the buck
• Data Warehouse and Messaging
• (optional) How to deploy Spark so it does not
backfire
The Story
• “Hey guys, we might land a new cool project”
• “It might be 5-10x as much data as we have so far”
• “In 1+year it probably will be much more than 10x”
• “Oh, and can you do it in 6 months?”
“Lets do that
in 3 months!”
The Result
• 5 low-cost servers (8core, 64gb RAM)
• Located on Amazon with Hosted Apache Spark
• a fraction of cost that any other technology would cost
• Initial max capacity load tested on 2.5bn/day
• Currently improved to max capacity 6-8bn/day, ~250-
350mil/hour (with no extra hardware required)
• As Spark scales with hardware, we could do 15Bn with 10-15 machines
• Delivered in 3 months, in production for 1.5 year now 
Developing code for
distributed systems
Normal workflow
• Code locally on your machine
• Compile and assemble
• Upload JAR + make sure deps are
present on all nodes
• Run job and test if work, spend
time looking for results
Notebook workflow
• Write code online
• Shift+Enter to compile (on master),
send to cluster nodes, run and
show results in browser 
+ Can support Git lifecycle
+ Allows mixing Python/Scala/Sql
(which is awesome  )
Code in Notebooks, they are awesome
• Development on Cluster systems is by nature not easy
• Best you can do locally is to know that code compiles, unit tests pass and if
the code runs on some sample data
• You do not actually know if it works until you test-run on the PreProd/Dev
cluster as the data defines the correctness, not syntax
Traps of Notebooks
• Code is compiled on the fly
• When the chunk of code is executed as a Spark
Job (on a whole cluster) all the dependent
objects will be serialized and packaged with the
job
• Sometimes the dependency structure is very
non-trivial and the Notebook will start
serializing huge amounts of data (completely
silently, and attaching it to the Job)
• PRO TIP: have as few global variables as
possible, if needed use objects
Traps of Notebooks
Code as distributed JAR vs Code as lambda
Traps of Notebooks
Code as distributed JAR vs Code as lambda
This is compiled to JAR and
distributed and bootstrapped
to JVMs across cluster
(when initializing JAR it will
open the connection)
This is serialized on master
and attached to the job
(connection object will fail to
work after deserialization)
Writing high throughput
code in Spark
Spark’s core: the collections
• Spark is just a processing framework
• Works on distributed collections:
• Collections are partitioned
• The number of partitions is defined from the source
• Collections are lazily evaluated (nothing done until you request
results)
• Spark collection you only write a ‘recipe’ for what spark has to do (called
lineage)
• Types of collections:
• RDDs, just collections of Java objects. Slowest, but most flexible
• DataFrames/DataSet’s, mainly tabular data, can do structured data but is not
trivial. Much faster serialization/deserialization, more compact, faster
memory management, SparkSql compatible
Spark’s core: in-memory map reduce
• Spark Implements Map-LocalReduce-
LocalShuffle-Shuffle-Reduce paradigm
• Each step in the ‘recipe’/lineage is a
combination of the above
• Why in that way? Vast majority of BigData
problems can be converted to this paradigm:
• All SqlQueries/data extracts
• In many cases DataScience (modelling)
Master
node 2
node 1
Map-LocalReduce-Shuffle-Reduce
%scala
val lirdd =
sc.parallelize(
loremIpsum.split(" ")
)
val wordCount =
lirdd
.map(w => (w,1))
.reduceByKey(_ + _)
.collect
%sql
select
word,
count(*) as word_count
from words
group by word
(“lorem”)
(“Ipsum”)
(“lorem”)
(“Ipsum”)
(“sicut”)
(“sicut”)
(“lorem”, 1)
(“Ipsum”, 1)
(“lorem”, 1)
(“Ipsum”, 1)
(“sicut”, 1)
(“sicut”, 1)
(“lorem”, 2)
(“Ipsum”, 1)
(“Ipsum”, 1)
(“sicut”, 2)
(“lorem”, 2)
(“Ipsum”, 1)
(“Ipsum”, 1)
(“sicut”, 2)
(“lorem”, 2)
(“Ipsum”, 2)
(“sicut”, 2)
LocalReduceMap Shuffle
Reduce
.map([...]) .reduceByKey([...])
Master
node 2
node 1
Map-LocalReduce-Shuffle-Reduce
%scala
val lirdd =
sc.parallelize(
loremIpsum.split(" ")
)
val wordCount =
lirdd
.map(w => (w,1))
.reduceByKey(_ + _)
.collect
%sql
select
word,
count(*) as word_count
from words
group by word
(“lorem”)
(“Ipsum”)
(“lorem”)
(“Ipsum”)
(“sicut”)
(“sicut”)
(“lorem”, 1)
(“Ipsum”, 1)
(“lorem”, 1)
(“Ipsum”, 1)
(“sicut”, 1)
(“sicut”, 1)
(“lorem”, 2)
(“Ipsum”, 1)
(“Ipsum”, 1)
(“sicut”, 2)
(“lorem”, 2)
(“Ipsum”, 1)
(“Ipsum”, 1)
(“sicut”, 2)
(“lorem”, 2)
(“Ipsum”, 2)
(“sicut”, 2)
LocalReduceMap Shuffle
Reduce
.map([...]) .reduceByKey([...])
Slowest part:
data is serialized from objects
to BLOBs, send over network
and deserialized
Map only operations
%scala
// val rawBytesRDD is defined
// and contains blobs with
// serialized Avro objects
rawBytesRDD
.map(fromAvroToObj)
.toDF.write
.parquet(outputPath)
0x00[…]
Spark knows shuffle is expensive and tries to avoid it if can
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
.map([...])
Obj 001-100
Obj 101-200
Obj 201-300
Obj 301-400
Obj 401-500
Obj 501-600
incoming
blobs
incoming
blobs
node 1
node 2
File 1
File 2
File 3
File 4
File 5
File 6
Local Shuffle-Map operations
%scala
// val rawBytesRDD is defined
// and contains blobs with
// serialized Avro objects
rawBytesRDD
.coalesce(2) //**
.map(fromAvroToObj)
.toDF.write
.parquet(outputPath)
// ** never set to so low!!
// This is just an example 
// Aim in at least 2x node count.
// Moreover, if possible
// coalesce() or repartition()
// on binary blobs
0x00[…]
For fragmented collections (with too many partitions)
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
Obj 001-300
Obj 301-600
Map
.map([...])
Local Shuffle
.coalesce(2)
node 1
node 2
File 1
File 2
incoming
blobs
incoming
blobs
node 2
node 1
Why Python/PySpark is (generally)
slower than Scala
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
0x00[…]
ObjA
ObjB
ObjC
ObjD
ObjE
ObjF
Map
.map([...])
Conditional Shuffle
.coalesce(2)
• All rows will be
serialized between JVM
and Python
• There are exceptions
• Within the same
machine, so is very fast
• Nonetheless significant
overhead
Python
JVM-> Python Serde
do the map
Python -> JVM Serde
Python
JVM-> Python Serde
do the map
Python -> JVM Serde
Why Python/PySpark is (generally)
slower than Scala
• In Spark 2.0 with new version of Catalyst and dynamic code generation
Spark will try to convert Python code to native Spark functions
• This means in some occasions Python might work equally fast as Scala, as
in fact Python code is translated into native Spark calls
• Catalyst and code generation will not be able to do it for RDD map
operations as well as custom UDFs in DataFrames
• PRO TIP: avoid using RDDs as Spark will serialize whole objects. For UDFs
it only will serialize few columns and will do it in a very efficient way
df2 = df1 
.filter(df1.old_column >= 30) 
.withColumn("new_column1", ((df1.old_column - 2) % 7) + 1)
df3 = df2 
.withColumn("new_column2", custom_function(df2.new_column1))
N bn Data Platform design
What Infare does
Leading provider of
Airfare Intelligence Solutions
to the Aviation Industry
Business Intelligence on
flight tickets prices, such that
airlines know competitors
prices and market trends
Advanced Analytics and
Data Science predicting
prices, ticket demand and
well as financial data
Collect and processes
1.5+ billion distinct
airfares daily
What we were supposed to do
Data
Collection
Scalable
DataWarehouse
(aimed at
500+bn rows)
Customer
specific Data
Warehouse
Data
Collection
Data
Collection
Data
Collection
Customer
specific Data
Warehouse
Customer
specific Data
Warehouse
Customer
specific Data
Warehouse
*Scalable to billions of rows a day
What we need
Data
Collection
Data
Collection
Data
Collection
Data
Collection
Scalable
low cost
permanent
storage
Scalable
fast-access
temporary
storage
Processing
Framework
ALL BigData systems in the world look like that 
What we first did 
Data
Collection
Data
Collection
Data
Collection
Data
Collection
Message
Broker
(Kinesis)
Monitoring/stats
Real time analytics
Preaggregation
Micro
Batch
Micro
Batch
Mini
Batch
S3 Offline analytics
Data Science
Data Warehouse
Avro blobs
compressed with
Snappy
uncompressed
Parquet
micro batches partitioned
aggregated
Parquet
DataWarehouse
Monitoring System
Data
Streamer
Temporary Storage
Permanent Storage
Did it work?
Data
Collection
Data
Collection
Data
Collection
Data
Collection
Message
Broker
(Kinesis)
Monitoring/stats
Real time analytics
Preaggregation
Micro
Batch
Micro
Batch
Mini
Batch
S3 Offline analytics
Data Science
Data Warehouse
Avro blobs
compressed with
Snappy
uncompressed
Parquet
micro batches partitioned
aggregated
Parquet
DataWarehouse
Monitoring System
Data
Streamer
S3 has
latency
(inconsistent
for deletes)
Temporary Storage
Permanent Storage
• No Spark native driver, so no clustered queries
• Parallelize in current implementation has a
memory leak
Why DynamoDB was a failure: Spark’s Parallelize hell
DynamoDB historically DID NOT SUPPORT
Spark WHATSOEVER, we effectively ended
up writing our own Spark driver from
scratch, WEEKS of wasted effort
I have to admit since our initial huge disappointment 1 year ago
Amazon released a Spark driver, and I do not know how good it is.
My opinion is still a closed-source DB with limited support and
usage will always be inferior to other technologies
How are we doing it now 
Data
Collection
Data
Collection
Data
Collection
Data
Collection
Message
Broker
Monitoring/stats
Real time analytics
Preaggregation
Micro
Batch
Micro
Batch
Mini
Batch
S3 Offline analytics
Data Science
Data Warehouse
Avro blobs
Compressed
with Snappy
uncompressed
Parquet
micro batches partitioned
aggregated
Parquet/ORC
DataWarehouse
Data
Streamer
Temporary Storage
Permanent Storage
Monitoring
System
Elastic Search 5.2
has an amazing
Spark support
New Kafka
0.10.2 has great
streaming support
Kinesis:
• The messages are max 25kB (if larger the
driver will slice the message into multiple
PUT requests)
• Avro serialized and Snappy compressed
data to max 25kB
(~200 data rows per message)
• Obtained 10x throughput compared to
sending individual rows (each 180 bytes)
Getting maximum out of the Kinesis/Kafka
Message
Broker
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
Serialize and send micro batches of data, not individual messages
Kafka:
• The messages size is 1MB, but that is
very large
• Jay Kreps @ Linkedin researched
optimal message size for Kafka and it is
between 10-100kB
• From his research at those message
sizes allow sending as much as
hardware/network allows
Data
Collection
Data
Collection
Data
Collection
Data
Collection
Data
Streamer
• Creates a stream of MiniBatches, which are RDDs
created every n seconds
• Spark Driver continuously polls message broker,
default every 200ms
• Each received block (every 200ms) becomes an RDD
partition
• Consider using repartition/coalesce, as the number
of partitions gets very large quickly (for 60 sec, there
will be up to 300 partitions, thus 300 files)
• NOTE: in Spark 2.1 they added Structured Streaming
(streaming on DataFrames, not RDDs, very cool but
still limited in functionality)
Spark Streaming
Data
Collection
Data
Collection
Data
Collection
Data
Collection
Mini
Batch
Mini
Batch
kstream
.repartition(3 * nodeCount)
.foreachRDD(rawBytes => {
[...]
})
n-seconds
Message
Broker
• Retry the MiniBatch n times (4 default)
• If fails all retries, kill the streaming job
• Conclusion: must do error handling
Default error handling in Spark Streaming
Data
Collection
Data
Collection
Data
Collection
Data
Collection
Micro
Batch
Mini
Batch
Message
Broker
In stream error handling, low error rate
Data
Collection
Data
Collection
Data
Collection
Data
Collection
Mini
Batch
Mini
Batch
• For low error rate, handle each error
individually
• Open connection to storage, save
the error packet for later processing,
close connection
• Will clog the stream for high error
rate streams
kstream
.repartition(concurency)
.foreachRDD(objects => {
val res = objects.flatMap(b => {
try { // can use Option instead of Seq
[do something, rare error]
} catch { case e: =>
[save the error packet]
Seq[Obj]()
}
})
[do something with res]
})
Errors
Message
Broker
Error
Batch
Error
Batch
Advanced error handling, high error rate
Data
Collection
Data
Collection
Data
Collection
Data
Collection
Mini
Batch
Mini
Batch
• For high error rate, can’t store each error
individually
• Unfortunately Spark does not support
Multicast operation (stream splitting)
Message
Broker
Advanced error handling, high error rate
• Try high error probability action (such as
API request)
• Use transform to return Either[String, Obj]
• Either class is like tuple but guarantees
that only one or the other is present
(String for error, Obj for success)
• cache() to prevent reprocessing
• individually process error and success
stream
• NOTE: cache() should be used cautiously
kstream
.repartition(concurency)
.foreachRDD(objects => {
val res = objects.map(b => {
Try([frequent error])
.transform(
{ b => Success(Right(b)) },
{ e => Success(Left(e.getMessage)) }
).get
})
res.cache() // cautious, can be slow
res
.filter(ei => ei.isLeft)
.map(ei => ei.left.get)
.[process errors as stream]
res
.filter(ei => ei.isRight)
.map(ei => ei.right.get)
.[process successes as stream]
})
To cache or not to cache
• Cache is the most abused function in Spark
• It is NOT (!) a simple storing of a pointer to a
collection in memory of the process
• It is a SERIALISATION of the data to BLOB and
storing it in cluster’s shared memory
• When reusing data which was cached, it has to
deserialize the data from BLOB
• By default uses generic JAVA serializer
Step1
Step2 cache
serialize
Step3 Step1
…
deserialize
…
Job1
Job2
Message
Broker with
Avro objects
Standard Spark Streaming Scenario:
• Incoming Avro stream
• Step 1) Stats computation, storing stats
• Step 2) Storing data from the stream
Question: Will caching make it faster?
Deserialize
Avro
Compute
Stats
Save Stats
Message
Broker with
Avro objects
cache
Save Data
Deserialize
Avro
Compute
Stats
Save Stats
Save Data
Deserialize
Avro
Standard Spark Streaming Scenario:
• Incoming Avro stream
• Step 1) Stats computation, storing stats
• Step 2) Storing data from the stream
Question: Will caching make it faster? Nope
Faster
Message
Broker with
Avro objects
Deserialize
Avro
Compute
Stats
Save Stats
Message
Broker with
Avro objects
cache
Save Data
Deserialize
Avro
Compute
Stats
Save Stats
Save Data
Deserialize
Avro
Standard Spark Streaming Scenario:
• Incoming Avro stream
• Step 1) Stats computation, storing stats
• Step 2) Storing data from the stream
Question: Will caching make it faster? Nope
Message
Broker with
Avro objects
cache
Compute
Stats
Save Stats
Save DataDeserialize
Avro
Serialize
Java
Deserialize
Java
Faster
Message
Broker with
Avro objects
Deserialize
Avro
Compute
Stats
Save Stats
Save Data
Deserialize
Avro
To cache or not to cache
• Cache is the most abused function in Spark
• It is NOT (!) a simple storing of a pointer to a collection in memory of the
process
• It is a SERIALISATION of the data to BLOB and storing it in cluster’s shared
memory
• When reusing data which was cached, it has to deserialize the data from
BLOB
• By default uses generic JAVA serializer, which is SLOW
• Even super fast serde like Kryo, are much slower as are generic
(serializer does not know the type at compile time)
• Avro is amazingly fast as it is a Specific serializer (knows the type)
• Often you will be quicker to reprocess the data from your source than
use the cache, especially for complex objects (pure strings/byte arrays
cache fast)
• Caching is faster in DataFrames/Tungsten API, but even then it might be
slower than reprocessing
• Pro TIP: when using cache, make sure it actually helps. And monitor
CPU consumption too
Step1
Step2 cache
serialize
Step3 Step1
…
deserialize
…
Job1
Job2
Tungsten Memory Manager + Row Serializer
• Introduced in Spark 1.5, used in
DataFrames/DataSets
• Stores data in memory as a readable binary
blobs, not as Java objects
• Since Spark 2.0 the blobs are in columnar
format (much better compression)
• Does some black magic wizardy with
L1/L2/L3 CPU cache
• Much faster: 10x-100x faster then RDDs
• Rule of thumb: always when possible use
DataFrames and you will get Tungsten
case class MyRow(val col: Option[String], val
exception: Option[String])
kstream
.repartition(concurency)
.foreachRDD(objects => {
val res = objects.map(b => {
Try([frequent error]).transform(
{ b => Success(MyRow(b, None))},
{ e => Success(MyRow(None, e.getMessage))}
).get
}).toDF()
res.cache()
res.select("exception")
.filter("exception is not null")
.[process errors as stream]
res.select("col")
.filter("exception is null")
.[process successes as stream]
})
Data Warehouse and Messaging
Data Storage: Row Store vs Column Store
• What are DataBases: collections of objects
• Main difference:
• Row store require only one row at the time to
serialize
• Column Store requires a batch of data to serialize
• Serialization:
• Row store can serialize online (as rows come into
the serializer, can be appended to the binary buffer)
• Column Store requires w whole batch to present at
the moment of serialization, data can be processed
(index creation, sorting, duplicate removal etc.)
• Reading:
• Row store always reads all data from a file
• Column Store allows reading only selected columns
JSON/CSV Row Store
Pros:
• Human readable (do not underestimate that)
• No dev time required
• Compression algorithms work very well on ASCII text
(compressed CSV ‘only’ 2x larger than compressed
Avro)
Cons:
• Large(CSV) and very large (JSON) volume
• Slow serialization/deserialization
Overall: worth considering, especially during dev phase
Avro Row Store
Pros:
• BLAZING FAST serialization/deserialization
• Apache Avro lib is amazing (buffer based serde)
• Binary/compact storage
• Compresses about 70% with Snappy (compressed
200 objects with 50 cols result in 20kb)
Cons:
• Hard to debug, once BLOB is corrupt it is very
hard to find out what went wrong
Avro Scala classes
• Avro
serialization/deserialization
requires Avro contract
compatible with Apache
Avro library
• In principle the Avro classes
are logically similar to the
Parquet classes
(definition/field accessor)
class ObjectModelV1Avro (
var dummy_id: Long,
var jobgroup: Int,
var instance_id: Int,
[...]
) extends SpecificRecordBase with SpecificRecord {
def get(field: Int): AnyRef = { field match {
case pos if pos == 0 => { dummy_id
}.asInstanceOf[AnyRef]
[...]
}
def put(field: Int, value: Any): Unit = {
field match {
case pos if pos == 0 => this.dummy_id = {
value }.asInstanceOf[Long]
[...]
}
def getSchema: org.apache.avro.Schema =
new Schema.Parser().parse("[...]“)
}
Avro real-time serialization
• Apache Avro allows serializing on the fly Row by row
• Incoming data stream can be serialized on the fly
into a binary buffer
// C# code
byte[] raw;
using (var ms = new MemoryStream())
{
using (var dfw =
DataFileWriter<T>
.OpenWriter(_datumWriter,
ms, _codec))
{
// can be yielded
microBatch.ForEach(dfw.Append);
}
raw = ms.ToArray();
}
// Scala
val writer: DataFileWriter[T] =
new DataFileWriter[T](datumWriter)
writer.create(objs(0).getSchema,
outputStream)
// can be streamed
for (obj <- objs) {
writer.append(obj)
}
writer.close
val encodedByteArray: Array[Byte] =
outputStream.toByteArray
Parquet Column Store
Pros:
• Meant for large data sets
• Single column searchable
• Compressed (eliminated duplicates etc.)
• Contains in-file stats and metadata located in TAIL
• Very well supported in Spark:
• predicate/filter pushdown
• VECTORIZED READER AND TUNGSTEN INTEGRATION
(5-10x faster then Java Parquet library)
Cons:
• Not indexed
spark
.read
.parquet(dwLoc)
.filter('col1 === “text“)
.select("col2")
More on predicate/filter pushdown
• Processing separate from the storage
• Predicate/filter pushdown gives Spark
uniform way to push query to the source
• Spark remains oblivious how the driver
executes the query, it only cares if the Driver
can or can’t execute the pushdown request
• If driver can’t execute the request Spark will
load all data and filter it in Spark
Storage
Parquet/ORC API
Apache Spark
DataFrame
makes pushdown
request to API
Parquet/ORC
driver executes
pushed request
on binary data
More on predicate/filter pushdown
• Processing separate from the storage
• Predicate/filter pushdown gives Spark
uniform way to push query to the source
• Spark remains oblivious how the driver
executes the query, it only cares if the Driver
can or can’t execute the pushdown request
• If driver can’t execute the request Spark will
load all data and filter it in Spark
• Such abstraction allows easy replacement of
storage, Spark does not care if the storage is
S3 files or a DataBase
Apache Spark
DataFrame
makes pushdown
request to API
ORC Column Store
Pros:
• Meant for large data sets
• Single column searchable
• Even better compression then Parquet (20-30% less)
• Contains in-file stats, metadata and indexes (3 level:
file, block and every 10k rows) located in TAIL
• Theoretically well supported in Spark:
• predicate/ filter pushdown
• uses indexes for filter pushdown searches, amazing 
Cons:
• No vectorized reader, rumors about adding it from
Spark 2.2. If this turns out to be true then ORC should
be faster
spark
.read
.orc(dwLoc)
.filter('col1 === “text“)
.select("col2")
DataWarehouse building
• Think of it as a collection of read only files
• Recommended to use Parquet/ORC files in a folder
structure (aim at >100-1000Mb files, use coalesce)
• Folders are partitions
• Spark supports append for Parquet/ORC
• Compression:
• Use Snappy (decompression speed ~500MB/sec per core)
• Gzip (decompression speed ~60MB/sec per core)
• Note: Snappy is not splittable, keep files under 1GB
• Ordering: if you can (often can not) order your data, as
then columnar deduplication will work better
• In our case this saves 50% of space, and thus 50% of reading
time
df
.coalesce(fileCount)
.write
.option("compression",“snappy")
.mode("append")
.partitionBy(
"year",
"month",
"day")
//.orderBy(“some_column")
.parquet(outputLoc)
//.orc(outputLoc)
DataWarehouse query execution
dwFolder/year=2016/month=3/day=10/part-[ManyFiles]
dwFolder/year=2016/month=3/day=11/part-[ManyFiles]
[...]
• Partition pruning: Spark will only look for the files in
appropriate folders
• Row group pruning: uses row group stats to skip data
(if filtered data is outside of min/max value of Row Group
stats in a Parquet file, data will be skipped, turned off by
default, as is expensive and gives benefit for ordered files)
• Reads only col1 and col2 from the file, and col1 as filter
(never seen by Spark, handled by the API), col2 returned to
Spark for processing
• If DW is ORC, it will use in-file indexes to speed up the scan
(Parquet will still scan through entire column in each
scanned file to filter col1)
sqlContext
.read
.parquet(dwLoc)
.filter(
'year === 2016 &&
'month === 1 &&
'day === 1 &&
'col1 === "text")
.select("col2")
sqlContext.setConf("spark.sql.parquet.filterPushdown", “True“)
DataWarehouse schema evolution
the SLOW way
• Schema evolution = columns changing over time
• Spark allows Schema-on-read paradigm
• Allows only adding columns
• Removing is be done by predicate pushdown in SELECT
• Renaming is handled in Spark
• Each file in DW (both ORC and Parquet) is schema
aware
• Each file can have different columns
• By default Spark (for speed purposes) assumes all files
with the same schema
• In order to enable schema merging, manually set a
flag during the read
• there is a heavy speed penalty for doing this
• How to make it: simply append data with different
columns to already existing store
sqlContext
.read
.option(
"mergeSchema",
"true")
.parquet(dwLoc)
.filter(
'year === 2016 &&
'day === 1 &&
'col1 === "text")
.select("col2")
.withColumnRenamed(
"col2",
"newAwesomeColumn")
DataWarehouse schema evolution
the RIGHT way
• The much faster way it to create multiple warehouses,
and merging the by calling UNION
• The union requires the columns and types to be the
same in ALL dataframes/datawarehouses
• The dataframes have to be aligned by
adding/renaming the columns using default values
etc.
• The advantage of doing this is that Spark now is
dealing with small(er) number of datawarehouses,
where within them it can assume the same types,
which can save massive amount of resources
• Spark is smart enough to figure out to execute
partition pruning and filter/predicate pushdown to
all unioned warehouses, therefore this is a
recommended way
val df1 = sqlContext
.read.parquet(dwLoc1)
.withColumn([...])
val df2 = sqlContext
.read.parquet(dwLoc2)
.withColumn([...])
val dfs = Seq(df1, df2)
val df_union = dfs
.reduce(_ union _)
// df_union is your queryable
// warehouse, including
// partitions etc
Scala Case classes limitation
• Spark can only automatically build the DataFrames
from RDDs consisting of case classes
• This means for saving Parquet/ORC you have to
use case classes
• In Scala 2.10 case classes can have max 22 fields
(limitation not present in Scala 2.11), thus only 22
columns
• Case classes are implicitly extending Product type,
if in need of DataWarehouse with more than 22
columns, create a POJO class extending the
Product type
case class MyRow(
val col1: Option[String],
val col2: Option[String]
)
// val rdd: RDD[MyRow]
val df = rdd.toDF()
df.coalesce(fileCount)
.write
.parquet(outputLoc)
Scala Product Class example
class ObjectModelV1 (
var dummy_id: Long,
var jobgroup: Int,
var instance_id: Int,
[...]
) extends java.io.Serializable with Product {
def canEqual(that: Any) = that.isInstanceOf[ObjectModelV1]
def productArity = 50
def productElement(idx: Int) = idx match {
case 0 => dummy_id
case 1 => jobgroup
case 2 => instance_id
[...]
}
}
Scala Avro + Parquet contract combined
• Avro + Parquet contract can be the same (no
inheritance collision)
• Save unnecessary object conversion/data copy
which in 5bn rage is actually large cost
• Spark Streaming can receive objects as Avro and
directly convert to Parquet/ORC
+
class ObjectModelV1 (
var dummy_id: Long,
var jobgroup: Int,
var instance_id: Int,
[...]
) extends SpecificRecordBase with SpecificRecord
with Serializable with Product {
[...]
}
Summary: Key to high performance
Data
Collection
Data
Collection
Data
Collection
Data
Collection
Message
Broker
(Kinesis)
Monitoring/stats
Preaggregation
Micro
Batch
Micro
Batch
Mini
Batch
S3 Offline analytics
Data Science
Data Warehouse
+ Buffer by 25kb
+ Submit stats
C# Kinesis
Uploader
+ Buffer by 1min, 12threads
+ Submit stats
+ Build daily DW
+ Aim in 100-500Mb files
+ Submit stats
• Incremental aggregation/batching
• Always make sure to have as many
write threads as cores in cluster
• Avoid reduce phase at all costs, avoid
shuffle unless have a good reason**
• “If it can wait do it later in the pipeline”
• Use DataFrames whenever possible
• When using cashing, make sure it
actually helps 
2x Senior Data Scientist, working with
Apache Spark + R/Python doing Airfare/price forecasting
4x Senior Data Platform Engineer, working with
Apache Spark/S3/Cassandra/Scala backend + MicroAPIs
http://www.infare.com/jobs/
job@infare.com
Want to work with cutting edge
100% Apache Spark projects? We are hiring!!!
1x Network Administrator for Big Data systems
2x DevOps Engineer for Big Data, working on
as Hadoop, Spark, Kubernetes, OpenStack and more
Thank You!!!
Q/A?
And remember, we are hiring!!!
http://www.infare.com/jobs/
job@infare.com
How to deploy Spark
so it does not backfire
Hardware
Own
+ Fully customizable
+ Cheaper, if you already have enough
OPS capacity, best case scenario 30-40%
cheaper
- Dealing with bandwidth limits
- Dealing with hardware failures
- No on-demand scalability
Hosted
+ Much more failsafe
+ On-demand scalability
+ No burden on current OPS
- Deal with dependencies
with existing systems
(e.g. inter-data center
communication)
Data Platform
MapReduce + HDFS/S3
+ Simple platform
+ Can be fully hosted
-Much slower
- Possibly more coding
required, less maintainable
(Java/Pig/Hive)
- Less future oriented
Spark + HDFS/S3
+ More advanced platform
+ Can be fully hosted
+ Possibly less coding
thanks to Scala
+ ML enabled (SparkML,
Python), future oriented
+MessageBroker enabled
Spark + Cassandra
+ The state of art for
BigData systems
+ Might not need message
broker (can easily with-
stand 100k’s inserts /sec)
+ Amazing future
possibilities
- Can not (yet) by hosted
- Possibly still needs
HDFS/S3
Spark on Amazon
Deployment only
~132$/month license per
8core/64Gb RAM
No spot instances
No support incl.
SSH access
Self customizable
Notebook Zeppelin
Out of the box platform
~500$/month license per
8core/64Gb RAM (min 5)
Allowed spot instances
Support with debug
SSH access (new in 2017)
Limited customization
Notebook DataBricks
Out of the box platform
~170$/month license per
8core/64Gb RAM
Allowed spot instances
Platform support
SSH access
Support in customization
Notebook Zeppelin
Spark on Amazon
Deployment only, so
requires a lot of
IT/Unix related
knowledge to go
All you need for Spark
system + amazing
support, a little pricey
but ‘just works’ and
worth it
Cheap and fully
customizable
platform, needs more
low level knowledge

More Related Content

What's hot

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1ScyllaDB
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Running Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using KubernetesRunning Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using KubernetesDatabricks
 
Apache Spark.
Apache Spark.Apache Spark.
Apache Spark.JananiJ19
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 

What's hot (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Running Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using KubernetesRunning Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using Kubernetes
 
Apache Spark.
Apache Spark.Apache Spark.
Apache Spark.
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 

Similar to Extreme Apache Spark: how in 3 months we created a pipeline that can process 2.5 billion rows a day

OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Datio Big Data
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkIvan Morozov
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
End-to-end working of Apache Spark
End-to-end working of Apache SparkEnd-to-end working of Apache Spark
End-to-end working of Apache SparkKnoldus Inc.
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 

Similar to Extreme Apache Spark: how in 3 months we created a pipeline that can process 2.5 billion rows a day (20)

OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Spark
SparkSpark
Spark
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
End-to-end working of Apache Spark
End-to-end working of Apache SparkEnd-to-end working of Apache Spark
End-to-end working of Apache Spark
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Extreme Apache Spark: how in 3 months we created a pipeline that can process 2.5 billion rows a day

  • 1. 2 How to create a Pipeline capable of processing 2.5 Billion records/day in just 3 Months Josef Habdank Lead Data Scientist & Data Platform Architect at Infare Solutions @jahabdank jha@infare.com linkedin.com/in/jahabdank
  • 2. • Using Spark currently is the most fundamental skill in BigData and DataScience world • Main reason: helps to solve most of BigData problems: processing, transformation, abstraction and Machine Learning You are in the right place! Spark currently is de facto standard for BigData Google Trends for “Apache Spark” 2013-2015 was insane Pretty much all serious BigData players are using Spark Good old‘n’slow Hadoop MapReduce days
  • 3. What is this talk about? Presentation consists of 4 parts: • Quick intro to Spark internals and optimization • N-billion rows/day system architecture • How exactly we did what we did • Focus on Spark’s Performance, getting maximum bang for the buck • Data Warehouse and Messaging • (optional) How to deploy Spark so it does not backfire
  • 4. The Story • “Hey guys, we might land a new cool project” • “It might be 5-10x as much data as we have so far” • “In 1+year it probably will be much more than 10x” • “Oh, and can you do it in 6 months?” “Lets do that in 3 months!”
  • 5. The Result • 5 low-cost servers (8core, 64gb RAM) • Located on Amazon with Hosted Apache Spark • a fraction of cost that any other technology would cost • Initial max capacity load tested on 2.5bn/day • Currently improved to max capacity 6-8bn/day, ~250- 350mil/hour (with no extra hardware required) • As Spark scales with hardware, we could do 15Bn with 10-15 machines • Delivered in 3 months, in production for 1.5 year now 
  • 7. Normal workflow • Code locally on your machine • Compile and assemble • Upload JAR + make sure deps are present on all nodes • Run job and test if work, spend time looking for results Notebook workflow • Write code online • Shift+Enter to compile (on master), send to cluster nodes, run and show results in browser  + Can support Git lifecycle + Allows mixing Python/Scala/Sql (which is awesome  ) Code in Notebooks, they are awesome • Development on Cluster systems is by nature not easy • Best you can do locally is to know that code compiles, unit tests pass and if the code runs on some sample data • You do not actually know if it works until you test-run on the PreProd/Dev cluster as the data defines the correctness, not syntax
  • 8. Traps of Notebooks • Code is compiled on the fly • When the chunk of code is executed as a Spark Job (on a whole cluster) all the dependent objects will be serialized and packaged with the job • Sometimes the dependency structure is very non-trivial and the Notebook will start serializing huge amounts of data (completely silently, and attaching it to the Job) • PRO TIP: have as few global variables as possible, if needed use objects
  • 9. Traps of Notebooks Code as distributed JAR vs Code as lambda
  • 10. Traps of Notebooks Code as distributed JAR vs Code as lambda This is compiled to JAR and distributed and bootstrapped to JVMs across cluster (when initializing JAR it will open the connection) This is serialized on master and attached to the job (connection object will fail to work after deserialization)
  • 12. Spark’s core: the collections • Spark is just a processing framework • Works on distributed collections: • Collections are partitioned • The number of partitions is defined from the source • Collections are lazily evaluated (nothing done until you request results) • Spark collection you only write a ‘recipe’ for what spark has to do (called lineage) • Types of collections: • RDDs, just collections of Java objects. Slowest, but most flexible • DataFrames/DataSet’s, mainly tabular data, can do structured data but is not trivial. Much faster serialization/deserialization, more compact, faster memory management, SparkSql compatible
  • 13. Spark’s core: in-memory map reduce • Spark Implements Map-LocalReduce- LocalShuffle-Shuffle-Reduce paradigm • Each step in the ‘recipe’/lineage is a combination of the above • Why in that way? Vast majority of BigData problems can be converted to this paradigm: • All SqlQueries/data extracts • In many cases DataScience (modelling)
  • 14. Master node 2 node 1 Map-LocalReduce-Shuffle-Reduce %scala val lirdd = sc.parallelize( loremIpsum.split(" ") ) val wordCount = lirdd .map(w => (w,1)) .reduceByKey(_ + _) .collect %sql select word, count(*) as word_count from words group by word (“lorem”) (“Ipsum”) (“lorem”) (“Ipsum”) (“sicut”) (“sicut”) (“lorem”, 1) (“Ipsum”, 1) (“lorem”, 1) (“Ipsum”, 1) (“sicut”, 1) (“sicut”, 1) (“lorem”, 2) (“Ipsum”, 1) (“Ipsum”, 1) (“sicut”, 2) (“lorem”, 2) (“Ipsum”, 1) (“Ipsum”, 1) (“sicut”, 2) (“lorem”, 2) (“Ipsum”, 2) (“sicut”, 2) LocalReduceMap Shuffle Reduce .map([...]) .reduceByKey([...])
  • 15. Master node 2 node 1 Map-LocalReduce-Shuffle-Reduce %scala val lirdd = sc.parallelize( loremIpsum.split(" ") ) val wordCount = lirdd .map(w => (w,1)) .reduceByKey(_ + _) .collect %sql select word, count(*) as word_count from words group by word (“lorem”) (“Ipsum”) (“lorem”) (“Ipsum”) (“sicut”) (“sicut”) (“lorem”, 1) (“Ipsum”, 1) (“lorem”, 1) (“Ipsum”, 1) (“sicut”, 1) (“sicut”, 1) (“lorem”, 2) (“Ipsum”, 1) (“Ipsum”, 1) (“sicut”, 2) (“lorem”, 2) (“Ipsum”, 1) (“Ipsum”, 1) (“sicut”, 2) (“lorem”, 2) (“Ipsum”, 2) (“sicut”, 2) LocalReduceMap Shuffle Reduce .map([...]) .reduceByKey([...]) Slowest part: data is serialized from objects to BLOBs, send over network and deserialized
  • 16. Map only operations %scala // val rawBytesRDD is defined // and contains blobs with // serialized Avro objects rawBytesRDD .map(fromAvroToObj) .toDF.write .parquet(outputPath) 0x00[…] Spark knows shuffle is expensive and tries to avoid it if can 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] .map([...]) Obj 001-100 Obj 101-200 Obj 201-300 Obj 301-400 Obj 401-500 Obj 501-600 incoming blobs incoming blobs node 1 node 2 File 1 File 2 File 3 File 4 File 5 File 6
  • 17. Local Shuffle-Map operations %scala // val rawBytesRDD is defined // and contains blobs with // serialized Avro objects rawBytesRDD .coalesce(2) //** .map(fromAvroToObj) .toDF.write .parquet(outputPath) // ** never set to so low!! // This is just an example  // Aim in at least 2x node count. // Moreover, if possible // coalesce() or repartition() // on binary blobs 0x00[…] For fragmented collections (with too many partitions) 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] Obj 001-300 Obj 301-600 Map .map([...]) Local Shuffle .coalesce(2) node 1 node 2 File 1 File 2 incoming blobs incoming blobs
  • 18. node 2 node 1 Why Python/PySpark is (generally) slower than Scala 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] 0x00[…] ObjA ObjB ObjC ObjD ObjE ObjF Map .map([...]) Conditional Shuffle .coalesce(2) • All rows will be serialized between JVM and Python • There are exceptions • Within the same machine, so is very fast • Nonetheless significant overhead Python JVM-> Python Serde do the map Python -> JVM Serde Python JVM-> Python Serde do the map Python -> JVM Serde
  • 19. Why Python/PySpark is (generally) slower than Scala • In Spark 2.0 with new version of Catalyst and dynamic code generation Spark will try to convert Python code to native Spark functions • This means in some occasions Python might work equally fast as Scala, as in fact Python code is translated into native Spark calls • Catalyst and code generation will not be able to do it for RDD map operations as well as custom UDFs in DataFrames • PRO TIP: avoid using RDDs as Spark will serialize whole objects. For UDFs it only will serialize few columns and will do it in a very efficient way df2 = df1 .filter(df1.old_column >= 30) .withColumn("new_column1", ((df1.old_column - 2) % 7) + 1) df3 = df2 .withColumn("new_column2", custom_function(df2.new_column1))
  • 20. N bn Data Platform design
  • 21. What Infare does Leading provider of Airfare Intelligence Solutions to the Aviation Industry Business Intelligence on flight tickets prices, such that airlines know competitors prices and market trends Advanced Analytics and Data Science predicting prices, ticket demand and well as financial data Collect and processes 1.5+ billion distinct airfares daily
  • 22. What we were supposed to do Data Collection Scalable DataWarehouse (aimed at 500+bn rows) Customer specific Data Warehouse Data Collection Data Collection Data Collection Customer specific Data Warehouse Customer specific Data Warehouse Customer specific Data Warehouse *Scalable to billions of rows a day
  • 23. What we need Data Collection Data Collection Data Collection Data Collection Scalable low cost permanent storage Scalable fast-access temporary storage Processing Framework ALL BigData systems in the world look like that 
  • 24. What we first did  Data Collection Data Collection Data Collection Data Collection Message Broker (Kinesis) Monitoring/stats Real time analytics Preaggregation Micro Batch Micro Batch Mini Batch S3 Offline analytics Data Science Data Warehouse Avro blobs compressed with Snappy uncompressed Parquet micro batches partitioned aggregated Parquet DataWarehouse Monitoring System Data Streamer Temporary Storage Permanent Storage
  • 25. Did it work? Data Collection Data Collection Data Collection Data Collection Message Broker (Kinesis) Monitoring/stats Real time analytics Preaggregation Micro Batch Micro Batch Mini Batch S3 Offline analytics Data Science Data Warehouse Avro blobs compressed with Snappy uncompressed Parquet micro batches partitioned aggregated Parquet DataWarehouse Monitoring System Data Streamer S3 has latency (inconsistent for deletes) Temporary Storage Permanent Storage
  • 26. • No Spark native driver, so no clustered queries • Parallelize in current implementation has a memory leak Why DynamoDB was a failure: Spark’s Parallelize hell DynamoDB historically DID NOT SUPPORT Spark WHATSOEVER, we effectively ended up writing our own Spark driver from scratch, WEEKS of wasted effort I have to admit since our initial huge disappointment 1 year ago Amazon released a Spark driver, and I do not know how good it is. My opinion is still a closed-source DB with limited support and usage will always be inferior to other technologies
  • 27. How are we doing it now  Data Collection Data Collection Data Collection Data Collection Message Broker Monitoring/stats Real time analytics Preaggregation Micro Batch Micro Batch Mini Batch S3 Offline analytics Data Science Data Warehouse Avro blobs Compressed with Snappy uncompressed Parquet micro batches partitioned aggregated Parquet/ORC DataWarehouse Data Streamer Temporary Storage Permanent Storage Monitoring System Elastic Search 5.2 has an amazing Spark support New Kafka 0.10.2 has great streaming support
  • 28. Kinesis: • The messages are max 25kB (if larger the driver will slice the message into multiple PUT requests) • Avro serialized and Snappy compressed data to max 25kB (~200 data rows per message) • Obtained 10x throughput compared to sending individual rows (each 180 bytes) Getting maximum out of the Kinesis/Kafka Message Broker https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines Serialize and send micro batches of data, not individual messages Kafka: • The messages size is 1MB, but that is very large • Jay Kreps @ Linkedin researched optimal message size for Kafka and it is between 10-100kB • From his research at those message sizes allow sending as much as hardware/network allows Data Collection Data Collection Data Collection Data Collection Data Streamer
  • 29. • Creates a stream of MiniBatches, which are RDDs created every n seconds • Spark Driver continuously polls message broker, default every 200ms • Each received block (every 200ms) becomes an RDD partition • Consider using repartition/coalesce, as the number of partitions gets very large quickly (for 60 sec, there will be up to 300 partitions, thus 300 files) • NOTE: in Spark 2.1 they added Structured Streaming (streaming on DataFrames, not RDDs, very cool but still limited in functionality) Spark Streaming Data Collection Data Collection Data Collection Data Collection Mini Batch Mini Batch kstream .repartition(3 * nodeCount) .foreachRDD(rawBytes => { [...] }) n-seconds Message Broker
  • 30. • Retry the MiniBatch n times (4 default) • If fails all retries, kill the streaming job • Conclusion: must do error handling Default error handling in Spark Streaming Data Collection Data Collection Data Collection Data Collection Micro Batch Mini Batch Message Broker
  • 31. In stream error handling, low error rate Data Collection Data Collection Data Collection Data Collection Mini Batch Mini Batch • For low error rate, handle each error individually • Open connection to storage, save the error packet for later processing, close connection • Will clog the stream for high error rate streams kstream .repartition(concurency) .foreachRDD(objects => { val res = objects.flatMap(b => { try { // can use Option instead of Seq [do something, rare error] } catch { case e: => [save the error packet] Seq[Obj]() } }) [do something with res] }) Errors Message Broker
  • 32. Error Batch Error Batch Advanced error handling, high error rate Data Collection Data Collection Data Collection Data Collection Mini Batch Mini Batch • For high error rate, can’t store each error individually • Unfortunately Spark does not support Multicast operation (stream splitting) Message Broker
  • 33. Advanced error handling, high error rate • Try high error probability action (such as API request) • Use transform to return Either[String, Obj] • Either class is like tuple but guarantees that only one or the other is present (String for error, Obj for success) • cache() to prevent reprocessing • individually process error and success stream • NOTE: cache() should be used cautiously kstream .repartition(concurency) .foreachRDD(objects => { val res = objects.map(b => { Try([frequent error]) .transform( { b => Success(Right(b)) }, { e => Success(Left(e.getMessage)) } ).get }) res.cache() // cautious, can be slow res .filter(ei => ei.isLeft) .map(ei => ei.left.get) .[process errors as stream] res .filter(ei => ei.isRight) .map(ei => ei.right.get) .[process successes as stream] })
  • 34. To cache or not to cache • Cache is the most abused function in Spark • It is NOT (!) a simple storing of a pointer to a collection in memory of the process • It is a SERIALISATION of the data to BLOB and storing it in cluster’s shared memory • When reusing data which was cached, it has to deserialize the data from BLOB • By default uses generic JAVA serializer Step1 Step2 cache serialize Step3 Step1 … deserialize … Job1 Job2
  • 35. Message Broker with Avro objects Standard Spark Streaming Scenario: • Incoming Avro stream • Step 1) Stats computation, storing stats • Step 2) Storing data from the stream Question: Will caching make it faster? Deserialize Avro Compute Stats Save Stats Message Broker with Avro objects cache Save Data Deserialize Avro Compute Stats Save Stats Save Data Deserialize Avro
  • 36. Standard Spark Streaming Scenario: • Incoming Avro stream • Step 1) Stats computation, storing stats • Step 2) Storing data from the stream Question: Will caching make it faster? Nope Faster Message Broker with Avro objects Deserialize Avro Compute Stats Save Stats Message Broker with Avro objects cache Save Data Deserialize Avro Compute Stats Save Stats Save Data Deserialize Avro
  • 37. Standard Spark Streaming Scenario: • Incoming Avro stream • Step 1) Stats computation, storing stats • Step 2) Storing data from the stream Question: Will caching make it faster? Nope Message Broker with Avro objects cache Compute Stats Save Stats Save DataDeserialize Avro Serialize Java Deserialize Java Faster Message Broker with Avro objects Deserialize Avro Compute Stats Save Stats Save Data Deserialize Avro
  • 38. To cache or not to cache • Cache is the most abused function in Spark • It is NOT (!) a simple storing of a pointer to a collection in memory of the process • It is a SERIALISATION of the data to BLOB and storing it in cluster’s shared memory • When reusing data which was cached, it has to deserialize the data from BLOB • By default uses generic JAVA serializer, which is SLOW • Even super fast serde like Kryo, are much slower as are generic (serializer does not know the type at compile time) • Avro is amazingly fast as it is a Specific serializer (knows the type) • Often you will be quicker to reprocess the data from your source than use the cache, especially for complex objects (pure strings/byte arrays cache fast) • Caching is faster in DataFrames/Tungsten API, but even then it might be slower than reprocessing • Pro TIP: when using cache, make sure it actually helps. And monitor CPU consumption too Step1 Step2 cache serialize Step3 Step1 … deserialize … Job1 Job2
  • 39. Tungsten Memory Manager + Row Serializer • Introduced in Spark 1.5, used in DataFrames/DataSets • Stores data in memory as a readable binary blobs, not as Java objects • Since Spark 2.0 the blobs are in columnar format (much better compression) • Does some black magic wizardy with L1/L2/L3 CPU cache • Much faster: 10x-100x faster then RDDs • Rule of thumb: always when possible use DataFrames and you will get Tungsten case class MyRow(val col: Option[String], val exception: Option[String]) kstream .repartition(concurency) .foreachRDD(objects => { val res = objects.map(b => { Try([frequent error]).transform( { b => Success(MyRow(b, None))}, { e => Success(MyRow(None, e.getMessage))} ).get }).toDF() res.cache() res.select("exception") .filter("exception is not null") .[process errors as stream] res.select("col") .filter("exception is null") .[process successes as stream] })
  • 40. Data Warehouse and Messaging
  • 41. Data Storage: Row Store vs Column Store • What are DataBases: collections of objects • Main difference: • Row store require only one row at the time to serialize • Column Store requires a batch of data to serialize • Serialization: • Row store can serialize online (as rows come into the serializer, can be appended to the binary buffer) • Column Store requires w whole batch to present at the moment of serialization, data can be processed (index creation, sorting, duplicate removal etc.) • Reading: • Row store always reads all data from a file • Column Store allows reading only selected columns
  • 42. JSON/CSV Row Store Pros: • Human readable (do not underestimate that) • No dev time required • Compression algorithms work very well on ASCII text (compressed CSV ‘only’ 2x larger than compressed Avro) Cons: • Large(CSV) and very large (JSON) volume • Slow serialization/deserialization Overall: worth considering, especially during dev phase
  • 43. Avro Row Store Pros: • BLAZING FAST serialization/deserialization • Apache Avro lib is amazing (buffer based serde) • Binary/compact storage • Compresses about 70% with Snappy (compressed 200 objects with 50 cols result in 20kb) Cons: • Hard to debug, once BLOB is corrupt it is very hard to find out what went wrong
  • 44. Avro Scala classes • Avro serialization/deserialization requires Avro contract compatible with Apache Avro library • In principle the Avro classes are logically similar to the Parquet classes (definition/field accessor) class ObjectModelV1Avro ( var dummy_id: Long, var jobgroup: Int, var instance_id: Int, [...] ) extends SpecificRecordBase with SpecificRecord { def get(field: Int): AnyRef = { field match { case pos if pos == 0 => { dummy_id }.asInstanceOf[AnyRef] [...] } def put(field: Int, value: Any): Unit = { field match { case pos if pos == 0 => this.dummy_id = { value }.asInstanceOf[Long] [...] } def getSchema: org.apache.avro.Schema = new Schema.Parser().parse("[...]“) }
  • 45. Avro real-time serialization • Apache Avro allows serializing on the fly Row by row • Incoming data stream can be serialized on the fly into a binary buffer // C# code byte[] raw; using (var ms = new MemoryStream()) { using (var dfw = DataFileWriter<T> .OpenWriter(_datumWriter, ms, _codec)) { // can be yielded microBatch.ForEach(dfw.Append); } raw = ms.ToArray(); } // Scala val writer: DataFileWriter[T] = new DataFileWriter[T](datumWriter) writer.create(objs(0).getSchema, outputStream) // can be streamed for (obj <- objs) { writer.append(obj) } writer.close val encodedByteArray: Array[Byte] = outputStream.toByteArray
  • 46. Parquet Column Store Pros: • Meant for large data sets • Single column searchable • Compressed (eliminated duplicates etc.) • Contains in-file stats and metadata located in TAIL • Very well supported in Spark: • predicate/filter pushdown • VECTORIZED READER AND TUNGSTEN INTEGRATION (5-10x faster then Java Parquet library) Cons: • Not indexed spark .read .parquet(dwLoc) .filter('col1 === “text“) .select("col2")
  • 47. More on predicate/filter pushdown • Processing separate from the storage • Predicate/filter pushdown gives Spark uniform way to push query to the source • Spark remains oblivious how the driver executes the query, it only cares if the Driver can or can’t execute the pushdown request • If driver can’t execute the request Spark will load all data and filter it in Spark Storage Parquet/ORC API Apache Spark DataFrame makes pushdown request to API Parquet/ORC driver executes pushed request on binary data
  • 48. More on predicate/filter pushdown • Processing separate from the storage • Predicate/filter pushdown gives Spark uniform way to push query to the source • Spark remains oblivious how the driver executes the query, it only cares if the Driver can or can’t execute the pushdown request • If driver can’t execute the request Spark will load all data and filter it in Spark • Such abstraction allows easy replacement of storage, Spark does not care if the storage is S3 files or a DataBase Apache Spark DataFrame makes pushdown request to API
  • 49. ORC Column Store Pros: • Meant for large data sets • Single column searchable • Even better compression then Parquet (20-30% less) • Contains in-file stats, metadata and indexes (3 level: file, block and every 10k rows) located in TAIL • Theoretically well supported in Spark: • predicate/ filter pushdown • uses indexes for filter pushdown searches, amazing  Cons: • No vectorized reader, rumors about adding it from Spark 2.2. If this turns out to be true then ORC should be faster spark .read .orc(dwLoc) .filter('col1 === “text“) .select("col2")
  • 50. DataWarehouse building • Think of it as a collection of read only files • Recommended to use Parquet/ORC files in a folder structure (aim at >100-1000Mb files, use coalesce) • Folders are partitions • Spark supports append for Parquet/ORC • Compression: • Use Snappy (decompression speed ~500MB/sec per core) • Gzip (decompression speed ~60MB/sec per core) • Note: Snappy is not splittable, keep files under 1GB • Ordering: if you can (often can not) order your data, as then columnar deduplication will work better • In our case this saves 50% of space, and thus 50% of reading time df .coalesce(fileCount) .write .option("compression",“snappy") .mode("append") .partitionBy( "year", "month", "day") //.orderBy(“some_column") .parquet(outputLoc) //.orc(outputLoc)
  • 51. DataWarehouse query execution dwFolder/year=2016/month=3/day=10/part-[ManyFiles] dwFolder/year=2016/month=3/day=11/part-[ManyFiles] [...] • Partition pruning: Spark will only look for the files in appropriate folders • Row group pruning: uses row group stats to skip data (if filtered data is outside of min/max value of Row Group stats in a Parquet file, data will be skipped, turned off by default, as is expensive and gives benefit for ordered files) • Reads only col1 and col2 from the file, and col1 as filter (never seen by Spark, handled by the API), col2 returned to Spark for processing • If DW is ORC, it will use in-file indexes to speed up the scan (Parquet will still scan through entire column in each scanned file to filter col1) sqlContext .read .parquet(dwLoc) .filter( 'year === 2016 && 'month === 1 && 'day === 1 && 'col1 === "text") .select("col2") sqlContext.setConf("spark.sql.parquet.filterPushdown", “True“)
  • 52. DataWarehouse schema evolution the SLOW way • Schema evolution = columns changing over time • Spark allows Schema-on-read paradigm • Allows only adding columns • Removing is be done by predicate pushdown in SELECT • Renaming is handled in Spark • Each file in DW (both ORC and Parquet) is schema aware • Each file can have different columns • By default Spark (for speed purposes) assumes all files with the same schema • In order to enable schema merging, manually set a flag during the read • there is a heavy speed penalty for doing this • How to make it: simply append data with different columns to already existing store sqlContext .read .option( "mergeSchema", "true") .parquet(dwLoc) .filter( 'year === 2016 && 'day === 1 && 'col1 === "text") .select("col2") .withColumnRenamed( "col2", "newAwesomeColumn")
  • 53. DataWarehouse schema evolution the RIGHT way • The much faster way it to create multiple warehouses, and merging the by calling UNION • The union requires the columns and types to be the same in ALL dataframes/datawarehouses • The dataframes have to be aligned by adding/renaming the columns using default values etc. • The advantage of doing this is that Spark now is dealing with small(er) number of datawarehouses, where within them it can assume the same types, which can save massive amount of resources • Spark is smart enough to figure out to execute partition pruning and filter/predicate pushdown to all unioned warehouses, therefore this is a recommended way val df1 = sqlContext .read.parquet(dwLoc1) .withColumn([...]) val df2 = sqlContext .read.parquet(dwLoc2) .withColumn([...]) val dfs = Seq(df1, df2) val df_union = dfs .reduce(_ union _) // df_union is your queryable // warehouse, including // partitions etc
  • 54. Scala Case classes limitation • Spark can only automatically build the DataFrames from RDDs consisting of case classes • This means for saving Parquet/ORC you have to use case classes • In Scala 2.10 case classes can have max 22 fields (limitation not present in Scala 2.11), thus only 22 columns • Case classes are implicitly extending Product type, if in need of DataWarehouse with more than 22 columns, create a POJO class extending the Product type case class MyRow( val col1: Option[String], val col2: Option[String] ) // val rdd: RDD[MyRow] val df = rdd.toDF() df.coalesce(fileCount) .write .parquet(outputLoc)
  • 55. Scala Product Class example class ObjectModelV1 ( var dummy_id: Long, var jobgroup: Int, var instance_id: Int, [...] ) extends java.io.Serializable with Product { def canEqual(that: Any) = that.isInstanceOf[ObjectModelV1] def productArity = 50 def productElement(idx: Int) = idx match { case 0 => dummy_id case 1 => jobgroup case 2 => instance_id [...] } }
  • 56. Scala Avro + Parquet contract combined • Avro + Parquet contract can be the same (no inheritance collision) • Save unnecessary object conversion/data copy which in 5bn rage is actually large cost • Spark Streaming can receive objects as Avro and directly convert to Parquet/ORC + class ObjectModelV1 ( var dummy_id: Long, var jobgroup: Int, var instance_id: Int, [...] ) extends SpecificRecordBase with SpecificRecord with Serializable with Product { [...] }
  • 57. Summary: Key to high performance Data Collection Data Collection Data Collection Data Collection Message Broker (Kinesis) Monitoring/stats Preaggregation Micro Batch Micro Batch Mini Batch S3 Offline analytics Data Science Data Warehouse + Buffer by 25kb + Submit stats C# Kinesis Uploader + Buffer by 1min, 12threads + Submit stats + Build daily DW + Aim in 100-500Mb files + Submit stats • Incremental aggregation/batching • Always make sure to have as many write threads as cores in cluster • Avoid reduce phase at all costs, avoid shuffle unless have a good reason** • “If it can wait do it later in the pipeline” • Use DataFrames whenever possible • When using cashing, make sure it actually helps 
  • 58. 2x Senior Data Scientist, working with Apache Spark + R/Python doing Airfare/price forecasting 4x Senior Data Platform Engineer, working with Apache Spark/S3/Cassandra/Scala backend + MicroAPIs http://www.infare.com/jobs/ job@infare.com Want to work with cutting edge 100% Apache Spark projects? We are hiring!!! 1x Network Administrator for Big Data systems 2x DevOps Engineer for Big Data, working on as Hadoop, Spark, Kubernetes, OpenStack and more
  • 59. Thank You!!! Q/A? And remember, we are hiring!!! http://www.infare.com/jobs/ job@infare.com
  • 60.
  • 61. How to deploy Spark so it does not backfire
  • 62. Hardware Own + Fully customizable + Cheaper, if you already have enough OPS capacity, best case scenario 30-40% cheaper - Dealing with bandwidth limits - Dealing with hardware failures - No on-demand scalability Hosted + Much more failsafe + On-demand scalability + No burden on current OPS - Deal with dependencies with existing systems (e.g. inter-data center communication)
  • 63. Data Platform MapReduce + HDFS/S3 + Simple platform + Can be fully hosted -Much slower - Possibly more coding required, less maintainable (Java/Pig/Hive) - Less future oriented Spark + HDFS/S3 + More advanced platform + Can be fully hosted + Possibly less coding thanks to Scala + ML enabled (SparkML, Python), future oriented +MessageBroker enabled Spark + Cassandra + The state of art for BigData systems + Might not need message broker (can easily with- stand 100k’s inserts /sec) + Amazing future possibilities - Can not (yet) by hosted - Possibly still needs HDFS/S3
  • 64. Spark on Amazon Deployment only ~132$/month license per 8core/64Gb RAM No spot instances No support incl. SSH access Self customizable Notebook Zeppelin Out of the box platform ~500$/month license per 8core/64Gb RAM (min 5) Allowed spot instances Support with debug SSH access (new in 2017) Limited customization Notebook DataBricks Out of the box platform ~170$/month license per 8core/64Gb RAM Allowed spot instances Platform support SSH access Support in customization Notebook Zeppelin
  • 65. Spark on Amazon Deployment only, so requires a lot of IT/Unix related knowledge to go All you need for Spark system + amazing support, a little pricey but ‘just works’ and worth it Cheap and fully customizable platform, needs more low level knowledge