SlideShare a Scribd company logo
1 of 30
Juggling with Bits and Bytes
How Apache Flink operates on binary data
Fabian Hueske
fhueske@apache.org @fhueske
1
Big Data frameworks on JVMs
• Many (open source) Big Data frameworks run on JVMs
– Hadoop, Drill, Spark, Hive, Pig, and ...
– Flink as well
• Common challenge: How to organize data in-memory?
– In-memory processing (sorting, joining, aggregating)
– In-memory caching of intermediate results
• Memory management of a system influences
– Reliability
– Resource efficiency, performance & performance predictability
– Ease of configuration
2
The straight-forward approach
Store and process data as objects on the heap
• Put objects in an Array and sort it
A few notable drawbacks
• Predicting memory consumption is hard
– If you fail, an OutOfMemoryError will kill you!
• High garbage collection overhead
– Easily 50% of time spend on GC
• Objects have space overhead
– At least 8 bytes for each (nested) object! (Depends on arch)
3
FLINK’S APPROACH
4
Flink adopts DBMS technology
• Allocates fixed number of memory segments upfront
• Data objects are serialized into memory segments
• DBMS-style algorithms work on binary representation
5
Why is that good?
• Memory-safe execution
– Used and available memory segments are easy to count
• Efficient out-of-core algorithms
– Memory segments can be efficiently written to disk
• Reduced GC pressure
– Memory segments are never deallocated
– Data objects are short-lived or reused
• Space-efficient data representation
• Efficient operations on binary data
6
What does it cost?
• Significant implementation investment
– Using java.util.HashMap
vs.
– Implementing a spillable hash table backed by byte arrays
and custom serialization stack
• Other systems use similar techniques
– Apache Drill, Apache Ignite, Apache Geode
• Apache Spark plans to evolve into a similar direction
7
MEMORY ALLOCATION
8
Memory segments
• Unit of memory distribution in Flink
– Fixed number allocated when worker starts
• Backed by a regular byte array (default 32KB)
• R/W access through Java’s efficient unsafe methods
• Multiple memory segments can be concatenated to
a larger chunk of memory
9
Memory allocation
10
DATA SERIALIZATION
11
Custom de/serialization stack
• Many alternatives for Java object serialization
– Kryo, Apache Avro, Apache Thrift, Protobufs, …
• But Flink has its own serialization stack
– Operating on serialized data requires knowledge of layout
– Control over layout can improve efficiency of operations
– Data types are known before execution
12
Rich & extensible type system
• Serialization framework requires knowledge of types
• Flink analyzes return types of functions
– Java: Reflection based type analyzer
– Scala: Compiler information
• Rich type system
– Atomics: Primitives, Writables, Generic types, …
– Composites: Tuples, Pojos, CaseClasses
– Extensible by custom types
13
Serializers & comparators
• All types have dedicated de/serializers
– Primitives are natively serialized
– Writables use their own serialization functions
– Generic types use Kryo
– …
• Serialization goes automatically through Java unsafe
• Comparators compare and hash objects
– On binary representation if possible
• Composite serializers and comparators delegate to
serializers and comparators of member types
14
Serializing a Tuple3<Integer, Double, Person>
15
OPERATING ON BINARY DATA
16
Data Processing Algorithms
• Flink’s algorithms are based on RDBMS technology
– External Merge Sort, Hybrid Hash Join, Sort Merge Join, …
• Algorithms receive a budget of memory segments
• Operate in-memory as long as data fits into budget
– And gracefully spill to disk if data exceeds memory
17
In-Memory Sort – Fill the Sort Buffer
18
In-Memory Sort – Sort the Buffer
19
In-Memory Sort – Read Sorted Buffer
20
SHOW ME NUMBERS!
21
Sort benchmark
• Task: Sort 10 million Tuple2<Integer, String> records
– String length 12 chars
• Tuple has 16 Bytes of raw data
• ~152 MB raw data
– Integers uniformly, Strings long-tail distributed
– Sort on Integer field and on String field
• Input provided as mutable object iterator
• Use JVM with 900 MB heap size
– Minimum size to reliable run the benchmark
22
Sorting methods
1. Objects-on-Heap:
– Put cloned data objects in ArrayList and use Java’s Collection sort.
– ArrayList is initialized with right size.
2. Flink-serialized:
– Using Flink’s custom serializers.
– Integer with full binary sorting key, String with 8 byte prefix key.
3. Kryo-serialized:
– Serialize fields with Kryo.
– No binary sorting keys, objects are deserialized for comparison.
• All implementations use a single thread
• Average execution time of 10 runs reported
• GC triggered between runs (does not go into time)
23
Execution time
24
Garbage collection and heap usage
25
Objects-on-heap
Flink-serialized
Memory usage
26
• Breakdown: Flink serialized - Sort Integer
– 4 bytes Integer
– 12 bytes String
– 4 bytes String length
– 4 bytes pointer
– 4 bytes Integer sorting key
– 28 bytes * 10M records = 267 MB
Object-on-heap Flink-serialized Kryo-serialized
Sort Integer Approx. 700 MB 277 MB 266 MB
Sort String Approx. 700 MB 315 MB 266 MB
WHAT’S NEXT?
27
We’re not done yet!
• Move memory segments to off-heap memory
– Smaller JVM, lower GC pressure, easier configuration
• Table API provides full semantics for execution
– Use code generation to operate fully on binary data
• Serialization layouts tailored towards operations
– More efficient operations on binary data
• …
28
Summary
• Active memory management avoids OOMErrors.
• Highly efficient data serialization stack
– Facilitates operations on binary data
– Makes more data fit into memory
• DBMS-style operators operate on binary data
– High performance in-memory processing
– Graceful destaging to disk if necessary
• Read the full story:
http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html
29
30
http://flink.apache.org @ApacheFlink
Apache Flink

More Related Content

What's hot

Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsStephan Ewen
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingFlink Forward
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkRobert Metzger
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon PresentationGyula Fóra
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Fabian Hueske
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream ProcessingGyula Fóra
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...Flink Forward
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestDataGyula Fóra
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingFlink Forward
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteFlink Forward
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkRobert Metzger
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Stephan Ewen
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016Robert Metzger
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Apache Flink Taiwan User Group
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Till Rohrmann
 
Stephan Ewen - Running Flink Everywhere
Stephan Ewen - Running Flink EverywhereStephan Ewen - Running Flink Everywhere
Stephan Ewen - Running Flink EverywhereFlink Forward
 

What's hot (20)

Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
Apache Flink Hands On
Apache Flink Hands OnApache Flink Hands On
Apache Flink Hands On
 
Apache flink
Apache flinkApache flink
Apache flink
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016
 
Stephan Ewen - Running Flink Everywhere
Stephan Ewen - Running Flink EverywhereStephan Ewen - Running Flink Everywhere
Stephan Ewen - Running Flink Everywhere
 

Viewers also liked

Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache FlinkStream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache FlinkFabian Hueske
 
Apache Flink - A Sneek Preview on Language Integrated Queries
Apache Flink - A Sneek Preview on Language Integrated QueriesApache Flink - A Sneek Preview on Language Integrated Queries
Apache Flink - A Sneek Preview on Language Integrated QueriesFabian Hueske
 
Apache Flink - Community Update January 2015
Apache Flink - Community Update January 2015Apache Flink - Community Update January 2015
Apache Flink - Community Update January 2015Fabian Hueske
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityFabian Hueske
 
Apache Flink - Akka for the Win!
Apache Flink - Akka for the Win!Apache Flink - Akka for the Win!
Apache Flink - Akka for the Win!Fabian Hueske
 
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Timo Walther - Table & SQL API - unified APIs for batch and stream processingTimo Walther - Table & SQL API - unified APIs for batch and stream processing
Timo Walther - Table & SQL API - unified APIs for batch and stream processingVerverica
 
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015Till Rohrmann
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...Till Rohrmann
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFlink Forward
 
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Ververica
 
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream ...Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...Ververica
 
Eron Wright - Flink Security Enhancements
Eron Wright - Flink Security EnhancementsEron Wright - Flink Security Enhancements
Eron Wright - Flink Security EnhancementsFlink Forward
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkVerverica
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry confluent
 
Kostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIsKostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIsVerverica
 
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...Till Rohrmann
 

Viewers also liked (16)

Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache FlinkStream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache Flink
 
Apache Flink - A Sneek Preview on Language Integrated Queries
Apache Flink - A Sneek Preview on Language Integrated QueriesApache Flink - A Sneek Preview on Language Integrated Queries
Apache Flink - A Sneek Preview on Language Integrated Queries
 
Apache Flink - Community Update January 2015
Apache Flink - Community Update January 2015Apache Flink - Community Update January 2015
Apache Flink - Community Update January 2015
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
 
Apache Flink - Akka for the Win!
Apache Flink - Akka for the Win!Apache Flink - Akka for the Win!
Apache Flink - Akka for the Win!
 
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Timo Walther - Table & SQL API - unified APIs for batch and stream processingTimo Walther - Table & SQL API - unified APIs for batch and stream processing
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
 
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and Bytes
 
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
 
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream ...Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
 
Eron Wright - Flink Security Enhancements
Eron Wright - Flink Security EnhancementsEron Wright - Flink Security Enhancements
Eron Wright - Flink Security Enhancements
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Kostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIsKostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIs
 
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
 

Similar to Juggling with Bits and Bytes - How Apache Flink operates on binary data

ModeShape 3 overview
ModeShape 3 overviewModeShape 3 overview
ModeShape 3 overviewRandall Hauch
 
Overview of the ehcache
Overview of the ehcacheOverview of the ehcache
Overview of the ehcacheHyeonSeok Choi
 
Getting started with Riak in the Cloud
Getting started with Riak in the CloudGetting started with Riak in the Cloud
Getting started with Riak in the CloudInes Sombra
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)Ryan Cuprak
 
SFDC Introduction to Apex
SFDC Introduction to ApexSFDC Introduction to Apex
SFDC Introduction to ApexSujit Kumar
 
Skillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise Group
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
hibernateormfeatures-140223193044-phpapp02.pdf
hibernateormfeatures-140223193044-phpapp02.pdfhibernateormfeatures-140223193044-phpapp02.pdf
hibernateormfeatures-140223193044-phpapp02.pdfPatiento Del Mar
 
Not Just ORM: Powerful Hibernate ORM Features and Capabilities
Not Just ORM: Powerful Hibernate ORM Features and CapabilitiesNot Just ORM: Powerful Hibernate ORM Features and Capabilities
Not Just ORM: Powerful Hibernate ORM Features and CapabilitiesBrett Meyer
 
Cassandra
CassandraCassandra
Cassandraexsuns
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in JavaRuben Badaró
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?DoiT International
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)srisatish ambati
 
Bringing Concurrency to Ruby - RubyConf India 2014
Bringing Concurrency to Ruby - RubyConf India 2014Bringing Concurrency to Ruby - RubyConf India 2014
Bringing Concurrency to Ruby - RubyConf India 2014Charles Nutter
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-researchsaintdevil163
 

Similar to Juggling with Bits and Bytes - How Apache Flink operates on binary data (20)

Solr
SolrSolr
Solr
 
ModeShape 3 overview
ModeShape 3 overviewModeShape 3 overview
ModeShape 3 overview
 
Overview of the ehcache
Overview of the ehcacheOverview of the ehcache
Overview of the ehcache
 
Getting started with Riak in the Cloud
Getting started with Riak in the CloudGetting started with Riak in the Cloud
Getting started with Riak in the Cloud
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)
 
SFDC Introduction to Apex
SFDC Introduction to ApexSFDC Introduction to Apex
SFDC Introduction to Apex
 
Skillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet app
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
hibernateormfeatures-140223193044-phpapp02.pdf
hibernateormfeatures-140223193044-phpapp02.pdfhibernateormfeatures-140223193044-phpapp02.pdf
hibernateormfeatures-140223193044-phpapp02.pdf
 
Java Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and SolutionsJava Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and Solutions
 
Not Just ORM: Powerful Hibernate ORM Features and Capabilities
Not Just ORM: Powerful Hibernate ORM Features and CapabilitiesNot Just ORM: Powerful Hibernate ORM Features and Capabilities
Not Just ORM: Powerful Hibernate ORM Features and Capabilities
 
Cassandra
CassandraCassandra
Cassandra
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
Tech4Africa 2014
Tech4Africa 2014Tech4Africa 2014
Tech4Africa 2014
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
 
Bringing Concurrency to Ruby - RubyConf India 2014
Bringing Concurrency to Ruby - RubyConf India 2014Bringing Concurrency to Ruby - RubyConf India 2014
Bringing Concurrency to Ruby - RubyConf India 2014
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 

Recently uploaded

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 

Recently uploaded (20)

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 

Juggling with Bits and Bytes - How Apache Flink operates on binary data

  • 1. Juggling with Bits and Bytes How Apache Flink operates on binary data Fabian Hueske fhueske@apache.org @fhueske 1
  • 2. Big Data frameworks on JVMs • Many (open source) Big Data frameworks run on JVMs – Hadoop, Drill, Spark, Hive, Pig, and ... – Flink as well • Common challenge: How to organize data in-memory? – In-memory processing (sorting, joining, aggregating) – In-memory caching of intermediate results • Memory management of a system influences – Reliability – Resource efficiency, performance & performance predictability – Ease of configuration 2
  • 3. The straight-forward approach Store and process data as objects on the heap • Put objects in an Array and sort it A few notable drawbacks • Predicting memory consumption is hard – If you fail, an OutOfMemoryError will kill you! • High garbage collection overhead – Easily 50% of time spend on GC • Objects have space overhead – At least 8 bytes for each (nested) object! (Depends on arch) 3
  • 5. Flink adopts DBMS technology • Allocates fixed number of memory segments upfront • Data objects are serialized into memory segments • DBMS-style algorithms work on binary representation 5
  • 6. Why is that good? • Memory-safe execution – Used and available memory segments are easy to count • Efficient out-of-core algorithms – Memory segments can be efficiently written to disk • Reduced GC pressure – Memory segments are never deallocated – Data objects are short-lived or reused • Space-efficient data representation • Efficient operations on binary data 6
  • 7. What does it cost? • Significant implementation investment – Using java.util.HashMap vs. – Implementing a spillable hash table backed by byte arrays and custom serialization stack • Other systems use similar techniques – Apache Drill, Apache Ignite, Apache Geode • Apache Spark plans to evolve into a similar direction 7
  • 9. Memory segments • Unit of memory distribution in Flink – Fixed number allocated when worker starts • Backed by a regular byte array (default 32KB) • R/W access through Java’s efficient unsafe methods • Multiple memory segments can be concatenated to a larger chunk of memory 9
  • 12. Custom de/serialization stack • Many alternatives for Java object serialization – Kryo, Apache Avro, Apache Thrift, Protobufs, … • But Flink has its own serialization stack – Operating on serialized data requires knowledge of layout – Control over layout can improve efficiency of operations – Data types are known before execution 12
  • 13. Rich & extensible type system • Serialization framework requires knowledge of types • Flink analyzes return types of functions – Java: Reflection based type analyzer – Scala: Compiler information • Rich type system – Atomics: Primitives, Writables, Generic types, … – Composites: Tuples, Pojos, CaseClasses – Extensible by custom types 13
  • 14. Serializers & comparators • All types have dedicated de/serializers – Primitives are natively serialized – Writables use their own serialization functions – Generic types use Kryo – … • Serialization goes automatically through Java unsafe • Comparators compare and hash objects – On binary representation if possible • Composite serializers and comparators delegate to serializers and comparators of member types 14
  • 15. Serializing a Tuple3<Integer, Double, Person> 15
  • 17. Data Processing Algorithms • Flink’s algorithms are based on RDBMS technology – External Merge Sort, Hybrid Hash Join, Sort Merge Join, … • Algorithms receive a budget of memory segments • Operate in-memory as long as data fits into budget – And gracefully spill to disk if data exceeds memory 17
  • 18. In-Memory Sort – Fill the Sort Buffer 18
  • 19. In-Memory Sort – Sort the Buffer 19
  • 20. In-Memory Sort – Read Sorted Buffer 20
  • 22. Sort benchmark • Task: Sort 10 million Tuple2<Integer, String> records – String length 12 chars • Tuple has 16 Bytes of raw data • ~152 MB raw data – Integers uniformly, Strings long-tail distributed – Sort on Integer field and on String field • Input provided as mutable object iterator • Use JVM with 900 MB heap size – Minimum size to reliable run the benchmark 22
  • 23. Sorting methods 1. Objects-on-Heap: – Put cloned data objects in ArrayList and use Java’s Collection sort. – ArrayList is initialized with right size. 2. Flink-serialized: – Using Flink’s custom serializers. – Integer with full binary sorting key, String with 8 byte prefix key. 3. Kryo-serialized: – Serialize fields with Kryo. – No binary sorting keys, objects are deserialized for comparison. • All implementations use a single thread • Average execution time of 10 runs reported • GC triggered between runs (does not go into time) 23
  • 25. Garbage collection and heap usage 25 Objects-on-heap Flink-serialized
  • 26. Memory usage 26 • Breakdown: Flink serialized - Sort Integer – 4 bytes Integer – 12 bytes String – 4 bytes String length – 4 bytes pointer – 4 bytes Integer sorting key – 28 bytes * 10M records = 267 MB Object-on-heap Flink-serialized Kryo-serialized Sort Integer Approx. 700 MB 277 MB 266 MB Sort String Approx. 700 MB 315 MB 266 MB
  • 28. We’re not done yet! • Move memory segments to off-heap memory – Smaller JVM, lower GC pressure, easier configuration • Table API provides full semantics for execution – Use code generation to operate fully on binary data • Serialization layouts tailored towards operations – More efficient operations on binary data • … 28
  • 29. Summary • Active memory management avoids OOMErrors. • Highly efficient data serialization stack – Facilitates operations on binary data – Makes more data fit into memory • DBMS-style operators operate on binary data – High performance in-memory processing – Graceful destaging to disk if necessary • Read the full story: http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html 29