SlideShare a Scribd company logo
Java Memory Analysis:
Problems and Solutions
How Java apps
(ab)use memory
Assume that at the high level your data
is represented efficiently
• Data doesn’t sit in memory for longer than needed
• No unnecessary duplicate data structures

- E.g. don’t keep same objects in both a List and a Set
• Data structures are appropriate

- E.g. don’t use ConcurrentHashMap when no concurrency
• Data format is appropriate

- E.g. don’t use Strings for int/double numbers

Main sources of memory waste
(from bottom to top level)
• JVM internal object implementation
• Inefficient common data structures

- Collections

- Boxed numbers
• Data duplication - often biggest overhead
• Memory leaks
Internal Object Format in HotSpot JVM
Internal Object Format: Alignment
• To enable 4-byte pointers (compressedOops) with
>4G heap, objects are 8-byte aligned
• Thus, for example:

- java.lang.Integer effective size is 16 bytes

(12b header + 4b int)

- java.lang.Long effective size is 24 bytes - not 20!

(12b header + 8b long + 4b padding)
Summary: small objects are bad
• A small object’s overhead is up to 400% of its
workload
• There are apps with up to 40% of the heap wasted
due to this
• See if you can change your code to “consolidate”
objects or put their contents into flat arrays
• Avoid heap size > 32G! (really ~30G)

- Unless your data is mostly int[], byte[] etc.
Common Collections
• JDK: java.util.ArrayList, java.util.HashMap,
java.util.concurrent.ConcurrentHashMap etc.
• Third-party - mainly Google:
com.google.common.collect.*
• Scala has its own equivalent of JDK collections
• JDK collections are nothing magical

- Written in Java, easy to load and read in IDE
ArrayList Internals
HashMap Internals
Memory Footprint of JDK Collections
• JDK pays not much attention to memory footprint

- Just some optimizations for empty ArrayLists and HashMaps

- ConcurrentHashMap and some Google collections are the
worst “memory hogs”
• Memory is wasted due to:

- Default size of the internal array (10 for AL, 16 for HM) too
high for small maps. Never shrinks after initialization.

- $Entry objects used by all Maps take at least 32b each! 

- Sets just reuse Map structure, no footprint optimization
Boxed numbers - related to collections
• java.lang.Integer, java.lang.Double etc.
• Were introduced mainly to avoid creating
specialized classes like IntToObjectHashMap
• However, proven to be extremely wasteful:

- Single int takes 4b. java.lang.Integer effective size is 16b
(12b header + 4b int), plus 4b pointer to it

- Single long takes 8b. java.lang.Long effective size is 24b
(12b header + 8b long + 4b padding), plus 4b pointer to it
JDK Collections: Summary
• Initialized, but empty collections waste memory
• Things like HashMap<Object, Integer> are bad
• HashMap$Entry etc. may take up to 30% of memory
• Some third-party libraries provide alternatives

- In particular, fastutil.di.unimi.it (University of Milan, Italy)

- Has Object2IntHashMap, Long2ObjectHashMap,
Int2DoubleHashMap, etc. - no boxed numbers

- Has Object2ObjectOpenHashMap : no $Entry objects
Data Duplication
• Can happen for many reasons:

- s = s1 + s2 or s = s.toUpperCase() etc. always
generates a new String object

- intObj = new Integer(intScalar) always generates
a new Integer object

- Duplicate byte[] buffers in I/O, serialization, etc.
• Very hard to detect without tooling

- Small amount of duplication is inevitable

- 20-40% waste is not uncommon in unoptimized apps
• Duplicate Strings are most common and easy to fix
Dealing with String duplication
• Use tooling to determine where dup strings are either

- generated, e.g. s = s.toUpperCase();

- permanently attached, e.g. this.name = name;
• Use String.intern() to de-duplicate

- Uses a JVM-internal fast, scalable canonicalization hashtable

- Table is fixed and preallocated - no extra memory overhead 

- Small CPU overhead is normally offset by reduced GC time
and improved cache locality
• s = s.toUpperCase.intern();

this.name = name.intern(); …
Other duplicate data
• Can be almost anything. Examples:

- Timestamp objects

- Partitions (with HashMaps and ArrayLists) in Apache Hive

- Various byte[], char[] etc. data buffers everywhere
• So far convenient tooling so far for automatic
detection of arbitrary duplicate objects
• But one can often guess correctly

- Just look at classes that take most memory…
Dealing with non-string duplicates
• Use WeakHashMap to store canonicalized objects

- com.google.common.collect.Interner wraps a
(Weak)HashMap
• For big data structures, interning may cause some
CPU performance impact

- Interning calls hashCode() and equals()

- GC time reduction would likely offset this
• If duplicate objects are mutable, like HashMap…

- May need CopyOnFirstChangeHashMap, etc.
Duplicate Data: Summary
• Duplicate data may cause huge memory waste

- Observed up to 40% overhead in unoptimized apps
• Duplicate Strings are easy to

- Detect (but need tooling to analyze a heap dump)

- Get rid of - just use String.intern()
• Other kinds of duplicate data more difficult to find

- But it’s worth the effort!

- Mutable duplicate data is more difficult to deal with
Memory Leaks
• Unlike C++, Java doesn’t have real leaks

- Data that’s not used anymore, but not released

- Too much persistent data cached in memory
• No reliable way to distinguish leaked data…

- But any data structure that just keeps growing is bad
• So, just pay attention to the biggest (and growing)
data structures

- Heap dump: see which GC root(s) hold most memory

- Runtime profiling can be more accurate, but more expensive
JXRay Memory
Analysis Tool
What is it
• Offline heap analysis tool

- Runs once on a given heap dump, produces a text report
• Simple command-line interface: 

- Just one jar + .sh script

- No complex installation

- Can run anywhere (laptop or remote headless machine)

- Needs JDK 8
• See http://www.jxray.com for more info
JXRay: main features
• Shows you what occupies the heap

- Object histogram: which objects take most memory

- Reference chains: which GC roots/data structures keep
biggest object “lumps” in memory
• Shows you where memory is wasted

- Object headers

- Duplicate Strings

- Bad collections (empty; 1-element; small (2-4 element))

- Bad object arrays (empty (all nulls); length 0 or 1; 1-element)

- Boxed numbers

- Duplicate primitive arrays (e.g. byte[] buffers)
Keeping results succinct
• No GUI - generates a plain text report

- Easy to save and exchange

- Small: ~50K regardless of the dump size

- Details a given problem once its overhead is above
threshold (by default 0.1% of used heap)
• Knows about internals of most standard collections

- More compact/informative representation
• Aggregates reference chains from GC roots to
problematic objects
Reference chain aggregation:
assumptions
• A problem is important if many objects have it

- E.g.1000s/1000,000s of duplicate strings
• Usually there are not too many places in the code
responsible for such a problem

- Foo(String s) {

this.s = s.toUpperCase(); …

}

- Bar(String s1, String s2) {

this.s = s1 + s2; …

}
Reference chain aggregation: what is it
• In the heap, we may have e.g.

Baz.stat1 -> HashMap@243 -> ArrayList@650 -> Foo.s = “xyz”

Baz.stat2 -> LinkedList@798 -> HashSet@134 -> Bar.s = “0”

Baz.stat1 -> HashMap@529 -> ArrayList@351 -> Foo.s = “abc”

Baz.stat2 -> LinkedList@284 -> HashSet@960 -> Bar.s = “1”

… 1000s more chains like this
• JXRay aggregates them all into just two lines:

Baz.stat1 -> {HashMap} -> {ArrayList} -> Foo.s (“abc”,”xyz” and

3567 more dup strings)

Baz.stat2 -> {LinkedList} -> {HashSet} -> Bar.s (“0”, “1” and …)
Treating collections specially
• Object histogram: standard vs JXRay view

HashMap$Entry 21500 objs 430K

HashMap$Entry[] 3200 objs 180K

HashMap 3200 objs 150K 

vs

{HashMap} 3200 objs 760K
• Reference chains:

Foo <- HashMap$Entry.value <- HashMap$Entry[] <-

<- HashMap <- Object[] <- ArrayList <- rootX

vs

Foo <- {HashMap.values} <- {ArrayList} <- rootX
Bad collections
• Empty: no elements at all

- Is it used at all? If yes, allocate lazily.
• 1-element

- Always has only 1 element - replace with object

- Almost always has 1 element - solution more complex.
Switch between Object and collection/array lazily.
• Small: 2..4 elements

- Consider smaller initial capacity

- Consider replacing with a plain array
Bad object arrays
• Empty: only nulls

- Same as empty collections - delete or allocate lazily
• Length 0

- Replace with a singleton zero-length array
• Length 1

- Replace with an object?
• Single non-null element

- Replace with an object? Reduce length?
Memory Analysis and
Reducing Footprint:
concrete cases
A Monitoring app
• Scalability wasn’t great

- Some users had to increase -Xmx again and again.

- Unclear how to choose the correct size
• Big heap -> long full GC pauses -> frozen UI
• Some OOMs in small clusters

- Not a scale problem - a bug?
Investigation, part 1
• Started with the smaller dumps with OOMs

- Immediately found duplicate strings

- One string repeated 1000s times used 90% of the heap 

- Long SQL query saved in DB many times, then retrieved

- Adding two String.intern() calls solved the problem.. almost
• Duplicate byte[] buffers in a 3rd-party library code

- That still caused noticeable overhead

- Ended up limiting saved query size at high level

- Library/auto-gen code may be difficult to change…
Investigation, part 2
• Next, looked into heap dumps with scalability
problems

- Both real and artificial benchmark setup
• Found all the usual issues

- String duplication

- Empty or small (1-4 elements) collections

- Tons of small objects (object headers used 31% of heap!)

- Boxed numbers
Standard solutions applied
• Duplicate strings: add more String.intern() calls

- Easy: check jxray report, find what data structures reference
bad strings, edit code

- Non-trivial when a String object is mostly managed by auto-
generated code
• Bad collections: less trivial

- Sometimes it’s enough to replace new HashMap() with new
HashMap(expectedSize)

- Found ArrayLists that almost always size 0/1
Dealing with mostly 0/1-size ArrayLists
• Replaced ArrayList list; —> Object valueOrArray;
• Depending on the situation, valueOrArray may

- be null

- point to a single object (element)

- point to an array of objects (elements)
• ~70 LOC hand-written for this

- But memory savings were worth the effort
Dealing with non-string duplicate data
• Heap contained a lot of of small objects

class TimestampAndData {

long timestamp;

long value; 

… }
• Guessed that there may be many duplicates

- E.g. many values are just 0/1
• Added a simple canonicalization cache. Result:

- 8x fewer TimestampAndData objects

- 16% memory savings
A Monitoring app: conclusions
• Fixing string/other data duplication, boxed nums,
small/empty collections: together saved ~50%

- Depends on the workload

- Scalability improved: more data - higher savings
• Can still save more - replace standard HashMaps
with more memory-friendly maps

- HashMap$Entry objects may take a lot of memory!
Apache Hive: Hive Server 2 (HS2)
• HS2 may run out of memory
• Most scenarios involve 1000s of partitions and 10s
of concurrent queries
• Not many heap dumps from real users
• Create a benchmark which reproduces the
problem, measure where memory goes, optimize
Experimental setup
• Created a Hive table with 2000 small partitions
• Running 50 concurrent queries like “select
count(myfield_1) from mytable;” crashes an HS2
server with -Xmx500m
• More partitions or concurrent queries - more
memory needed
HS2: Investigation
• Looked into the heap dump generated after OOM
• Not too many different problems:

- Duplicate strings: 23%

- java.util.Properties objects take 20% of memory

- Various bad collections: 18%
• Apparently, many Properties are duplicate

- A separate copy per partition per query

- For a read-only partition, all per-query copies are identical
HS2: Fixing duplicate strings
• Some String.intern() calls added
• Some strings come from HDFS code

- Need separate changes in Hadoop code
• Most interesting: String fields of java.net.URI

- private fields initialized internally - no access

- But still can read/write using Java Reflection

- Wrote StringInternUtils.internStringsInURI(URI) method
HS2: Fixing duplicate 

java.util.Properties objects
• Main problem: Properties object is mutable

- All PartitionDesc objects representing the same partition
cannot simply use one “canonicalized” Properties object

- If one is changed, others should not!
• Had to implement a new class

class CopyOnFirstWriteProperties extends Properties {

Properties interned; // Used until/unless a mutator called

// Inherited table is filled and used after first mutation

…

}
HS2: Improvements based on simple
read-only benchmark
• Fixing duplicate strings and properties together
saved ~37% of memory
• Another ~5% can be saved by reduplicating
strings in HDFS
• Another ~10% can be saved by dealing with bad
collections
Investigating/fixing concrete apps:
conclusions
• Any app can develop memory problems over time

- Check and optimize periodically
• Many such problems are easy enough to fix

- Intern strings, initialize collections lazily, etc.
• Duplication other than strings is frequent

- More difficult to fix, but may be well worth the effort

- Need to improve tooling to detect it automatically

More Related Content

What's hot

Using the set operators
Using the set operatorsUsing the set operators
Using the set operators
Syed Zaid Irshad
 
Sprytniejsze testowanie kodu java ze spock framework (zaawansowane techniki) ...
Sprytniejsze testowanie kodu java ze spock framework (zaawansowane techniki) ...Sprytniejsze testowanie kodu java ze spock framework (zaawansowane techniki) ...
Sprytniejsze testowanie kodu java ze spock framework (zaawansowane techniki) ...
PROIDEA
 
Sql select
Sql select Sql select
Sql select
Mudasir Syed
 
Entities on Node.JS
Entities on Node.JSEntities on Node.JS
Entities on Node.JS
Thanos Polychronakis
 
Generics
GenericsGenerics
Generics
Ravi_Kant_Sahu
 
MySQL JOINS
MySQL JOINSMySQL JOINS
PHP Functions & Arrays
PHP Functions & ArraysPHP Functions & Arrays
PHP Functions & Arrays
Henry Osborne
 
Design pattern (Abstract Factory & Singleton)
Design pattern (Abstract Factory & Singleton)Design pattern (Abstract Factory & Singleton)
Design pattern (Abstract Factory & Singleton)
paramisoft
 
Lttt matlab chuong 5
Lttt matlab chuong 5Lttt matlab chuong 5
Lttt matlab chuong 5
Hoa Cỏ May
 
String Builder & String Buffer (Java Programming)
String Builder & String Buffer (Java Programming)String Builder & String Buffer (Java Programming)
String Builder & String Buffer (Java Programming)
Anwar Hasan Shuvo
 
Array in Java
Array in JavaArray in Java
Array in Java
Shehrevar Davierwala
 
Generics in java
Generics in javaGenerics in java
Generics in java
suraj pandey
 
Giáo trình xử lý ảnh
Giáo trình xử lý ảnhGiáo trình xử lý ảnh
Giáo trình xử lý ảnh
Tùng Trần
 
Join sql
Join sqlJoin sql
Join sql
Vikas Gupta
 
MySql: Queries
MySql: QueriesMySql: Queries
MySql: Queries
DataminingTools Inc
 
Graphs In Data Structure
Graphs In Data StructureGraphs In Data Structure
Graphs In Data Structure
Anuj Modi
 
06.01 sql select distinct
06.01 sql select distinct06.01 sql select distinct
06.01 sql select distinct
Bishal Ghimire
 
Ddl &amp; dml commands
Ddl &amp; dml commandsDdl &amp; dml commands
Ddl &amp; dml commands
AnjaliJain167
 
Discrete Math Ch5 counting + proofs
Discrete Math Ch5 counting + proofsDiscrete Math Ch5 counting + proofs
Discrete Math Ch5 counting + proofs
Amr Rashed
 
Taking User Input in Java
Taking User Input in JavaTaking User Input in Java
Taking User Input in Java
Eftakhairul Islam
 

What's hot (20)

Using the set operators
Using the set operatorsUsing the set operators
Using the set operators
 
Sprytniejsze testowanie kodu java ze spock framework (zaawansowane techniki) ...
Sprytniejsze testowanie kodu java ze spock framework (zaawansowane techniki) ...Sprytniejsze testowanie kodu java ze spock framework (zaawansowane techniki) ...
Sprytniejsze testowanie kodu java ze spock framework (zaawansowane techniki) ...
 
Sql select
Sql select Sql select
Sql select
 
Entities on Node.JS
Entities on Node.JSEntities on Node.JS
Entities on Node.JS
 
Generics
GenericsGenerics
Generics
 
MySQL JOINS
MySQL JOINSMySQL JOINS
MySQL JOINS
 
PHP Functions & Arrays
PHP Functions & ArraysPHP Functions & Arrays
PHP Functions & Arrays
 
Design pattern (Abstract Factory & Singleton)
Design pattern (Abstract Factory & Singleton)Design pattern (Abstract Factory & Singleton)
Design pattern (Abstract Factory & Singleton)
 
Lttt matlab chuong 5
Lttt matlab chuong 5Lttt matlab chuong 5
Lttt matlab chuong 5
 
String Builder & String Buffer (Java Programming)
String Builder & String Buffer (Java Programming)String Builder & String Buffer (Java Programming)
String Builder & String Buffer (Java Programming)
 
Array in Java
Array in JavaArray in Java
Array in Java
 
Generics in java
Generics in javaGenerics in java
Generics in java
 
Giáo trình xử lý ảnh
Giáo trình xử lý ảnhGiáo trình xử lý ảnh
Giáo trình xử lý ảnh
 
Join sql
Join sqlJoin sql
Join sql
 
MySql: Queries
MySql: QueriesMySql: Queries
MySql: Queries
 
Graphs In Data Structure
Graphs In Data StructureGraphs In Data Structure
Graphs In Data Structure
 
06.01 sql select distinct
06.01 sql select distinct06.01 sql select distinct
06.01 sql select distinct
 
Ddl &amp; dml commands
Ddl &amp; dml commandsDdl &amp; dml commands
Ddl &amp; dml commands
 
Discrete Math Ch5 counting + proofs
Discrete Math Ch5 counting + proofsDiscrete Math Ch5 counting + proofs
Discrete Math Ch5 counting + proofs
 
Taking User Input in Java
Taking User Input in JavaTaking User Input in Java
Taking User Input in Java
 

Similar to Java Memory Analysis: Problems and Solutions

Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and Bytes
Flink Forward
 
Performance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen BorgersPerformance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen Borgers
NLJUG
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
Csaba Toth
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
srisatish ambati
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
Hazelcast
 
Data Step Hash Object vs SQL Join
Data Step Hash Object vs SQL JoinData Step Hash Object vs SQL Join
Data Step Hash Object vs SQL Join
Geoff Ness
 
Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)
Ryan Cuprak
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
Andraz Tori
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
malduarte
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Cache is King!
Cache is King!Cache is King!
Cache is King!
David Engel
 
Apache Spark
Apache SparkApache Spark
Apache Spark
SugumarSarDurai
 
Real-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and HadoopReal-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and Hadoop
Rogue Wave Software
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
huguk
 
Top 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudTop 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloud
Rogue Wave Software
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
Gal Marder
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
Csaba Toth
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
Ruben Badaró
 
Architectural anti-patterns for data handling
Architectural anti-patterns for data handlingArchitectural anti-patterns for data handling
Architectural anti-patterns for data handling
Gleicon Moraes
 

Similar to Java Memory Analysis: Problems and Solutions (20)

Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and Bytes
 
Performance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen BorgersPerformance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen Borgers
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
 
Data Step Hash Object vs SQL Join
Data Step Hash Object vs SQL JoinData Step Hash Object vs SQL Join
Data Step Hash Object vs SQL Join
 
Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Cache is King!
Cache is King!Cache is King!
Cache is King!
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Real-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and HadoopReal-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Top 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudTop 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloud
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
Architectural anti-patterns for data handling
Architectural anti-patterns for data handlingArchitectural anti-patterns for data handling
Architectural anti-patterns for data handling
 

Recently uploaded

Maximizing Efficiency and Profitability: Optimizing Data Systems, Enhancing C...
Maximizing Efficiency and Profitability: Optimizing Data Systems, Enhancing C...Maximizing Efficiency and Profitability: Optimizing Data Systems, Enhancing C...
Maximizing Efficiency and Profitability: Optimizing Data Systems, Enhancing C...
OnePlan Solutions
 
當測試開始左移
當測試開始左移當測試開始左移
當測試開始左移
Jersey (CHE-PING) Su
 
Crafting highly scalable and performant Modern Data Platforms
Crafting highly scalable and performant Modern Data PlatformsCrafting highly scalable and performant Modern Data Platforms
Crafting highly scalable and performant Modern Data Platforms
Sameer Paradkar
 
Tour and travel website management in odoo,
Tour and travel website management in odoo,Tour and travel website management in odoo,
Tour and travel website management in odoo,
Axis Technolabs
 
How To Fill Timesheet in TaskSprint: Quick Guide 2024
How To Fill Timesheet in TaskSprint: Quick Guide 2024How To Fill Timesheet in TaskSprint: Quick Guide 2024
How To Fill Timesheet in TaskSprint: Quick Guide 2024
TaskSprint | Employee Efficiency Software
 
Celebrity Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Servic...
Celebrity Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Servic...Celebrity Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Servic...
Celebrity Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Servic...
45unexpected
 
To Avoid Mistakes When Using Online Attendance Sheets
To Avoid Mistakes When Using Online Attendance SheetsTo Avoid Mistakes When Using Online Attendance Sheets
To Avoid Mistakes When Using Online Attendance Sheets
Task Tracker
 
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
jealousviolet
 
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
ashiklo9823
 
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
andrehoraa
 
bangalore Girls call 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
bangalore Girls call  👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Deliverybangalore Girls call  👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
bangalore Girls call 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 
Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...
Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...
Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...
rachitkumar09887
 
Busty Girls Call Mumbai 9930245274 Unlimited Short Providing Girls Service Av...
Busty Girls Call Mumbai 9930245274 Unlimited Short Providing Girls Service Av...Busty Girls Call Mumbai 9930245274 Unlimited Short Providing Girls Service Av...
Busty Girls Call Mumbai 9930245274 Unlimited Short Providing Girls Service Av...
revolutionary575
 
Il Data Streaming per un’AI real-time di nuova generazione
Il Data Streaming per un’AI real-time di nuova generazioneIl Data Streaming per un’AI real-time di nuova generazione
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
Hotel Management Software Development Company
Hotel Management Software Development CompanyHotel Management Software Development Company
Hotel Management Software Development Company
XongoLab Technologies LLP
 
BATber53 AWS Modernize your applications with purpose-built AWS databases
BATber53 AWS Modernize your applications with purpose-built AWS databasesBATber53 AWS Modernize your applications with purpose-built AWS databases
BATber53 AWS Modernize your applications with purpose-built AWS databases
BATbern
 
Russian Girls Call Mumbai 🎈🔥9930687706 🔥💋🎈 Provide Best And Top Girl Service ...
Russian Girls Call Mumbai 🎈🔥9930687706 🔥💋🎈 Provide Best And Top Girl Service ...Russian Girls Call Mumbai 🎈🔥9930687706 🔥💋🎈 Provide Best And Top Girl Service ...
Russian Girls Call Mumbai 🎈🔥9930687706 🔥💋🎈 Provide Best And Top Girl Service ...
shanihomely
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...
High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...
High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...
singhlata50dh
 
SEO Cheat Sheet with Learning Resources by Balti Bloggers.pdf
SEO Cheat Sheet with Learning Resources by Balti Bloggers.pdfSEO Cheat Sheet with Learning Resources by Balti Bloggers.pdf
SEO Cheat Sheet with Learning Resources by Balti Bloggers.pdf
Balti Bloggers
 

Recently uploaded (20)

Maximizing Efficiency and Profitability: Optimizing Data Systems, Enhancing C...
Maximizing Efficiency and Profitability: Optimizing Data Systems, Enhancing C...Maximizing Efficiency and Profitability: Optimizing Data Systems, Enhancing C...
Maximizing Efficiency and Profitability: Optimizing Data Systems, Enhancing C...
 
當測試開始左移
當測試開始左移當測試開始左移
當測試開始左移
 
Crafting highly scalable and performant Modern Data Platforms
Crafting highly scalable and performant Modern Data PlatformsCrafting highly scalable and performant Modern Data Platforms
Crafting highly scalable and performant Modern Data Platforms
 
Tour and travel website management in odoo,
Tour and travel website management in odoo,Tour and travel website management in odoo,
Tour and travel website management in odoo,
 
How To Fill Timesheet in TaskSprint: Quick Guide 2024
How To Fill Timesheet in TaskSprint: Quick Guide 2024How To Fill Timesheet in TaskSprint: Quick Guide 2024
How To Fill Timesheet in TaskSprint: Quick Guide 2024
 
Celebrity Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Servic...
Celebrity Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Servic...Celebrity Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Servic...
Celebrity Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Servic...
 
To Avoid Mistakes When Using Online Attendance Sheets
To Avoid Mistakes When Using Online Attendance SheetsTo Avoid Mistakes When Using Online Attendance Sheets
To Avoid Mistakes When Using Online Attendance Sheets
 
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
 
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
 
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
 
bangalore Girls call 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
bangalore Girls call  👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Deliverybangalore Girls call  👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
bangalore Girls call 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 
Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...
Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...
Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...
 
Busty Girls Call Mumbai 9930245274 Unlimited Short Providing Girls Service Av...
Busty Girls Call Mumbai 9930245274 Unlimited Short Providing Girls Service Av...Busty Girls Call Mumbai 9930245274 Unlimited Short Providing Girls Service Av...
Busty Girls Call Mumbai 9930245274 Unlimited Short Providing Girls Service Av...
 
Il Data Streaming per un’AI real-time di nuova generazione
Il Data Streaming per un’AI real-time di nuova generazioneIl Data Streaming per un’AI real-time di nuova generazione
Il Data Streaming per un’AI real-time di nuova generazione
 
Hotel Management Software Development Company
Hotel Management Software Development CompanyHotel Management Software Development Company
Hotel Management Software Development Company
 
BATber53 AWS Modernize your applications with purpose-built AWS databases
BATber53 AWS Modernize your applications with purpose-built AWS databasesBATber53 AWS Modernize your applications with purpose-built AWS databases
BATber53 AWS Modernize your applications with purpose-built AWS databases
 
Russian Girls Call Mumbai 🎈🔥9930687706 🔥💋🎈 Provide Best And Top Girl Service ...
Russian Girls Call Mumbai 🎈🔥9930687706 🔥💋🎈 Provide Best And Top Girl Service ...Russian Girls Call Mumbai 🎈🔥9930687706 🔥💋🎈 Provide Best And Top Girl Service ...
Russian Girls Call Mumbai 🎈🔥9930687706 🔥💋🎈 Provide Best And Top Girl Service ...
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
 
High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...
High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...
High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...
 
SEO Cheat Sheet with Learning Resources by Balti Bloggers.pdf
SEO Cheat Sheet with Learning Resources by Balti Bloggers.pdfSEO Cheat Sheet with Learning Resources by Balti Bloggers.pdf
SEO Cheat Sheet with Learning Resources by Balti Bloggers.pdf
 

Java Memory Analysis: Problems and Solutions

  • 3. Assume that at the high level your data is represented efficiently • Data doesn’t sit in memory for longer than needed • No unnecessary duplicate data structures
 - E.g. don’t keep same objects in both a List and a Set • Data structures are appropriate
 - E.g. don’t use ConcurrentHashMap when no concurrency • Data format is appropriate
 - E.g. don’t use Strings for int/double numbers

  • 4. Main sources of memory waste (from bottom to top level) • JVM internal object implementation • Inefficient common data structures
 - Collections
 - Boxed numbers • Data duplication - often biggest overhead • Memory leaks
  • 5. Internal Object Format in HotSpot JVM
  • 6. Internal Object Format: Alignment • To enable 4-byte pointers (compressedOops) with >4G heap, objects are 8-byte aligned • Thus, for example:
 - java.lang.Integer effective size is 16 bytes
 (12b header + 4b int)
 - java.lang.Long effective size is 24 bytes - not 20!
 (12b header + 8b long + 4b padding)
  • 7. Summary: small objects are bad • A small object’s overhead is up to 400% of its workload • There are apps with up to 40% of the heap wasted due to this • See if you can change your code to “consolidate” objects or put their contents into flat arrays • Avoid heap size > 32G! (really ~30G)
 - Unless your data is mostly int[], byte[] etc.
  • 8. Common Collections • JDK: java.util.ArrayList, java.util.HashMap, java.util.concurrent.ConcurrentHashMap etc. • Third-party - mainly Google: com.google.common.collect.* • Scala has its own equivalent of JDK collections • JDK collections are nothing magical
 - Written in Java, easy to load and read in IDE
  • 11. Memory Footprint of JDK Collections • JDK pays not much attention to memory footprint
 - Just some optimizations for empty ArrayLists and HashMaps
 - ConcurrentHashMap and some Google collections are the worst “memory hogs” • Memory is wasted due to:
 - Default size of the internal array (10 for AL, 16 for HM) too high for small maps. Never shrinks after initialization.
 - $Entry objects used by all Maps take at least 32b each! 
 - Sets just reuse Map structure, no footprint optimization
  • 12. Boxed numbers - related to collections • java.lang.Integer, java.lang.Double etc. • Were introduced mainly to avoid creating specialized classes like IntToObjectHashMap • However, proven to be extremely wasteful:
 - Single int takes 4b. java.lang.Integer effective size is 16b (12b header + 4b int), plus 4b pointer to it
 - Single long takes 8b. java.lang.Long effective size is 24b (12b header + 8b long + 4b padding), plus 4b pointer to it
  • 13. JDK Collections: Summary • Initialized, but empty collections waste memory • Things like HashMap<Object, Integer> are bad • HashMap$Entry etc. may take up to 30% of memory • Some third-party libraries provide alternatives
 - In particular, fastutil.di.unimi.it (University of Milan, Italy)
 - Has Object2IntHashMap, Long2ObjectHashMap, Int2DoubleHashMap, etc. - no boxed numbers
 - Has Object2ObjectOpenHashMap : no $Entry objects
  • 14. Data Duplication • Can happen for many reasons:
 - s = s1 + s2 or s = s.toUpperCase() etc. always generates a new String object
 - intObj = new Integer(intScalar) always generates a new Integer object
 - Duplicate byte[] buffers in I/O, serialization, etc. • Very hard to detect without tooling
 - Small amount of duplication is inevitable
 - 20-40% waste is not uncommon in unoptimized apps • Duplicate Strings are most common and easy to fix
  • 15. Dealing with String duplication • Use tooling to determine where dup strings are either
 - generated, e.g. s = s.toUpperCase();
 - permanently attached, e.g. this.name = name; • Use String.intern() to de-duplicate
 - Uses a JVM-internal fast, scalable canonicalization hashtable
 - Table is fixed and preallocated - no extra memory overhead 
 - Small CPU overhead is normally offset by reduced GC time and improved cache locality • s = s.toUpperCase.intern();
 this.name = name.intern(); …
  • 16. Other duplicate data • Can be almost anything. Examples:
 - Timestamp objects
 - Partitions (with HashMaps and ArrayLists) in Apache Hive
 - Various byte[], char[] etc. data buffers everywhere • So far convenient tooling so far for automatic detection of arbitrary duplicate objects • But one can often guess correctly
 - Just look at classes that take most memory…
  • 17. Dealing with non-string duplicates • Use WeakHashMap to store canonicalized objects
 - com.google.common.collect.Interner wraps a (Weak)HashMap • For big data structures, interning may cause some CPU performance impact
 - Interning calls hashCode() and equals()
 - GC time reduction would likely offset this • If duplicate objects are mutable, like HashMap…
 - May need CopyOnFirstChangeHashMap, etc.
  • 18. Duplicate Data: Summary • Duplicate data may cause huge memory waste
 - Observed up to 40% overhead in unoptimized apps • Duplicate Strings are easy to
 - Detect (but need tooling to analyze a heap dump)
 - Get rid of - just use String.intern() • Other kinds of duplicate data more difficult to find
 - But it’s worth the effort!
 - Mutable duplicate data is more difficult to deal with
  • 19. Memory Leaks • Unlike C++, Java doesn’t have real leaks
 - Data that’s not used anymore, but not released
 - Too much persistent data cached in memory • No reliable way to distinguish leaked data…
 - But any data structure that just keeps growing is bad • So, just pay attention to the biggest (and growing) data structures
 - Heap dump: see which GC root(s) hold most memory
 - Runtime profiling can be more accurate, but more expensive
  • 21. What is it • Offline heap analysis tool
 - Runs once on a given heap dump, produces a text report • Simple command-line interface: 
 - Just one jar + .sh script
 - No complex installation
 - Can run anywhere (laptop or remote headless machine)
 - Needs JDK 8 • See http://www.jxray.com for more info
  • 22. JXRay: main features • Shows you what occupies the heap
 - Object histogram: which objects take most memory
 - Reference chains: which GC roots/data structures keep biggest object “lumps” in memory • Shows you where memory is wasted
 - Object headers
 - Duplicate Strings
 - Bad collections (empty; 1-element; small (2-4 element))
 - Bad object arrays (empty (all nulls); length 0 or 1; 1-element)
 - Boxed numbers
 - Duplicate primitive arrays (e.g. byte[] buffers)
  • 23. Keeping results succinct • No GUI - generates a plain text report
 - Easy to save and exchange
 - Small: ~50K regardless of the dump size
 - Details a given problem once its overhead is above threshold (by default 0.1% of used heap) • Knows about internals of most standard collections
 - More compact/informative representation • Aggregates reference chains from GC roots to problematic objects
  • 24. Reference chain aggregation: assumptions • A problem is important if many objects have it
 - E.g.1000s/1000,000s of duplicate strings • Usually there are not too many places in the code responsible for such a problem
 - Foo(String s) {
 this.s = s.toUpperCase(); …
 }
 - Bar(String s1, String s2) {
 this.s = s1 + s2; …
 }
  • 25. Reference chain aggregation: what is it • In the heap, we may have e.g.
 Baz.stat1 -> HashMap@243 -> ArrayList@650 -> Foo.s = “xyz”
 Baz.stat2 -> LinkedList@798 -> HashSet@134 -> Bar.s = “0”
 Baz.stat1 -> HashMap@529 -> ArrayList@351 -> Foo.s = “abc”
 Baz.stat2 -> LinkedList@284 -> HashSet@960 -> Bar.s = “1”
 … 1000s more chains like this • JXRay aggregates them all into just two lines:
 Baz.stat1 -> {HashMap} -> {ArrayList} -> Foo.s (“abc”,”xyz” and
 3567 more dup strings)
 Baz.stat2 -> {LinkedList} -> {HashSet} -> Bar.s (“0”, “1” and …)
  • 26. Treating collections specially • Object histogram: standard vs JXRay view
 HashMap$Entry 21500 objs 430K
 HashMap$Entry[] 3200 objs 180K
 HashMap 3200 objs 150K 
 vs
 {HashMap} 3200 objs 760K • Reference chains:
 Foo <- HashMap$Entry.value <- HashMap$Entry[] <-
 <- HashMap <- Object[] <- ArrayList <- rootX
 vs
 Foo <- {HashMap.values} <- {ArrayList} <- rootX
  • 27. Bad collections • Empty: no elements at all
 - Is it used at all? If yes, allocate lazily. • 1-element
 - Always has only 1 element - replace with object
 - Almost always has 1 element - solution more complex. Switch between Object and collection/array lazily. • Small: 2..4 elements
 - Consider smaller initial capacity
 - Consider replacing with a plain array
  • 28. Bad object arrays • Empty: only nulls
 - Same as empty collections - delete or allocate lazily • Length 0
 - Replace with a singleton zero-length array • Length 1
 - Replace with an object? • Single non-null element
 - Replace with an object? Reduce length?
  • 29. Memory Analysis and Reducing Footprint: concrete cases
  • 30. A Monitoring app • Scalability wasn’t great
 - Some users had to increase -Xmx again and again.
 - Unclear how to choose the correct size • Big heap -> long full GC pauses -> frozen UI • Some OOMs in small clusters
 - Not a scale problem - a bug?
  • 31. Investigation, part 1 • Started with the smaller dumps with OOMs
 - Immediately found duplicate strings
 - One string repeated 1000s times used 90% of the heap 
 - Long SQL query saved in DB many times, then retrieved
 - Adding two String.intern() calls solved the problem.. almost • Duplicate byte[] buffers in a 3rd-party library code
 - That still caused noticeable overhead
 - Ended up limiting saved query size at high level
 - Library/auto-gen code may be difficult to change…
  • 32. Investigation, part 2 • Next, looked into heap dumps with scalability problems
 - Both real and artificial benchmark setup • Found all the usual issues
 - String duplication
 - Empty or small (1-4 elements) collections
 - Tons of small objects (object headers used 31% of heap!)
 - Boxed numbers
  • 33. Standard solutions applied • Duplicate strings: add more String.intern() calls
 - Easy: check jxray report, find what data structures reference bad strings, edit code
 - Non-trivial when a String object is mostly managed by auto- generated code • Bad collections: less trivial
 - Sometimes it’s enough to replace new HashMap() with new HashMap(expectedSize)
 - Found ArrayLists that almost always size 0/1
  • 34. Dealing with mostly 0/1-size ArrayLists • Replaced ArrayList list; —> Object valueOrArray; • Depending on the situation, valueOrArray may
 - be null
 - point to a single object (element)
 - point to an array of objects (elements) • ~70 LOC hand-written for this
 - But memory savings were worth the effort
  • 35. Dealing with non-string duplicate data • Heap contained a lot of of small objects
 class TimestampAndData {
 long timestamp;
 long value; 
 … } • Guessed that there may be many duplicates
 - E.g. many values are just 0/1 • Added a simple canonicalization cache. Result:
 - 8x fewer TimestampAndData objects
 - 16% memory savings
  • 36. A Monitoring app: conclusions • Fixing string/other data duplication, boxed nums, small/empty collections: together saved ~50%
 - Depends on the workload
 - Scalability improved: more data - higher savings • Can still save more - replace standard HashMaps with more memory-friendly maps
 - HashMap$Entry objects may take a lot of memory!
  • 37. Apache Hive: Hive Server 2 (HS2) • HS2 may run out of memory • Most scenarios involve 1000s of partitions and 10s of concurrent queries • Not many heap dumps from real users • Create a benchmark which reproduces the problem, measure where memory goes, optimize
  • 38. Experimental setup • Created a Hive table with 2000 small partitions • Running 50 concurrent queries like “select count(myfield_1) from mytable;” crashes an HS2 server with -Xmx500m • More partitions or concurrent queries - more memory needed
  • 39. HS2: Investigation • Looked into the heap dump generated after OOM • Not too many different problems:
 - Duplicate strings: 23%
 - java.util.Properties objects take 20% of memory
 - Various bad collections: 18% • Apparently, many Properties are duplicate
 - A separate copy per partition per query
 - For a read-only partition, all per-query copies are identical
  • 40. HS2: Fixing duplicate strings • Some String.intern() calls added • Some strings come from HDFS code
 - Need separate changes in Hadoop code • Most interesting: String fields of java.net.URI
 - private fields initialized internally - no access
 - But still can read/write using Java Reflection
 - Wrote StringInternUtils.internStringsInURI(URI) method
  • 41. HS2: Fixing duplicate 
 java.util.Properties objects • Main problem: Properties object is mutable
 - All PartitionDesc objects representing the same partition cannot simply use one “canonicalized” Properties object
 - If one is changed, others should not! • Had to implement a new class
 class CopyOnFirstWriteProperties extends Properties {
 Properties interned; // Used until/unless a mutator called
 // Inherited table is filled and used after first mutation
 …
 }
  • 42. HS2: Improvements based on simple read-only benchmark • Fixing duplicate strings and properties together saved ~37% of memory • Another ~5% can be saved by reduplicating strings in HDFS • Another ~10% can be saved by dealing with bad collections
  • 43. Investigating/fixing concrete apps: conclusions • Any app can develop memory problems over time
 - Check and optimize periodically • Many such problems are easy enough to fix
 - Intern strings, initialize collections lazily, etc. • Duplication other than strings is frequent
 - More difficult to fix, but may be well worth the effort
 - Need to improve tooling to detect it automatically