I was asked to explain once why my program consumed so much memory when the amount of actual data it loaded was so small. It is often assumed that the culprit is either duplicate data or unused data that the developer carelessly placed in the app.
I asked a coworker to run a simple experiment once. To read a 2GB file into a HashSet. The program required almost 12GB to run. That’s a 6x increase in memory requirement!!!
It turns out that the java vm is very expensive when it comes to storing data.
I found a study by IBM that describes this phenomenon. It’s linked here.
Here are some highlights from the slide deck:
– for a string, 20%-75% of memory footprint is overhead! That means as low as 25% of it is actual data
– an example TreeMap is 82% overhead while an array (that requires binary search) is 2% overhead
– trying to magically solve memory problems without understanding consequences is bad
– upgrading from 32-bit to 64-bit to address more memory increases object memory overheads
– using 64-bit jvm increases memory usage by 40-50% on average compared to 32-bit
– e.g. an 8-byte string requires 64 bytes in 32-bit java5 and 96 bytes in 64-bit java5
– leaving caches unbounded to have higher hit-rate can cause excessive GC that slows performance
– frameworks add additional memory requirements
– none of this even includes duplicate data or unused data
And some lessons learned:
– consider context and usage statistics of collections
– sometimes an IdentityHashMap is better (50% less memory and faster)
– use array/list when number of elements is small
– research found that only 5% of sets had more than a few elements each
– set initial capacity where possible
– trim to size after load
– use weak or soft references so GC can remove unused objects (apache’s ReferenceMap)
Now let’s all be a little more aware of how we’re building our applications so they don’t lead to memory obesity.