Monday, March 23, 2009

Make sure to re-check your assumptions when performance tuning

We found the memory leak. This is what our properly tuned system looks like during a 66 hour test. Notice that the black line does not slowly trend upward, like it did in the last post.

The velocity templating framework that we use for generating xml to send to a third party system was slowly eating away at our memory. It wasn't really a problem with velocity, it was a problem with the way that we had our system tuned.

The first thing to remember is that Java splits up its memory into three parts, called generations. They are Young, Tenured, and Permanent.

Whenever you create a new instance of a class, it is created in the Young generation. This generation is used and quickly discarded each time there is a garbage collection. When the young generation is full, the garbage collector goes through the young generation and copies all the objects that are still referenced to the tenured generation. The place where you find memory leaks in most applications is when you keep references to too many objects. You will start getting out of memory exceptions when the garbage collector starts to copy objects from the Young generation to the Tenured generation, and there is not enough space to fit all of the objects that are still referenced. Frequently, you don't even realize that you are keeping a reference to those objects.

What made this leak particularly difficult to detect was that it was a problem in the permanent space. This area of memory is used to store objects that don't relate to instances of objects. This is where the JVM stores definitions of classes and methods. This area is pretty obscure, and the commercial tool that I used for profiling didn't even notice the leak in that area.

Velocity was creating dynamic classes, which are used once per call, then promptly discarded. These classes are created in the Permanent generation. The problem came from an optimization that we made about 5-7 years ago to speed up the server. We had set the flag noclassgc , which disables garbage collection for class definitions. This option makes sense if you never create dynamic classes, because you don't have to spend time reloading classes into memory.

We were creating classes that would never be used again, and telling the garbage collector not to bother with cleaning up after us. The definition of the classes are very small, which is why it took about 66 hours for the leak to take down the server.

The moral of the story is to revisit your performance optimizations occasionally and make sure they still make sense. The best decision today will not necessarily be the best decision tomorrow, so be ready to change your mind.

No comments: