Monday, March 23, 2009

Make sure to re-check your assumptions when performance tuning

We found the memory leak. This is what our properly tuned system looks like during a 66 hour test. Notice that the black line does not slowly trend upward, like it did in the last post.

The velocity templating framework that we use for generating xml to send to a third party system was slowly eating away at our memory. It wasn't really a problem with velocity, it was a problem with the way that we had our system tuned.

The first thing to remember is that Java splits up its memory into three parts, called generations. They are Young, Tenured, and Permanent.

Whenever you create a new instance of a class, it is created in the Young generation. This generation is used and quickly discarded each time there is a garbage collection. When the young generation is full, the garbage collector goes through the young generation and copies all the objects that are still referenced to the tenured generation. The place where you find memory leaks in most applications is when you keep references to too many objects. You will start getting out of memory exceptions when the garbage collector starts to copy objects from the Young generation to the Tenured generation, and there is not enough space to fit all of the objects that are still referenced. Frequently, you don't even realize that you are keeping a reference to those objects.

What made this leak particularly difficult to detect was that it was a problem in the permanent space. This area of memory is used to store objects that don't relate to instances of objects. This is where the JVM stores definitions of classes and methods. This area is pretty obscure, and the commercial tool that I used for profiling didn't even notice the leak in that area.

Velocity was creating dynamic classes, which are used once per call, then promptly discarded. These classes are created in the Permanent generation. The problem came from an optimization that we made about 5-7 years ago to speed up the server. We had set the flag noclassgc , which disables garbage collection for class definitions. This option makes sense if you never create dynamic classes, because you don't have to spend time reloading classes into memory.

We were creating classes that would never be used again, and telling the garbage collector not to bother with cleaning up after us. The definition of the classes are very small, which is why it took about 66 hours for the leak to take down the server.

The moral of the story is to revisit your performance optimizations occasionally and make sure they still make sense. The best decision today will not necessarily be the best decision tomorrow, so be ready to change your mind.

Wednesday, March 18, 2009

What takes 125 users 66 hours to discover?

Unfortunately, I think the answer is a very slow memory leak.

For our application, we typically run 3 classes of performance test. Our quick test with 100 users for 1 hour, our longer test with 125 users for 8 hours, and our really long test with 125 users for 66 hours.

The 1 and 8 hour tests were fine this time, but looking at the garbage collection logs and zooming out over a very long time period, we see that the application is holding on to memory. The blue area represents the memory that is used and released. The black area, that is slowing moving upwards, is the memory that is used and not released.

The last performance problem that we fixed was related to xml parsing and xpath. This time it looks like we need to tune Velocity.