Friday, March 4, 2011

Learning through games

Playing games is a great way to learn something new. It is also a great way to take something that you know intellectually and turn it into something that you feel instinctively. Sometimes you think you are just playing a game, but in reality, you can learn something from everything you try.

My son, Jeffrey, never wanted to learn how to read. He had all the skills for it, he could recognize all the letters and he knew what the sounds of the letters made, but he would never actually sit down and read anything. He said that he couldn’t. That was until we bought the game Kingdom Hearts Re:coded for his Nintendo DS. He really wanted to play and since it is a role playing game, there was a lot of reading needed just to understand what was happening. Initially, he needed to sit with me or his grandma and ask us what most of the words were. As we played more, he started reading more and more of the words on his own. By the time we finished the game he could read anything meant for kids his age.

A few weeks ago, at the Cleveland Agile User’s group, we had a games night. We played a game called “We’re having a party” We had about 35 adults all excited about making invitations for a party that won’t actually happen. We were all rushing seeing how many invitations we could make before the timer ran out, then we were brainstorming about how we could make more invitations next time. While we were having fun, we picked up some important lessons about lean software development, cycle time, and defect prevention.

When you are trying to introduce a new behavior into your team, try to make it into a game.

Tuesday, April 28, 2009

What comes from a 10 year agile project?

Not very many people can claim to be on a 10 year agile project. I have the privileged of being the tech lead for ThoughtWorks' longest running project. About 10 years ago, we started working with our client to build a leasing application that would revolutionize their industry, giving them the ability to grow quickly in boom times and survive through lean times. After 10 years and more than 25 major releases, we continue to do new development and have fun doing it.

This project has been the home of some interesting innovations, including CruiseControl. My project was one of two large projects from ThoughtWorks that started CruiseControl and open sourced it. We think we may have the longest running instance on CruiseControl in the world.

We also have a lot to talk about. The podcast Keeping Grey Code Fit - Perspectives on Mature Projects highlights just a few of our stories.

Even Martin Fowler has chimed in on this project. He refers to us in his blog as the Nashville Project.

Monday, March 23, 2009

Make sure to re-check your assumptions when performance tuning

We found the memory leak. This is what our properly tuned system looks like during a 66 hour test. Notice that the black line does not slowly trend upward, like it did in the last post.


The velocity templating framework that we use for generating xml to send to a third party system was slowly eating away at our memory. It wasn't really a problem with velocity, it was a problem with the way that we had our system tuned.

The first thing to remember is that Java splits up its memory into three parts, called generations. They are Young, Tenured, and Permanent.


Whenever you create a new instance of a class, it is created in the Young generation. This generation is used and quickly discarded each time there is a garbage collection. When the young generation is full, the garbage collector goes through the young generation and copies all the objects that are still referenced to the tenured generation. The place where you find memory leaks in most applications is when you keep references to too many objects. You will start getting out of memory exceptions when the garbage collector starts to copy objects from the Young generation to the Tenured generation, and there is not enough space to fit all of the objects that are still referenced. Frequently, you don't even realize that you are keeping a reference to those objects.


What made this leak particularly difficult to detect was that it was a problem in the permanent space. This area of memory is used to store objects that don't relate to instances of objects. This is where the JVM stores definitions of classes and methods. This area is pretty obscure, and the commercial tool that I used for profiling didn't even notice the leak in that area.

Velocity was creating dynamic classes, which are used once per call, then promptly discarded. These classes are created in the Permanent generation. The problem came from an optimization that we made about 5-7 years ago to speed up the server. We had set the flag noclassgc , which disables garbage collection for class definitions. This option makes sense if you never create dynamic classes, because you don't have to spend time reloading classes into memory.

We were creating classes that would never be used again, and telling the garbage collector not to bother with cleaning up after us. The definition of the classes are very small, which is why it took about 66 hours for the leak to take down the server.

The moral of the story is to revisit your performance optimizations occasionally and make sure they still make sense. The best decision today will not necessarily be the best decision tomorrow, so be ready to change your mind.

Wednesday, March 18, 2009

What takes 125 users 66 hours to discover?

Unfortunately, I think the answer is a very slow memory leak.


For our application, we typically run 3 classes of performance test. Our quick test with 100 users for 1 hour, our longer test with 125 users for 8 hours, and our really long test with 125 users for 66 hours.

The 1 and 8 hour tests were fine this time, but looking at the garbage collection logs and zooming out over a very long time period, we see that the application is holding on to memory. The blue area represents the memory that is used and released. The black area, that is slowing moving upwards, is the memory that is used and not released.

The last performance problem that we fixed was related to xml parsing and xpath. This time it looks like we need to tune Velocity.

Friday, December 5, 2008

Is is a memory leak or a performance problem?

We are about to start user acceptance testing for the next release of our project. The application passed the performance tests in our environment, but failed at our client site. Time to panic?

It doesn't have to be. This is how I analyzed our problem to determine if it was a memory leak or performance problem.

First some context:
The application is a Java J2EE web application running on Weblogic and integrating with about a dozen external systems through a variety of middleware platforms. The most dramatic change for this release involves rewriting one of those integrations to support a new and dramatically different version of an external system. This integration is very complicated and heavy on the xml.

So now I know there is a problem, but I don't know how to reproduce it in our environment. This kind of problem could easily be caused by a memory leak. The easiest way to tell if there is a leak is to look at the garbage collection logs.

The gc.log looks something like this:

[INFO ][memory ] 20.407: parallel nursery GC 378204K->202844K (2097152K), 220.715 ms
[INFO ][memory ] 24.512: parallel nursery GC 463356K->266172K (2097152K), 108.514 ms
[INFO ][memory ] 158.352: parallel nursery GC 594933K->397813K (2097152K), 223.864 ms
[INFO ][memory ] 354.614: parallel nursery GC 710277K->554373K (2097152K), 297.761 ms
[INFO ][memory ] 356.632: parallel nursery GC 918935K->772951K (2097152K), 246.615 ms
[INFO ][memory ] 359.976: parallel nursery GC 1121408K->1000192K (2097152K), 254.630 ms
[INFO ][memory ] 362.469: parallel nursery GC 1372915K->1239347K (2097152K), 245.692 ms
[INFO ][memory ] 366.508: parallel nursery GC 1595123K->1468403K (2097152K), 249.064 ms
[INFO ][memory ] 368.933: parallel nursery GC 1824064K->1698240K (2097152K), 240.599 ms
[INFO ][memory ] 371.724: parallel nursery GC 2047811K->2097152K (2097152K), 229.808 ms


While you could look through this file and try to analyze it in your head, I wouldn't recommend it. This is where a visualization tool comes in really handy. I suggest GCViewer (Thanks Andy)

  • First delete any lines off the top of the file that don't look like the lines I pasted above. They are interesting to read but will confuse GCViewer.
  • Then open the edited gc.log using GCViewer and find a section of the log where the system performing badly. Timestamps are useful for this.
The log will look something like:



The important things to look at in the graph are:
  • The saw tooth lines up. Each "tooth" where the memory usage drops a little represents a partial garbage collection. These partial garbage collections only grab the free memory that is easily available. These collections are fast, but not very effective.
  • The dramatic drops represent a full garbage collection. When you run out of memory, your system will do a full garbage collection, which should free up all the space that is available. This garbage collection is much slower, but is much more effective.
  • The gray box indicates a system pause for garbage collection. If you see a large gray box like the one on the right of the graph, this is a noticeable pause.
  • The amount of memory that is available after a full garbage collection remains approximately constant. This indicates that we are using memory quickly, but releasing it when we are done. If the amount of used memory was increasing after a full collection, that would indicate a memory leak. A leak happens when you hold on to a memory reference when you are done processing instead of releasing it.
Based on the analysis of this graph, I knew that there was not a memory leak, but instead that the system was using too much memory, too quickly. This was causing the JVM to do a full garbage collection every 15-20 seconds, and taking up to 2 seconds for a full collection.

In my next post, I'll talk about how I figured out what was using too much memory and how we fixed it.