The More Things Change, the More They Stay the Same

Posted by Berin Loritsch Wed, 21 Jul 2010 23:25:00 GMT

Garbage collection is nothing new, and the Java folks have pretty much proven that garbage collection is a valid solution to memory management problems. Thanks to the JVM, the bar was set pretty high when Microsoft entered the fray with the Common Language Runtime (CLR). Based on some recent observations tracking down some performance issues in a C# application, it turns out all the experience writing Java paid off. So here are the high level observations:

  • Both the JVM and CLR use a variation of mark/sweep and generational garbage collection
  • Both the JVM and CLR will pause program execution during collection processing (even if there are threaded garbage collectors, the program is paused in that thread while collection occurs)
  • Creating a number of small, simple objects will be quickly collected in the first generational sweep
  • Creating a number of large, hierarchical object trees will be more slowly collected in later generation sweeps
  • Later generation sweeps take a long time in comparison to the first generation
  • The CLR garbage collector is more deterministic than the JVM garbage collector

All the bullet points up to the last one should be common sense if you’ve been developing in a garbage collected language for any amount of time. The last one actually surprised me a bit. When I performed some micro benchmarks to confirm my suspicions about the performance characteristics, the number of garbage collections was always the same. The amount of time the process took varied, the amount of memory varied (albeit very little), but the number of garbage collections performed in each test was identical. What this means is that it is a bit easier to perform micro benchmarks in the CLR and have reasonable confidence in the results.

Needless to say, when you are having performance problems and it is not immediately obvious what the problem is, suspect unplanned garbage collection first. I was given two algorithms that did the same thing, but a little differently. The original algorithm was accurate, but slow. The new algorithm was inaccurate but about 4-5 times faster. The question was, can we keep the accuracy but get the speed? So the first thing I did was put together a test harness to ensure that everything was measured the same way. I can’t post my code here, but there are a few things to consider when writing micro benchmarks:

  1. Always provide a warm up period so the environments can perform any optimizations (yes, the CLR does runtime optimizations)
  2. Perform your garbage collection, and wait for all finalizers (there’s two calls on the GC object to do this)
  3. Collect your pre-test measurements: time, memory usage, number of garbage collections in each generation
  4. Perform you test (use several iterations)
  5. Collect your post-test measurements: time, memory usage, number of garbage collections in each generation
  6. Report the difference between the pre-test measurements and the post-test measurements

I was able to document the difference between the two algorithms. The slow algorithm was four times slower, used 2 MB of memory, and caused more than 350 garbage collections—over a hundred of which were in the mature generation collection. Looking at the algorithms the biggest difference that would explain the difference in memory consumption like that was the object being used to hold the data. The faster algorithm used a pair of objects that held the raw data directly encapsulated. The slower algorithm used a matrix library, which held the values in an array of arrays. An array of arrays is a complex object from the viewpoint of the garbage collector. So on a hunch (that Java experience coming to play), I made a copy of the older algorithm that used a simple array.

By just changing the object holding the values, the original accurate algorithm was marginally faster than the newer “optimized” inaccurate one. It caused fewer garbage collections, and they were all first generation collections which are faster. I thought, “this is great, let’s try to do even better”. If the source of the performance drain was related to garbage collection, let’s see if we can eliminate it altogether. The key to making this happen is to use the Flyweight Pattern. Basically, the flyweight pattern says to create a small set of unique objects and then just change the values in them as necessary. In order to make that happen, I had to adjust the original API for the original algorithm to use references to the return object so I can let it reuse the response object. Just to up the ante, I used the original expensive value object to perform this test.

The results had the original algorithm performing twice as fast as the optimized algorithm, using only 8 KB of memory, and 0 garbage collections. All of this just by changing how we are using the objects that are expensive to create. I realize it is not always practical to use the flyweight pattern, but in the times where it can be used the results can be quite impressive. I was able to drop something that took 4 seconds down to 400 milliseconds. Those results are not typical, but it is always worth throwing a couple tests to discover the nature of the problem. If my first hunch didn’t pay off, I would never have tried the flyweight pattern. I would have looked at other avenues of speeding things up.

The bottom line is this: the more pressure you put on your garbage collector by creating and discarding objects in rapid succession, the more your garbage collector will punish you by sucking up all the CPU cycles.

Multi-threading and .Net 4.0

Posted by Berin Loritsch Wed, 07 Jul 2010 13:19:00 GMT

.Net 4.0 has some new features that will make it easier to work with multiple threads. This will in turn make it easier to use up all those cores on your multi-core processors. The usual warnings and caveats apply with ensuring your classes are thread safe. So, what is there to get excited about?

First, .Net finally gets it’s Barrier class. A barrier allows all the threads running in parallel to complete at the same time. Java’s had theirs since Java 5 (more than a couple years old). I find it most useful when I am using multiple threads to do some number crunching, but I need to make sure I’ve incorporated all their work before I go on. The initial purpose is to have the multiple threads sync up before doing the next round of processing. Using multi-threaded ray tracer problem from yesterday, let’s look at the problem it will solve:


ThreadPool.SetMaxThreads(screenWidth,25);

for (int y = 0; y < screenHeight; y++)
{
    for (int x = 0; x < screenWidth; x++)
    {
        int cx = x;
        int cy = y;

        ThreadPool.QueueUserWorkItem(state =>
        {
            Color c = RenderColor(cx, cy, scene);
            RenderPixelDelegate dl = RenderPixel;
            pictureBox.Invoke(dl, new Object[] { cx, cy, c });
        }, (y+1) * (x+1));
    }
}

Now, the RenderPixel(x,y,color) method takes care of plotting the pixel on the bitmap and telling the screen to redraw. The problem is the screen redrawing. That takes a lot of time, and the whole program performs better when you only redraw when the whole raster line is done. Problem is, you can’t be sure that the raster line is complete just because the last pixel in the row finished when you are computing the pixels in parallel. Truly we only care about the last line. In Java we could poll the ThreadPool to see if there are any remaining workers. Unfortunately .Net doesn’t give us that option. This is where the Barrier comes in to play. By using a Barrier we can ensure all the pixels are rendered before issuing the final redraw command:


ThreadPool.SetMaxThreads(screenWidth,25);
Barrier barrier = new Barrier(screenHeight * screenWidth + 1);

for (int y = 0; y < screenHeight; y++)
{
    for (int x = 0; x < screenWidth; x++)
    {
        int cx = x;
        int cy = y;

        ThreadPool.QueueUserWorkItem(state =>
        {
            Color c = RenderColor(cx, cy, scene);
            RenderPixelDelegate dl = RenderPixel;
            pictureBox.Invoke(dl, new Object[] { cx, cy, c });
            barrier.SignalAndWait();
        }, (y+1) * (x+1));
    }
}

barrier.SignalAndWait();
barrier.Dispose();

pictureBox.Refresh();

A couple things I’ll point out here. When you want the barrier to force the main thread to wait, you have to include it in the number of Barrier participants. I added a participant for each thread. The Java variation allowed for a thread to signal that it got to the synchronization point but not pause. That’s more useful in a situation like this where we only need to be notified that the thread is done. Having too many open Barriers may have bad effects on the ThreadPool because the worker queue items can’t officially close until all the barriers are synced up. Unfortunately I do not have .Net 4 installed on my machine so I can’t say for sure. It’s my suspicion based on my experience with the Java equivalent.

Another key threading improvement is the Parallel class. In some ways I think it is a step up from the Java Fork/Join Task architecture. The nice thing about the Parallel class is that you can use the Action delegate instead of the WaitCallBack delegate. The WaitCallBack item passes in an object for the thread state, however you rarely need it. The Parallel class will let you invoke the same delegate several times, or perform a parallel loop for you. Essentially, the code block I’ve been using can look like this now:


Barrier barrier = new Barrier(screenHeight * screenWidth + 1);

Parallel.For(0, screenHeight, y =>
{
    Parallel.For(0, screenWidth, x =>
    {
        int cx = x;
        int cy = y;

        Color c = RenderColor(cx, cy, scene);
        RenderPixelDelegate dl = RenderPixel;
        pictureBox.Invoke(dl, new Object[] { cx, cy, c });
        barrier.SignalAndWait();
    }
}

barrier.SignalAndWait();
barrier.Dispose();

pictureBox.Refresh();

Of course, I’m assuming that the Parallel For loop has the same mutable integer problem that I experienced with the delegate BeginInvoke problem. The problem still exists for the ThreadPool worker pool, so there is no reason for me to assume any different here. I like the simplicity and lack of clutter that the Parallel class provides.

Please do note that I chose a highly parallelizable problem to show off these features. Not all complex actions are as responsive. Below is a checklist to determine if a type of problem/algorithm can be ridiculously parallel:

  • No state needs to be shared between threads
  • All state necessary for the function was passed in as parameters
  • There are no reference or output parameters

Ray tracing fits this bill well because each pixel can be computed and plotted completely independent of the other pixels. If we added anti-aliasing to the mix, we would need to create more samples than we display. Web applications also fit this bill well because HTTP is a stateless protocol. Each request is handled independently from the others. There are several other problems that fit this bill.

When you have a set of data that is shared between threads and it has to remain consistent (i.e. race conditions would cause major problems or unstable behavior), you have to be more careful with your threads. Sure you have locks, mutexes, etc. but they essentially turn a multi-threaded application into a single threaded application and carry more overhead than if you never dealt with threads in the first place. By rethinking the problem a bit, you might be able to make the overall solution a bit more friendly to multiple threads. Below are some ways of making room:

  • Copy all data into each thread
    • Avoids synchronizations because the data is being used only in one thread at a time
    • Adds memory overhead and requires efficient and safe copy routines
  • Don’t do micro threads
    • The ThreadPool and Parallel classes handle micro processing needs, reusing threads as necessary
    • The costs outweigh the benefits. Always look for the major points that can be parallelized before looking at smaller items.
  • Understand the goals you have for multi-threading
    • The raw processing time might be shorter if you did everything in one thread
    • You can process more things at once if you have multiple cores, but you won’t gain much from having more threads than processing units (in fact you might lose some)
    • Is it the data or the process that can be run in parallel? Sometimes it’s not recommended to split an algorithm up into multiple threads, but you still might be able to process the data in chunks.
    • Stop if you can’t understand what the code is supposed to be doing. Debugging parallel code is very difficult, but if you have no working mental model for how the code is supposed to work it is impossible.

The trick with multi-threaded programming is minimizing the dependencies between the threads. The less they have to coordinate with each other the more efficiently the threads can use your computer. That’s a lesson from Erlang.

.Net Asynchronous Features Require Some Rethinking

Posted by Berin Loritsch Tue, 06 Jul 2010 17:31:00 GMT

Disclaimer: I’m originally a Java guy, and I know all kinds of asynchronous patterns and approaches that work in the Java world. The .Net approach to multi-threading and asynchronous applications is quite a bit different. In some ways the way that Microsoft approached the problem is more problematic. I stumbled across someone’s code for a small C# ray tracer with a fixed scene. I figured that this would be the perfect place to start playing with asynchronous code. Microsoft likes to hide things from you. I find this a bit problematic while trying to find out what is going wrong. Case and point: my assumptions about the way lambdas and closures ought to work (from my Ruby background) turned out to be wrong. Consider this code snippet:


for (int y = 0; y < screenHeight; y++)
{
    for (int x = 0; x < screenWidth; x++)
    {
        RenderColorDelegate px = RenderColor;
        px.BeginInvoke(x, y, scene, ar =>
           {
                Color c = px.EndInvoke(ar);
                RenderPixelDelegate dl = RenderPixel;
                pictureBox.Invoke(dl,new Object[]{x,y,c});
            }, null);
    }
}

Before I go into the surprise, let me tel you what this code is doing. This code is launching the rays from the view port one pixel at a time to calculate the color for that pixel. The px.BeginInvoke call is done on a delegate I had to create to alias the method I wanted to make asynchronous. I needed to pass in the variable “x”, “y”, and the scene that was passed in to the Render method. The original code makes use of synchronous calls, and behaves exactly as you would expect. X is always between 0 and one less than screenWidth. Y is always between 0 and one less than screenHeight. Now here is the surprise: when calling the methods asynchronously X can become greater than screenWidth!

Let me state that again in other terms: the values you pass in to the asynchronous method are not the values that end up being used. It behaves as a reference to the integer you pass in, and any call to x++ will increment the value on any asynchronous calls that have not been executed yet. That should alarm you. This can be (and was in this case) the source of serious and difficult to trace errors. With Ruby and Java anonymous classes, the values for x and y are frozen at the moment the asynchronous block is created. However, with .Net it is not. In order to fix the problem we have to copy x and y to new integers strictly for the purpose of passing in to the asynchrnous call:


for (int y = 0; y < screenHeight; y++)
{
    for (int x = 0; x < screenWidth; x++)
    {
        // Necessary to work around bug
        int cx = x;
        int cy = y;

        RenderColorDelegate px = RenderColor;
        px.BeginInvoke(cx, cy, scene, ar =>
            {
                Color c = px.EndInvoke(ar);
                RenderPixelDelegate dl = RenderPixel;
                pictureBox.Invoke(dl,new Object[]{cx,cy,c});
            }, null);
    }
}

The “pictureBox.Invoke” method should be familiar to .Net and Java UI developers. Essentially screen drawing and refreshing has to be done within the GUI thread. The “Invoke” method makes that happen when you are in another thread. The style of asynchronous call that I used included the callback function. This makes the redraw happen after the calculation is done.

When it comes to more traditional Thread manipulation, Java’s bag of tricks is a bit more complete. That may have changed with .Net 4.0 but I need to stick with 3.5. For example, the Future construct is really what I need. A Future is basically the promise of a value for when you need it later. It allows you to start processing a complex algorithm before you need the result, and pause for the result only when you really need it. It works for lazy initialization, but it’s more effective when you have a number of complex calculations you need to merge together. Java also has some good thread pool support, tasks, and other features. The Fork/Join approach in Java7 is also a very effective way to divvy up work. .Net 4.0 may have something similar to that, but both of those are taboo for new work for the time being.

All in all, I only scratched the surface. The delegate BeginInvoke() approach was designed for asynchronous event processing. It wasn’t really designed for more serious multi-threading. In fact, the way I used it for the ray tracer was the wrong tool for the job.

The bottom line is that I have to change the way I think about multi-threaded application design in the .Net platform. Some of the principles I’ve picked up from other languages still convey—but there’s a new twist on them. It’s to be expected. However, do be cautious of the delegate problem I ran into. It’s not good to have your values change underneath of you.

Fun with Delegates, or How to make .Net more like Ruby 2

Posted by Berin Loritsch Thu, 01 Jul 2010 16:53:00 GMT

I’ll admit it, I’m a Ruby fan. There are certain aspects of programming with that language that are really fun. While I don’t have the privilege of working with it every day, I’m working with it enough. With yesterday’s look at the world of LINQ, I decided to play a bit more with delegates and extension methods. For the other Ruby fans out there, it is true that C# is both a statically typed language and those types are locked in at compile time. However, there are ways to make C# behave more like Ruby. Sure you can use the var keyword which just means that the type isn’t decided until the first assignment. But I’m going to look at imitating the way Ruby does looping in C#.

Our example Ruby code that we want to imitate:

list = ["one", "two", "three"]

list.each {|item| puts item}

# Alternately:

list.each do |item|
  puts item
end

Perusing the .Net APIs, your IEnumerable<T> or List<T> objects don’t have an Each() method. Although it’s not difficult to add it after the fact. The process is not hard. The first step is to create an extension method. Extension methods are much like Ruby mixins, and are implemented very similarly. As long as our mixin class in in scope, so is our extension method. Let’s look at how it would look in C#:


string[] list = {"one", "two", "three"};

list.Each( item => Console.WriteLine(item) );

// Alternately:

list.Each(
    delegate(string item)
    {
        Console.WriteLine(item);
    });

The code to implement this really is not difficult to do, but it requires some understanding of the underlying mechanisms that make it possible:


public static class IEnumerableExtensions
{
    public static void Each<T>(this IEnumerable<T> list,
            Action<T> callback)
    {
        foreach(T item in list)
        {
             callback(item);
        }
    }
}

To better understand what’s going on here, I’m going to have to call out a few things. First, the mixin class IEnumerableExtensions must be static. Without the static keyword in front of the class the CLR will not be able to look inside for extension methods. That keyword also tells the compiler that all members of this class will also be static. The class cannot be instantiated in any way. Functionally, it behaves a lot like a Ruby module, which is why I’m calling it a mixin class. Note: the name of the class really doesn’t matter, but convention adds the word “Extensions” to the class or interface we are adding functionality to.

Next, the first argument of an extension method includes the keyword “this” and the type we are extending as the first parameter. Without the keyword “this” we would have just another random static method. Also note, this is a feature that is not possible in Java without bytecode manipulation or dynamic proxy chicanery. Perhaps that will change in the future. For those that used to know how C++ objects worked, the structure here makes a lot of sense. It’s similar to the idiom popular among C programmers to pass the struct being modified as the first parameter. I chose the IEnumberable interface because the same method will work on anything that implements that interface: which is exactly how Ruby attacks the problem. Essentially Ruby has an Enumerable module (mixin) that defines all the extra methods the collection classes can use.

We used generics here to provide the same idioms and levels of type checking that come with the .Net platform. Even when you are mixing in concepts derived from other languages, you should never completely throw away the principles in the language you are using. The generics allow us to use the method preserving all the type safety built into the language without requiring us to write a number of overloads. This helps using the method feel a little more Ruby-like without ignoring C# principles.

Lastly, I want to point out the type Action. It is a delegate defined by the .Net framework, along with its companion “Func”. The only difference between the two is that Action does not return anything and Func does. Instead of creating my own delegates, I simply used what was already available. With the pair of delegates supplied, we can have a lot of fun. For example, let’s say we wanted to derive a list of answers from executing the same function across all the members of a set of values. One example would be to derive the Root Mean Square (RMS) value on a set of numbers. Our extension method would look like this:


public static List<TResult> Collect<T,TResult>(
        this IEnumerable<T> list, Func<T,TResult> callback)
{
    List<TResult> answers = new List<TResult>();

    foreach (T item in list)
    {
        answers.Add(callback(item));
    }

    return answers;
}

The way you would use the “Derive” method we just defined to calculate the RMS would look something like this:


double[] values = {1.0, 4.0, 3.0};

double rms = Math.Sqrt(
    values.Collect(x => x * x).Sum() / values.Count());

Console.WriteLine("RMS is: {0}", rms);

The “Derive” method as it is written will also allow you to do a mass conversion of all the elements in one IEnumerable object into a new list. For example, if you had an array of numbers and you wanted a set of strings it can be done with just one line:


double[] values = {1.0, 4.0, 3.0};

List<string> conversion = values.Derive( item => item.ToString() );

conversion.Each( item => Console.WriteLine(item) );

Your creativity is bounded only by your imagination.