Re: Optiimising StringBuilder Size revisited

From:

Tom Anderson <twic@urchin.earth.li>

Newsgroups:

comp.lang.java.programmer

Date:

Mon, 22 Sep 2008 13:47:49 +0100

Message-ID:

<Pine.LNX.4.64.0809221328040.2745@urchin.earth.li>

On Sun, 21 Sep 2008, Roedy Green wrote:

Knowing that, how big should you make the initialisation parameter?

Really, you want to compute it, or at least estimate it, at runtime based
on what you're going to put in the buffer. Obviously, you can't do that
with an automatic tool, at least not without serious cleverness.

However, one thing you might consider is having the tool report the sites
where the most waste occurs - not just the most waste per instance, but
the most waste total, ie waste per instance * number of buffers allocated
at that site during the run. The programmer could then go and hand-tune
the sizes there if they wanted. This would be after applying your
automatic tuning process, of course.

Setting it to the longest string conserves CPU time in an infinitely RAM
rich environment. However outside the government, you probably also
want to conserve RAM by making letting a few of the StringBuildes grow
in return for most StringBuilders being smaller.

Since arrays need to be zeroed when they're allocated, allocating an
overly large array does also waste cycles.

Indeed, since StringBuffers are generally fairly short-lived, the worry is
not really about using up RAM, it's about generating garbage and making
the collector run more often that it otherwise would. Again, a CPU cost.

You need some measure of tradeoff of doublings vs wasted bytes in a
given invocation.

This is the crux of it - you can fairly easily calculate the number of
unnecessary doublings and number of wasted bytes that any given initial
size will give you, but how do you compare those numbers to decide which
initial size is best? It's easy when, for some initial sizes a and b, a
gives both fewer or equal doublings and fewer wasted bytes than b, but
that will be fairly uncommon - usually, wasting fewer bytes means making
more doublings.

I suggest a fairly simple metric: total number of chars allocated. The
logic here is that the CPU cost of both allocation (zeroing) and
deallocation (making GC happen more frequently) is directly proportional
to this. Working out the total allocation for a given combination of
initial and final size is fairly straightforward:

int allocation(int initial, int final) {
if (final <= initial) return initial ;
else return initial + allocation((initial * 2), final) ;
}

if you record all the final sizes in an array called finals, you can
evaluate the lifecycle cost of a given initial size choice easily:

int lifecycleAllocation(int initial, int[] finals) {
int sum = 0 ;
for (int final: finals) sum += allocation(initial, finals) ;
return sum ;
}

Then, as you say, you just run over every possible initial value from the
size of the smallest string to the largest, and pick the one which gives
the lowst lifecycle allocation.

What's slightly counterintuitive about this approach is that there's no
concept of wasted space. But you don't need it - you just need to allocate
as few characters as possible.

If you wanted to be really anal, you could modify the allocation()
function to take account of the size of an array header - i don't know for
sure how big that is, but i think it's something like three or four
machine words, which is equivalent to six or eight chars.

A mindless algorithm might work like this:

Simulate the doublings and wasted bytes of every for every possible
StringBuilder size between min and max length and pick the op for each
instance.

"The op"?

tom

--
Also, a 'dark future where there is only war!' ... have you seen the
news lately? -- applez