Re: I-O Performance
Eric Sosman wrote:
The time spent opening and closing all those little files
is (presumably) pretty much independent of their sizes, and if
it's significant -- a couple nice, deep directory searches on
a network-attached drive, say -- it might turn out to be the
performance-limiting term. I'd suggest doing some profiling on
elapsed time (not on CPU time) to get an idea of the division
between open/close time and I/O time, to see where further effort
should be focused.
If open/close times turn out to be significant, the only
thing I can think of to ameliorate the problem would be to do
them in parallel. ...
The output files will be a limiting factor. The requirement that
"[l]ines will be appended unchanged to a small number of output files" implies
that consumers of the Readers will have to wait their turn at the output side
to prevent interleaving.
This is a classic producer-consumer scenario with a few standard solution
architectures.
You might, for example, have a pool of worker threads
java.util.concurrent and brethren have a lot of useful support types for this
sort of thing.
LinkedBlockingQueue, Executor, ExecutorService, FutureTask, ...
You might be able to achieve something clever with NIO and Buffer magic. The
program has many-file, short-read-life pattern on input and a few-channel,
long-append-life pattern on output. You can provide optimal output structures
because they'll last a while. Input will be dominated by directory seek and
file open times.
So you get a lot of worker threads out there gathering memory buffers full of
complete input and queuing them up to go to the output channels. Each input
worker takes a while, but gets a fast input channel or even an entire
file-full in a single shot, once its directory seek/open cycle is finally done.
The output side will be much faster - no seek/open time, just wait on a
resource adapter (gateway to a resource, e.g., work queue, output channel)
that has optimized access to the output resource. This is where NIO might
shine. You might even be able to avoid double-buffering if you can tie the
NIO output Buffer directly to the memory (chunks) delivered by the input
worker when it finally connects to the output resource adapter. I feel sure
there's some such magic available in the Java API but specifics escape me at
the moment.
I'm being vague, partly on purpose, to avoid enmiring us in details, like
exactly what sort of queue, pipe, NIO channel, message-passer, listener,
stream, temporary file would catch the output. The structure is similar with
a variety of implementations. The key is to make the number of input workers
independent of the number of output channels. Slow to start, numerous inputs
should match pretty well with scarce, very fast, no-overhead output channels.
The output workers are the consumers, the input workers the producers.
Subscribers, publishers. Services, clients.
There is more art than science to balancing the numbers on each side. Some
call this balance the "impedance match". I do.
<http://en.wikipedia.org/wiki/Impedance_match>
The concept of impedance matching was originally developed for
electrical power, but can be applied to any other field where
a form of energy (not just electrical) is transferred between
a source and a load.
Miners and carts. Lots of miners lined up waiting to dump their carts of
input data into a few railroad cars that'll carry large loads away.
--
Lew