Re: Handling large text streams of integers
On Apr 1, 8:47 am, Jorgen Grahn <grahn+n...@snipabacken.se> wrote:
On Tue, 31 Mar 2009 17:54:27 -0400, Victor Bazarov
<v.Abaza...@comAcast.net> wrote:
James Kanze wrote:
...
A lot of systems maintain the current position in the file
in an std::streamsize. Which means that
std::numeric_limits<std::streamsize>::max() is also the
maximum number of bytes in the file. And that there can't
be that many numbers, since each number requires at least
two bytes (one digit and a separator).
[..]
What if the "file" is actually a serial connection that,
like the Energizer Bunny, just keeps going, and going,
and... Will the system also try to keep track of the
"current position" on a socket, for example? I know, I
know, the OP asked about a text file...
I am obviously too lazy to check what the standard says about
std::numeric_limits<std::streamsize>::max(), but I hope it's
just a case of unfortunate naming and that it has to do with
seekable streams only (like James hinted at elsewhere).
I'd be very disappointed if you couldn't use iostreams with
"infinite streams", which (on Unix) includes pipes, (TCP)
sockets, /dev/random, ... I expect to be able to use
std::cin/cerr constantly for years.
Given that the standard doesn't require support for such
things, it doubtlessly doesn't say anything. Disk file
access (even without seek) often does involve the "current
position", at least internally. To quote from the man page
of "read" (the lowest level system function which accesses
the data) on Solaris:
On files that support seeking (for example, a regular
file), the read() starts at a position in the file
given by the file offset associated with fildes. The
file offset is incremented by the number of bytes
actually read.
Files that do not support seeking (for example,
terminals) always read from the current position. The
value of a file offset associated with such a file is
undefined.
But also:
For regular files, no data transfer will occur past the
offset maximum established in the open file description
associated with fildes.
Interally, the system maintains the position as a 64 bit
value. When compiling in 32 bit mode, std::streamsize is 32
bits, and files are opened by default in a mode which only
allows 2^32 as the offset maximum, so the limitation holds.
(The C++ standard library could open the files in a way that
would allow 64 bit seeks and reads, even in 32 bit mode.
I'm pretty sure it doesn't, since we've had problems with
log data being lost when the log file size was greater than
2^32.)
Generally speaking, a lot of systems allow files larger than
2^32 bytes, but compiling in 32 bit mode. In such cases,
several solutions are possible:
-- If the system has two modes for accessing the files,
like Solaris, the library code just uses the 32 bit
mode, and the system behaves as if files couldn't be
bigger than 2^32 bytes. I suspect that this is the most
frequently used solution. (It's certainly the easiest
to implement, if the system supports it, and I suspect
that most systems, or at least most Unix, do.)
-- If the system doesn't have such support, the library
could keep track of the position as well, and simulate
it.
-- Alternatively, the library could either use a 64 bit
type for std::streamsize (if one exists on the
implementation) or define it as a class type, using 2 or
more smaller integral types in the implementation, using
whatever system requests are necessary to support full
64 bit file positionning at the system level. In many
ways, this would be the best solution. But if it means
making std::streamsize a class type, it will probably
break code. Incorrect code, since the standard doesn't
require that std::streamsize be an integral type, or
even that it reasonably convert to one, but such code
exists, and is, I fear, widespread. (If the system
supports long long, as most do now adays,
std::streamsize could be a typedef to this.)
-- Finally, I'm sure that some libraries just ignore the
issue. If the system defaults to limiting the file size
to 2^32 in 32 bit code, this is identical to the first
case above. If it doesn't, then the library isn't
conform---std::istream::tellg can return an apparently
valid position, but seeking to it will not go to the
right place. Still, conform or not, it wouldn't
surprise me to encounter such a system.
--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34