Re: Reading LAST line from text file without iterating through the
file?
On Sat, 26 Feb 2011 13:29:48 +0000, Martin Gregorie wrote:
On Sat, 26 Feb 2011 12:15:21 +0100, Ken Wesson wrote:
On Fri, 25 Feb 2011 14:58:27 +0000, Martin Gregorie wrote:
a text file contains records. They are variable length records with a
'newline' encoding as the delimiter.
By that definition the concept of "record-based" vs. "not-record-based"
becomes completely meaningless.
It is pretty much meaningless unless you're referring to the way a
programs handles data. Consider a file containing nothing but printable
characters:
- if a C or Java program reads the file byte by byte or parses it
by reading words separated by whitespace then line delimiters are
utterly meaningless and the program doesn't care whether the file
contains records or not.
- OTOH if a different program reads the same file a line at a time, e.g
C using fgets(), Java using BufferedReader.readLine(), then this is
pure record-level access.
But the text file itself is not "record-based". You can implement a
record-based format *on top of text* -- CSV goes further that way -- but
the resulting file, crucially, can still be manipulated with tools
designed for generic operations on arbitrary text files properly. In
particular, this should be lossless on it:
import java.io.*;
public class TextFileCopier {
public static void main (String[] args) throws IOException {
if (args.length < 3) {
System.out.println("Please specify source and" +
"destination file.");
return;
}
File f = new File(args[1]);
InputStream is = new FileInputStream(f);
Reader rdr = new InputStreamReader(is);
File g = new File(args[2]);
OutputStream os = new FileOutputStream(g);
Writer wtr = new OutputStreamWriter(os);
int c;
while ((c = rdr.read()) != -1) wtr.write(c);
}
}
But this won't be lossless on the strange file formats Arne has become
obsessed with. At the reading stage, the record boundaries in those file
formats will be translated into some newline character or another, likely
\u000A. When that happens, the distinction between those and literal
\u000A characters in the source file will be lost and can never be
regained.
Surely you agree that a file format cannot be regarded as a true text
file format unless the above TextFileCopier can copy all files in that
format faithfully?
But most of us use "records" to mean a structure that involves out-of-
band boundaries of some sort.
Not necessarily.
Yes necessarily, where "out of band" is taken with respect to whatever is
in the record fields.
However, fixed length records made up of fixed length fields contain no
out-of-band structure. You want an example? How about the two magnetic
stripe tracks on a credit card - 40 bytes and containing fields whose
content and meaning are defined by their position.
The boundaries are certainly out of band here -- they aren't represented
in the data itself at all, but rather in the reader and writer software!
BTW, you can use C to handle iSeries text files through the usual
gets() and puts() functions despite the iSeries holding text in what
are effectively database rows. They have three fields per row - a line
number, a fixed length text field and an 8 byte ID.
That's text plus file metadata.
Indeed it is. Technically it is made up of fixed length fields with no
delimiters. Apart from the record description that forms part of every
file and the member separators the only metadata is similar to a UNIX
directory entry plus the i-node. OS/400 and Z/OS text files are closer
to a tar or zip file than what a Unix or Windows user considers to be a
text file because you can store many separate chunks of text in a single
text file.
So which is it -- a tar-like archive of multiple text files, each with
internal newlines, or a single text file with a funny representation of
newlines? Arne indicated the latter while you seem to be indicating the
former.
Of course, neither are true text files -- both fail the TextFileCopier
test, in particular, and yours doesn't even pass the most elementary
sniff test -- calling that a text file would be like expecting
List<String> x = whatever;
String y = x;
to compile and work ... somehow.
It's a clear type error.
What makes it not *quite* a legitimate text file is that the file's
actual content contains a line break that is distinct from 0x0A, 0x0D,
No it doesn't. The editor won't let you put newlines into an OS/400
text file
If so, then that editor is broken. And if you edit it on a working editor
(say, by mounting the file system over the network and using vi on it
from the comfort of your nice, sane Unix work station) you can certainly
put newlines in it.
Database rows need an ID field so there's something you can uniquely
key on, and you said the system stores text in database rows, so
there's your explanation. The thing that makes no sense is it storing
text in database rows instead of as native text.
Nice guess, but that's not how it works.
Sure it is.
That role is taken by the line number (which can be a decimal value -
when you add lines between lines 0002 and 0003 they'll be numbered
0002.01, 0002.02 etc until yo ask the editor to renumber the member -
unlike Unix and Windows systems the line numbers in compilation errors
aren't screwed up by editing the source.
Well, that's a silly design, then. They already had a field perfectly
usable as the row ID field and added another separate one? What the hell
course in DB design did they take? Maybe one that dogmatically told them
never to make any meaningful data the key field even if it is guaranteed
unique?
Actually C is already broken here even on "normal" systems, because C
strings can't properly represent text containing NUL characters.
By definition they can't be included in 'text files'
They belong to the Unicode (and indeed even the base ASCII) character
set, so by definition they *can* be included in text files.
Nope; see above. If everything you've told me is accurate then it is
possible to write an OS/400 "text file" that encodes some information
that will be destroyed in a copy made by simply reading it character by
character through a java.io.Reader and outputting it character by
character, unaltered, through a java.io.Writer.
Incorrect assumption
No, it is not.
because you can't put non-printable characters in an OS/400 source file
member - the editor and other programs won't let you.
The programs that ship with the system won't? But then you can't properly
edit text files. No spaces? No tabs? No newlines? There goes making
anything long be readable. "Missing separator. Stop." on all your
makefiles. Etc.
Unless of course you use a more normal editor. Say, mount the machine's
file system over the network and use vi. Or copy a file to it that was
written with normal editors. Or, of course, emit a file containing
newlines from Java with Writer.
The OS/400 is a database machine. There are no files that aren't
databases. Every file has defining metadata which is automatically
generated for standard file types, e.g. source files and compiled
binaries. The field types control what byte values can appear in every
field, so you might limit a text field to upper case. Violating these
rules generally causes an exception which, of course, can be caught and
acted on.
If so, then you're indicating that the operating system *itself* will
throw an exception if you try to write a text file containing a newline
to the system.
So much for using it for text files, then. And so much for the claims
others made that you could read and write text files on such a system
normally over the network with vi and similar tools, store and compile C
sources, etc. So much also for the claim that you can use Java normally
on such a system -- an awful lot of Java programs will break if the
filesystem throws exceptions for writing perfectly normal things like 09
and 0A out to a text file. The OP's seek-back-to-start-of-last-line issue
is gonna be the *least* of his problems if he tries to run his Java code
on a wacky box like the one that you've just described!