Re: New utf8string design may make UTF-8 the superior encoding

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++,microsoft.public.vc.mfc

Date:

Wed, 19 May 2010 03:39:00 -0700 (PDT)

Message-ID:

<a2546bc9-43aa-4aa0-b3da-1dc3fae27b7b@m21g2000vbr.googlegroups.com>

On May 18, 8:17 pm, Peter Olcott <NoS...@OCR4Screen.com> wrote:

On 5/18/2010 9:34 AM, James Kanze wrote:

On 17 May, 14:08, Peter Olcott<NoS...@OCR4Screen.com> wrote:

On 5/17/2010 1:35 AM, Mihai N. wrote:

a regular expression implemented as a finite state machine
is the fastest and simplest possible way of every way that
can possibly exist to validate a UTF-8 sequence and divide
it into its constituent parts.

It all depends on the formal specification; one of the
characteristics of UTF-8 is that you don't have to look at
every character to find the length of a sequence. And
a regular expression generally will have to look at every
character.

Validation and translation to UTF-32 concurrently can not be
done faster than a DFA recognizer, in fact it must always be
slower.

UTF-8 was designed intentionally in a way that it doesn't
require a complete DFA to handle, but can be handled faster.
Complete DFA's are usually slower than caluculations on modern
processors, since they require memory accesses, and memory is
often the limiting factor.

In fact, there is no "must always be slower". There are too
many variables involved to be able to make such statements.

--
James Kanze