String is ASCII or UTF-8?

by C. Benson Manicaon 2010-03-09T15:59:20+00:00
Hours of Googling has not helped me resolve a seemingly simple
question - Given a string s, how can I tell whether it's ascii (and
thus 1 byte per character) or UTF-8 (and two bytes per character)?
This is python 2.4.3, so I don't have getsizeof available to me.
--
http://mail.python.org/mailman/listinfo/python-list

Re: String is ASCII or UTF-8?

by Alf P. Steinbachon 2010-03-09T16:05:28+00:00.
* C. Benson Manica:
> Hours of Googling has not helped me resolve a seemingly simple
> question - Given a string s, how can I tell whether it's ascii (and
> thus 1 byte per character) or UTF-8 (and two bytes per character)?
> This is python 2.4.3, so I don't have getsizeof available to me.
Generally, if you need 100% certainty then you can't tell the encoding from a
sequence of byte values.
However, if you know that it's EITHER ascii or utf-8 then the presence of any
value above 127 (or, for signed byte values, any negative values), tells you
that it can't be ascii, hence, must be utf-8. And since utf-8 is an extension of
ascii nothing is lost by assuming ascii in the other case. So, problem solved.
If the string represents the contents of a file then you may also look for an
UTF-8 represention of the Unicode BOM (Byte Order Mark) at the beginning. If
found then it indicates utf-8 for almost-sure and more expensive searching can
be avoided. It's just three bytes to check.
Cheers & hth.,
- Alf
--
http://mail.python.org/mailman/listinfo/python-list

Re: String is ASCII or UTF-8?

by Tim Goldenon 2010-03-09T16:08:26+00:00.
On 09/03/2010 16:54, C. Benson Manica wrote:
> Hours of Googling has not helped me resolve a seemingly simple
> question - Given a string s, how can I tell whether it's ascii (and
> thus 1 byte per character) or UTF-8 (and two bytes per character)?
> This is python 2.4.3, so I don't have getsizeof available to me.
You can't. You can apply one or more heuristics, depending on exactly
what your requirement is. But any valid ASCII text is also valid
UTF8-encoded text since UTF-8 isn't "two bytes per char" but a variable
number of bytes per char.
Obviously, you can test whether all the bytes are less than 128 which
suggests that the text is legal ASCII. But then it's also legal UTF8.
Or you can just attempt to decode and catch the exception:
try:
unicode (text, "ascii")
except UnicodeDecodeError:
print "Not ASCII"
TJG
--
http://mail.python.org/mailman/listinfo/python-list

Re: String is ASCII or UTF-8?

by Stef Mientkion 2010-03-09T16:13:16+00:00.
On 09-03-2010 18:02, Alf P. Steinbach wrote:
> * C. Benson Manica:
>> Hours of Googling has not helped me resolve a seemingly simple
>> question - Given a string s, how can I tell whether it's ascii (and
>> thus 1 byte per character) or UTF-8 (and two bytes per character)?
>> This is python 2.4.3, so I don't have getsizeof available to me.
>
> Generally, if you need 100% certainty then you can't tell the encoding
> from a sequence of byte values.
>
> However, if you know that it's EITHER ascii or utf-8 then the presence
> of any value above 127 (or, for signed byte values, any negative
> values), tells you that it can't be ascii,
AFAIK it's completely impossible.
UTF-8 characters have 1 to 4 bytes / byte.
I can create ASCII strings containing byte values between 127 and 255.
cheers,
Stef
> hence, must be utf-8. And since utf-8 is an extension of ascii nothing
> is lost by assuming ascii in the other case. So, problem solved.
>
> If the string represents the contents of a file then you may also look
> for an UTF-8 represention of the Unicode BOM (Byte Order Mark) at the
> beginning. If found then it indicates utf-8 for almost-sure and more
> expensive searching can be avoided. It's just three bytes to check.
>
>
> Cheers & hth.,
>
> - Alf
--
http://mail.python.org/mailman/listinfo/python-list

Re: String is ASCII or UTF-8?

by C. Benson Manicaon 2010-03-09T16:54:58+00:00.
On Mar 9, 12:07=A0pm, Tim Golden wrote:
> You can't. You can apply one or more heuristics, depending on exactly
> what your requirement is. But any valid ASCII text is also valid
> UTF8-encoded text since UTF-8 isn't "two bytes per char" but a variable
> number of bytes per char.
Hm, well that's very unfortunate. I'm using a database library which
seems to assume that all strings passed to it are ASCII, and I'm
attempting to use it on two different systems - one where all strings
are ASCII, and one where they seem to be UTF-8. The strings come from
the same place, i.e. they're exclusively normal ASCII characters.
What I would want is to check once for whether the strings passed to
function foo() are ASCII or UTF-8, and if they are to assume that all
strings need to be decoded. So that's not possible?
-- =
http://mail.python.org/mailman/listinfo/python-list

Re: String is ASCII or UTF-8?

by Richard Brodieon 2010-03-09T17:05:00+00:00.

"C. Benson Manica" wrote in message
news:98375575-1071-46af-8ebc-f3c817b47e1d@q23g2000yqd.googlegroups.com...
>The strings come from the same place, i.e. they're exclusively
> normal ASCII characters.
In this case then converting them to/from UTF-8 is a no-op, so
it makes no difference at all.
--
http://mail.python.org/mailman/listinfo/python-list

Re: String is ASCII or UTF-8?

by C. Benson Manicaon 2010-03-09T17:10:17+00:00.
On Mar 9, 12:24=A0pm, "Richard Brodie" wrote:
> "C. Benson Manica" wrote in messagenews:98375575-107=
1-46af-8ebc-f3c817b47e1d@q23g2000yqd.googlegroups.com...
>
> >The strings come from the same place, i.e. they're exclusively
> > normal ASCII characters.
>
> In this case then converting them to/from UTF-8 is a no-op, so
> it makes no difference at all.
Except to the database library, which seems perfectly happy to send an
8-character UTF-8 string to the database as 16 raw characters...
-- =
http://mail.python.org/mailman/listinfo/python-list

Re: String is ASCII or UTF-8?

by Robert Kernon 2010-03-09T17:18:43+00:00.
On 2010-03-09 11:12 AM, Stef Mientki wrote:
> On 09-03-2010 18:02, Alf P. Steinbach wrote:
>> * C. Benson Manica:
>>> Hours of Googling has not helped me resolve a seemingly simple
>>> question - Given a string s, how can I tell whether it's ascii (and
>>> thus 1 byte per character) or UTF-8 (and two bytes per character)?
>>> This is python 2.4.3, so I don't have getsizeof available to me.
>>
>> Generally, if you need 100% certainty then you can't tell the encoding
>> from a sequence of byte values.
>>
>> However, if you know that it's EITHER ascii or utf-8 then the presence
>> of any value above 127 (or, for signed byte values, any negative
>> values), tells you that it can't be ascii,
> AFAIK it's completely impossible.
> UTF-8 characters have 1 to 4 bytes / byte.
> I can create ASCII strings containing byte values between 127 and 255.
No, you can't. ASCII strings only have characters in the range 0..127. You could
create Latin-1 (or any number of the 8-bit encodings out there) strings with
characters 0..255, yes, but not ASCII.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
--
http://mail.python.org/mailman/listinfo/python-list

Re: String is ASCII or UTF-8?

by Terry Reedyon 2010-03-09T17:24:01+00:00.
On 3/9/2010 11:54 AM, C. Benson Manica wrote:
> Hours of Googling has not helped me resolve a seemingly simple
> question - Given a string s, how can I tell whether it's ascii (and
> thus 1 byte per character) or UTF-8 (and two bytes per character)?
Utf-8 is an encoding that uses 1 to 4 bytes per character.
So it is not clear what you are asking. Alf answered one of the possible
questions.
> This is python 2.4.3, so I don't have getsizeof available to me.
--
http://mail.python.org/mailman/listinfo/python-list

Re: String is ASCII or UTF-8?

by Roel Schroevenon 2010-03-09T17:27:03+00:00.
Op 2010-03-09 18:31, C. Benson Manica schreef:
> On Mar 9, 12:24 pm, "Richard Brodie" wrote:
>> "C. Benson Manica" wrote in messagenews:98375575-1071-46af-8ebc-f3c817b47e1d@q23g2000yqd.googlegroups.com...
>>
>>> The strings come from the same place, i.e. they're exclusively
>>> normal ASCII characters.
>>
>> In this case then converting them to/from UTF-8 is a no-op, so
>> it makes no difference at all.
>
> Except to the database library, which seems perfectly happy to send an
> 8-character UTF-8 string to the database as 16 raw characters...
In that case I think you mean UTF-16 or UCS-2 instead of UTF-8. UTF-16
uses 2 or more bytes per character, UCS-2 always uses 2 bytes per
character. UTF-8 uses 1 or more bytes per character.
If your texts are in a Western language, the second byte will be zero in
most characters; you could check for that (but note that the second byte
might be the first one in the byte stream, depending on the byte ordering).
HTH,
Roel
--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov
Roel Schroeven
--
http://mail.python.org/mailman/listinfo/python-list

Re: String is ASCII or UTF-8?

by Stef Mientkion 2010-03-09T20:36:29+00:00.
This is a multi-part message in MIME format.
--------------090609070102090506080101
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
On 09-03-2010 18:36, Robert Kern wrote:
> On 2010-03-09 11:12 AM, Stef Mientki wrote:
>> On 09-03-2010 18:02, Alf P. Steinbach wrote:
>>> * C. Benson Manica:
>>>> Hours of Googling has not helped me resolve a seemingly simple
>>>> question - Given a string s, how can I tell whether it's ascii (and
>>>> thus 1 byte per character) or UTF-8 (and two bytes per character)?
>>>> This is python 2.4.3, so I don't have getsizeof available to me.
>>>
>>> Generally, if you need 100% certainty then you can't tell the encoding
>>> from a sequence of byte values.
>>>
>>> However, if you know that it's EITHER ascii or utf-8 then the presence
>>> of any value above 127 (or, for signed byte values, any negative
>>> values), tells you that it can't be ascii,
>> AFAIK it's completely impossible.
>> UTF-8 characters have 1 to 4 bytes / byte.
>> I can create ASCII strings containing byte values between 127 and 255.
>
> No, you can't. ASCII strings only have characters in the range 0..127.
> You could create Latin-1 (or any number of the 8-bit encodings out
> there) strings with characters 0..255, yes, but not ASCII.
>
Probably, and according to wikipedia you're right.
I think I've to get rid of my old books,
Borland turbo Pascal 4 (1987) has an ASCII table of 256 characters,
while the small letters say 7-bit ;-)
cheers,
Stef
--------------090609070102090506080101
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit






On 09-03-2010 18:36, Robert Kern wrote:
On
2010-03-09 11:12 AM, Stef Mientki wrote:

On 09-03-2010 18:02, Alf P. Steinbach wrote:

* C. Benson Manica:

Hours of Googling has not helped me
resolve a seemingly simple

question - Given a string s, how can I tell whether it's ascii (and

thus 1 byte per character) or UTF-8 (and two bytes per character)?

This is python 2.4.3, so I don't have getsizeof available to me.



Generally, if you need 100% certainty then you can't tell the encoding

from a sequence of byte values.


However, if you know that it's EITHER ascii or utf-8 then the presence

of any value above 127 (or, for signed byte values, any negative

values), tells you that it can't be ascii,


AFAIK it's completely impossible.

UTF-8 characters have 1 to 4 bytes / byte.

I can create ASCII strings containing byte values between 127 and 255.



No, you can't. ASCII strings only have characters in the range 0..127.
You could create Latin-1 (or any number of the 8-bit encodings out
there) strings with characters 0..255, yes, but not ASCII.



Probably, and according to wikipedia you're right.
I think I've to get rid of my old books,
Borland turbo Pascal 4 (1987) has an ASCII table of 256 characters,
while the small letters say 7-bit  ;-)

cheers,
Stef



--------------090609070102090506080101--

Re: String is ASCII or UTF-8?

by Emile van Sebilleon 2010-03-09T21:20:35+00:00.
On 3/9/2010 1:36 PM Stef Mientki said...
> On 09-03-2010 18:36, Robert Kern wrote:

>> No, you can't. ASCII strings only have characters in the range 0..127.
>> You could create Latin-1 (or any number of the 8-bit encodings out
>> there) strings with characters 0..255, yes, but not ASCII.
>>
> Probably, and according to wikipedia you're right.
I too looked at wikipedia, and it seems historically incomplete to me.
In particular, I looked for 'high order ascii', which, when I was
working with Basic Four in the '70's, is what they used. Essentially,
the high order bit was set for all characters to make 8A a line feed,
etc. Still the same 0..127 characters, but not really an extended ascii
which is where wikipedia forwards you to.
I remember having to strap the eighth bit high when I reused the older
line printers to get them to work.
Emile
--
http://mail.python.org/mailman/listinfo/python-list