- Previous thread: 1794 - Best, Cheapest Web-Hosting, Domain at $1.99!
- Next thread: Are there in Python some static web site generating tools like webgen, nanoc or webby in Ruby ?
- Threads sorted by date: python 201003
Hours of Googling has not helped me resolve a seemingly simple
question - Given a string s, how can I tell whether it's ascii (and
thus 1 byte per character) or UTF-8 (and two bytes per character)?
This is python 2.4.3, so I don't have getsizeof available to me.
--
http://mail.python.org/mailman/listinfo/python-list
question - Given a string s, how can I tell whether it's ascii (and
thus 1 byte per character) or UTF-8 (and two bytes per character)?
This is python 2.4.3, so I don't have getsizeof available to me.
--
http://mail.python.org/mailman/listinfo/python-list
* C. Benson Manica:
> Hours of Googling has not helped me resolve a seemingly simple
> question - Given a string s, how can I tell whether it's ascii (and
> thus 1 byte per character) or UTF-8 (and two bytes per character)?
> This is python 2.4.3, so I don't have getsizeof available to me.
Generally, if you need 100% certainty then you can't tell the encoding from a
sequence of byte values.
However, if you know that it's EITHER ascii or utf-8 then the presence of any
value above 127 (or, for signed byte values, any negative values), tells you
that it can't be ascii, hence, must be utf-8. And since utf-8 is an extension of
ascii nothing is lost by assuming ascii in the other case. So, problem solved.
If the string represents the contents of a file then you may also look for an
UTF-8 represention of the Unicode BOM (Byte Order Mark) at the beginning. If
found then it indicates utf-8 for almost-sure and more expensive searching can
be avoided. It's just three bytes to check.
Cheers & hth.,
- Alf
--
http://mail.python.org/mailman/listinfo/python-list
> Hours of Googling has not helped me resolve a seemingly simple
> question - Given a string s, how can I tell whether it's ascii (and
> thus 1 byte per character) or UTF-8 (and two bytes per character)?
> This is python 2.4.3, so I don't have getsizeof available to me.
Generally, if you need 100% certainty then you can't tell the encoding from a
sequence of byte values.
However, if you know that it's EITHER ascii or utf-8 then the presence of any
value above 127 (or, for signed byte values, any negative values), tells you
that it can't be ascii, hence, must be utf-8. And since utf-8 is an extension of
ascii nothing is lost by assuming ascii in the other case. So, problem solved.
If the string represents the contents of a file then you may also look for an
UTF-8 represention of the Unicode BOM (Byte Order Mark) at the beginning. If
found then it indicates utf-8 for almost-sure and more expensive searching can
be avoided. It's just three bytes to check.
Cheers & hth.,
- Alf
--
http://mail.python.org/mailman/listinfo/python-list
On 09/03/2010 16:54, C. Benson Manica wrote:
> Hours of Googling has not helped me resolve a seemingly simple
> question - Given a string s, how can I tell whether it's ascii (and
> thus 1 byte per character) or UTF-8 (and two bytes per character)?
> This is python 2.4.3, so I don't have getsizeof available to me.
You can't. You can apply one or more heuristics, depending on exactly
what your requirement is. But any valid ASCII text is also valid
UTF8-encoded text since UTF-8 isn't "two bytes per char" but a variable
number of bytes per char.
Obviously, you can test whether all the bytes are less than 128 which
suggests that the text is legal ASCII. But then it's also legal UTF8.
Or you can just attempt to decode and catch the exception:
try:
unicode (text, "ascii")
except UnicodeDecodeError:
print "Not ASCII"
TJG
--
http://mail.python.org/mailman/listinfo/python-list
> Hours of Googling has not helped me resolve a seemingly simple
> question - Given a string s, how can I tell whether it's ascii (and
> thus 1 byte per character) or UTF-8 (and two bytes per character)?
> This is python 2.4.3, so I don't have getsizeof available to me.
You can't. You can apply one or more heuristics, depending on exactly
what your requirement is. But any valid ASCII text is also valid
UTF8-encoded text since UTF-8 isn't "two bytes per char" but a variable
number of bytes per char.
Obviously, you can test whether all the bytes are less than 128 which
suggests that the text is legal ASCII. But then it's also legal UTF8.
Or you can just attempt to decode and catch the exception:
try:
unicode (text, "ascii")
except UnicodeDecodeError:
print "Not ASCII"
TJG
--
http://mail.python.org/mailman/listinfo/python-list
On 09-03-2010 18:02, Alf P. Steinbach wrote:
> * C. Benson Manica:
>> Hours of Googling has not helped me resolve a seemingly simple
>> question - Given a string s, how can I tell whether it's ascii (and
>> thus 1 byte per character) or UTF-8 (and two bytes per character)?
>> This is python 2.4.3, so I don't have getsizeof available to me.
>
> Generally, if you need 100% certainty then you can't tell the encoding
> from a sequence of byte values.
>
> However, if you know that it's EITHER ascii or utf-8 then the presence
> of any value above 127 (or, for signed byte values, any negative
> values), tells you that it can't be ascii,
AFAIK it's completely impossible.
UTF-8 characters have 1 to 4 bytes / byte.
I can create ASCII strings containing byte values between 127 and 255.
cheers,
Stef
> hence, must be utf-8. And since utf-8 is an extension of ascii nothing
> is lost by assuming ascii in the other case. So, problem solved.
>
> If the string represents the contents of a file then you may also look
> for an UTF-8 represention of the Unicode BOM (Byte Order Mark) at the
> beginning. If found then it indicates utf-8 for almost-sure and more
> expensive searching can be avoided. It's just three bytes to check.
>
>
> Cheers & hth.,
>
> - Alf
--
http://mail.python.org/mailman/listinfo/python-list
> * C. Benson Manica:
>> Hours of Googling has not helped me resolve a seemingly simple
>> question - Given a string s, how can I tell whether it's ascii (and
>> thus 1 byte per character) or UTF-8 (and two bytes per character)?
>> This is python 2.4.3, so I don't have getsizeof available to me.
>
> Generally, if you need 100% certainty then you can't tell the encoding
> from a sequence of byte values.
>
> However, if you know that it's EITHER ascii or utf-8 then the presence
> of any value above 127 (or, for signed byte values, any negative
> values), tells you that it can't be ascii,
AFAIK it's completely impossible.
UTF-8 characters have 1 to 4 bytes / byte.
I can create ASCII strings containing byte values between 127 and 255.
cheers,
Stef
> hence, must be utf-8. And since utf-8 is an extension of ascii nothing
> is lost by assuming ascii in the other case. So, problem solved.
>
> If the string represents the contents of a file then you may also look
> for an UTF-8 represention of the Unicode BOM (Byte Order Mark) at the
> beginning. If found then it indicates utf-8 for almost-sure and more
> expensive searching can be avoided. It's just three bytes to check.
>
>
> Cheers & hth.,
>
> - Alf
--
http://mail.python.org/mailman/listinfo/python-list
On Mar 9, 12:07=A0pm, Tim Golden wrote:
> You can't. You can apply one or more heuristics, depending on exactly
> what your requirement is. But any valid ASCII text is also valid
> UTF8-encoded text since UTF-8 isn't "two bytes per char" but a variable
> number of bytes per char.
Hm, well that's very unfortunate. I'm using a database library which
seems to assume that all strings passed to it are ASCII, and I'm
attempting to use it on two different systems - one where all strings
are ASCII, and one where they seem to be UTF-8. The strings come from
the same place, i.e. they're exclusively normal ASCII characters.
What I would want is to check once for whether the strings passed to
function foo() are ASCII or UTF-8, and if they are to assume that all
strings need to be decoded. So that's not possible?
-- =
http://mail.python.org/mailman/listinfo/python-list
> You can't. You can apply one or more heuristics, depending on exactly
> what your requirement is. But any valid ASCII text is also valid
> UTF8-encoded text since UTF-8 isn't "two bytes per char" but a variable
> number of bytes per char.
Hm, well that's very unfortunate. I'm using a database library which
seems to assume that all strings passed to it are ASCII, and I'm
attempting to use it on two different systems - one where all strings
are ASCII, and one where they seem to be UTF-8. The strings come from
the same place, i.e. they're exclusively normal ASCII characters.
What I would want is to check once for whether the strings passed to
function foo() are ASCII or UTF-8, and if they are to assume that all
strings need to be decoded. So that's not possible?
-- =
http://mail.python.org/mailman/listinfo/python-list
"C. Benson Manica" wrote in message
news:98375575-1071-46af-8ebc-f3c817b47e1d@q23g2000yqd.googlegroups.com...
>The strings come from the same place, i.e. they're exclusively
> normal ASCII characters.
In this case then converting them to/from UTF-8 is a no-op, so
it makes no difference at all.
--
http://mail.python.org/mailman/listinfo/python-list
On Mar 9, 12:24=A0pm, "Richard Brodie" wrote:
> "C. Benson Manica" wrote in messagenews:98375575-107=
1-46af-8ebc-f3c817b47e1d@q23g2000yqd.googlegroups.com...
>
> >The strings come from the same place, i.e. they're exclusively
> > normal ASCII characters.
>
> In this case then converting them to/from UTF-8 is a no-op, so
> it makes no difference at all.
Except to the database library, which seems perfectly happy to send an
8-character UTF-8 string to the database as 16 raw characters...
-- =
http://mail.python.org/mailman/listinfo/python-list
> "C. Benson Manica" wrote in messagenews:98375575-107=
1-46af-8ebc-f3c817b47e1d@q23g2000yqd.googlegroups.com...
>
> >The strings come from the same place, i.e. they're exclusively
> > normal ASCII characters.
>
> In this case then converting them to/from UTF-8 is a no-op, so
> it makes no difference at all.
Except to the database library, which seems perfectly happy to send an
8-character UTF-8 string to the database as 16 raw characters...
-- =
http://mail.python.org/mailman/listinfo/python-list
On 2010-03-09 11:12 AM, Stef Mientki wrote:
> On 09-03-2010 18:02, Alf P. Steinbach wrote:
>> * C. Benson Manica:
>>> Hours of Googling has not helped me resolve a seemingly simple
>>> question - Given a string s, how can I tell whether it's ascii (and
>>> thus 1 byte per character) or UTF-8 (and two bytes per character)?
>>> This is python 2.4.3, so I don't have getsizeof available to me.
>>
>> Generally, if you need 100% certainty then you can't tell the encoding
>> from a sequence of byte values.
>>
>> However, if you know that it's EITHER ascii or utf-8 then the presence
>> of any value above 127 (or, for signed byte values, any negative
>> values), tells you that it can't be ascii,
> AFAIK it's completely impossible.
> UTF-8 characters have 1 to 4 bytes / byte.
> I can create ASCII strings containing byte values between 127 and 255.
No, you can't. ASCII strings only have characters in the range 0..127. You could
create Latin-1 (or any number of the 8-bit encodings out there) strings with
characters 0..255, yes, but not ASCII.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
--
http://mail.python.org/mailman/listinfo/python-list
> On 09-03-2010 18:02, Alf P. Steinbach wrote:
>> * C. Benson Manica:
>>> Hours of Googling has not helped me resolve a seemingly simple
>>> question - Given a string s, how can I tell whether it's ascii (and
>>> thus 1 byte per character) or UTF-8 (and two bytes per character)?
>>> This is python 2.4.3, so I don't have getsizeof available to me.
>>
>> Generally, if you need 100% certainty then you can't tell the encoding
>> from a sequence of byte values.
>>
>> However, if you know that it's EITHER ascii or utf-8 then the presence
>> of any value above 127 (or, for signed byte values, any negative
>> values), tells you that it can't be ascii,
> AFAIK it's completely impossible.
> UTF-8 characters have 1 to 4 bytes / byte.
> I can create ASCII strings containing byte values between 127 and 255.
No, you can't. ASCII strings only have characters in the range 0..127. You could
create Latin-1 (or any number of the 8-bit encodings out there) strings with
characters 0..255, yes, but not ASCII.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
--
http://mail.python.org/mailman/listinfo/python-list
On 3/9/2010 11:54 AM, C. Benson Manica wrote:
> Hours of Googling has not helped me resolve a seemingly simple
> question - Given a string s, how can I tell whether it's ascii (and
> thus 1 byte per character) or UTF-8 (and two bytes per character)?
Utf-8 is an encoding that uses 1 to 4 bytes per character.
So it is not clear what you are asking. Alf answered one of the possible
questions.
> This is python 2.4.3, so I don't have getsizeof available to me.
--
http://mail.python.org/mailman/listinfo/python-list
> Hours of Googling has not helped me resolve a seemingly simple
> question - Given a string s, how can I tell whether it's ascii (and
> thus 1 byte per character) or UTF-8 (and two bytes per character)?
Utf-8 is an encoding that uses 1 to 4 bytes per character.
So it is not clear what you are asking. Alf answered one of the possible
questions.
> This is python 2.4.3, so I don't have getsizeof available to me.
--
http://mail.python.org/mailman/listinfo/python-list
Op 2010-03-09 18:31, C. Benson Manica schreef:
> On Mar 9, 12:24 pm, "Richard Brodie" wrote:
>> "C. Benson Manica" wrote in messagenews:98375575-1071-46af-8ebc-f3c817b47e1d@q23g2000yqd.googlegroups.com...
>>
>>> The strings come from the same place, i.e. they're exclusively
>>> normal ASCII characters.
>>
>> In this case then converting them to/from UTF-8 is a no-op, so
>> it makes no difference at all.
>
> Except to the database library, which seems perfectly happy to send an
> 8-character UTF-8 string to the database as 16 raw characters...
In that case I think you mean UTF-16 or UCS-2 instead of UTF-8. UTF-16
uses 2 or more bytes per character, UCS-2 always uses 2 bytes per
character. UTF-8 uses 1 or more bytes per character.
If your texts are in a Western language, the second byte will be zero in
most characters; you could check for that (but note that the second byte
might be the first one in the byte stream, depending on the byte ordering).
HTH,
Roel
--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov
Roel Schroeven
--
http://mail.python.org/mailman/listinfo/python-list
> On Mar 9, 12:24 pm, "Richard Brodie" wrote:
>> "C. Benson Manica" wrote in messagenews:98375575-1071-46af-8ebc-f3c817b47e1d@q23g2000yqd.googlegroups.com...
>>
>>> The strings come from the same place, i.e. they're exclusively
>>> normal ASCII characters.
>>
>> In this case then converting them to/from UTF-8 is a no-op, so
>> it makes no difference at all.
>
> Except to the database library, which seems perfectly happy to send an
> 8-character UTF-8 string to the database as 16 raw characters...
In that case I think you mean UTF-16 or UCS-2 instead of UTF-8. UTF-16
uses 2 or more bytes per character, UCS-2 always uses 2 bytes per
character. UTF-8 uses 1 or more bytes per character.
If your texts are in a Western language, the second byte will be zero in
most characters; you could check for that (but note that the second byte
might be the first one in the byte stream, depending on the byte ordering).
HTH,
Roel
--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov
Roel Schroeven
--
http://mail.python.org/mailman/listinfo/python-list
This is a multi-part message in MIME format.
--------------090609070102090506080101
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
On 09-03-2010 18:36, Robert Kern wrote:
> On 2010-03-09 11:12 AM, Stef Mientki wrote:
>> On 09-03-2010 18:02, Alf P. Steinbach wrote:
>>> * C. Benson Manica:
>>>> Hours of Googling has not helped me resolve a seemingly simple
>>>> question - Given a string s, how can I tell whether it's ascii (and
>>>> thus 1 byte per character) or UTF-8 (and two bytes per character)?
>>>> This is python 2.4.3, so I don't have getsizeof available to me.
>>>
>>> Generally, if you need 100% certainty then you can't tell the encoding
>>> from a sequence of byte values.
>>>
>>> However, if you know that it's EITHER ascii or utf-8 then the presence
>>> of any value above 127 (or, for signed byte values, any negative
>>> values), tells you that it can't be ascii,
>> AFAIK it's completely impossible.
>> UTF-8 characters have 1 to 4 bytes / byte.
>> I can create ASCII strings containing byte values between 127 and 255.
>
> No, you can't. ASCII strings only have characters in the range 0..127.
> You could create Latin-1 (or any number of the 8-bit encodings out
> there) strings with characters 0..255, yes, but not ASCII.
>
Probably, and according to wikipedia you're right.
I think I've to get rid of my old books,
Borland turbo Pascal 4 (1987) has an ASCII table of 256 characters,
while the small letters say 7-bit ;-)
cheers,
Stef
--------------090609070102090506080101
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
On 09-03-2010 18:36, Robert Kern wrote:
On
2010-03-09 11:12 AM, Stef Mientki wrote:
On 09-03-2010 18:02, Alf P. Steinbach wrote:
* C. Benson Manica:
Hours of Googling has not helped me
resolve a seemingly simple
question - Given a string s, how can I tell whether it's ascii (and
thus 1 byte per character) or UTF-8 (and two bytes per character)?
This is python 2.4.3, so I don't have getsizeof available to me.
Generally, if you need 100% certainty then you can't tell the encoding
from a sequence of byte values.
However, if you know that it's EITHER ascii or utf-8 then the presence
of any value above 127 (or, for signed byte values, any negative
values), tells you that it can't be ascii,
AFAIK it's completely impossible.
UTF-8 characters have 1 to 4 bytes / byte.
I can create ASCII strings containing byte values between 127 and 255.
No, you can't. ASCII strings only have characters in the range 0..127.
You could create Latin-1 (or any number of the 8-bit encodings out
there) strings with characters 0..255, yes, but not ASCII.
Probably, and according to wikipedia you're right.
I think I've to get rid of my old books,
Borland turbo Pascal 4 (1987) has an ASCII table of 256 characters,
while the small letters say 7-bit ;-)
cheers,
Stef
--------------090609070102090506080101--
--------------090609070102090506080101
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
On 09-03-2010 18:36, Robert Kern wrote:
> On 2010-03-09 11:12 AM, Stef Mientki wrote:
>> On 09-03-2010 18:02, Alf P. Steinbach wrote:
>>> * C. Benson Manica:
>>>> Hours of Googling has not helped me resolve a seemingly simple
>>>> question - Given a string s, how can I tell whether it's ascii (and
>>>> thus 1 byte per character) or UTF-8 (and two bytes per character)?
>>>> This is python 2.4.3, so I don't have getsizeof available to me.
>>>
>>> Generally, if you need 100% certainty then you can't tell the encoding
>>> from a sequence of byte values.
>>>
>>> However, if you know that it's EITHER ascii or utf-8 then the presence
>>> of any value above 127 (or, for signed byte values, any negative
>>> values), tells you that it can't be ascii,
>> AFAIK it's completely impossible.
>> UTF-8 characters have 1 to 4 bytes / byte.
>> I can create ASCII strings containing byte values between 127 and 255.
>
> No, you can't. ASCII strings only have characters in the range 0..127.
> You could create Latin-1 (or any number of the 8-bit encodings out
> there) strings with characters 0..255, yes, but not ASCII.
>
Probably, and according to wikipedia you're right.
I think I've to get rid of my old books,
Borland turbo Pascal 4 (1987) has an ASCII table of 256 characters,
while the small letters say 7-bit ;-)
cheers,
Stef
--------------090609070102090506080101
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
On 09-03-2010 18:36, Robert Kern wrote:
On
2010-03-09 11:12 AM, Stef Mientki wrote:
On 09-03-2010 18:02, Alf P. Steinbach wrote:
* C. Benson Manica:
Hours of Googling has not helped me
resolve a seemingly simple
question - Given a string s, how can I tell whether it's ascii (and
thus 1 byte per character) or UTF-8 (and two bytes per character)?
This is python 2.4.3, so I don't have getsizeof available to me.
Generally, if you need 100% certainty then you can't tell the encoding
from a sequence of byte values.
However, if you know that it's EITHER ascii or utf-8 then the presence
of any value above 127 (or, for signed byte values, any negative
values), tells you that it can't be ascii,
AFAIK it's completely impossible.
UTF-8 characters have 1 to 4 bytes / byte.
I can create ASCII strings containing byte values between 127 and 255.
No, you can't. ASCII strings only have characters in the range 0..127.
You could create Latin-1 (or any number of the 8-bit encodings out
there) strings with characters 0..255, yes, but not ASCII.
Probably, and according to wikipedia you're right.
I think I've to get rid of my old books,
Borland turbo Pascal 4 (1987) has an ASCII table of 256 characters,
while the small letters say 7-bit ;-)
cheers,
Stef
--------------090609070102090506080101--
On 3/9/2010 1:36 PM Stef Mientki said...
> On 09-03-2010 18:36, Robert Kern wrote:
>> No, you can't. ASCII strings only have characters in the range 0..127.
>> You could create Latin-1 (or any number of the 8-bit encodings out
>> there) strings with characters 0..255, yes, but not ASCII.
>>
> Probably, and according to wikipedia you're right.
I too looked at wikipedia, and it seems historically incomplete to me.
In particular, I looked for 'high order ascii', which, when I was
working with Basic Four in the '70's, is what they used. Essentially,
the high order bit was set for all characters to make 8A a line feed,
etc. Still the same 0..127 characters, but not really an extended ascii
which is where wikipedia forwards you to.
I remember having to strap the eighth bit high when I reused the older
line printers to get them to work.
Emile
--
http://mail.python.org/mailman/listinfo/python-list
> On 09-03-2010 18:36, Robert Kern wrote:
>> No, you can't. ASCII strings only have characters in the range 0..127.
>> You could create Latin-1 (or any number of the 8-bit encodings out
>> there) strings with characters 0..255, yes, but not ASCII.
>>
> Probably, and according to wikipedia you're right.
I too looked at wikipedia, and it seems historically incomplete to me.
In particular, I looked for 'high order ascii', which, when I was
working with Basic Four in the '70's, is what they used. Essentially,
the high order bit was set for all characters to make 8A a line feed,
etc. Still the same 0..127 characters, but not really an extended ascii
which is where wikipedia forwards you to.
I remember having to strap the eighth bit high when I reused the older
line printers to get them to work.
Emile
--
http://mail.python.org/mailman/listinfo/python-list
Related Threads
- Re: [newbie] Skype problem, no signal from mike - mandriva
- [Bug 21594] New: Bottom of windows cut of with KDE4 Dual-Head Setup - wine
- DO NOT REPLY [Bug 48683] New: Sessions expiring unexpectedly - tomcat
- [Hendrix] open new tab button? - firefox
- [Hendrix] put slider at left of picture - firefox
- rhev-h and virsh - redhat
- dolphin fails to start after KDE 4.4 update - macports
- FTP troubles with Roundcube vacation plugin - freebsd
- [GENERAL] Source RPMs for PostgreSQL 7.4.27 on RHEL4 - pgsql
- Build failed in Hudson: ActiveMQ-5.3 #51 - activemq
- [Hendrix] Persona Design - firefox