gsub not working to replace a 'Chinese' Charater.

by Ryan Smithon 2010-01-28T16:06:39+00:00
gsub not works for me when replace 'DBCS'(double byte character set)
character, using last version ruby 1.8.6
when "strºº×Öend".gsub(/ºº×Ö/,"hanzi"),
output still is: strºº×Öend , but not strhanziend which I want to
get.
Searched web two whole night with no clue found.
Anyone can help are much appreciated, need got it work very urgent.
thank you!
--
Posted via http://www.ruby-forum.com/.

Re: gsub not working to replace a 'Chinese' Charater.

by Richard Conroyon 2010-01-28T16:59:32+00:00.
On Thu, Jan 28, 2010 at 5:05 PM, Ryan Smith wrote:
> gsub not works for me when replace 'DBCS'(double byte character set)
> character, using last version ruby 1.8.6
>
> when "str=BA=BA=D7=D6end".gsub(/=BA=BA=D7=D6/,"hanzi"),
> output still is: str=BA=BA=D7=D6end , but not strhanziend which I want =
to
> get.
>
> Searched web two whole night with no clue found.
>
> Anyone can help are much appreciated, need got it work very urgent.
> thank you!
>
Mixing encoding schemes is hell in almost any context, and Ruby is no
exception.
Until you have complete control in your program over all encoding inputs yo=
u
are
going to fail.
If your input is coming from the shell environment or standard in the text
can be
in the system encoding, regardless of what encoding you specify in Ruby.
It is preferable to use unicode (UTF-8) in any operation where you are
processing
multilingual text. Failing that there is the Iconv library which you can us=
e
to convert
between encoding schemes.
Note that 'double-byte encoding scheme' is an utterly useless term for
practical encoding
purposes. Its a gross simplification of what is going on, and especially so
with Han
character sets. To do any practical work with non-unicode, multi-byte
character sets, you
have to know the encoding scheme.
--=20
http://richardconroy.blogspot.com

Re: gsub not working to replace a 'Chinese' Charater.

by Roger Packon 2010-01-28T17:01:27+00:00.
Ryan Smith wrote:
> gsub not works for me when replace 'DBCS'(double byte character set)
> character, using last version ruby 1.8.6
Maybe try 1.9?
-r
--
Posted via http://www.ruby-forum.com/.

Re: gsub not working to replace a 'Chinese' Charater.

by Benoit Dalozeon 2010-01-28T17:14:25+00:00.
With ruby 1.9.2dev (2010-01-14 trunk 26319) [x86_64-darwin10.2.0]
"strhanziend"
It works fine. You need to set encoding with "# encoding: utf-8" at
the top of the file.
In fact, it will complain if not in 1.9.2
Ruby 1.8.6 is kind of outdated, but at least I think it works with "-Ku".
2010/1/28 Ryan Smith :
> gsub not works for me when replace 'DBCS'(double byte character set)
> character, =A0using last version ruby 1.8.6
>
> when "str=BA=BA=D7=D6end".gsub(/=BA=BA=D7=D6/,"hanzi"),
> =A0 output still is: str=BA=BA=D7=D6end , but not strhanziend which I wan=
t to
> get.
>
> Searched web two whole night with no clue found.
>
> Anyone can help are much appreciated, need got it work very urgent.
> thank you!
> --
> Posted via http://www.ruby-forum.com/.
>
>

Re: gsub not working to replace a 'Chinese' Charater.

by Richard Conroyon 2010-01-28T17:18:22+00:00.
On Thu, Jan 28, 2010 at 5:58 PM, Roger Pack wrote:
> Ryan Smith wrote:
> > gsub not works for me when replace 'DBCS'(double byte character set)
> > character, using last version ruby 1.8.6
>
> Maybe try 1.9?
> -r
Its an option, but a better strategy is to outline what you are trying to
achieve.
You should state where the Chinese text is coming from (database, text file,
binary file (e.g. excel), shell, UI, stdin), what encoding it is in (or what
encoding you
think it is in), and share the source of your ruby code (just the relevant
bits, like
your KCODE attribute, Iconv usage etc.
--
http://richardconroy.blogspot.com

Re: gsub not working to replace a 'Chinese' Charater.

by Ryan Smithon 2010-01-28T17:20:41+00:00.
I parse a webpage which encoded in gb2312, using Watir to get the
context of the page title, and want to replace the 'chinese character'
in title with english words.
When puts title which watir get, the chinese character displaied as
corrupt code there (under windows cmd,code page using cp936, display
works normal when change code page to utf-8). But I think cmd's code
page just display setting does not related with what I need (replace
chinese char). I did not know if string I get by Watir is also in
'gb2312' encoding or something others, the fact is fail happen when
convert the string to utf-8 encoding, message is complain the char is
invalid.
totally no idea what need to do.
Richard Conroy wrote:
> On Thu, Jan 28, 2010 at 5:05 PM, Ryan Smith
> wrote:
>
>> thank you!
>>
>
> Mixing encoding schemes is hell in almost any context, and Ruby is no
> exception.
> Until you have complete control in your program over all encoding inputs
> you
> are
> going to fail.
>
> If your input is coming from the shell environment or standard in the
> text
> can be
> in the system encoding, regardless of what encoding you specify in Ruby.
>
> It is preferable to use unicode (UTF-8) in any operation where you are
> processing
> multilingual text. Failing that there is the Iconv library which you can
> use
> to convert
> between encoding schemes.
>
> Note that 'double-byte encoding scheme' is an utterly useless term for
> practical encoding
> purposes. Its a gross simplification of what is going on, and especially
> so
> with Han
> character sets. To do any practical work with non-unicode, multi-byte
> character sets, you
> have to know the encoding scheme.
--
Posted via http://www.ruby-forum.com/.

Re: gsub not working to replace a 'Chinese' Charater.

by Marnen Laibow-Koseron 2010-01-28T17:22:24+00:00.
Benoit Daloze wrote:
> With ruby 1.9.2dev (2010-01-14 trunk 26319) [x86_64-darwin10.2.0]
> "strhanziend"
>
> It works fine. You need to set encoding with "# encoding: utf-8" at
> the top of the file.
> In fact, it will complain if not in 1.9.2
>
> Ruby 1.8.6 is kind of outdated, but at least I think it works with
> "-Ku".
Ruby 1.8 isn't outdated. It just doesn't handle multibyte text that
well.
>
> 2010/1/28 Ryan Smith :
Best,
--
Marnen Laibow-Koser
http://www.marnen.org
marnen@marnen.org
--
Posted via http://www.ruby-forum.com/.

Re: gsub not working to replace a 'Chinese' Charater.

by Dido Sevillaon 2010-01-28T17:29:10+00:00.
On Thu, Jan 28, 2010 at 10:05 AM, Ryan Smith wrote=
:
> gsub not works for me when replace 'DBCS'(double byte character set)
> character, =C2=A0using last version ruby 1.8.6
>
The term 'DBCS' is not sufficient to determine what the encoding of
your input character set is. That could mean any of a large number of
mutually incompatible character encodings: UTF-16, Big5, Shift-JIS
(for Japanese), and EUC-CN are only a few of the possibilities. Before
you can even begin to process any text you get from anywhere, it's
just a stream of bytes with no meaning until you know what its
encoding is. You can't perform any kind of useful text processing
without knowing the encoding.
Once you do know the encoding, you should probably convert it to UTF-8
using the Iconv library, and you should probably be able to do
something useful with the text.
Also, make sure that $KCODE is 'u'. Otherwise Unicode text processing
will not work.
--=20
=E6=99=AE=E9=80=9A=E3=81=98=E3=82=83=E3=81=AA=E3=81=84=E3=81=AE=E3=81=8C=E5=
=BD=93=E7=84=B6=E3=81=AA=E3=82=89=E7=AD=94=E3=81=88=E3=82=8B=E7=A7=81=E3=81=
=AF=E4=BD=95=E3=81=8C=E3=81=A7=E3=81=8D=E3=82=8B=EF=BC=9F
=E6=99=AE=E9=80=9A=E3=81=A7=E3=82=82=E6=99=AE=E9=80=9A=E3=81=98=E3=82=83=E3=
=81=AA=E3=81=8F=E3=81=A6=E6=84=9F=E3=81=98=E3=82=8B=E3=81=BE=E3=81=BE=E6=84=
=9F=E3=81=98=E3=82=8B=E3=81=93=E3=81=A8=E3=81=A0=E3=81=91=E3=82=92=E3=81=99=
=E3=82=8B=E3=82=88=EF=BC=81
http://stormwyrm.blogspot.com

Re: gsub not working to replace a 'Chinese' Charater.

by Dido Sevillaon 2010-01-28T17:37:39+00:00.
On Thu, Jan 28, 2010 at 11:13 AM, Ryan Smith wrote=
:
> I parse a webpage which encoded in gb2312, using Watir to get the
> context of the page title, and want to replace the 'chinese character'
> in title with english words.
>
> When puts title which watir get, the chinese character displaied as
> corrupt code there (under windows cmd,code page using cp936, display
> works normal when change code page to utf-8). =C2=A0But I think cmd's cod=
e
> page just display setting does not related with what I need (replace
> chinese char). I did not know if string I get by Watir is also in
> 'gb2312' encoding or something others, the fact is fail happen when
> convert the string to utf-8 encoding, message is complain the char is
> invalid.
GB2312? From a web page that probably means EUC-CN, so you have to
convert from that encoding into UTF-8.
--=20
=E6=99=AE=E9=80=9A=E3=81=98=E3=82=83=E3=81=AA=E3=81=84=E3=81=AE=E3=81=8C=E5=
=BD=93=E7=84=B6=E3=81=AA=E3=82=89=E7=AD=94=E3=81=88=E3=82=8B=E7=A7=81=E3=81=
=AF=E4=BD=95=E3=81=8C=E3=81=A7=E3=81=8D=E3=82=8B=EF=BC=9F
=E6=99=AE=E9=80=9A=E3=81=A7=E3=82=82=E6=99=AE=E9=80=9A=E3=81=98=E3=82=83=E3=
=81=AA=E3=81=8F=E3=81=A6=E6=84=9F=E3=81=98=E3=82=8B=E3=81=BE=E3=81=BE=E6=84=
=9F=E3=81=98=E3=82=8B=E3=81=93=E3=81=A8=E3=81=A0=E3=81=91=E3=82=92=E3=81=99=
=E3=82=8B=E3=82=88=EF=BC=81
http://stormwyrm.blogspot.com

Re: gsub not working to replace a 'Chinese' Charater.

by Richard Conroyon 2010-01-28T18:49:36+00:00.
On Thu, Jan 28, 2010 at 6:13 PM, Ryan Smith wrote:
> I parse a webpage which encoded in gb2312, using Watir to get the
> context of the page title, and want to replace the 'chinese character'
> in title with english words.
>
> When puts title which watir get, the chinese character displaied as
> corrupt code there (under windows cmd,code page using cp936, display
> works normal when change code page to utf-8). But I think cmd's code
> page just display setting does not related with what I need (replace
> chinese char). I did not know if string I get by Watir is also in
> 'gb2312' encoding or something others, the fact is fail happen when
> convert the string to utf-8 encoding, message is complain the char is
> invalid.
>
> totally no idea what need to do.
>
>
I had a sneaking suspicion that your task was Watir related.
First off the solution to this is hard - I have done some Watir tasks that
involved
processing international text (even in UTF-8) and there are some nasty
gotchas.
It is not Watir's fault or anything, but there are 2 areas which really
annoy:
puts 'International text' in the windows cmd shell is almost useless as a
way of
debugging the problem. Windows CMD shell takes perverse satisfaction in
ignoring
any encoding you might have set your ruby code to work in. The platform
codepage
is what Windows does all its work in.
Watir itself is implemented on top of Win32OLE which is yet another area
where
the platform encoding can interfere against your wishes.
You might get better luck posting these questions on the watir mailing list.
Its been a
while for me, but useful workarounds to common and uncommon gotchas like
this
are discussed and answered there.
Also check the Watir wiki http://wiki.openqa.org/display/WTR/Project+Home (odd
its currently down for me).
Testing of encoded text did come up a lot when I checked it last, and they
heavily document their
workarounds.
Lastly check out the JRuby equivalent to Watir: Celerity. It is API
compatible with Watir so your script will
probably still work with only minor modifications. It runs over a java http
library so you have good encoding
processing if you need it, and more importantly there are less entry points
for the Windows platform
encoding to impose itself. Note: Celerity has no visual component. FireWatir
(FF) or ChromeWatir may
also be useful to you for similar reasons.
--
http://richardconroy.blogspot.com

Re: gsub not working to replace a 'Chinese' Charater.

by Ryan Smithon 2010-01-29T01:54:48+00:00.
Here is my code:
============File Main.rb ================
# -*- coding: utf-8 -*-
$KCODE = "U"
$LOAD_PATH en
"0"
"鐧惧害涓€涓嬶紝浣犲氨鐭ラ亾"
"Baidu about, you know"
true
"鐧惧害涓€涓嬶紝浣犲氨鐭ラ亾"
==== THANKS EVERYBODY WHO CARE THIS POST====
--
Posted via http://www.ruby-forum.com/.

Re: gsub not working to replace a 'Chinese' Charater.

by Ryan Smithon 2010-01-29T04:20:21+00:00.
I found my mistake is using incorrect reg match, it works after change
to
s=s.gsub(/#{cnarray[i]}/, x.to_s)
thanks everyone!
--
Posted via http://www.ruby-forum.com/.