Compréhension du problème japonais pour le jeu universel de caractères et son implication en géomatique et pour latin 9

Archive Liste Typographie
Message : Compréhension du problème japonais pour le jeu universel de caractères et son implication en géomatique et pour latin 9
(Alain LaBonté ) - Mercredi 18 Mars 1998

Navigation par date [ Précédent Index Suivant ]
Navigation par sujet [ Précédent Index Suivant ]

Subject:	Compréhension du problème japonais pour le jeu universel de caractères et son implication en géomatique et pour latin 9
Date:	Tue, 17 Mar 1998 18:10:57 -0800
From:	Alain LaBonté <alb@xxxxxxxxxxxxxx>

Yves-Luc,

j'ai fait lire à mon collègue et vieil ami Takayuki Sato le document
annexé, que tu m'avais numérisé au lieu de me le faxer et transmis en
format traitable tel que je te l'avais demandé (tu vois, ça n'a pas été
inutile!) Il m'a fait comprendre exactement le problème des Japonais, bien
qu'il ait convenu avec moi que le document en question contient des erreurs
de taille (mais peu importe, là n'est pas la question).

Il admet avec moi que pour les fins d'échange d'information, le JUC est
requis... sauf pour un point bien particulier sur lequel je comprend
maintenant mieux le problème japonais.

Il m'a fait comprendre, idéogrammes à l'appui (heureusement cela ne m'est
pas entièrement étranger et j'étais très content de mes notions qui m'ont
été très utile), que les Japonais accordent une importance très grande aux
différences de style qu'ont leur caractères. Par exemple, il faut deux
caractères pour représenter les noms Bei-Jing en chinois et To-Kyo en
japonais... le caractère Jing sinifie « capitale » (Beijing = capitale du
Nord, par opposition à Nanjing, capitale du Sud) et c'est le même caractère
Unicode que le caractère Kyo, qui veut aussi dire la même chose, mais qui
n'est pas représenté au Japon de la même manière qu'en Chine (le point sur
le caractère chinois est séparé du corps du caractère et est représenté un
peu comme un accent grave, alors que ce point présente l'aspect d'une
petite barre verticale sur le caractère japonais qui fait entièrement corps
avec le reste du caractère).

Personnellement, je n'ai jamais vu la difficulté puisqu'il suffit alors de
baliser la langue et représenter le caractère selon cette balise, si le
texte est bilingue (chinois-japonais), cela n'aparaissant aux Occidentaux
qu'une seule question de présentation, ce que c'est pour les Japonais
aussi, mais cela revêt pour eux une immense importance, et même une
question d'identité nationale.

Là où le bât blesse, et cela affecte justement les cartes géographiques,
c'est que les Japonais disent que sur une carte géographique
internationale, on doit représenter Bei-Jing selon la graphie chinoise et
To-Kyo selon la graphie japonaise. Or l'état actuel de la technologie fait
qu'ils peuvent, en utilisant l'ISO/CEI 2022 (techniques d'extension de
code), passer du code GB chinois au code JIS japonais, et que cela est
parfaitement au point, sans utiliser de balises autres que celles fournies
par le codage conventionnel et normalisé. Ils perdent cette information
avec Unicode (ou ISO/CEI 10646, qui est pareil, contrairement à ce que ton
collègue expert en géomatique disait, mais cette erreur est un détail, ne
nous attardons pas!) et il n'y a pas encore de méthode éprouvée de balisage
de la langue dans les applications informatiques asiatiques.

À la réunion de Seattle, à cause d'une demande de la société Internet, il y
a un projet à l'étude pour se servir du codage du JUC pour baliser la
langue (tu pourras voir cela dans mon rapport de mission)... mais ce n'est
qu'un projet non encore approuvé, donc pas à la portée immédiate des
producteurs... et son implantation sera beaucoup plus facile avec le JUC
dans sa version 32 bits que dans sa version 16 bits (Unicode), puisque ce
codage (du moins dans la proposition) est placé dans le plan 14 du JUC, ce
qui nécessite soit un codage fixe à 32 bits ou un codage variable avec
UTF-16, un artifice relativement compliqué sous Unicode.

Par ailleurs, mon collègue japonais est d'avis, comme son compatriote
expert en géomatique, que de placer l'ISO/CEI 646 et l'ISO/CEI 8859-1 à la
base de la pyramide de codage est pour les Asiatiques, et les Japonais en
particulier, une immense erreur de jugement, ce en quoi je ne peux leur
donner tort, particulièrement parce que nous poussons nous-mêmes pour le
latin 9 pour les jeux de caractères à 8 bits, comme je te l'ai souvent dit
depuis un an et même avant!

Mes conclusions préliminaires, puisqu'il me faut conclure en ne te laissant
pas en plan :

-l'on devrait quand même pousser pour que les échanges internationaux
utilisent le JUC (dénominateur commun universel) à moyen terme ;
-par contre ma conversation me convainc encore plus que cela devrait se
faire dans la version à 32 bits du JUC, ce qui implique de laisser
tomber entièrement la notion incrémantale de pyramide de codage (8 bits,
16 bits, 32 bits) telle que présentée par notre ami Doug O'Brien.
-entre-temps, l'on devrait permettre l'utilisation de codages nationaux
quand des échanges internationaux ne sont pas en jeu, compte tenu des
problèmes immédiats de balisage de la langue qui constituent une
inquiétude légitime des Japonais, surtout pour les applications
géographiques.

Cela te satisfait-il comme information ? Cela ne permettrait-il pas de
simplifier la stratégie canadienne ?

Alain LaBonté
Seattle
________________________________________________________________
Document que tu m'as transmis et que j'ai fait lire à Sato San :
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
Comment for Character coding

Junichi KOSEKI (Japan)

1. Can't we use another character code except only three standards?

The sentence is described as follows in Chapter 7 Character coding.
"Today only three standards are widely implemented in different computer
systems in the world, i.e., IS0646, IS08859-1 and IS010646 UCS-2 and UCS-4
(UCS - Universal Character Set).....UCS-2 covers every living language in
the world, while UCS-4 also covers historical languages."

In Japan, Unicode 2.0 is implemented, but it is difficult to say that
IS010646 is generally used. The other hand JIS X201 (almost same as
IS0646), JIS X208 (domestic character code sets in Japanese Kanji) and so
on are used more universally than IS010646 (JIS X921). IS010646 UCS-4 is
not implemented yet. We think therefore, it is not a suitable expression "
only three standards are widely implemented" in working draft.

2. Comment for UNICODE

ISO-10646-1 is essentially based on Unicode 1.0. As Unicode is revised to
Version 2.0, IS0 10646 will be revised too. In this case, the problem is
how to adjust between IS0 10646-1 which is the fixed length of 16 bits or
32 bits, and Unicode 2.0 which is variable length code. And, the basic
concept of ISO 10646 which express the characters of the whole world by the
fixed length of 16 bits or 32 bits may be lost.

The other hand, we can express the whole characters by changing the
character code with escape sequence, and such a trial is actually advanced,
too.

After this, we should sufficiently examine whether to follow IS0 10646 as
Unicode or to expand the character code based on established technique as
IS0 2022. If existent 2-bite-character code comes into wide use fully as
especially in Japan, careful consideration is necessary to decide that it
can move without causing confusion in the international standard.

3. About the Character Connotation

One character which bas different typefaces as new typeface character and
old typeface character in Chinese characters (Kanji), should be judged
weather these characters are same typeface or not. If these characters are
judged as same one, the characters are given the same Kanji code, and it
becomes unacceptable that those different typefaces are distinguished after
that. We call this connotation.

Before establishing the standard of connotation of the kanji, careful
discussion based on an actual condition of usage in each country, a
historical background and so on is needed. As a result of discussion the
domestic Connotation Standard. In Japan the Japanese standard of
Connotation once worked out as JIS X0208 in 1978. As the clearance of
Connotation Standard is needed, the renewal Connotation Standard JIS X0208
was revised in 1997 after fully deliberation for many years. It is
important to examine the connotation of typeface of character in each
country, and it is quite obvious to clear to "Connotation Standard" in the
international standard. We estimate that the area for about 20,000
characters was secured at first and divided into three parts, for Japan,
China (include Taiwan) and Korea, then compensate for the excess area to
put together by similar typeface. We cannot verify whether the Connotation
Standard is suitable unification or not. But we might doubt to examine
sufficiently the clearance of the Connotation Standard for each country,
because the draft of unification of IS0 10646 was completed for only 4 months.
If we will unify the character such as Kanji, we have to deliberate
sufficiently on a actual condition of usage of Kanji and a historical
background of each country in east Asia. And if each country bas the
domestic Connotation Standard, the international Connotation Standard is
needed not to country these domestic Connotation Standards and to
correspond to revise the Connotation Standard of each country in the future.

4. Appointment of language in character code

IS010646 Unicode does not have mechanism that the character express to
describe the language of any country. If we work the mechanical
interpretation , the language must be estimated by character code in
advance. There is not any problem in case that the sentence is described
by single language. But, the sentences of mixed languages sometimes come
into question.
For example, if the character code based on IS0 2022 is adapted, it is
possible to know the language of any country when a character code is changed.
The draft of HTML (Hyper Text Markup Language) specification ver. 4 based
IS0 8879 (SGML: Standard Generalized Markup language) is opened in July
1997, the tag of appoint of language is included.
English
le français

It will be expected that the expression of different language character,
the processing of forbidden rules and mechanical interpretation etc. work
efficiently.

The Character encoding in HTML 4.0 is described as follows.
"Commonly used character encoding on the Web include ISO-8859-1 (also
referred to as "Latin-1"; usable for most Western European languages),
ISO-8859-5 (which supports Cyrillic), SHIFT-JIS (a Japanese encoding),
EUC-JP (another Japanese encoding), and UTF-8 (an encoding of ISO 10646
using a different number of bytes for different characters). Names for
character encoding are case-insensitive, so that for example "SHIF'T-JIS",
"Shift-JIS", and "shift -jis" are equivalent.

This specification does not mandate which character encoding a user agent
must support."

Compréhension du problème japonais pour le jeu universel de caractères et son implication en géomatique et pour latin 9, Alain LaBonté <=