use utf8; use Text::Unidecode; print unidecode( "XX\n" # Chinese characters for Beijing (U+5317 U+4EB0) ); # That prints: Bei Jing
What Text::Unidecode provides is a function, "unidecode(...)" that takes Unicode data and tries to represent it in US-ASCII characters (i.e., the universally displayable characters between 0x00 and 0x7F). The representation is almost always an attempt at transliteration-- i.e., conveying, in Roman letters, the pronunciation expressed by the text in some other writing system. (See the example in the synopsis.)
NOTE:
To make sure your perldoc/Pod viewing setup for viewing this page is working: The six-letter word ``resume'' should look like ``resume'' with an ``/'' accent on each ``e''.
For further tests, and help if that doesn't work, see below, ``A POD ENCODING TEST''.
So if you have Hebrew data that has no vowel points in it, then Unidecode cannot guess what vowels should appear in a pronunciation. S f y hv n vwls n th npt, y wn't gt ny vwls n th tpt. (This is a specific application of the general principle of ``Garbage In, Garbage Out''.)
Writing a real and clever transliteration algorithm for any single language usually requires a lot of time, and at least a passable knowledge of the language involved. But Unicode text can convey more languages than I could possibly learn (much less create a transliterator for) in the entire rest of my lifetime. So I put a cap on how intelligent Unidecode could be, by insisting that it support only context-insensitive transliteration. That means missing the finer details of any given writing system, while still hopefully being useful.
Unidecode, in other words, is quick and dirty. Sometimes the output is not so dirty at all: Russian and Greek seem to work passably; and while Thaana (Divehi, AKA Maldivian) is a definitely non-Western writing system, setting up a mapping from it to Roman letters seems to work pretty well. But sometimes the output is very dirty: Unidecode does quite badly on Japanese and Thai.
If you want a smarter transliteration for a particular language than Unidecode provides, then you should look for (or write) a transliteration algorithm specific to that language, and apply it instead of (or at least before) applying Unidecode.
In other words, Unidecode's approach is broad (knowing about dozens of writing systems), but shallow (not being meticulous about any of them).
You should make a minimum of assumptions about the output of "unidecode(...)". For example, if you assume an all-alphabetic (Unicode) string passed to "unidecode(...)" will return an all-alphabetic string, you're wrong--- some alphabetic Unicode characters are transliterated as strings containing punctuation (e.g., the Armenian letter ``X'' (U+0539), currently transliterates as ``T`'' (capital-T then a backtick).
However, these are the assumptions you can make:
Unidecode transliterates context-insensitively--- that is, a given character is replaced with the same US-ASCII (7-bit ASCII) character or characters, no matter what the surrounding characters are.
The main reason I'm making Text::Unidecode work with only context-insensitive substitution is that it's fast, dumb, and straightforward enough to be feasible. It doesn't tax my (quite limited) knowledge of world languages. It doesn't require me writing a hundred lines of code to get the Thai syllabification right (and never knowing whether I've gotten it wrong, because I don't know Thai), or spending a year trying to get Text::Unidecode to use the ChaSen algorithm for Japanese, or trying to write heuristics for telling the difference between Japanese, Chinese, or Korean, so it knows how to transliterate any given Uni-Han glyph. And moreover, context-insensitive substitution is still mostly useful, but still clearly couldn't be mistaken for authoritative.
Text::Unidecode is an example of the 80/20 rule in action--- you get 80% of the usefulness using just 20% of a ``real'' solution.
A ``real'' approach to transliteration for any given language can involve such increasingly tricky contextual factors as these:
Out of a desire to avoid being mired in any of these kinds of contextual factors, I chose to exclude all of them and just stick with context-insensitive replacement.
If all of those come out right, your Pod viewing setup is working fine--- welcome to the 2010s! If those are full of garbage characters, consider viewing this page as HTML at <https://metacpan.org/pod/Text::Unidecode> or <http://search.cpan.org/perldoc?Text::Unidecode>
If things look mostly okay, but the Malayalam and/or the Chinese are just question-marks or empty boxes, it's probably just that your computer lacks the fonts for those.
* Rebuild the Unihan database. (Talk about hitting a moving target!)
* Add tone-numbers for Mandarin hanzi? Namely: In Unihan, when tone marks are present (like in ``kMandarin: dao'', should I continue to transliterate as just ``Dao'', or should I put in the tone number: ``Dao4''? It would be pretty jarring to have digits appear where previously there was just alphabetic stuff--- But tone numbers make Chinese more readable.
* Start dealing with characters over U+FFFF.
* Fill in all the little characters that've crept into the Misc Symbols Etc blocks.
* More things that need tending to are detailed in the TODO.txt file, included in this distribution. Normal installs probably don't leave the TODO.txt lying around, but if nothing else, you can see it at <http://search.cpan.org/search?dist=Text::Unidecode>
It's better than nothing!
...in both meanings: 1) seeing the output of "unidecode(...)" is better than just having all font-unavailable Unicode characters replaced with ``?'''s, or rendered as gibberish; and 2) it's the worst, i.e., there's nothing that Text::Unidecode's algorithm is better than. All sensible transliteration algorithms (like for German, see below) are going to be smarter than Unidecode's.
Text::Unidecode is meant to be a transliterator of last resort, to be used once you've decided that you can't just display the Unicode data as is, and once you've decided you don't have a more clever, language-specific transliterator available-- or once you've already applied a smarter algorithm and now just want Unidecode to do cleanup.
In other words, when you don't like what Unidecode does, do it yourself. Really, that's what the above says. Here's how you would do this for German, for example:
In German, there's the typographical convention that an umlaut (the double-dots on: a. o. u.) can be written as an ``-e'', like with ``Scho.n'' becoming ``Schoen''. But Unidecode doesn't do that--- I have Unidecode simply drop the umlaut accent and give back ``Schon''.
(I chose this not because I'm a big meanie, but because generally changing ``u.'' to ``ue'' is disastrous for all text that's not in German. Finnish ``Hyva.a. pa.iva.a.'' would turn into ``Hyvaeae paeivaeae''. And I discourage you from being yet another German who emails me, trying to impel me to consider a typographical nicety of German to be more important than all other languages.)
If you know that the text you're handling is probably in German, and you want to apply the ``umlaut becomes -e'' rule, here's how to do it for yourself (and then use Unidecode as the fallback afterwards):
our( %German_Characters ) = qw( A. AE a. ae O. OE o. oe U. UE u. ue β ss ); use Text::Unidecode qw(unidecode); sub german_to_ascii { my($german_text) = @_; $german_text =~ s/([A.a.O.o.U.u.β])/$German_Characters{$1}/g; # And now, as a *fallthrough*: $german_text = unidecode( $german_text ); return $german_text; }
To pick another example, here's something that's not about a specific language, but simply having a preference that may or may not agree with Unidecode's (i.e., mine). Consider the ``X'' symbol. Unidecode changes that to ``Y=''. If you want ``X'' as ``YEN'', then...
use Text::Unidecode qw(unidecode); sub my_favorite_unidecode { my($text) = @_; $text =~ s/X/YEN/g; # ...and anything else you like, such as: $text =~ s/X/Euro/g; # And then, as a fallback,... $text = unidecode($text); return $text; }
Then if you do:
print my_favorite_unidecode("You just won X250,000 and X40,000!!!");
...you'll get:
You just won YEN250,000 and Euro40,000!!!
...just as you like it.
(By the way, the reason I don't have Unidecode just turn ``X'' into ``YEN'' is that the same symbol also stands for yuan, the Chinese currency. A ``Y='' is nicely, safely neutral as to whether we're talking about yen or yuan--- Japan, or China.)
Another example: for hanzi/kanji/hanja, I have designed Unidecode to transliterate according to the value that that character has in Mandarin (otherwise Cantonese,...). Some users have complained that applying Unidecode to Japanese produces gibberish.
To make a long story short: transliterating from Japanese is difficult and it requires a lot of context-sensitivity. If you have text that you're fairly sure is in Japanese, you're going to have to use a Japanese-specific algorithm to transliterate Japanese into ASCII. (And then you can call Unidecode on the output from that--- it is useful for, for example, turning XXXXXXXXX characters into their normal (ASCII) forms.
Unicode Consortium: <http://www.unicode.org/>
Searchable Unihan database: <http://www.unicode.org/cgi-bin/GetUnihanData.pl>
Geoffrey Sampson. 1990. Writing Systems: A Linguistic Introduction. ISBN: 0804717567
Randall K. Barry (editor). 1997. ALA-LC Romanization Tables: Transliteration Schemes for Non-Roman Scripts. ISBN: 0844409405 [ALA is the American Library Association; LC is the Library of Congress.]
Rupert Snell. 2000. Beginner's Hindi Script (Teach Yourself Books). ISBN: 0658009109
Unidecode is distributed under the Perl Artistic License ( perlartistic ), namely:
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
The views and conclusions contained in the software and documentation are those of the authors/contributors and should not be interpreted as representing official policies, either expressed or implied, of The Unicode Consortium.