TheJach.com

Jach's personal blog

(Largely containing a mind-dump to myselves: past, present, and future)
Current favorite quote: "Supposedly smart people are weirdly ignorant of Bayes' Rule." William B Vogt, 2010

Hello Unicode!

I introduce you to a cross product of A×B! Or something.

Anyway, my blog now officially supports unicode. That was fun! Or not. My blog is programmed in PHP, and as anyone who's done much PHP knows, nice Unicode handling is a far-off dream in the magical PHP 6. Python handles unicode so much nicer... To convert a code base in PHP to unicode, there are a bunch of functions you have to change and watch out for. Luckily this isn't too tedious as most of them are string functions, and with some clever scripting we can find all the spots in the code that need fixing up. (I also thought about some scripting to do the replacing for me, but I didn't have that many and I like to verify things manually sometimes.)

First up is a helpful guide: http://www.phpwact.org/php/i18n/utf-8 You should definitely skim through that before starting any unicode conversion project, as there are quite a few things you have to worry about. For now, let's get a list of all the string functions we need to change.

Going to the docs, there is a table of contents of all the mb_ multibyte string functions we have to use if we're doing unicode. (Interestingly, str_replace is not among them--it does the right thing automagically. explode is another such function that just works.) One of the things I commonly use a Python REPL for is to format some words I can copy-paste from a website into a programmable form, such as an array of the functions I need. This is really easy stuff (I omit the arrows since I have them off in my REPL):


tab = '''highlight the table of contents functions, copy paste in here
'''
fns = []
for line in tab.split('\n'):
fns.append(line[:line.find('—')].strip()) # is that '—' unicode? Go python!

'(' + '|'.join(fns) + ')'


This gives us the big list in ready-to-regex format:


'(check_encoding|convert_case|convert_encoding|convert_kana|convert_variables|decode_mimeheader|decode_numericentity|detect_encoding|detect_order|encode_mimeheader|encode_numericentity|encoding_aliases|ereg_match|ereg_replace|ereg_search_getpos|ereg_search_getregs|ereg_search_init|ereg_search_pos|ereg_search_regs|ereg_search_setpos|ereg_search|ereg|eregi_replace|eregi|get_info|http_input|http_output|internal_encoding|language|list_encodings|output_handler|parse_str|preferred_mime_name|regex_encoding|regex_set_options|send_mail|split|strcut|strimwidth|stripos|stristr|strlen|strpos|strrchr|strrichr|strripos|strrpos|strstr|strtolower|strtoupper|strwidth|substitute_character|substr_count|substr)'


There's a few more functions though. First, functions such as preg_replace need to have the u attribute specified in the pattern options. So we add to our list:

preg_filter|preg_grep|preg_last_error|preg_match_all|preg_match|preg_quote|preg_replace_callback|preg_replace|preg_split

htmlentities is another gotcha, it really needs the "UTF-8" parameter passed into its third argument. Add that to the list. You should also want to search for htmlspecialchars and html_entity_decode, so add those.

Now take your long string of functions to search for, and paste it onto the command line:

$ find -name '*.php' | xargs grep --color -i -n -E '(fn1|fn2|...|fnN)'


This will print out any occurrences you need. If it looks right, redirect the output to a file for saving. Then you can open it up with vim! (I love vim.) vim will automatically take you to the proper files and lines and you can edit manually to make sure everything is done correctly. (Or you could spend some time coming up with a clever script to do the replacing for you.) The command for vim is:

$ vi -q foundfiles -c copen


After you're done, there is one more detail to take care of. You need to tell PHP to use UTF-8. If you have one index.php file or config file like me that handles everything, this is easy and you place this at the top of that file, otherwise the following needs to be at the top of every executed script.


mb_internal_encoding('UTF-8');


(You may also need to change your database to work with UTF-8. Most MySQLs do this by default these days.)

Edit: Woops, there was one other thing I forgot to mention. After creating the database connection, you should run:


mysqli_set_charset($dbc, 'UTF-8');


(If you're not using MySQLi...) A second page of good tips is http://abeautifulsite.net/blog/2010/07/tips-for-supporting-utf-8-in-your-php5-applications/. One of the things my blog has done forever but I also forgot to mention is the HTML settings. You need to add <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> to your header.

If you're looking for a base64 encoding function that handles unicode, there's one in my javascript code. (As well as one that doesn't work for unicode.)


Posted on 2011-05-21 by Jach

Tags: personal, php, programming, python, scripting, tips, unicode

Permalink: https://www.thejach.com/view/id/174

Trackback URL: https://www.thejach.com/view/2011/5/hello_unicode

Back to the top

Jach May 21, 2011 01:32:28 AM Now you crazy commenters can keep writing your comments in Word or something and polluting my blog with your “evil unicode quoting characters”.
Back to the first comment

Comment using the form below

(Only if you want to be notified of further responses, never displayed.)

Your Comment:

LaTeX allowed in comments, use $$\$\$...\$\$$$ to wrap inline and $$[math]...[/math]$$ to wrap blocks.