Modify OpenAccess to Handle Multilingual UTF-8 Strings Natively

I once had a job scrutinizing Virtuoso schematics that contained many useful annotations written in Italian. EDA tools generally support only ASCII strings, but the Italian designers were lucky–theirs is one of the few languages that can be written naturally using the same characters as English.

Wouldn’t it be nice to annotate designs in OpenAccess databases using any of the world’s languages? You can do it right now, using any OpenAccess release.

UTF-8 Encoding

Unicode can represent virtually every written language in history. 8-bit Unicode Transformation Format (UTF-8) is a clever 8 bit encoding of Unicode into multibyte character strings that today enjoys immense popularity in the software industry, EDA being a notable exception. Unicode and UTF-8 are already described in other excellent articles, so here I will assume a general understanding of UTF-8, without going into details available elsewhere.

With just a bit of care, software originally designed exclusively for ASCII can be enhanced to handle UTF-8, using what Markus Kuhn has dubbed the soft conversion approach. Throughout the application, all strings remain null terminated arrays of 8-bit characters, which is how the OpenAccess oaString class is currently organized.

In fact, if you are writing a new application, you can use UTF-8 with a standard OpenAccess release by simply avoiding the few oaString methods that are incompatible with UTF-8. This is so natural that you may already be writing your application in a UTF-8 compatible fashion. In the next article in this series, I will show you how to write a UTF-8 compatible application based on any standard OpenAccess release.

But first, it is instructive to look at what is needed to make the OpenAccess oaString class fully UTF-8 capable.

What Is Changed

Of course, the overarching goal is to change as little as possible. Ultimately, we want your user to set any UTF-8 locale,

export LC_CTYPE=ja_JP.UTF-8 # Japanese UTF-8

and start your application, and insofar as it makes sense, any string can contain any language (not just Japanese).

Of paramount importance is ASCII compatibility–whether or not you set the locale, OpenAccess must always work with ASCII.

The OpenAccess class oaString almost worked as-is. I made just a few changes:

Added new oaString methods to support multibyte strings. No backward compatibility problem here.
Enhanced substr() to use character indices. This is the only existing oaString method that required modification. It’s behavior with non-ASCII characters changes with the locale, but with pure ASCII strings substr()works the same as always in a UTF-8 or default locale.
Completely rewrote the oaString unit test, expanding it by a factor of five. All tests run twice: once in the default C locale, and once in the Japanese UTF-8 locale. The existing test is also included to verify backward compatibility.
Fixed memory error for non-printing ASCII characters in oaFont::calcBBox()
Implemented ctype functions in oaNameSpace to keep OpenAccess namespaces using 7 bit ASCII when the rest of the application switches to a non-default locale

The above fix OpenAccess bugs 1280, 1283, 1284, 1285, 1286 and OpenAccess feature requests 700 and 1190

What Is Not Changed

This project is driven by practical application, not a maniacal pursuit of purity. The objective is merely to allow designers to annotate schematics and layouts in any language. Therefore:

No attempt will be made to extend OpenAccess namespaces to handle non-ASCII strings–most of them throw an exception when they encounter a non-ASCII character. This means that netlist identifiers, like net, instance or terminal names remain ASCII. Non-ASCII strings are for humans to read, not machines. The OpenAccess native namespace, oaNativeNS, does work with UTF-8, but it will only be robust after OA bug 1286 is fixed.
The lexicographical rules used by the comparison operators <, >, <=, and >= are unaffected by the locale. This means that if you use these operators to sort strings they may not appear in the exact order customary in the locale. Don’t show the results to a librarian in the target country.
Wide characters (wchar_t) will not be added. Even a single character will be represented as a multibyte string in an ordinary oaString.
There are many more locales beyond the default C or UTF-8. No attempt will be made to address other locales like Japanese Shift-JIS or Cyrillic KOI8-R.

The New oaString

Most of the oaString interface can be used without concern for whether the string contains ASCII or UTF-8. The differences occur when counting,

The number of bytes occupied
The number of characters
The display width

oaString currently assumes that all three of the above are equal, and when the string contains only ASCII characters, they are indeed equal. However, when a string contains multibyte characters, these three values can be different.

So let’s look at the existing and new oaString methods that involve a count.

Number of Bytes in a String

All existing oaString methods that accept or return a count refer to a byte count. For example, consider a string consisting of three Japanese characters stored in an oaString,

oaString str("文字列"); oaUInt4 len = str.getLength(); // len = 9

In UTF-8, each of these kanji characters is three bytes long. oaString::getLength() returns 9, the number of bytes consumed by the string, excluding the null terminator.

Most of the existing oaString methods that accept or return counts continue to make sense as long as you stop thinking of them as character counts:

oaString::oaString(oaUInt4 length), where length is the number of bytes to preallocate, not including the terminating null character.
oaString::oaString(const oaChar *initialValue, oaUInt4 length), where length is the maximum number of bytes the string is allowed to consume, excluding the null terminator. Should this method also be modified so that the length argument is in units of characters rather than bytes? Please leave a comment if you have a preference.
oaString::getLength() returns the number of bytes required to store the string, excluding the null terminator. This method remains just as important for multibyte character strings as it is for ASCII strings.
oaString::resize(oaUInt4 size) changes the number of bytes allocated to the oaString instance, potentially truncating the string. size includes the terminating null character.
oaString::getSize() returns the number of bytes currently allocated, including the terminating null character

The counts in these methods are closely related to memory management, so it makes sense that they are in the same units used by the familiar malloc().

Number of Characters in a String

Since a single UTF-8 character can be from one to four bytes long, the length of a string in characters is not necessarily the length in bytes.

Only one existing oaString method has been enhanced to support character counts:

oaString::substr(const oaString &sub, oaUInt4 offset) accepts a character offset and returns the character position of string sub. This also serves as the multibyte character replacement for oaString::index(). When substring sub is not found, substr()returns the string length in bytes, not characters. This maintains compatibility with existing code that detects the existence of the substring by comparing with getLength().

I have added new methods that accept and return character counts:

oaString::getNumChars(), which counts the characters in the string based on the locale
oaString::at(const oaUInt4 offset, oaString &charOut) returns in charOut the string representing the multibyte character at position offset. This provides the functionality of oaString::operator[] for multibyte characters.

Display Width

When displaying monospaced characters, Chinese, Japanese and Korean (CJK) characters are rendered twice as wide as Latin characters. For example,

monster化け物

“monster” is rendered seven columns wide. Each of the three Japanese character in “化け物” is rendered twice as wide, and therefore the display width for this string is six columns. I added the new method,

oaString::getNumColumns() counts the number of display columns based on the locale

This method can be used for aligning columns of monospaced characters while printing to the console. Things are more complex when determining the bounding box of an oaText or oaTextDisplay object. I will take up that topic in the next article in this series.

8-bit Character Functions

Some oaString methods manipulate 8-bit characters one by one, and are therefore not so useful with multibyte characters:

oaString::operator[](oaUInt4 i) returns a byte. Use this with multibyte strings only when you really do want to access a string byte by byte. Use the new function oaString::at() to access the nth character of a multibyte string.
oaString::index(oaChar c, oaUInt4 offset=0) searches for the 8-bit character c. Use substr() to search for a multibyte character in a string.

What you Need to Change

There are two approaches to internationalizing your OpenAccess-based application.

You can perform all string manipulation in a Unicode compatible string class like std::string or Qt QString, thereby avoiding the few oaString methods that are incompatible with multibyte character strings. This allows you to use any OpenAccess release. The next article in this series details this approach.

If your existing application relies on oaString to correctly count characters, you can internationalize it with the following procedure:

Get the new UTF-8 oaString from the Si2 OpenAccess Contributed Reuse Library. Install and build it according to the instructions provided. Encourage the OpenAccess change team to incorporate these changes in future OpenAccess releases.
Upon startup, get the locale from the user environment with,
setlocale(LC_CTYPE, "");
If you are using an application framework, it may set the locale for you. For example, Qt QApplication sets the locale.
Inspect your use of oaString and remove the assumption that one byte contains one character that prints in one column:
- Replace oaString::getLength() with oaString::getNumChars() where it is your intention to count characters. Continue to check for the existence of a substring by comparing the return values of substr() and getLength().
- Check the length argument of oaString::oaString(const oaChar *initialValue, oaUInt4 length), to make sure it is in units of bytes, not characters
- Make sure the size argument to oaString::resize(oaUInt4 size) is the number of bytes to allocate, not the number of characters. Note that size includes the terminating null character.
- Eliminate oaString::index(), replacing it with oaString::substr()
- Examine your usage of oaString::operator[]. Replace it with oaString::at() where your intention is to to extract the nth character. Continue using oaString::operator[] where you really do want to access a string byte by byte.
- Use oaString::getNumColumns() instead of oaString::getLength() when printing aligned columns of monospaced characters to the console
As a practical matter, OpenAccess client applications rarely do much string manipulation using oaString, so you will probably find that very few changes are required.

It Just Works

With the few exceptions detailed above, you can use multilingual UTF-8 strings in any OpenAccess release by merely… typing them. For example, you can add a Japanese property to an instance,

oaStringProp *prop = oaStringProp::create(inst, "current", "電流5μA");

and then print it out:

oaString propValue; prop->getValue(propValue); cout << "プロパティーの値は「" << propValue << "」です。" << endl; // Prints プロパティーの値は「電流5μA」です。

You will note that this example uses none of the new multibyte character oaString methods–it works with any OpenAccess release, as will the vast majority of your existing code. This is the beauty of UTF-8. The next article in this series will go into the details of using multilingual UTF-8 with a standard OpenAccess release.

John McGehee

Search This Blog