Hacking TrueType Fonts For Character Information
Those of you who have ever been curious about making your own font should know that doing so on the computer isn't easy. Sure there are several good programs out there that can help you take your design and digitize it, but a well-made font has been crafted with much care and attention to detail by a computer scientist just as much as a designer. Some considerations that need to be made on the technical side include, for instance, how to "hint" rendering at very large or small sizes, accounting for grayscale devices in such hinting, making characters by compositing glyphs to save on file size (e.g. fi = f + i), and dealing with different platforms and character encodings across different computer systems so the font can be portable across Windows, Mac, and others.
Now, think back to one of my long-time projects that relates to displaying text and images. Yes, BriteBlox can certainly be capable of displaying messages set with TrueType fonts, and this has been supported in the development version for quite some time. However, to make it scale well for any message at all, it is important to know what the width of each character is. As such, the efforts described here were undertaken for the sake of improving BriteBlox.
Why?
The simplest way to render TTFs in Python is to use PIL (the Python Imaging Library). With this, you can establish an Image object and then instruct PIL to render text with the desired typeface onto the image. However, you need to know in advance what the width of each character is so you can make the correct-sized Image object before rendering text onto it only to discover that either it's too short and text is chopped, or you're out of memory. In the BriteBlox PC Tools, this feature was disabled in releases for such a long time because I would manually have to guess and check the correct size for the bounding box for my text. Soon, that will no longer be required!
The High-Level Solution
[Important note] There may be, in fact, a better solution for those of you using Qt, an application framework. Unfortunately, my implementation of the Qt 5 libraries in PyQt5 seg-faults (or tries to access a null pointer) when I try to run the appropriate commands, so I will have to write about that in the future once I upgrade Qt and hopefully get it working.
Along with PIL (or Pillow) in Python, you can use the fonttools and ttfquery libraries (which depend on numpy) in order to fetch the width of a particularcharacter glyph. (The glyph is the artistic rendering of the character; the character is more of just a concept in the realm of typography.) To get the required width (and height) for the container image, begin by using this code:
Along with PIL (or Pillow) in Python, you can use the fonttools and ttfquery libraries (which depend on numpy) in order to fetch the width of a particular
myfont = describe.openFont("C:\\Windows\\Fonts\\arial.ttf")
glyphquery.charHeight(myfont)
glyphquery.width(myfont, 'W')
Now you have the width of a character from your TTF file. If you actually run this, though, you may notice the values seem really odd -- in fact, very large. This is because the values being retrieved (I'll tell you exactly where these come from later) are scaled to "font units" or "EM units", which relate to the "em square". Remember your em-dash and en-dashes from English class? Well, turns out they're incredibly important in typography too. The EM units are derived from the "EM square", which is a square the size of the em-dash. Back when fonts were cast into metal stamps and then pressed into paper, the em-dash was typically the widest character you could have. In digital media, though, characters are allowed to be wider than the em-dash, so you have to look at each character specifically to find out how wide each one is. Nothing can be taken for granted.
EM units are simply little divisions of the EM square such that now the EM square is divided up into a grid. There are several acceptable values for how many units exist along one single side -- in fact, any value (or power of two?) from 16 to 16384 is acceptable. The typical "resolution" of the EM square, as defined by the "unitsPerEm" field in the TTF specification, is 2,048 units per side of the square. However, again, this value cannot be taken for granted; I will explain ways to fetch it later. Once you have the correct unitsPerEm value, put it into the following equations:
Finding the EM Size Of Your Font
Trust Me, This Is Correct
- The raw data seems to just launch right into the 1st glyph without any nice header info as to what glyph(s) belongs to what character, or how many bytes define each glyph in advance.
- The data I gleaned for the first glyph (which I don't even know what it is) seemed out of whack, with a total height of slightly over the EM size and a total width of almost 3 times the EM size!
I was leery of those results, and decided to take another route. The "OS/2" table (its header is literally thus in the font file data) contains properties such as sTypoAscender, sTypoDescender, and sTypoLineGap. Despite that OS/2 is used by Microsoft devices only, the values it contains should be platform-agnostic. However, comparing my Arial font file to the documentation I had, something seemed fishy. Maybe its OS/2 table is older and doesn't contain as much information, but because these three fields are so far down the table, I didn't want to take any chances with having counted incorrectly or misreading one of the data types. I soon abandoned this idea too.
Yet another idea was to go to the CMAP table, which contains the mappings of characters to glyph indexes. (I would have to sit and parse this table to figure out what the very first glyph is in GLYF, and there's no need for me to work backwards like that now.) This table contains at least one sub-table (Arial has, in fact, three sub-tables here), so there is quite a lot of header data you need to go through before you get to the good stuff. However, you still need to go through it carefully, otherwise you will be misled into meaningless data. For Microsoft devices, you should look for the sub-table with the Platform ID of 3 and the Platform Encoding ID of 1. After finding the byte offset to this table (which is relative to the start of CMAP, not just 0), I had to solve some equations in order to find what character (as defined by ASCII or compatible Unicode codes) mapped to which glyph.
I'm not going to go into the math here since it's described in the documentation, but I found out that in Arial, most printable characters we normally care about (specifically, those with ASCII codes between 0x20 and 0xFE) all exist sequentially and contiguously with glyph IDs ranging from 3 to 0x61. The letters I cared about testing, the extreme-width cases of "W" and "i", happen to have glyph indices of 0x3A and 0x4C respectively, according to the algorithm.
With this information, it's time to scour the HMTX table for horizontal metrics. The first thing in this table is an array of values pertaining to the advance width and left-side boundary of each glyph. These values take two bytes apiece, thus from the beginning of the HMTX table, the offset to the glyph you care about is (glyph index * 4). With the table at offset 0x268, the path to the letter W leads me down (0x3A * 4 = 0xE8) more bytes, to a total offset of 0x350. Here, I quickly learn the advance width for the letter W is:
1933
That's exactly what the Python program said with ttfquery & fonttools!
By this time, I had (only just, by sheer coincidence, auspicious timing, serendipity, or whatever you want to call such good fortune) discovered that Microsoft scales its PPI to 96 rather than the 72 I had originally expected. After trying (and failing) to see if there was a particular DPI used with image objects generated by PIL, I simply stuck (96.0/72.0) into the equation and confirmed visually that the values seen here in the HMTX table are in fact the values you can use to calculate the width of a TrueType font on a Microsoft Windows system.
It remains to be seen how this'll perform on Macs. I anticipate the PPI will need to be something different; perhaps it will in fact be 72 on that platform. We'll see...
An Aside
The Evenflo logo from when I was little
Oh, how titillating.
Comments
Post a Comment