tag:blogger.com,1999:blog-1986583754516060222024-02-18T23:27:03.616-08:00GrünfieldLigatures and umlautsIan Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.comBlogger3125tag:blogger.com,1999:blog-198658375451606022.post-34676804541864957982015-05-30T06:25:00.003-07:002015-05-30T06:25:37.901-07:00UTF-8/UTF-32 transcoding - Part 1Well, just short of two years from my previous post, so I thought I'd resurrect this blog.<br />
<br />
Back then, I was working on a hobby project for text processing with particular reference to UTF- encoding and decoding. I wrote a lot of exploratory code (in a C++ library called "tea") that was left hanging when I moved on to other things. I've decided to re-visit that library and publish what I discovered whilst working on it.<br />
<br />
Firstly, let's define some useful types. Note that all the code below is pseudo-code; I'll publish the full library source code when it's in a suitable state:<br />
<br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">namespace tea {</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> namespace types {</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><span style="color: blue;"> // Unsigned integers</span></span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef uint8_t u8;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef uint16_t u16;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef uint32_t u32;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef uint64_t u64;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> // Signed integers</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef int8_t s8;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef int16_t s16;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef int32_t s32;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef int64_t s64;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> // Signed characters</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef signed char c8;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef int16_t c16;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef int32_t c32;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> // IEEE floating-point</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef float f32;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef double f64;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> // Traits</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> template<typename T></span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> struct Traits {</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> inline static T minimum(void) {</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> return std::numeric_limits<T>::lowest();</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> }</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> inline static T maximum(void) {</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> return std::numeric_limits<T>::max();</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> }</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> inline static T extreme(void) {</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> return std::is_signed<T>::value ?</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> std::numeric_limits<T>::lowest() :</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> std::numeric_limits<T>::max();</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> }</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> };</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> // Bit-sizes</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> template<size_t BITS></span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> struct BitSize;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> template<></span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> struct BitSize<8> {</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef s8 Signed;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef u8 Unsigned;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> };</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> template<></span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> struct BitSize<16> {</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef s16 Signed;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef u16 Unsigned;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> };</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> template<></span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> struct BitSize<32> {</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef s32 Signed;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef u32 Unsigned;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> };</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> template<></span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> struct BitSize<64> {</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef s64 Signed;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> typedef u64 Unsigned;</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> };</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><span style="color: blue;"> }</span></span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> using namespace types;</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><span style="color: blue; font-size: x-small;">}</span></span><br />
<div>
<br /></div>
Next, let's be overly-optimistic and assume that our UTF-8 and UTF-32 representations are always valid. That is:<br />
<br />
<ul>
<li>UTF-8 codepoints are not truncated mid-sequence.</li>
<li>All codepoints are valid Unicode codepoints.</li>
<li>When writing UTF-8, we never over-run the buffer.</li>
</ul>
<br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">namespace tea {</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> namespace utf8 {</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><span style="color: blue;"> const c32 codePointMinimum = 0x000000;</span></span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> const c32 codePointMaximum = 0x10FFFF;</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><span style="color: blue;"> enum Status_UTF8 {</span></span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> Status_UTF8_OK = 0,</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> Status_UTF8_Overlong = -1,</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> Status_UTF8_Beyond = -2,</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> Status_UTF8_Extraneous = -3,</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> Status_UTF8_Continuation = -4,</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> Status_UTF8_Truncated = -5,</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> Status_UTF8_Invalid = -6,</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> Status_UTF8_ThresholdStrict = Status_UTF8_OK,</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> Status_UTF8_ThresholdNormal = Status_UTF8_Overlong,</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> Status_UTF8_ThresholdLax = Status_UTF8_Invalid</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> };</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> // Positive for code point length, negative for error</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> // (Continuation/Invalid)</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> extern const s8 leadByteTable[256];</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> // Positive for code point length, zero for error</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> extern const u8 msbEncodedSize[33];</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> // The UTF-8 sequence must be valid</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> namespace optimistic {</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> size_t measureCodePoint(c32 codepoint);</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> void encodeCodePointMeasured(u8* ptr, c32 codepoint, size_t bytes);</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> u8* encodeCodePoint(u8* ptr, c32 codepoint);</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> c32 decodeCodePoint(const u8*& ptr, const u8* end);</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> size_t countCodePoints(const u8* begin, const u8* end);</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> }</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> }</span><br />
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">}</span><br />
<div>
<br /></div>
<div>
The "<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">leadByteTable</span>" array is the classic length-of-UTF-8-sequence-based-on-the-first-byte:</div>
<div>
<br /></div>
<div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">#define SIXTEEN(x) x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">const tea::s8 tea::utf8::leadByteTable[256] = {</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0x00-0x0F */ SIXTEEN(1),</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0x10-0x1F */ SIXTEEN(1),</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0x20-0x2F */ SIXTEEN(1),</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0x30-0x3F */ SIXTEEN(1),</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0x40-0x4F */ SIXTEEN(1),</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0x50-0x5F */ SIXTEEN(1),</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0x60-0x6F */ SIXTEEN(1),</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0x70-0x7F */ SIXTEEN(1),</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0x80-0x8F */ SIXTEEN(tea::utf8::Status_UTF8_Continuation),</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0x90-0x9F */ SIXTEEN(tea::utf8::Status_UTF8_Continuation),</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0xA0-0xAF */ SIXTEEN(tea::utf8::Status_UTF8_Continuation),</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0xB0-0xBF */ SIXTEEN(tea::utf8::Status_UTF8_Continuation),</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0xC0-0xCF */ SIXTEEN(2),</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0xD0-0xDF */ SIXTEEN(2),</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0xE0-0xEF */ SIXTEEN(3),</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> /* 0xF0-0xFF */ 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6,</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> tea::utf8::Status_UTF8_Invalid,</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> tea::utf8::Status_UTF8_Invalid</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">};</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">#undef SIXTEEN</span></div>
</div>
<div>
<br /></div>
<div>
Note that negative numbers indicate the type of error.</div>
<div>
<br /></div>
<div>
<div>
This gives us a trivial implementation of "<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">countCodePoints</span><span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">()</span>" for UTF-8 strings assuming that unrecognised UTF-8 sequences are mapped to a single, replacement codepoint:</div>
<div>
<br /></div>
<div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">size_t countCodePoints(const u8* begin, const u8* end) {</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> assert(begin != nullptr);</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> assert(end >= begin);</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> size_t count = 0;</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> while (begin < end) {</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> count++;</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> s8 bytes = leadByteTable[*begin];</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> begin += (bytes <= 0) ? 1 : bytes;</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> }</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> return count;</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">}</span></div>
</div>
</div>
<div>
<br /></div>
<div>
<div>
The implementation of "<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">decodeCodePoint</span><span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">()</span>" to decode a single UTF-8 codepoint to UTF-32 is also relatively simple:</div>
<div>
<br /></div>
<div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">c32 decodeCodePoint(const u8*& ptr, const u8* end) {</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> static const s32 bias[] = {</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> 0, 0, -12416, -925824, -63447168, 100130688, 2113396608</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> };</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> assert(ptr != nullptr);</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> assert(ptr < end);</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> u8 u1 = *(ptr++);</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> s8 expected = leadByteTable[u1];</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> assert((expected > 0) && (expected < TEA_NELEMS(bias)));</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> s32 codepoint = u1;</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> while (--expected > 0) {</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> // Extract any continuation bytes</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> assert(ptr < end);</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> u8 u2 = *(ptr++);</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> assert(leadByteTable[u2] == Status_UTF8_Continuation);</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> codepoint = (codepoint << 6) + u2;</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> }</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> codepoint += bias[leadByteTable[u1]];</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> assert(codepoint >= 0);</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> return codepoint;</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">}</span></div>
</div>
</div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span></div>
<div>
Working out how many UTF-8 bytes are required to store a UTF-32 character is performed by "<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">measureCodePoint()</span>":</div>
<div>
<br /></div>
<div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">size_t measureCodePoint(c32 codepoint) {</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> int msb = tea::bits::scanMostSetBit(u32(codepoint)) + 1;</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> assert((msb >= 0) && (msb < TEA_NELEMS(msbEncodedSize)));</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> return msbEncodedSize[msb];</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">}</span></div>
</div>
<div>
<br /></div>
<div>
<div>
This requires a lookup table "<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">msbEncodedSize</span>" and a fast way of detecting the most-significant set bit of a 32-bit value:</div>
<div>
<br /></div>
<div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">const u8 msbEncodedSize[33] = {</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> 1,1,1,1,1,1,1,1, // 0000000000000000000000000XXXXXX1: 0-7 : One byte</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> 2,2,2,2, // 000000000000000000000XXX1xxxxxxx: 8-11: Two bytes</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> 3,3,3,3,3, // 0000000000000000XXXX1xxxxxxxxxxx: 12-16: Three bytes</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> 4,4,4,4,4, // 00000000000XXXX1xxxxxxxxxxxxxxxx: 17-21: Four bytes</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> 5,5,5,5,5, // 000000XXXX1xxxxxxxxxxxxxxxxxxxxx: 22-26: Five bytes</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> 6,6,6,6,6, // 0XXXX1xxxxxxxxxxxxxxxxxxxxxxxxxx: 27-31: Six bytes</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> 0 // 1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: 32 : Invalid</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">};</span></div>
<div>
<br /></div>
<div>
Obviously, the implementation for "<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">scanMostSetBit</span>" is platform-dependent. So for Microsoft compilers it is:</div>
<div>
<br /></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"><div>
int tea::bits::scanMostSetBit(u32 value) {</div>
<div>
// Returns the index (31..0) of the most-significant non-zero bit</div>
<div>
// or -1 if input is zero</div>
<div>
unsigned long bit;</div>
<div>
return _BitScanReverse(&bit, value) ? int(bit) : -1;</div>
<div>
}</div>
<div>
<br /></div>
</span></div>
<div>
Finally, encoding a UTF-32 codepoint to a UTF-8 sequency is a case of measuring the size and storing the bytes:</div>
</div>
</div>
<div>
<br /></div>
<div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">void encodeCodePointMeasured(u8* ptr, c32 codepoint, size_t bytes) {</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> static const u64 bias[7] = {</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> 0, 0, 12288, 917504, 62914560, 4160749568, 270582939648</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> };</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> assert(codepoint >= 0);</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> assert(ptr != nullptr);</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> assert((bytes > 0) && (bytes < TEA_NELEMS(bias)));;</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> u8* p = ptr + bytes;</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> u64 bits = bias[bytes] + codepoint;</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> while (bits > 0xFF) {</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> *(--p) = u8((bits & 0x3F) + 0x80);</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> bits >>= 6;</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> }</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> *(--p) = u8(bits);</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"> assert(p == ptr);</span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;">}</span></div>
</div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"><div>
u8* encodeCodePoint(u8* ptr, c32 codepoint) {</div>
<div>
assert(codepoint >= 0);</div>
<div>
assert(ptr != nullptr);</div>
<div>
size_t bytes = measureCodePoint(codepoint);</div>
<div>
assert(bytes > 0);</div>
<div>
encodeCodePointMeasured(ptr, codepoint, bytes);</div>
<div>
return ptr + bytes;</div>
<div>
}</div>
<div>
<br /></div>
</span></div>
<div>
Obviously, our assumptions about valid UTF-8 sequences and the lack of checks for buffer over-runs make these "optimistic" routines unsuitable for production code, <i>unless</i> we've pre-validated the sequences.</div>
<div>
<br /></div>
<div>
Next time, we'll discuss routines that handle those error conditions gracefully.</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<span style="color: blue; font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span></div>
Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-198658375451606022.post-32904951225910150322013-06-06T10:23:00.003-07:002013-06-06T10:23:51.843-07:00Cut-down XML EBNFSometimes, the full XML specification is just overkill. If all you want to do is extract plain data from an XML file, using a conformant parser sometimes gets in the way.<br />
<br />
I've been looking at writing a minimal parser for text processing and will no doubt write about my experiences here. In the mean-time, I've prepared a cut-down version of the formal EBNF <a href="http://www.chilliant.com/xml_ebnf.html">here</a>.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgA7IIKHO93nz6LF-7-EQ0NpBMO91uMhQI9-r7WaQ9vuMSuXxqKaBrOhUgz9HfSOwgBaw3lY60VY-4pTK3AZ6pNXbZGd-Va6QrZB7qNzqlUuhDHVkut3ZuBVirix0bp35711rFZgrMz-J0/s1600/comment.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgA7IIKHO93nz6LF-7-EQ0NpBMO91uMhQI9-r7WaQ9vuMSuXxqKaBrOhUgz9HfSOwgBaw3lY60VY-4pTK3AZ6pNXbZGd-Va6QrZB7qNzqlUuhDHVkut3ZuBVirix0bp35711rFZgrMz-J0/s1600/comment.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Arial, Helvetica, sans-serif;"><strong>comment</strong></span></td></tr>
</tbody></table>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><br />
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjckZBUC4yOQW_AerMszEUTf6tsKpNJOe7HQ2lmS0xjx2Z_xJtL6F5Z8uXiiVZQRczbvA9CIwC22eMbaf3pByKiiQnNeqEtPo68iDC8znPE2L-yge3oZBBdvOynkSTd9JU4RICqvaK71s8/s1600/processing_instruction.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="48" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjckZBUC4yOQW_AerMszEUTf6tsKpNJOe7HQ2lmS0xjx2Z_xJtL6F5Z8uXiiVZQRczbvA9CIwC22eMbaf3pByKiiQnNeqEtPo68iDC8znPE2L-yge3oZBBdvOynkSTd9JU4RICqvaK71s8/s640/processing_instruction.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Arial, Helvetica, sans-serif;"><strong>processing_instruction</strong></span></td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhfVHOAh_UTEfr8Kk_3EhMhnK4DHjN8VHQ-Y1EsPqL8c6Otx0a840QKZ7C7gAIgh0Fx68Rk9OAq5d7-F-DUzU160zQiduv6jsICz4AB69NkpmG5TLKy1LZsoiAEovFq9Jzt1DNjHVVJu4/s1600/character_entity.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhfVHOAh_UTEfr8Kk_3EhMhnK4DHjN8VHQ-Y1EsPqL8c6Otx0a840QKZ7C7gAIgh0Fx68Rk9OAq5d7-F-DUzU160zQiduv6jsICz4AB69NkpmG5TLKy1LZsoiAEovFq9Jzt1DNjHVVJu4/s1600/character_entity.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Arial, Helvetica, sans-serif;"><strong>character_entity</strong></span></td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWcWSKPmdROUqzd2DP21cLKYvhXI8HRJw6-D_C4Hpjsy9wk507NeRFKM_Ui2K7m45eZibXXpKv8FQ9TsSAEJEOKBVOksHbtLttxhmg2VevwmPPfrXyyLWt7kc14Msvm4s46muoWaUqzFY/s1600/text.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWcWSKPmdROUqzd2DP21cLKYvhXI8HRJw6-D_C4Hpjsy9wk507NeRFKM_Ui2K7m45eZibXXpKv8FQ9TsSAEJEOKBVOksHbtLttxhmg2VevwmPPfrXyyLWt7kc14Msvm4s46muoWaUqzFY/s1600/text.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Arial, Helvetica, sans-serif;"><strong>text</strong></span></td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhm9bOrZIpJheBi8MRaF7ACx3iPabpWNYyMcVUohZbGNka406W_swDaN0vuIgrQKqE83G4iSL89KNLFDXNJwfqZKnt1x46a4ppld-1tZX-osyL-sPO864al9X6eVCjaTqMgOEaTNy9zTQU/s1600/attribute.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="176" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhm9bOrZIpJheBi8MRaF7ACx3iPabpWNYyMcVUohZbGNka406W_swDaN0vuIgrQKqE83G4iSL89KNLFDXNJwfqZKnt1x46a4ppld-1tZX-osyL-sPO864al9X6eVCjaTqMgOEaTNy9zTQU/s640/attribute.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Arial, Helvetica, sans-serif;"><strong>attribute</strong></span></td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2A0VECtjSNmRhZwEG_9YFVsITBJgA2UvaLg1sbt_iza861te0dZwNTmNvorZXfOZq0bnJ3L7zBAzjKm8Tj3X3uoPSW5rE7vnK0NPsJaSNBfhNDeIxRVTVfMJqWY1QhFmP1yEElbWpbhY/s1600/element.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="64" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2A0VECtjSNmRhZwEG_9YFVsITBJgA2UvaLg1sbt_iza861te0dZwNTmNvorZXfOZq0bnJ3L7zBAzjKm8Tj3X3uoPSW5rE7vnK0NPsJaSNBfhNDeIxRVTVfMJqWY1QhFmP1yEElbWpbhY/s640/element.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Arial, Helvetica, sans-serif;"><strong>element</strong></span></td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjh_etAMazAPWjY4z_9hx5idpDu-iccOJ60DSmHJgSxV9rjI315AKV2w23mvtkCjtoRr2Gezu_-UCNF1rKg5jMK5oa_0OphVy7muG6oG-Wn6LLHvFKU_PnyQH7jyxA-dWg2w4kQin4Xfpc/s1600/miscellaneous.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjh_etAMazAPWjY4z_9hx5idpDu-iccOJ60DSmHJgSxV9rjI315AKV2w23mvtkCjtoRr2Gezu_-UCNF1rKg5jMK5oa_0OphVy7muG6oG-Wn6LLHvFKU_PnyQH7jyxA-dWg2w4kQin4Xfpc/s1600/miscellaneous.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Arial, Helvetica, sans-serif;"><strong>miscellaneous</strong></span></td></tr>
</tbody></table>
<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLFifrgEAGLZ-WPfeI5aP6WwyiduxJrktYXQ5r2vDLiJGBgYpqsPeNPWHibRydtODEtV9nxVIjX-enpeCGsYDPSyrDzS9FLDeKl3FefaMqGioX4wVGnxClMYfy531G6l-spq1fHf5pWiM/s1600/doctype_declaration.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="118" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLFifrgEAGLZ-WPfeI5aP6WwyiduxJrktYXQ5r2vDLiJGBgYpqsPeNPWHibRydtODEtV9nxVIjX-enpeCGsYDPSyrDzS9FLDeKl3FefaMqGioX4wVGnxClMYfy531G6l-spq1fHf5pWiM/s640/doctype_declaration.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Arial, Helvetica, sans-serif;"><strong>doctype_declaration</strong></span></td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjw1qnPvEGSrQ1MYj4-d6gS1ecmA99AtMMlBYRyCNS6N4Rvd4MTIcdhlZzINYFiKv6K8_Y-sV8QkQpVAYGsl9xpPDNTz2iEAeDqDOW0H5GI7iBTONjXDc_f33d8S7zARZtLPG-Dz6YPfaI/s1600/doctype_parameter.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjw1qnPvEGSrQ1MYj4-d6gS1ecmA99AtMMlBYRyCNS6N4Rvd4MTIcdhlZzINYFiKv6K8_Y-sV8QkQpVAYGsl9xpPDNTz2iEAeDqDOW0H5GI7iBTONjXDc_f33d8S7zARZtLPG-Dz6YPfaI/s1600/doctype_parameter.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Arial, Helvetica, sans-serif;"><strong>doctype_parameter</strong></span></td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj18qMgjRyH6LiZXa1hQwWSvCjbzrv6o4dnvqo62tzzvkZRZvUEVlz15uNi7a3rMmxSuHIUjtrOQAtDmDQelop56rp7jZMvMCJn5EsqoQGBNWX7L3-7ql2pUpQaHsRiSadKsm9bDmh4WHA/s1600/doctype.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="44" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj18qMgjRyH6LiZXa1hQwWSvCjbzrv6o4dnvqo62tzzvkZRZvUEVlz15uNi7a3rMmxSuHIUjtrOQAtDmDQelop56rp7jZMvMCJn5EsqoQGBNWX7L3-7ql2pUpQaHsRiSadKsm9bDmh4WHA/s640/doctype.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Arial, Helvetica, sans-serif;"><strong>doctype</strong></span></td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgErktHAHl-kMXVvEnER6W6JCCwgrqaTH-MuiLrFLoCQf7xyDQH55bSvS8iJxYYDxiMde0WJN6HWoSOLqDMmSxNsC1XnOSWWpCawsY_lCyrHgPoeTSt8kgxoh4g9MR9JV_Dp-hoiMA0rIk/s1600/document.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="49" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgErktHAHl-kMXVvEnER6W6JCCwgrqaTH-MuiLrFLoCQf7xyDQH55bSvS8iJxYYDxiMde0WJN6HWoSOLqDMmSxNsC1XnOSWWpCawsY_lCyrHgPoeTSt8kgxoh4g9MR9JV_Dp-hoiMA0rIk/s640/document.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Arial, Helvetica, sans-serif;"><strong>document</strong></span></td></tr>
</tbody></table>
Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-198658375451606022.post-34936759069036799212013-04-12T05:16:00.000-07:002013-04-12T05:16:03.693-07:00Another day, another blogLigatures and umlauts! That's all I've got to say on the matter.Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0