Skip to main content

The utf8.h File Reference

Various UTF8 related helper functions. More...

Included Headers

#include <cstdint> #include <string>

Functions Index

std::stringconvertUTF8ToLower (const std::string &input)

Converts the input string into a lower case version, also taking into account non-ASCII characters that has a lower case variant. More...

std::stringconvertUTF8ToUpper (const std::string &input)

Converts the input string into a upper case version, also taking into account non-ASCII characters that has a upper case variant. More...

std::stringgetUTF8CharAt (const std::string &input, size_t pos)

Returns the UTF8 character found at byte position pos in the input string. More...

uint32_tgetUnicodeForUTF8CharAt (const std::string &input, size_t pos)

Returns the 32bit Unicode value matching character at byte position pos in the UTF8 encoded input. More...

uint8_tgetUTF8CharNumBytes (char firstByte)

Returns the number of bytes making up a single UTF8 character given the first byte in the sequence. More...

const char *writeUTF8Char (TextStream &t, const char *s)

Writes the UTF8 character pointed to by s to stream t and returns a pointer to the next character. More...

boollastUTF8CharIsMultibyte (const std::string &input)

Returns true iff the last character in input is a multibyte character. More...

boolisUTF8CharUpperCase (const std::string &input, size_t pos)

Returns true iff the input string at byte position pos holds an upper case character. More...

intisUTF8NonBreakableSpace (const char *input)

Check if the first character pointed at by input is a non-breakable whitespace character. More...

boolisUTF8PunctuationCharacter (uint32_t unicode)

Check if the given Unicode character represents a punctuation character. More...

Description

Various UTF8 related helper functions.

See https://en.wikipedia.org/wiki/UTF-8 for details on UTF8 encoding.

Functions

convertUTF8ToLower()

std::string convertUTF8ToLower (const std::string & input)

Converts the input string into a lower case version, also taking into account non-ASCII characters that has a lower case variant.

Declaration at line 34 of file utf8.h, definition at line 187 of file utf8.cpp.

187std::string convertUTF8ToLower(const std::string &input)
188{
190}

References asciiToLower, caseConvert and convertUnicodeToLower.

Referenced by SearchIndexInfo::add, Index::addClassMemberNameToIndex, Index::addFileMemberNameToIndex, Index::addModuleMemberNameToIndex, Index::addNamespaceMemberNameToIndex, AnchorGenerator::generate, QCString::lower, FileNameFn::searchKey, SearchTerm::termEncoded and HtmlGenerator::writeLabel.

convertUTF8ToUpper()

std::string convertUTF8ToUpper (const std::string & input)

Converts the input string into a upper case version, also taking into account non-ASCII characters that has a upper case variant.

Declaration at line 39 of file utf8.h, definition at line 192 of file utf8.cpp.

192std::string convertUTF8ToUpper(const std::string &input)
193{
195}

References asciiToUpper, caseConvert and convertUnicodeToUpper.

Referenced by Translator::createNoun, QCString::upper and writeAlphabeticalClassList.

getUnicodeForUTF8CharAt()

uint32_t getUnicodeForUTF8CharAt (const std::string & input, size_t pos)

Returns the 32bit Unicode value matching character at byte position pos in the UTF8 encoded input.

Declaration at line 49 of file utf8.h, definition at line 135 of file utf8.cpp.

135uint32_t getUnicodeForUTF8CharAt(const std::string &input,size_t pos)
136{
137 std::string charS = getUTF8CharAt(input,pos);
138 int len=0;
139 return convertUTF8CharToUnicode(charS.c_str(),charS.length(),len);
140}

References convertUTF8CharToUnicode and getUTF8CharAt.

Referenced by AnchorGenerator::generate.

getUTF8CharAt()

std::string getUTF8CharAt (const std::string & input, size_t pos)

Returns the UTF8 character found at byte position pos in the input string.

The resulting string can be a multi byte sequence.

Declaration at line 44 of file utf8.h, definition at line 127 of file utf8.cpp.

127std::string getUTF8CharAt(const std::string &input,size_t pos)
128{
129 if (input.length()<=pos) return std::string();
130 int numBytes=getUTF8CharNumBytes(input[pos]);
131 if (input.length()<pos+numBytes) return std::string();
132 return input.substr(pos,numBytes);
133}

Reference getUTF8CharNumBytes.

Referenced by SearchIndexInfo::add, Index::addClassMemberNameToIndex, Index::addFileMemberNameToIndex, Index::addModuleMemberNameToIndex, Index::addNamespaceMemberNameToIndex, Translator::createNoun, AnchorGenerator::generate, getUnicodeForUTF8CharAt and writeAlphabeticalClassList.

getUTF8CharNumBytes()

uint8_t getUTF8CharNumBytes (char firstByte)

Returns the number of bytes making up a single UTF8 character given the first byte in the sequence.

Declaration at line 54 of file utf8.h, definition at line 23 of file utf8.cpp.

23uint8_t getUTF8CharNumBytes(char c)
24{
25 uint8_t num=1;
26 unsigned char uc = static_cast<unsigned char>(c);
27 if (uc>=0x80u) // multibyte character
28 {
29 if ((uc&0xE0u)==0xC0u)
30 {
31 num=2; // 110x.xxxx: 2 byte character
32 }
33 if ((uc&0xF0u)==0xE0u)
34 {
35 num=3; // 1110.xxxx: 3 byte character
36 }
37 if ((uc&0xF8u)==0xF0u)
38 {
39 num=4; // 1111.0xxx: 4 byte character
40 }
41 if ((uc&0xFCu)==0xF8u)
42 {
43 num=5; // 1111.10xx: 5 byte character
44 }
45 if ((uc&0xFEu)==0xFCu)
46 {
47 num=6; // 1111.110x: 6 byte character
48 }
49 }
50 return num;
51}

Referenced by detab, escapeCharsInString, AnchorGenerator::generate, getUTF8CharAt, nextUTF8CharPosition, updateColumnCount and writeUTF8Char.

isUTF8CharUpperCase()

bool isUTF8CharUpperCase (const std::string & input, size_t pos)

Returns true iff the input string at byte position pos holds an upper case character.

Declaration at line 65 of file utf8.h, definition at line 218 of file utf8.cpp.

218bool isUTF8CharUpperCase(const std::string &input,size_t pos)
219{
220 if (input.length()<=pos) return false;
221 int len=0;
222 // turn the UTF8 character at position pos into a unicode value
223 uint32_t code = convertUTF8CharToUnicode(input.c_str()+pos,input.length()-pos,len);
224 // check if the character can be converted to lower case, if so it was an upper case character
225 return convertUnicodeToLower(code)!=nullptr;
226}

References convertUnicodeToLower and convertUTF8CharToUnicode.

Referenced by DefinitionImpl::_setBriefDescription.

isUTF8NonBreakableSpace()

int isUTF8NonBreakableSpace (const char * input)

Check if the first character pointed at by input is a non-breakable whitespace character.

Returns the byte size of the character if there is match or 0 if not.

Declaration at line 70 of file utf8.h, definition at line 228 of file utf8.cpp.

228int isUTF8NonBreakableSpace(const char *input)
229{
230 return (static_cast<unsigned char>(input[0])==0xC2 &&
231 static_cast<unsigned char>(input[1])==0xA0) ? 2 : 0;
232}

Referenced by detab.

isUTF8PunctuationCharacter()

bool isUTF8PunctuationCharacter (uint32_t unicode)

Check if the given Unicode character represents a punctuation character.

Declaration at line 73 of file utf8.h, definition at line 234 of file utf8.cpp.

234bool isUTF8PunctuationCharacter(uint32_t unicode)
235{
236 bool b = isPunctuationCharacter(unicode);
237 return b;
238}

Reference isPunctuationCharacter.

Referenced by AnchorGenerator::generate.

lastUTF8CharIsMultibyte()

bool lastUTF8CharIsMultibyte (const std::string & input)

Returns true iff the last character in input is a multibyte character.

Declaration at line 62 of file utf8.h, definition at line 212 of file utf8.cpp.

212bool lastUTF8CharIsMultibyte(const std::string &input)
213{
214 // last byte is part of a multibyte UTF8 char if bit 8 is set and bit 7 is not
215 return !input.empty() && (static_cast<unsigned char>(input[input.length()-1])&0xC0)==0x80;
216}

Referenced by DefinitionImpl::_setBriefDescription.

writeUTF8Char()

const char * writeUTF8Char (TextStream & t, const char * s)

Writes the UTF8 character pointed to by s to stream t and returns a pointer to the next character.

Declaration at line 59 of file utf8.h, definition at line 197 of file utf8.cpp.

197const char *writeUTF8Char(TextStream &t,const char *s)
198{
199 if (s==nullptr) return nullptr;
200 uint8_t len = getUTF8CharNumBytes(*s);
201 for (uint8_t i=0;i<len;i++)
202 {
203 if (s[i]==0) // detect premature end of string (due to invalid UTF8 char)
204 {
205 len=i;
206 }
207 }
208 t.write(s,len);
209 return s+len;
210}

References getUTF8CharNumBytes and TextStream::write.

Referenced by HtmlCodeGenerator::codify, ManCodeGenerator::codify, RTFCodeGenerator::codify, HtmlDocVisitor::operator(), HtmlDocVisitor::writeObfuscatedMailAddress and writeXMLCodeString.


Generated via doxygen2docusaurus by Doxygen 1.14.0.