kibigo! :: Specifications

Kixt Charsets

Abstract

The following specification defines a mechanism for defining character sets (charsets) which may be used in a Kixt transmission or rendering system. Kixt charsets are, as their name implies, sets of Kixt characters, mapping each one to a codepoint and assigning it particular character properties. This document details the meaning of these properties within the Kixt Charset Model. Finally, this document introduces a plaintext document format, the Kixt Charset Definition, for describing such charsets.

1. Introduction

1.1 Purpose and Scope

The purpose of Kixt is to reduce the barriers of entry for developing unconventional, experimental encodings and text processing systems, particularly those of characters or scripts which are unlikely to gain mainstream acceptance. The Kixt Charset Model, defined in this document, is one piece in an interlocking set of specifications working towards this goal.

The Kixt Charset Model is not, and does not attempt to be, a replacement for Unicode. In fact, every assigned Kixt character is required to have a defined Unicode mapping. If you are building an application which needs to process characters in a wide variety of scripts, languages, and/or directionalities, Unicode is the correct solution for you.

1.2 Relationship to Other Specifications

This document is part of the Kixt family of specifications. It is also built upon the technologies of R·D·F and O·W·L. It makes minor use of the Ordered List Ontology for defining its ordered lists.

In this document, the following prefixes are used to represent the following strings:

Prefix Expansion
kixt: https://spec.go.kibi.family/ns/kixt/#
i18n: https://www.w3.org/ns/i18n#
olo: http://purl.org/ontology/olo/core#
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
xsd: http://www.w3.org/2001/XMLSchema#

2. Data Model

A Kixt character is an abstract representation of a unit of text, with associated properties and assignment to a single codepoint. In Kixt, all codepoints are integer values between 0 and 65535, inclusive.

A Kixt charset is a collection of Kixt characters.

Within a Kixt charset, characters may be collected into blocks. Blocks may have “gaps,” but must otherwise be contiguous across assigned characters; this is enforced by the Kixt Charset Definition syntax. Each character may belong to only one block.

Similarly, Kixt characters may be collected into scripts. Scripts are intended to allow for searching, accessing, and categorizing characters based on their histories or communities of usage. This specification makes no requirements on the structure of scripts, except to specify that each character may only belong to one.

The Kixt Charset Model is an R·D·F graph which associates charsets, characters, and their properties with one another according to the rules outlined in this document. The vocabulary for this model is the Kixt Ontology (O·W·L document), which normatively defines several classes and properties for use with the Model.

Although the Kixt Charset Model is defined using R·D·F and O·W·L, it is not expected that most or even many applications which make use of it will be fully‐fledged OWL reasoners. It is not presently recommended that applications interfacing with Kixt charsets load their information from anything other than a Kixt Charset Definition, whose resultant R·D·F graph is weldefined by this specification and may be abstracted as required. (However, its basis in R·D·F and O·W·L means that extensions to the Kixt Charset Model, as well as alternate means of loading and processing Kixt charsets, may conceivably be designed in the future.)

2.1 Types of Resource

There are two major types of resource in the Kixt Charset Model: charsets and characters. These are represented by the classes kixt:Charset and kixt:Character. In addition to these, kixt:Block and kixt:Script represent blocks and scripts of characters, respectively. Only kixt:Charset need explicitly be declared as the rdf:type of a resource; the remaining classes are implied through an instance's assignment to various properties (kixt:character, kixt:block, kixt:script).

Many other classes of resource are defined in the Kixt Ontology. However, knowledge of their existence is not required for the processes described in this document.

2.2 Datatypes

As an R·D·F‐based model, the Kixt Charset Model naturally inherits the various datatypes introduced in R·D·F Concepts, including those inherited from X·S·D Datatypes and elsewhere. It also introduces five new datatypes:

As O·W·L datatypes, you should not use these to type R·D·F literals; use xsd:anyURI, xsd:string, xsd:integer or similar instead. However, they are useful for expressing constraints on relations.

3. Charset Definitions

CharsetDefinition =
	[%xFEFF] CharsetDeclaration *(
		BlockDeclaration
		/ ScriptDeclaration
		/ CharacterDefinition
		/ Comment
		/ Blank
	)

A Kixt Charset Definition is a U·T·F‐8 or U·T·F‐16–encoded file (determined by a leading B·O·M, and defaulting to U·T·F‐8 if no B·O·M is present) consisting of any number of block or script declarations, interspersed with any number of codepoint definitions, comments, or lines of whitespace. This section describes the syntax of such documents, in A·B·N·F and prose. It also describes how processors can use Kixt Charset Definitions to generate R·D·F graphs.

The file extension .kichar or .kch is suggested for Kixt Charset Definitions.

3.1 Null Handling

Programs which process Kixt Charset Definitions must ignore any U+0000 NULL characters which appear in a document, behaving as though they were not present.

This effectively allows a U·T·F‐16‐encoded document to be processed without a B·O·M if it only contains codepoints in the A·S·C·I·I range.

3.2 End‐of‐Line Handling

Break = %x2028
	; A Unicode line separator
	; This is effectively just a placeholder for whatever manner of linebreak a document happens to use; see spec description

U+2028 LINE SEPARATOR is the only formally recognized line separator in a Kixt Charset Definition. However, to ease in the use of Kixt Charset Definitions on platforms for which support for U+2028 LINE SEPARATOR is lacking, programs which process Kixt Charset Definitions must behave as though they normalized all of the following, on input and before parsing, to a single U+2028 LINE SEPARATOR:

3.3 Basic Syntaxes

Space = %x20
	; An ordinary Unicode breaking space
Blank = *Space Break
	; A line consisting only of whitespace

U+0020 SPACE is the only space character in a Kixt Charset Definition. Spaces and line separators are not interchangeable.

ECMA6Char = %x21-22 / %x25-3F / %x41-5A / %x5F / %x61-7A
	; The invariant printing characters from ECMA-6 (also ISO/IEC 646); documents written using only these characters (plus some manner of ASCII linebreak as described above) will be comprehensible in any ECMA-6 character set
	; The core Kixt Character Definition syntax (with the exception of linebreaks) only uses characters from this set
ASCIIChar = %x21-7E
	; Non-control, non-space ASCII characters
UCSChar =
	%xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
		; Basic Multilingual Plane
		; (minus surrogates and specials)
	/ %x10000-1FFFD
		; Supplementary Multilingual Plane
	/ %x20000-2FFFD
		; Supplementary Ideographic Plane
	/ %x30000-3FFFD
		; Tertiary Ideographic Plane
	/ %x40000-4FFFD
		; Plane 4
	/ %x50000-5FFFD
		; Plane 5
	/ %x60000-6FFFD
		; Plane 6
	/ %x70000-7FFFD
		; Plane 7
	/ %x80000-8FFFD
		; Plane 8
	/ %x90000-9FFFD
		; Plane 9
	/ %xA0000-AFFFD
		; Plane 10
	/ %xB0000-BFFFD
		; Plane 11
	/ %xC0000-CFFFD
		; Plane 12
	/ %xD0000-DFFFD
		; Plane 13
	/ %xE1000-EFFFD
		; Supplementary Special-purpose Plane
		; Note the first 1000 characters are excluded
	; Non-special, non-privateuse, non-ASCII characters
	; This is the same definition as used for IRIs
PrivateUse = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD
	; Unicode private-use characters

Specials, such as controls or noncharacters, are not allowed in a Kixt Charset Definition. (This does not prevent you from using their codepoints in the <UnicodeMapping> of <CharacterDefinition>s.)

NoSlash = %x20-2E %x30-7E / UCSChar / PrivateUse
NoSpace = ASCIIChar / UCSChar / PrivateUse
NoBreak = NoSpace / Space
AnyChar = NoBreak / Break

You may, generally speaking, use non‐A·S·C·I·I and private‐use characters in a Kixt Charset Definition, except in restricted productions like <Name>.

Zero = %x30
One = %x31
Bit = Zero / One
NonZero = %x31-39
Decimal = %x30-39

Decimal numbers use the standard A·S·C·I·I digits.

NonZeroHex = %x31-39 / %x41-46
UpperHex = %x30-39 / %x41-46
Hex = %x30-39 / %x41-46 / %x61-66
	; Case-insensitive hexadecimal used by IRI syntax

Hexadecimal numbers (<UpperHex>) use only the digits 09 and the uppercase A·S·C·I·I letters AF. The <Hex> production, which also allows lowercase af, is only used in <IRI>s.

UpperAlpha = %x41-5A
Alpha = %x41-5A / %x61-7A

A number of productions in Kixt Charset Definitions use only the capital A·S·C·I·I alphabetic characters (<UpperAlpha>). <IRI>s make use of the small A·S·C·I·I alphabetic characters as well (<Alpha>).

IRI =
	URI-scheme %x3A IRI-hier-part
	[%x3F IRI-query]
	[%x23 IRI-fragment]
IRI-hier-part =
	%x2F.2F IRI-authority IRI-path-abempty
	/ IRI-path-absolute
	/ IRI-path-rootless
	/ IRI-path-empty
IRI-authority =
	[IRI-userinfo %x40]
	IRI-host
	[%x3A URI-port]
IRI-userinfo =
	*(
		IRI-unreserved
		/ URI-pct-encoded
		/ URI-sub-delims
		/ %x3A
	)
IRI-host =
	URI-IP-literal
	/ URI-IPv4address
	/ IRI-reg-name
IRI-reg-name =
	*(
		IRI-unreserved
		/ URI-pct-encoded
		/ URI-sub-delims
	)
IRI-path-abempty = *(%x2F IRI-segment)
IRI-path-absolute =
	%x2F [
		IRI-segment-nz
		*(%x2F IRI-segment)
	]
IRI-path-rootless = IRI-segment-nz *(%x2F IRI-segment)
IRI-path-empty = 0IRI-pchar
IRI-segment = *IRI-pchar
IRI-segment-nz = 1*IRI-pchar
IRI-pchar =
	IRI-unreserved
	/ URI-pct-encoded
	/ URI-sub-delims
	/ %x3A-40
IRI-query = *(IRI-pchar / Privateuse / %x2F / %x3F)
IRI-fragment = *(IRI-pchar / %x2F / %x3F)
IRI-unreserved =
	Alpha / Decimal / %x2D-2E / %x5F / %x7E
	/ UCSChar
URI-scheme = Alpha *(Alpha / Decimal / %x2B / %x2D-2E)
URI-port = *Decimal
URI-IP-literal =
	%x5B (
		URI-IPv6address
		/ URI-IPvFuture
	) %x5D
URI-IPvFuture =
	(%x56 / %x76) 1*Hex %x2E
	1*(URI-unreserved / URI-sub-delims / %x5B)
URI-IPv6address =
	6(URI-h16 %x3A) URI-ls32
	/ %x3A.3A
		5(URI-h16 %x3A) URI-ls32
	/ [URI-h16] %x3A.3A
		4(URI-h16 %x3A) URI-ls32
	/ [*1(URI-h16 %x3A) URI-h16] %x3A.3A
		3(URI-h16 %x3A) URI-ls32
	/ [*2(URI-h16 %x3A) URI-h16] %x3A.3A
		2(URI-h16 %x3A) URI-ls32
	/ [*3(URI-h16 %x3A) URI-h16] %x3A.3A
		URI-h16 %x3A URI-ls32
	/ [*4(URI-h16 %x3A) URI-h16] %x3A.3A
		URI-ls32
	/ [*5(URI-h16 %x3A) URI-h16] %x3A.3A
		URI-h16
	/ [*6(URI-h16 %x3A) URI-h16] %x3A.3A
URI-h16 = 1*4Hex
	; 16 bits of address represented in hexadecimal
URI-ls32 = (URI-h16 %x3A URI-h16) / IPv4address
	; Least significant 32 bits of address
URI-IPv4address =
	URI-dec-octet
	%x2E URI-dec-octet
	%x2E URI-dec-octet
	%x2E URI-dec-octet
URI-dec-octet =
	%x32 %x35 %x30-35
		; 250--255
	/ %x32 %x30-34 Decimal
		; 200--249
	/ One 2Decimal
		; 100--199
	/ NonZero Decimal
		; 10--99
	/ Decimal
		; 0--9
URI-pct-encoded = %x25 Hex Hex
URI-unreserved = Alpha / Decimal / %x2D-2E / %x5F / %x7E
URI-sub-delims = %x21 / %x24 / %x26-2C / %x3B / %x3D

I·R·Is must be <IRI>s as defined by R·F·C 3987, from which the above productions were taken.

NonEmptyString = NoSpace *([Space] NoSpace)
String = [NonEmptyString]
	; May be empty
CommentString = [NoSpace [*NoBreak NoSpace]]
	; May be empty

Ordinary strings cannot begin or end with spaces, or have multiple spaces appear in sequence. Comment strings allow multiple spaces in sequence. Neither allows linebreaks.

NameChar = UpperAlpha / Decimal / %x20 / %x2D
	; Valid characters for use in names
NameSpaceSequence =
	Space (
		UpperAlpha
			; A letter
		/ (%x2D (UpperAlpha / Decimal))
			; A hyphen followed by a letter or digit
	)
		; Rules for including a space in a name
NameHyphenSequence =
	%x2D (
		UpperAlpha
			; A letter
		/ Decimal
			; A digit
		/ NameSpaceSequence
			; A space (and then some) as defined above
	)
		; Rules for including a hyphen
Name =
	UpperAlpha *(
		UpperAlpha
		/ Decimal
		/ NameSpaceSequence
		/ NameHyphenSequence
	)
		; Valid character, block, and script names

Names must start with an uppercase A·S·C·I·I letter, and may consist of uppercase A·S·C·I·I letters, A·S·C·I·I digits, U+002D HYPHEN-MINUS, or spaces. Hyphens and spaces must not end a name or appear in sequence. Hyphens must not be surrounded by spaces, and digits must not be preceded by a space.

Codepoint =
	*Zero NonZeroHex *3UpperHex
		; Any nonzero value
	/ 1*Zero
		; Zero
UnicodeCodepoint =
	%x55 %x2B
	(
		*Zero (One Zero / NonZeroHex) *4UpperHex
			; Any nonzero value
		/ 1*Zero
			; Zero
	)
BinaryCodepoint =
	*(Zero [Space]) One *15([Space] Bit)
	/ Zero *([Space] Zero)
Integer =
	Zero
	/ NonZeroHex *3UpperHex

The rules defined above are designed to facilitate first‐match‐wins, greedy matching.

Ordinary codepoints must be hexadecimal numbers in the range 0000FFFF but may be preceded by any number of zeroes.

Unicode codepoints must be preceded with the string U+ and may be padded with any number of zeroes. Only codepoints in the range U+0000..U+10FFFF are valid Unicode codepoints.

Binary codepoints may have single spaces between their digits.

Integer values are expressed as hexadecimal numbers from 0FFFF, with no leading zeroes.

3.4 Comment

SingleLineComment =
	*Space %x2F
	*Space CommentString
	*Space Break
InnerCommentLine =
	(
		NoSlash
		/ %x2F NoSlash
		/ %x2F.2F NoSlash
		/ %x2F.2F.2F NoBreak
	)  *NoBreak Break
MultiLineComment =
	%x2E.2E.2E Break
	*InnerCommentLine
	%x2F.2F.2F Break
Comment =
	SingleLineComment
	/ MultiLineComment

A single·line comment is a single line beginning with U+002F SOLIDUS and then followed by any number of other characters. Multiline comments begin with three U+002E FULL STOP characters and end with three U+002F SOLIDUS characters. Comments should be ignored during processing.

Note that multiline comments can only appear on the “top level” and not inside of character declarations or other productions.

3.5 Common Constructs

The following productions are used in multiple types of declaration, with similar meanings.

3.5.1 Aliases

Alias =
	*Space %x3D
	*Space Name
	*Space Break
	*SingleLineComment
Aliases = 1*Alias

An <Aliases> gives alternate <Name>s by which a kixt:Block or kixt:Character might be known. It consists of one or more lines, each beginning with an U+003D EQUALS SIGN, and followed by a <Name>.

Upon reaching an <Aliases>, for each <Name>, create a new R·D·F triple with current parent as its subject, kixt:alias as its predicate, and the value of the <Name> as its object, as an xsd:string.

3.5.2 Other names

OtherName =
	*Space %x2D
	*Space NonEmptyString
	*Space Break
	*SingleLineComment
OtherNames = 1*OtherName

An <OtherNames> gives alternate names for a kixt:Charset, kixt:Block, kixt:Script, or kixt:Character, which may be more freeform than the <Name> production allows. It consists of one or more lines, each beginning with an U+002D HYPHEN-MINUS, and followed by a <NonEmptyString>.

Upon reaching an <OtherNames>, for each <NonEmptyString>, create a new R·D·F triple with current parent as its subject, kixt:alsoKnownAs as its predicate, and the value of the <NonEmptyString> as its object, as an xsd:string.

3.5.3 Notes

Note =
	*Space %x2A
	*Space NonEmptyString
	*Space Break
	*SingleLineComment
Notes = 1*Note

A <Notes> gives a freeform space for adding informative notes to a kixt:Charset, kixt:Block, kixt:Script, or kixt:Character. It consists of one or more lines, each beginning with an U+002A ASTERISK, and followed by a <NonEmptyString>.

Upon reaching a <Notes>, for each <NonEmptyString>, create a new R·D·F triple with current parent as its subject, kixt:note as its predicate, and the value of the <NonEmptyString> as its object, as an xsd:string.

3.6 Charset Declaration

CharsetDeclaration =
	CharsetIdentifier
	CharsetProperties
	[OtherNames]
	[Notes]

A <CharsetDeclaration> defines the Kixt Charset Definition's kixt:Charset. It must be the first thing in a Kixt Charset Definition, after an optional U+FEFF Byte Order Mark, with no leading spaces or breaks.

3.6.1 Charset identifier

CharsetIdentifier =
	%x3B.43.48.41.52.53.45.54.3C
		; `;CHARSET<`
	IRI %x3E [
		Integer
		[%x2E Integer]
	] *Space Break
	*SingleLineComment

A <CharsetIdentifier> defines the I·R·I and version for a Kixt Charset Definition's kixt:Charset. Upon reaching a <CharsetIdentifier>, set the current charset to the IRI specified by <IRI>. Set current parent to current charset. Create an R·D·F triple with the current charset as its subject, rdf:type as its predicate, and kixt:Charset as its object.

If a first <Integer> is present, create an R·D·F triple with the current charset as its subject, kixt:version as its predicate, and the value of the first <Integer> as its object, as an xsd:integer. If a second <Integer> is present, create an RDF triple with the current charset as its subject, kixt:revision as its predicate, and the value of the second <Integer> as its object, as an xsd:integer.

Finally, set current script to i18n:zzzz; this is the default script.

3.6.2 Charset properties

Variable = %x56.41.52.49.41.42.4C.45
	; `VARIABLE`
CharsetProperties =
	[
		*Space %x26
		*Space Variable
		*Space Break
		*SingleLineComment
	]

A <CharsetProperties> defines additional properties on a kixt:Character. At the moment, the only additional property defined is a promise as to whether the character set is variable‐width‐compatible. This property must be present if the production is nonempty. <CharsetProperties> begins with an U+0026 AMPERSAND.

Upon reaching a <CharsetProperties>:

  1. Create a new R·D·F triple with current charset as its subject, kixt:supportsVariableEncoding as its predicate, and an object of true, as an xsd:boolean, if <Variable> is present, and false, as an xsd:boolean, otherwise.

3.7 Block Declaration

BlockDeclaration =
	BlockName
	[Aliases]
	[OtherNames]
	[Notes]

A <BlockDeclaration> defines a new kixt:Block.

3.7.1 Block name

BlockName =
	*Space %x25
	*Space Name
	*Space Break
	*SingleLineComment

A <BlockName> names a kixt:Block. It begins with a U+0025 PERCENT SIGN, which is followed by the block name.

The special name NO BLOCK signifies no block. A <BlockDeclaration> with a <Name> of NO BLOCK must not have a <Aliases>, <OtherNames>, or <Notes>.

Upon reaching a <BlockName>, if the value of <Name> is NO BLOCK, set in a block to false. Otherwise, set in a block to true, set current block to a new blank node, and set current parent to current block.

If in a block is true, create a new R·D·F triple with current block as its subject, kixt:name as its predicate, and the value of <Name> as its object, as a xsd:string.

3.8 Script Declaration

ScriptDeclaration =
	ScriptIdentifier
	[OtherNames]
	[Notes]

A <ScriptDeclaration> defines a new kixt:Script.

3.8.1 Script identifier

ScriptIdentifier =
	*Space %x27
	*Space %x3C IRI %x3E
	*Space Break
	*SingleLineComment

A <ScriptIdentifier> sets the I·R·I for the current kixt:Script. It begins with a U+0027 APOSTROPHE, which is followed by the script <IRI>.

Three special scripts are defined:

The above values are given prefixed, but the actual value of <IRI> must be a full (expanded) I·R·I.

Upon reaching a <ScriptIdentifier>, set current script to <IRI> and current parent to current script.

3.9 Character Definition

CharacterDefinition =
	UnicodeMapping
	CharacterInfo
	LinebreakProperties
	CompatibilityMapping
	DecompositionMapping
	AdditionalProperties
	[Aliases]
	[OtherNames]
	[Notes]
	[References]
	[Glyphs]

A <CharacterDefinition> defines a single kixt:Character. The <UnicodeMapping> and <CharacterInfo> productions are required; the <CompatibilityMapping>, <DecompositionMapping>, and <AdditionalProperties> productions are required but may be empty; all other productions are optional but must be specified in the order above.

Upon reaching a <CharacterDefinition>, set current character to a new blank node. Set current parent to current character. Create a new R·D·F triple with current charset as its subject, kixt:character as its predicate, and current character as its object.

If in a block is true, create a new R·D·F triple with current character as its subject, kixt:block as its predicate, and current block as its object.

Create a new R·D·F triple with current character as its subject, kixt:script as its predicate, and current script as its object.

3.9.1 Unicode mapping

Unicode =
	*Space UnicodeCodepoint
	[1*Space CommentString]
	*Space Break
	*SingleLineComment
UnicodeMapping = 1*Unicode

A <UnicodeMapping> defines a the Unicode scalar values to which a given kixt:Character maps. These are given as <UnicodeCodepoint>s, one per line, each optionally followed by a comment.

The Kixt Charset Definition format requires every <CharacterDefinition> to have a <UnicodeMapping>. The character U+FFFD REPLACEMENT CHARACTER can be used in situations where no mapping is desired. However, the use of private‐use mappings is generally preferable.

Upon reaching a <UnicodeMapping>, set current sequence to a new blank node. Create a new R·D·F triple with current sequence as its subject, olo:length as its predicate, and total number of <UnicodeCodepoint>s within the <UnicodeMapping> as its object, as an xsd:integer.

For each <UnicodeCodepoint>:

  1. Set current slot to a new blank node. Create a new R·D·F triple with current slot as its subject, olo:index as its predicate, and the one-based index of the <UnicodeCodepoint> within the <UnicodeMapping> as its object, as an xsd:integer.

  2. Set current item to a new blank node. Create a new R·D·F triple with current item as its subject, rdf:value as its predicate, and the value of the <UnicodeCodepoint> as its object, as an xsd:integer.

  3. Create a new R·D·F triple with current slot as its subject, olo:item as its predicate, and current item as its object.

  4. Create a new R·D·F triple with current sequence as its subject, olo:slot as its predicate, and current slot as its object.

Finally, create a new R·D·F triple with current character as its subject, kixt:unicode as its predicate, and current sequence as its object.

In Turtle, the resulting R·D·F graph produced by the above steps will look something like the following:

[ ] kixt:unicode [
	olo:length 1;
	olo:slot [
		olo:index 1 ;
		olo:item [ rdf:value 65533 ] ] ]

3.9.2 Character info

BasicType =
	%x43.4F.4E.54.52.4F.4C
		; `CONTROL`
	/ %x4D.45.53.53.41.47.49.4E.47
		; `MESSAGING`
	/ %x46.4F.52.4D.41.54
		; `FORMAT`
	/ %x44.41.54.41
		; `DATA`
	/ %x4E.4F.4E.53.50.41.43.49.4E.47
		; `NONSPACING`
	/ %x53.50.41.43.49.4E.47
		; `SPACING`
	/ %x50.52.49.56.41.54.45.55.53.45
		; `PRIVATEUSE`
	/ %x4E.4F.4E.43.48.41.52.41.43.54.45.52
		; `NONCHARACTER`
	; It is not possible to define an `UNASSIGNED` character
CharacterInfo =
	*Space %x3B
	*Space (BinaryCodepoint *Space %x2F / Codepoint Space)
	*Space Name
	*Space %x28 BasicType %x29
	*Space Break
	*SingleLineComment

A <CharacterInfo> defines the basic aspects of a kixt:Character. It begins with a U+003B SEMICOLON, followed by a codepoint in either hexadecimal or binary, followed by the name and basic type of the character.

Upon reaching a <CharacterInfo>, perform the following steps:

  1. If there is already some subject node which is an object of the predicate kixt:character on the subject current charset, for whom the object of the predicate kixt:codepoint is the value of the <Codepoint> or <BinaryCodepoint> (whichever is present), replace current character with the first such node in all R·D·F triples in which current character is a subject or object, and set current character to this new node.

    This handles the case where a kixt:Character with this codepoint has already been created as part of a decomposition mapping.

    Otherwise, create a new R·D·F triple with current character as its subject, kixt:codepoint as its predicate, and the value of either <Codepoint> or <BinaryCodepoint> as its object, as an xsd:integer.

  2. Create a new R·D·F triple with current character as its subject, kixt:name as its predicate, and the value of <Name> as its object, as an xsd:string.

  3. Create a new R·D·F triple with current character as its subject, kixt:basicType as its predicate, and the value of <BasicType>, appended to the end of the string https://spec.go.kibi.family/ns/kixt/#, as its object, as an xsd:anyURI.

    Note that this is a literal with a datatype I·R·I of xsd:anyURI, not an R·D·F I·R·I.

3.9.3 Linebreak properties

LinebreakClass =
	%x53.45.50.41.52.41.42.4C.45
		; `SEPARABLE`
	/ %x49.4E.53.45.50.41.52.41.42.4C.45
		; `INSEPARABLE`
	/ %x42.52.45.41.4B
		; `BREAK`
	/ %x4A.4F.49.4E
		; `JOIN`
LinebreakProperties =
	[
		*Space %x3A
		*Space Name
		[*Space %x28 LinebreakClass [1*Space LinebreakClass] %x29]
		*Space Break
		*SingleLineComment
	]

A <LinebreakProperties> defines the linebreaking behaviour of a kixt:Character. It begins with a U+003A COLON, followed by a <Name> specifying the linebreak category, and optionally followed by one or two <LinebreakClass>es, in parentheses, to override the default linebreaking behavior for the character.

Upon reaching a <LinebreakProperties>, perform the following steps:

  1. Create a new R·D·F triple with current character as its subject, kixt:linebreakCategory as its predicate, and the value of <Name> as its object, as an xsd:string.

  2. Create a new R·D·F triple with current character as its subject, kixt:breakBefore as its predicate, and the value of either the first <LinebreakClass>, if one is defined, or JOIN, in either case appended to the end of the string https://spec.go.kibi.family/ns/kixt/#, as its object, as an xsd:anyURI.

    Note that this is a literal with a datatype I·R·I of xsd:anyURI, not an R·D·F I·R·I.

  3. Create a new R·D·F triple with current character as its subject, kixt:breakAfter as its predicate, and the value of either the last <LinebreakClass>, if one is defined, or INSEPARABLE, in either case appended to the end of the string https://spec.go.kibi.family/ns/kixt/#, as its object, as an xsd:anyURI.

    Note that this is a literal with a datatype I·R·I of xsd:anyURI, not an R·D·F I·R·I.

    If only one <LinebreakClass> is defined, kixt:breakBefore and kixt:breakAfter will have the same value for their objects.

3.9.4 Compatibility mapping

CompatibilityMapping =
	[
		*Space %x28
		[*Space %x3C IRI %x3E]
		*Space Codepoint
		*(1*Space Codepoint)
		*Space Break
		*SingleLineComment
	]

A <CompatibilityMapping> defines a compatibility decomposition for a kixt:Character. It begins with a U+0028 LEFT PARENTHESIS, followed by an optional <IRI> mode, followed by a sequence of <Codepoint>s giving the mapping. The entire production may be empty; if so, the character's compatibility decomposition is to itself.

The value kixt:GENERIC indicates a generic compatibility mode and is the default.

Upon reaching a <CompatibilityMapping>, set current sequence to a new blank node. Create a new R·D·F triple with current sequence as its subject, olo:length as its predicate, and the total number of <Codepoint>s within the <CompatibilityMapping>, or 1 if <CompatibilityMapping> is empty, as its object, as an xsd:integer.

For each <Codepoint>, or if the <CompatibilityMapping> is empty:

  1. Set current slot to a new [blank node]. Create a new R·D·F triple with current slot as its subject, olo:index as its predicate, and the one-based index of the <Codepoint> within the <CompatibilityMapping>, or 1 if <CompatibilityMapping> is empty, as its object, as an xsd:integer.

  2. If <CompatibilityMapping> is empty, set current item to current character. Otherwise, if there is already some subject node which is an object of the predicate kixt:character on the subject current charset, for whom the object of the predicate kixt:codepoint is the value of the current <Codepoint>, set current item to that node. Otherwise :—

    1. Set current item to a new blank node.

    2. Create a new R·D·F triple with current charset as its subject, kixt:character as its predicate, and current item as its object.

    3. Create a new R·D·F triple with current item as its subject, kixt:codepoint as its predicate, and the value of <Codepoint> as its object, as an xsd:integer.

  3. Create a new R·D·F triple with current slot as its subject, olo:item as its predicate, and current item as its object.

  4. Create a new R·D·F triple with current sequence as its subject, olo:slot as its predicate, and current slot as its object.

  5. Create a new R·D·F triple with current sequence as its subject, kixt:compatibilityMode as its predicate, and the value of <IRI>, or https://spec.go.kibi.family/ns/kixt/#GENERIC if <IRI> is not present, as its object, as an I·R·I.

Finally, create a new R·D·F triple with current character as its subject, kixt:compatibility as its predicate, and current sequence as its object.

In Turtle, the resulting R·D·F graph produced by the above steps will look something like the following:

[ ] kixt:compatibility [
	olo:length 1;
	kixt:compatibilityMode <example:mode> ;
	olo:slot [
		olo:index 1 ;
		olo:item [ kixt:codepoint: 69 ] ] ]

3.9.5 Decomposition mapping

DecompositionMapping =
	[
		(
			*Space %x3C
			*Space Codepoint
			*(1*Space Codepoint)
			*Space Break
		) / (
			*Space %x3C.3C
			*Space Codepoint
			1*(1*Space Codepoint)
			*Space Break
		)
		*SingleLineComment
	]

A <DecompositionMapping> defines a canonical decomposition for a kixt:Character. It begins with either one or two U+003C LESS-THAN SIGNs (indicating whether the decomposed form is preferred), followed by a sequence of <Codepoint>s giving the mapping. The entire production may be empty; if so, the character's canonical decomposition is to itself.

A decomposition mapping to a single character is always preferred, so the two‐U+003C LESS-THAN SIGN form is only permitted when defining a mapping to two or more codepoints.

Upon reaching a <DecompositionMapping>, set current sequence to a new blank node. Create a new R·D·F triple with current sequence as its subject, olo:length as its predicate, and total number of <Codepoint>s within the <DecompositionMapping>, or 1 if <DecompositionMapping> is empty, as its object, as an xsd:integer.

For each <Codepoint>, or if the <DecompositionMapping> is empty:

  1. Set current slot to a new blank node. Create a new R·D·F triple with current slot as its subject, olo:index as its predicate, and the one-based index of the <Codepoint> within the <DecompositionMapping>, or 1 if <DecompositionMapping> is empty, as its object, as an xsd:integer.

  2. If <DecompositionMapping> is empty, set current item to current character. If there is already some subject node which is an object of the predicate kixt:character on the subject current charset, for whom the object of the predicate kixt:codepoint is the value of the current <Codepoint>, set current item to that node. Otherwise :—

    1. Set current item to a new blank node.

    2. Create a new R·D·F triple with current charset as its subject, kixt:character as its predicate, and current item as its object.

    3. Create a new R·D·F triple with current item as its subject, kixt:codepoint as its predicate, and the value of <Codepoint>, converted from hexadecimal, as its object, as an xsd:integer.

  3. Create a new R·D·F triple with current slot as its subject, olo:item as its predicate, and current item as its object.

  4. Create a new R·D·F triple with current sequence as its subject, olo:slot as its predicate, and current slot as its object.

  5. Create a new R·D·F triple with current sequence as its subject, kixt:preferred as its predicate, and an object of true, as an xsd:boolean, if <DecompositionMapping> contains two U+003C LESS-THAN SIGNs or only one <Codepoint>, and false, as an xsd:boolean, otherwise.

Finally, create a new R·D·F triple with current character as its subject, kixt:decomposition as its predicate, and current sequence as its object.

In Turtle, the resulting R·D·F graph produced by the above steps will look something like the following:

[ ] kixt:decomposition [
	olo:length 1;
	kixt:preferred true ;
	olo:slot [
		olo:index 1 ;
		olo:item [ kixt:codepoint: 105 ] ] ]

3.9.6 Additional properties

Deprecated = %x44.45.50.52.45.43.41.54.45.44
	; `DEPRECATED`
CharacterWidth =
	%x50.52.4F.50.4F.52.54.49.4F.4E.41.4C
		; `PROPORTIONAL`
	/ %x46.55.4C.4C.57.49.44.54.48
		; `FULLWIDTH`
SegmentationClass =
	%x45.58.54.45.4E.44.53
		; `EXTENDS`
ConjoiningMode =
	%x43.4F.4E.4A.4F.49.4E.53.3C
		; `CONJOINS<`
	IRI %x3E
Conjoins = ConjoiningMode [Integer]
Combines = SegmentationClass [%x2B Integer]
AdditionalProperties =
	[
		*Space %x26
		*Space (
			Deprecated [
				1*Space (
					CharacterWidth [
						1*Space ConjoiningMode
					]
					/ Conjoins
					/ Combines
				)
			]
			/ CharacterWidth [
				1*Space ConjoiningMode
			]
			/ Conjoins
			/ Combines
		)
		*Space Break
		*SingleLineComment
	]

An <AdditionalProperties> defines a number of additional properties on a kixt:Character; in order, these are whether the character is deprecated, whether the character is fullwidth or proportional, whether the character conjoins with previous characters of a similar type, and whether the character is a combining character. All of these elements are optional, but at least one must be present if the production is nonempty as a whole. <AdditionalProperties> begins with an U+0026 AMPERSAND.

Upon reaching an <AdditionalProperties>:

  1. Create a new R·D·F triple with current character as its subject, kixt:deprecated as its predicate, and an object of true, as an xsd:boolean, if <DecompositionMapping> contains <Deprecated>, and false, as an xsd:boolean, otherwise.

  2. Create a new R·D·F triple with current character as its subject, kixt:fullwidth as its predicate, and an object of YES, as an xsd:string, if <CharacterWidth> is FULLWIDTH; NO, as an xsd:string, if <CharacterWidth> is PROPORTIONAL; and an empty xsd:string if <CharacterWidth> is not present.

  3. Create a new R·D·F triple with current character as its subject, kixt:extends as its predicate, and an object of true, as an xsd:boolean, if <Combines> is present, and false, as an xsd:boolean, otherwise.

  4. Create a new R·D·F triple with current character as its subject, kixt:combiningClass as its predicate, and the value of <Integer> in <Combines>, if present, or 0, otherwise, as its object, as an xsd:integer.

  5. If <ConjoiningMode> is present (including within a <Conjoins>), create a new R·D·F triple with current character as its subject, kixt:conjoiningMode as its predicate, and the value <ConjoiningMode> as its object, as an I·R·I.

  6. Create a new R·D·F triple with current character as its subject, kixt:conjoiningClass as its predicate, and the value of <Integer> in <Conjoins>, if present, or 0, otherwise, as its object, as an xsd:integer.

3.9.7 References

Reference =
	*Space %x3E
	*Space Codepoint
	[1*Space CommentString]
	*Space Break
	*SingleLineComment
References = 1*Reference

A <References> allows a kixt:Character to be associated with other kixt:Characters for the purposes of comparison. It consists of one or more lines, each beginning with an U+003E GREATER-THAN SIGN, and followed by a <Codepoint> and optional comment.

Upon reaching a <References>, for each <Codepoint>:

  1. If there is already some subject node, which is an object of the predicate kixt:character on the subject current charset, for whom the object of the predicate kixt:codepoint is the value of the current <Codepoint>, set current item to that node. Otherwise :—

    1. Set current item to a new blank node.

    2. Create a new R·D·F triple with current charset as its subject, kixt:character as its predicate, and current item as its object.

    3. Create a new R·D·F triple with current item as its subject, kixt:codepoint as its predicate, and the value of <Codepoint>, converted from hexadecimal, as its object, as an xsd:integer.

  2. Create a new R·D·F triple with current character as its subject, kixt:compare as its predicate, and current item as its object.

3.9.8 Glyphs

HexGlyph = 8*UpperHex
Glyph =
	*Space %x29
	*Space HexGlyph
	*Space Break
	*SingleLineComment
Glyphs = 1*Glyph

A <Glyphs> provides a lofi black & white representative glyphs for a kixt:Character. It consists of one or <Glyph>s, each beginning with a U+0029 RIGHT PARENTHESIS and consisting of one or more lines of binary data represented in hexadecimal. Each bit of this data represents a pixel, with 0 indicating the background colour and 1 the foreground, beginning from the starting (in both horizontal and vertical directions, and travelling in the direction of the writing mode) corner of the glyph.

Upon reaching a <Glyphs>, for each <Glyph>, create a new R·D·F triple with current character as its subject, kixt:representativeGlyph as its predicate, and the value of <HexGlyph> as its object, as a xsd:hexBinary padded with additional terminal zeroes as necessary until the total length is even, and the total length times either four or eight is square.

This effectively limits representative glyphs to having a height divisible by 4.

4. Conformance

A Kixt Charset Definition is welformed if it matches the A·B·N·F syntax for <CharsetDefinition> defined by this specification. Processors of Kixt Charset Definitions must fail to process any Kixt Charset Definition which is not well-formed.

4.1 Validity

In addition to the constraints made by the A·B·N·F syntax, the following situations are all semantically invalid in a Kixt Charset Definition:

  1. Two or more <BlockDeclaration>s with identical <Name>s, when both are not NO BLOCK.

  2. An <Aliases>, <OtherNames>, or <Notes> in a <BlockDeclaration> whose <Name> is NO BLOCK.

  3. A <LinebreakProperties> in a <CharacterDefinition> which does not have a <BasicType> of SPACING.

  4. A <Combines>, <Conjoins>, or <CharacterWidth> in a <CharacterDefinition> which does not have a <BasicType> of SPACING or NONSPACING.

  5. Assigning an object other than 0 for the kixt:combiningClass predicate for a subject whose kixt:extends predicate is not true or whose kixt:basicType predicate is not kixt:NONSPACING.

  6. Assigning an object other than https://spec.go.kibi.family/ns/kixt/#GENERIC for the kixt:compatibilityMode predicate for a subject whose kixt:compatibility predicate has an object with one kixt:slot predicate whose object has one kixt:item predicate whose object is the subject itself.

    In other words, if a character has a compatibility decomposition of itself, then it must have the default compatibility mode of kixt:GENERIC.

  7. Assigning the same value as the object of a kixt:name or kixt:alias predicate for two different subjects of the same rdf:type (kixt:name and kixt:alias must be unique within a shared namespace).

  8. Assigning i18n:zinh as the object of a kixt:script predicate while processing a <CharacterDefinition> which does not contain a <Combines>.

  9. Assigning the same object for a kixt:codepoint predicate while processing two different <CharacterInfo>s.

  10. Assigning the multiple objects with the same length for a kixt:representativeGlyph predicate for a single subject.

  11. Finishing processing the Kixt Charset Definition when not every kixt:character predicate with a subject of current charset has an object for which a kixt:basicType predicate has been assigned.

    Another way of expressing this constraint is that every <Codepoint> in a <CompatibilityMapping>, <DecompositionMapping>, or <Reference> must identify a kixt:Character defined in the same document.

  12. Creating a kixt:Charset which is not variable‐width‐compatible but for which kixt:variable is true.

A Kixt Charset Definition is valid if it is not invalid.

The processing behaviours for an invalid Kixt Charset Definition are undefined.

4.2 Compatibility

The following predicates are compatibility properties:

A Kixt Charset Definition is U·T·F‐8‐compatible if it is valid and does not assign any of the following codepoints:

A Kixt Charset Definition is null‐compatible if it is valid and the objects of the compatibility properties are equal to those defined in the following Kixt Charset Definition for all characters so defined:

;CHARSET<https://spec.go.kibi.family/-/kixt-charset/null>1.0

% ASCII CONTROLS AND BASIC LATIN
' <https://www.w3.org/ns/i18n#zyyy>

U+0000
; 00 NULL (FORMAT)
= NUL
* This is a meaningless format character which can be used for byte-padding when encoding texts

A Kixt Charset Definition is A·S·C·I·I‐compatible if it is valid and the objects of the compatibility properties are equal to those defined in https://spec.go.kibi.family/-/kixt-charset/ascii for all characters so defined.

All A·S·C·I·I‐compatible charsets are null compatible.

5. Changelog

Added <LinebreakProperties>.

Removed DIVIDER as a <SegmentationClass> and replaced kixt:segments with kixt:extends. Required characters with a nonzero kixt:combiningClass to be nonspacing.

New U·R·L’s and minor revisions.

kixt:TRANSMISSION has been replaced by kixt:MESSAGING as a kixt:basicType.

Allowed the specification of other names and notes on charsets, blocks, and scripts, and aliases on blocks.

Added a variable‐width promise (kixt:supportsVariableEncoding) to charset declarations.

The syntaxes for <Integer> and <InnerCommentLine> were improved.

Redefined a number of syntax components to make it possible to write a Kixt Charset Definition in any ECMA‐6‐compatible character set. By extension, you can now write a Kixt Charset Definition in any X·M·L‐compatible character set. (This is a breaking change.)

Allowed U·T·F‐16–encoded documents with the addition of a B·O·M.

Required processors to ignore U+0000 NULL characters which appear in Kixt Charset Definition documents.

Removed compatibility definitions that are better‐served in other specifications.

Added <MultiLineComment>s.

Allowed <SingleLineComment>s to appear inside of <CharacterDefinition>s.

Initial specification.