#!!!!!! AND THE KLINGON OPTION HAS THE OPTION OF EITHER ENGLISH TEXT OR KLINGON TEXT I AM IN LOVE | Explore Tumblr posts and blogs

arigatogatos · 6 years ago

Text

Making the Draenei Language - Part 1

So recently I've been inspired by the works of many other language creators such as David J. Peterson (of Dothraki and High Valyrian fame), Paul Frommer (who made Na’vi) and some conlangers on YouTube such as Biblaridion and Artifexian, and decided to look around to see if there has been any effort made into fleshing out the languages of Azeroth.

(Much more under the cut!)

As a Draenei and linguistics enthusiast I of course checked out if anything had been done to the language of my favourite fantasy race and after being alerted by @nessacity found the Te'Amun Word Compendium made by the Te’Amun guild on Moonguard. Sadly this hasn't been updated since around May 2016 and is largely a naming language though there is a start to some internal grammar (something I will start dissecting later in the post). Apart from the Te’Amun Word Compendium I couldn't find any other efforts to expand the Draenei vocabulary, so I decided to do some research for myself as to what we have already in canon.

Thankfully, on wikis and forums people have already compiled all the known Draenei text and names currently in the game, along with translations where there are any, and so, I did what any normal person would do and make a spreadsheet of the data and look for clues.

I also went through and added probable pronunciations (more on that later). So, first things first lets have a look at what can be extrapolated from what we have so far in canon.

drae | a prefix/adjective that means ‘exiled’. This can be seen in draenei - ‘Exiled Ones‘ and Draenor - ‘Exile’s Refuge.‘

sha | This fairly clearly means ‘Light’. This can be seen in Aar-don'sha, kai kahl'dos - A line spoken by Yrel which she translates as ‘In the Light, we triumph’, Ore Atal’sha - ‘For the Light!’ (Yet again spoken by Yrel), and in other names such as Sha'tar - ‘Born of Light’ and Shattrath - ‘Dwelling of Light.’

-nei | A derivational suffix that means ‘A person who is/has...’ Seen in Draenei

-nai | Something similar to the above, as it is mostly used in words that refer to groups of people. Seen in Kurenai and Auchenai

-ari | Derivational suffix probably meaning ‘One who does...’ Seen in Man’ari, Rangari

-trath | Another derivational suffix meaning ‘place of...’

-dor | A suffix which typically marks settlements. Seen in Talador, Telredor, Kerabor.

-nor | A Suffix probably meaning ‘refuge’. Seen in Draenor ‘Exile’s Refuge’

-naar | Another suffix for places. Seen in Sha’naar

This is all stuff that many others have clued into and its really nice to see that at least to some extent, there was some thought put in by the writing staff into the structure of the language in the game. From here I decided to see if there was any more underlying structure.

The phrase that really caught my eye was ‘Shanai ortar’. This is spoken by an NPC Draenei called Apprentice Miall in the quest ‘In Ared’s Memory’ in WoD Draenor after her mentor has died. Here’s the text in context:

My master has gone on to the Light, but I will continue his work. <Miall looks to Ared > Shanai ortar, Ared. I will have his body sent to Auchindoun....

I predict this phrase is some sort of blessing, akin to something like ‘be at peace.’ Noting also that shanai has the root sha in it, meaning light, I can only imagine this phrase directly translated being something along the lines of ‘be with the light’. We also have seen the -nai suffix before but I think this really solidifies its meaning. if we take -nai to be a prepositional affix (post-positional in this case) meaning ‘Be with’ or just ‘with’, we get shanai meaning ‘With the light’.

Extrapolating this onto Auchenai, through context I think a literal translation would be ‘with the dead’. This sounds weird in English but could definitely be a productive part of draenei grammar! I think Kurenai, which on the wiki is translated as ‘Redeemed’ ’in this case could be literally translated as something like ‘with honour’ as they are a group of escaped slaves who have regained their honour.

Okay back onto the ‘Shanai ortar’ sentence. Practically and clause or phrase has got to have a verb in it. As Gretchen McCulloch put it in their wonderful podcast Lingthusiam - “The verb is the coat rack that the rest of the sentence hangs on”. As Shanai is pretty definitely a noun or prepositional phrase (depending on whether you want to analyse -nai as a post-position or as a commutative noun case, but that’s for later :P), that must mean ortar is a verb of some sort. I'm going to analyse -nai as a commutative noun case, which I feel is more interesting (and allows Shanai to be the object of the sentence). Taking ortar to mean ‘go/pass (in the metaphysical sense)’ and having Ared (the dead dude) being an omitted or implied subject we get a translation being ‘(Ared/You) go with the light’.

What’s really important here is the word order. We have the verb last in the sentence with the object preceding it. This pins Draenei as having either SOV word order (like Korean and Turkish. Its also coincidently the most common word order cross-linguistically at 41% of the worlds languages), OSV word order (like Xavante and Warao. This is also the least common word order cross-linguistically at 0.3% of the worlds languages!) or OVS word order (Like Hixkaryana. Another very uncommon order at 0.8% of the worlds languages).

An example of these word orders using English would be the following (taking the sentence ‘I hit him’ as an example).

SOV: I him hit OSV: him I hit OVS him hit I

Its only at this point that I really decide that I actually want to try make this into a full language, so I have to pick a default word order (that is unless I want to do case marking on the nouns or subject marking on the verbs, which is an option!). Draenei being a race of aliens, I’d imagine their language also being quite dissimilar to what we have on earth so I decided to go with one of the rarer word orders. Out of the two I chose OSV because Klingon is OVS and ya know... I just don’t want to copy Klingon when making an alien language!

Sadly this doesn't mesh with what the Te’Amun word has :C. I really like the system of verbal agreement they’ve got going (sorta like Spanish but I believe this is the actual pronoun not an agreement marker) so I might incorporate that into what I'm doing but in a different way.

Anyway this post is getting real long so I'm going to do a part 2 soon on the sounds/pronunciation of the language (the phonology) so get excited for that! Thanks so much for reading if you made it all the way down here, and let me know if you have any suggestions for what to add to the language!

EDIT: @sielic has actually made a Draenei conlang called Tendral and made a really great video about it here.

#draenei #wow #lorecraft #lore #World of Warcraft #headcanon #language #linguistics #not art #wowlinguistics #dranei language

161 notes · View notes

beesandwasps · 6 years ago

Text

Everything You Will Ever Need To Know About Unicode (And Many Things You Will Not Need To Know As Well)

(This post is specifically being written so I have a single URL which covers all the points involved, and therefore does not necessarily cover all technical/historical points thoroughly even though it is extremely long. For your convenience in skipping it, the whole thing is below the fold.)

A Little History

Once upon a time, computers with keyboards instead of punch cards and switches were new and magical things. Since computers deal with complex data by converting it into numbers, a system was needed to map text to numbers. This kind of system — where numbers in some range, usually beginning with 0 and ending with some binary-significant value — are mapped to printing characters is called a character mapping.

After a certain number of false starts, American computer companies and their English-speaking foreign counterparts settled on ASCII (the American Standard Code for Information Interchange). Most people, even now, know a little bit about ASCII — the main points are:

Uses values from 0 to 127 (that is, values which can be represented in seven-eighths of one byte)

Values from 0 to 31 and the lone value 127 are non-printing characters (including a “backspace” character) intended to transmit control instructions

Contains the digits from 0 to 9, the latin alphabet without accents in both upper- and lower-case, and 32 symbols; the digits are in sequence, as are the alphabet.

Does not contain “curly quotes”, accented characters like “é”, or currency symbols other than the dollar sign “$”

Does not specify which non-printing character signifies a line break/paragraph ending

The line break issue is of some importance, so it’s worth explaining: ASCII as originally formulated treats text more or less as a stream of characters passing through the print head (the “carriage”) of an electric typewriter or old-fashioned line printer. As such, it has two characters which deal with moving the carriage to a new paragraph: “line feed” (LF), which moves the head down one line, and “carriage return” (CR), which moves the head to the extreme left edge of the line. ASCII says that LF is 10 (hexadecimal 0x0A), and CR is 13 (hexadecimal 0x0D). On an electric typewriter or line printer, where there is a physical moving part, a new paragraph is a carriage return and a line feed, in either order, but a computer generally doesn’t need two characters. With typical contrariness, the major families of operating systems adopted three standards:

POSIX-y systems (Unix, and eventually Linux) decided that LF meant a new paragraph, and CR meant nothing.

DOS (and eventually Windows) decided to follow the electric typewriter model, and used a carriage return followed by a line feed, CRLF. Either character by itself meant nothing, and LFCR meant nothing.

“Classic” Mac OS decided that CR meant a new paragraph, and LF meant nothing. (Mac OS X — which is actually a reworking of an older OS originally known as NextStep — is actually a Unix variant; any GUI program using the built-in text APIs will auto-translate all three options when reading a text file, and write LF.)

Obviously, although ASCII is good enough for programming in most computer languages, which tend to be English-y and designed for a certain amount of backwards-compatibility, ASCII was not good enough for other forms of text, which demand better punctuation and support for languages other than English.

There was a longish period during which a number of other character mappings were used in different contexts. Many of them were “extended ASCII” character sets, which used one byte per character, and filled the extra 128 values left by ASCII with extended punctuation and accented characters as support for other languages. By and large, these were encoded into a family of standards by the International Standards Organization (ISO), known collectively as ISO 8859. The most common of these is ISO 8859-1, “ISO Latin-1”, but there are others as well. (Most versions of Windows use a modified version of ISO Latin-1 known as Windows-1252, and the “Classic” Mac OS had an equivalent mostly-overlapping character mapping for European languages, Mac Roman.)

(As a purely historical note: IBM had its own family of character mappings entirely distinct from ASCII, known collectively as EBCDIC: Extended Binary Coded Decimal Interchange Code. For many reasons, EBCDIC did not catch on — among other things, the latin alphabet was never encoded as a single set of consecutive values — and it is included here primarily so that you can win trivia contests.)

Other languages, however, could not base their character mappings on ASCII. Real support for Chinese, Japanese, and Korean requires at least some large subset of the multi-thousand-character system called “漢字” (Pronounced “Hanzi” in Chinese, “Kanji” in Japanese, and “Hanja” in Korean), among other non-ASCII things. For this reason, there were also many pre-Unicode character mappings which used more than one byte per character, and even a few which used a variable number of bytes per character. All of these systems tended to include all the characters of ASCII, but were obviously not directly mappable to any ASCII-based single-byte character mapping.

This was the situation in the late 1980s, when Xerox and Apple began working on what would eventually become Unicode. The first published version of Unicode — which was a fixed-size two-byte encoding intended purely for modern languages in active use — was published in 1991, and an expanded version (essentially the current system from a technical standpoint) was published in 1996.

What Unicode Is, And What It Isn’t

Unicode is a character mapping (and a few other things as well which we won’t go into). It attempts to fulfill (sometimes with more success than others) certain specific goals:

Every character set used in “real” human communication should be representable. (There is a certain amount of fussiness over what “real” means — Klingon is not included, for instance, because it was deliberately invented to be “alien”, but Shavian phonetic script is included.)

For every character set included in Unicode, the order of the characters should ideally follow some popular pre-existing character mapping, if there is one, or at least have some technical justification even if it is only “this is the order of the alphabet in this language”.

Every included character has a numeric value — its “code point” — and a unique name in English, usually represented in all capitals. A code point is properly given as “U+” followed by the value in hexadecimal, padded to at least 4 digits. (That is, “U+0043”, not “U+43”.)

Characters which modify other characters, such as accent marks, should be included. If a modified character is common enough, it should appear as its own character. (For example, a capital “A” with a grave accent, “À” can be represented by U+0041, “LATIN CAPITAL LETTER A”, followed by U+0300, “COMBINING GRAVE ACCENT” — but it can also be represented as U+00C0, “LATIN CAPITAL LETTER A WITH GRAVE”.) Algorithms are provided to “compose” these combinations into single characters where appropriate, and to compare strings properly with consideration for composition.

The Unicode standard, which is at version 12 as of this writing, is maintained by the Unicode Consortium, which is an international organization not under the control of any company, but including representatives from some companies as well as academics specializing in language. Note that Unicode is not a piece of software, nor is it a font. It is a character mapping, which must be implemented by software companies and font designers who want to display text. Although the application to include new characters in Unicode (or to modify existing ones) is not particularly complex, the process involves deliberation and tends to move very slowly.

It is important to remember that a Unicode code point is distinct from the bytes which will be recorded in a file. Unicode by itself does not specify how a code point should be recorded in bytes, merely which number values correspond to which characters (and how they interact). Unicode code points range between 0 and 0x10FFFF, which means that it would be theoretically possible to encode Unicode directly to disk as a fixed-size 3-byte encoding. In practice, this does not happen; more about this topic below. Certain specific code points are deliberately unused, most notably the 2048-code-point range from U+D800 to U+DFFF and the specific value U+FFFE.

The code points are usually (with a few exceptions) arranged into groups of related characters by language or purpose. These sections are known as “code tables”, and always contain a multiple of 16 characters, some of which may be unused, so that any two code points whose hexadecimal value differ only in the last digit are always in the same code table. (For instance, Mongolian is U+1800 through U+18AF.)

Some general notes on the basic structure of Unicode:

The values from 0 to 255 are identical to those in ISO 8859-1, so that ASCII and the most common “European” encoding have a direct conversion of values.

Most (though not all) character sets which would have a single-byte encoding by themselves, but are not based on the latin alphabet, appear fairly early in the list. (Greek, Cyrillic, Hebrew, Cherokee, and so on.)

The majority (though again not all) of characters used commonly in modern languages appear in the Basic Multilingual Plane, or BMP, which is the range of values from U+0000 to U+FFFF. (In fact, Unicode 1.0 was the BMP.) There are other “planes” as well, and certain ranges of values are designated as “Private Use Areas”, so that programmers can use characters which are explicitly not part of ordinary text without having to switch between Unicode and other systems. (Apple, for example, stores their logo in the system fonts which come with Mac OS X as U+F8FF, so that it can be used in the menu bar and in text.)

In many cases, the Unicode Consortium later added additional characters for a specific language or character set, and so there are many small code tables containing “supplementary” or “extended” characters to previously-defined code tables. (The latin alphabet has 8 code tables so far, starting with “Basic Latin” for ASCII and “Latin-1 Supplement” for the rest of ISO Latin-1, but extending into extraordinarily rare characters up through “Latin Extended-E”, with a brief side trip to “Latin Extended Additional”.) Although the latin alphabet and the CJK character sets have more code tables than usual, there are “supplementary” and “extended” code tables for many others as well.

The 漢字 characters for Chinese, Japanese, and Korean appear only once, mostly in the massive code table “CJK Unified Ideographs”, which stretches from U+4E00 to U+9FFF. (The characters often have different appearances in the different languages, which means that there is a useful distinction between, say, the Korean version of a character and the Chinese version, which Unicode by itself does not preserve, making the single inclusion a contentious decision.) The characters generally appear in order of complexity, with the most common/simplest ones appearing at lower code points — but, as anybody who has perused a reference work involving 漢字 will tell you, “simpler” is not only a relative term but one on which different authorities disagree.

In addition to characters used in spoken languages, Unicode also contains a very wide range of shapes and symbols. Usually, these characters are added in response to their inclusion in some communications system outside of the Unicode Consortium’s control. (For example, emoji were originally taken from Japanese cell phone texting systems, which is why so many of the early emoji in Unicode are Japanese cultural items, like U+1F359 “🍡”, dango. That’s also why all the “face” emoji started off without alternate forms for skin color.) These non-language characters tend to appear in two places in the Unicode table: just before the 漢字 characters, and towards the end of the currently-defined characters.

Recording Unicode On Disk: Byte Sequences and UTF

In theory, you could record all Unicode characters using a 3-byte encoding; more commonly, people think of Unicode as having 4-byte values. While this is not technically incorrect, it is worth remembering that Unicode itself never uses values above 0x10FFFF. (On the other hand, there is a “larger” standard known as the Universal Character Set, UCS, which is currently defined as being identical to Unicode for all defined values, but explicitly says that characters can potentially have any four-byte value, and most software is written with an eye on UCS.)

A method of converting Unicode code points into values on disk is a Unicode Transformation Format (or, if you like, a UCS Transformation Format), abbreviated UTF. The obvious, simple, and almost unused method is to record text as a stream of 4-byte values. This is known as UTF-32, or UCS-4, and is extremely rare in practice.

There are 2 common encodings (one of which comes in two variants) which reduce the amount of wasted space (bytes of value 0) recorded to disk:

UTF-8

UTF-8 is by far the most common encoding found on the web, and also in text files used by programmers. The premise is simple: characters which appear in strict ASCII take up one byte with the high-order bit set to 0 (that is, with a value from 0 to 127), other characters use 2, 3, or 4 bytes with the high-order bit set to 1 (a value from 128 to 255). A valid strict ASCII text file is automatically a valid UTF-8 text file.

The Wikipedia article explains the algorithm in detail, but in short: multi-byte characters in UTF-8 are formatted so that the first byte indicates how many bytes will be used, and the remaining bytes cannot be mistaken for the first byte of a multi-byte character. That means UTF-8 is useful across unreliable transmission methods — if a single byte is lost, the character corresponding to that byte will be lost, but the rest of the text is not jumbled.

In addition, any system which was written with the assumption that all text was ASCII (which includes most command-line tools) will probably handle UTF-8 text without any problems, provided that they do not strip high-order bits or split up any adjacent multi-byte characters. (That is, they will be able to do things like search through the text or wrap it to a maximum line width for display without making it unreadable.) This makes it particularly convenient for programmers and system administrators, who frequently use tools which were written long ago and “think” in terms of pure ASCII text (or, at least, ASCII-based single-byte character mappings).

It is possibly important to remember that, contrary to popular belief, not every sequence of byte values is valid UTF-8 text. There are byte sequences which encode values which have no character mapping, byte sequences which encode values too large for the Unicode range, and byte sequences which encode ASCII characters — which would be readable but still invalid.

The algorithm for UTF-8 can actually handle the entire UCS character set, using up to 6 bytes per character for values outside the official Unicode mapping, and when the UTF-8 encoding was introduced, those additional values were part of the algorithm. By the official standard eventually set by the IETF RFC in 2003, UTF-8 is restricted to valid Unicode values, and therefore no more than 4 bytes per character.

UTF-16 (and UTF-16BE and UTF-16LE)

Much though “tl;dr” synopses are horrible, this section can be summarized as: “unless you are defining your own file type (so that only your program will ever need to read it) and will be storing a lot of text which does not use the latin character set, do not use UTF-16 because it is a mess and invites trouble”.

UTF-16 was — in Unicode 1.0 — the equivalent of UTF-32 in later versions, because all Unicode values used 2 bytes, meaning that recording two bytes per character was the simple, obvious thing to do. It was somewhat wasteful of space for most European languages, since it “wastes” a byte for any character in the ISO Latin-1 character mapping, but even in 1991 when Unicode 1.0 came out this was not an unreasonable strain on existing storage technology. The problems which eventually made UTF-16 unpopular were more subtle and varied.

First was the fact that there are two varieties of UTF-16. Some computers are “big-endian” (that is, the first byte in a multi-byte value is the byte which stores the larger portion of the value, the “big end”) and some are “little-endian”. (The terms are references to Gulliver’s Travels, and are traditional for describing this problem.) So by default, some computers “want” to write U+1234 as 0x1234, with the “big end” first (UTF-16BE) while others “want” to write it as 0x3412 with the “little end” first (UTF-16LE), and since U+3412 is just as valid a code point as U+1234, there is no automatic way to be sure whether a UTF-16 file from an unknown source was UTF-16BE or UTF-16LE.

In order to solve this problem, Unicode defined U+FEFF as a “Byte Order Mark” (BOM) which could be placed at the beginning of a UTF-16 text stream to show which one was in use. (Its character is defined as a “zero-width non-breaking space” — in other words, a character which is technically a space for the purposes of word-count or spell-checking, but takes up no space on the screen, and at which programs are not supposed to break up text for line-wrapping or layout purposes. Since this has no visible or analytical effect on the text at all, a correctly-interpreted BOM in a Unicode-supporting program does not alter the text and can even be safely included in UTF-8 text files where it is unnecessary.) Which solved the problem… except that the overwhelming majority of programs which used UTF-16 did not record a BOM in their files, regarding it as an unnecessary waste of space. As a result, most pure text editors which support UTF-16 have an option to force a file to open as either BE or LE.

When Unicode 2.0 increased the range of valid code points all the way to U+10FFFF, two bytes were no longer sufficient to cover all possible characters. UTF-16 was amended to use approximately the same trick as UTF-8 to extend its range. As it happened, no characters had yet been assigned to the range U+D800 to U+DFFF, and so this range was declared to be permanently unused; characters with code points above U+FFFF are recorded in UTF-16 as a pair of 2-byte values in this range. (More specifically, the first value will be in the range 0xD800 to 0xDBFF, and the second value will be in the range 0xDC00 to 0xDFFF; this gives 1024×1024 combinations, coincidentally the number of values from 0x010000 to 0x10FFFF.) As a result, “modern” UTF-16 is a variable-length encoding — most of the time, it uses two bytes per character, but sometimes uses four.

If a single byte is missed or deleted in transmission (or any odd number of bytes), the rest of the text is garbled, and unless an invalid combination like an unmatched U+D800 occurs, there is no way to detect this algorithmically. If multiple bytes are lost at random, the text will switch back and forth to and from gibberish.

Finally, and fatally in many cases, UTF-16 often chokes old programs which “think” in ASCII, which are often used by programmers. Unfortunately, a very large number of two-byte combinations which are valid UTF-16 are also valid ASCII, so programs which expect to be able to alter ASCII text for convenience can mangle UTF-16 text irreparably.

In particular, line break characters can be a problem with UTF-16 text. Tools such as Git often default to some sort of “auto-translation” mode when retrieving code files, but are not smart enough to catch on to multi-byte character encoding. A file which is converted to the Windows paragraph ending, CRLF, from (usually) the POSIX paragraph ending, LF, will cause the insertion of an extra byte, garbling all the characters in even-numbered paragraphs. Although it is possible to shut off a given tool’s auto-translation mode, it is much easier simply to avoid UTF-16 encoding in favor of UTF-8 from the start.

UTF-7

UTF-7 is primarily used in e-mail. You are unlikely to use it and it is listed here purely to head off questions. It records unaccented latin letters, numbers, and a few symbols directly as their ASCII equivalents, “+” as the sequence “+-”, and all other characters with a “+” followed by the base64-encoded UTF-16BE value of the character. This means that the entire text will consist of ASCII characters, thus conforming to MIME requirements. Since this is extremely wasteful of space without any particular benefit in terms of readability or formatting, it is generally unused as a file format.

#unicode

4 notes · View notes