Unicode® 12.0.0
Version 12.0.0 has been superseded by the latest version of the Unicode Standard.
This page summarizes the important changes for the Unicode Standard, Version 12.0.0.
This version supersedes all previous versions of the Unicode Standard.
A. Summary
B. Technical Overview
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Changes in the Unicode Character Database
G. Changes in the Unicode Standard Annexes
H. Changes in Synchronized Unicode Technical Standards
M. Implications for Migration
Unicode 12.0 adds 554 characters,
for a total of 137,928 characters.
These additions include 4 new scripts,
for a total of 150 scripts, as well as 61 new emoji characters.
The new scripts and characters in Version 12.0 add support for lesser-used languages and unique written requirements worldwide. Funds from the Adopt-a-Character program provided support for some of these additions. The new scripts and characters include:
- Elymaic, historically used to write Achaemenid Aramaic in the southwestern portion of modern-day Iran
- Nandinagari, historically used to write Sanskrit and Kannada in southern India
- Nyiakeng Puachue Hmong, used to write modern White Hmong and Green Hmong languages in Laos, Thailand, Vietnam, France, Australia, Canada, and the United States
- Wancho, used to write the modern Wancho language in India, Myanmar, and Bhutan
- Miao script additions to write several Miao and Yi dialects in China
Popular symbol additions:
- 61 emoji characters, including several new emoji for accessibility. For complete statistics regarding all emoji as of Unicode 12.0, see Emoji Counts. For more information about emoji additions for Unicode 12.0, including new emoji ZWJ sequences and emoji modifier sequences, see Emoji Recently Added, v12.0.
- Marca registrada sign
- Heterodox and fairy chess symbols
Additional support for lesser-used languages and scholarly work was extended worldwide, including:
- Hiragana and Katakana small letters, used to write archaic Japanese
- Tamil historic fractions and symbols, used in South India
- Lao letters, used to write Pali
- Latin letters used in Egyptological and Ugaritic transliteration
- Hieroglyph format controls, enabling full formatting of quadrats for Egyptian Hieroglyphs
Important glyph corrections, including:
- Bopomofo, with significant improvements
- Won currency sign, changed to align with modern Korean fonts
Synchronization
Several other important Unicode specifications have been updated for Version 12.0.
The following four Unicode Technical Standards are versioned in
synchrony with the Unicode Standard, because their data files cover the same repertoire.
All have been updated to Version 12.0:
Some of the changes in Version 12.0 and associated Unicode Technical Standards
may require modifications
to implementations. For more information, see the migration and modification sections of UTS #10, UTS #39, UTS #46, and UTS #51.
This version of the Unicode Standard is also synchronized with ISO/IEC 10646:2017, fifth edition,
plus Amendments 1 and 2 to the fifth edition,
plus the following additions from the CD for the sixth edition:
- 61 emoji characters
- U+1E94B ADLAM NASALIZATION MARK
See Sections D through H below for additional details regarding the changes in this version of
the Unicode Standard, its associated annexes, and the other synchronized Unicode specifications.
Version 12.0 of the Unicode Standard consists of:
- The core specification
- The code charts (delta and archival) for this version
- The Unicode Standard Annexes
- The Unicode Character Database (UCD)
The core specification gives the general principles,
requirements for conformance, and guidelines for implementers. The
code charts show representative glyphs for all the Unicode
characters. The Unicode Standard Annexes supply detailed normative
information about particular aspects of the standard. The Unicode
Character Database supplies normative and informative data for
implementers to allow them to implement the Unicode Standard.
The core specification is available as
a single pdf for viewing.
(14 MB)
Links are also available
in the navigation bar on the left of this page to access
individual chapters and appendices
of the core specification.
Several sets of code charts are available. They serve different
purposes:
- The latest set of code charts for the Unicode Standard is available online. Those charts are always the most current code charts available, and may be updated at any time. The charts are organized by scripts and blocks for easy reference. An online index by character name is also provided.
For Unicode 12.0.0 in particular two additional sets of code chart pages are provided:
- A set of delta code charts showing the
new blocks and any blocks in which characters were added for Unicode 12.0.0. The new characters are visually highlighted in the charts.
- A set of archival code charts that represents
the entire set of characters, names and representative glyphs at the time of publication of Unicode 12.0.0.
The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.
Links to the individual
Unicode Standard Annexes are available in
the navigation bar on the left of this page. The list of significant changes
in the content of the Unicode Standard Annexes for Version 12.0 can be found
in Section G below.
Data files for Version 12.0 of
the Unicode Character Database are available. The ReadMe.txt in that directory provides a roadmap
to the functions of the various subdirectories.
Zipped versions of the UCD
for bulk download are available, as well.
Version 12.0.0 of the Unicode Standard
should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 12.0.0, (Mountain View, CA: The Unicode Consortium,
2019. ISBN 978-1-936213-22-1)
http://www.unicode.org/versions/Unicode12.0.0/
The terms “Version 12.0” or “Unicode 12.0” are abbreviations for the full version reference, Version 12.0.0.
The citation and permalink for the latest published version of the Unicode Standard is:
The Unicode Consortium. The Unicode Standard.
http://www.unicode.org/versions/latest/
A complete specification of the contributory files for Unicode
12.0 is found on the page Components for 12.0.0.
That page also provides the recommended reference format for Unicode Standard Annexes. For examples of how to cite particular portions of the Unicode Standard, see also the Reference Examples.
Errata incorporated into Unicode 12.0 are listed by date in
a separate table. For corrigenda and errata after the release of Unicode 12.0, see the list of current
Updates and Errata.
There were no significant changes to the Stability Policy of the core specification between Unicode 11.0 and Unicode 12.0.
Four new
scripts were added with accompanying new block descriptions:
Script |
Number of Characters |
Elymaic |
23 |
Nandinagari |
65 |
Nyiakeng Puachue Hmong |
71 |
Wancho |
59 |
Changes in the Unicode Standard Annexes are listed in Section G.
Character Assignment Overview
554 characters have been added.
Most character additions are in new blocks, but there are also character additions to a number of existing blocks. For details, see delta code charts.
There are no significant new conformance requirements in Unicode 12.0.
The detailed listing of all changes to the contributory data files of the Unicode Character Database
for Version 12.0 can be found in
UAX #44, Unicode Character Database.
The changes listed there include character additions and property revisions to existing characters that will affect implementations.
Some of the important impacts on implementations migrating from earlier versions of the standard are highlighted in
Section M.
In Version 12.0, some of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section
of each UAX, linked directly from the following list of UAXes.
Unicode Standard Annex |
Changes |
UAX #9 Unicode Bidirectional Algorithm
|
Text was added in BD2 to guarantee that max_depth can be treated as a constant (with value 125). |
UAX
#11 East Asian Width |
No significant changes in this version. |
UAX
#14 Unicode Line Breaking Algorithm |
The behavior of NNBSP was clarified for Mongolian. References to CLDR and UTS #35 as
a source for tailoring were added. |
UAX
#15 Unicode Normalization Forms
|
No significant changes in this version. |
UAX
#24 Unicode Script Property
|
No significant changes in this version. |
UAX
#29 Unicode Text Segmentation |
The derivation of Lower and Upper for Sentence_Break was updated for Georgian, to account for the difference in how casing in Georgian interacts with sentence boundaries. Surrogate code points were moved from
Control to XX for the Grapheme_Cluster_Break property, to eliminate the need to have
isolated surrogate code points in the test cases. Fullwidth digits were moved to
Numeric for Word_Break and Sentence_Break, to address an inconsistency
in handling of boundaries for digits. |
UAX
#31 Unicode Identifier and Pattern Syntax
|
The context specified for A2 was tightened up, by requiring $Letter at the end of the sequence.
The new scripts for Unicode 12.0 were added to Tables 4 and 7. |
UAX
#34 Unicode Named Character Sequences |
The occurrence of initial hyphen-minus in Unicode character names was clarified. |
UAX
#38 Unicode Han Database (Unihan) |
The syntax and/or descriptions for several Unihan data fields were significantly updated: kIRG_GSource, kIRG_JSource, kIRG_KSource, and kIRG_TSource. The discussion of kDefaultSortKey was replaced with a description of the actual sorting algorithm used to generate the radical-stroke charts. |
UAX
#41 Common References for Unicode Standard Annexes |
All references were updated for Unicode 12.0. |
UAX
#42 Unicode Character Database in XML |
New code point attributes, values, and patterns were added. |
UAX
#44
Unicode Character Database |
Clarification was added about the meaning of “abbreviated” property aliases. The note on
the derivation of Default_Ignorable_Code_Point was updated to account for the Egyptian
Hieroglyph format controls. The note about Grapheme_Extend was updated to explain the
current relationship to GCB=Extend. Documentation was added for the new file
USourceRSChart.pdf in Table 5. |
UAX #45
U-Source Ideographs |
New values, A and B, were added to the status field, to account for CJK ideographs
encoded in Extensions A or B.
Documentation was added regarding the addition of a new comments field to the data file,
USourceData.txt. Numerous entries have been added to that data file, and many
entries were corrected to
indicate their correct extension and code point, if encoded. |
UAX #50
Unicode Vertical Text Layout |
No significant changes in this version. |
There are also significant revisions in the Unicode Technical Standards whose
versions are synchronized with the Unicode Standard. The most important of these changes are listed below.
For the full details of all changes, see the Modifications section
of each UTS, linked directly from the following list of UTSes.
Unicode Technical Standard |
Changes |
UTS #10 Unicode Collation Algorithm |
No significant changes in this version. |
UTS #39 Unicode Security Mechanisms |
The discussion of simplified versus traditional CJK characters as part of the enhancements for spoof detection was removed, because any effective approach for that would need to be more sophisticated.
The criteria for exclusions for the listing of Not_XID in the data files were clarified. |
UTS #46 Unicode IDNA Compatibility Processing |
Table 4, IDNA Comparisons was frozen at the Unicode 11.0 level, with appropriate recaptioning and
explanation added. Additional tweaks to the stats in the table for each subsequent release
have proven to be of little additional benefit. |
UTS #51 Unicode Emoji |
Several definitions were updated, and a new definition for “RGI Set” was added. A new section about marking gender in emoji input has been added, as well as numerous clarifications about multi-person groupings, emoji and text presentation selectors, and the significance of the word “FACE” in emoji names. The mechanisms for support of skin tone distinctions when using multi-person emoji are now more fully described. |
There are a significant number of changes in Unicode 12.0 which may impact implementations upgrading to Version 12.0 from earlier versions of the standard. The most important of these are listed and explained here, to help focus on the issues most likely to cause unexpected trouble during upgrades.
Script-related Changes
Four new scripts have been added in Unicode 12.0.0. Some of these scripts have
particular attributes which may cause issues for implementations. The more
important of these attributes are summarized here.
- Nandinagari is a complex script of the Indic type.
- Ottoman Siyaq numerals, when combined to
represent large numbers, have complex formatting requirements.
- A set of Egyptian format controls has been added in a new block in
the range U+13430..U+13438. While these are intended for use with the
existing Egyptian Hieroglyphs script, their use involves a complicated
extension to the rendering model for hieroglyphs to enable quadrat formation. Implementers who
wish to support these format controls will need to study the specification in the
supporting proposal documents. See, in particular,
L2/17-112.
- U+1E94B ADLAM NASALIZATION MARK has been added for the Adlam script.
Although the Adlam script was encoded earlier, implementations have run
into trouble attempting to implement the Adlam nasalization mark with
characters such as U+0027 APOSTROPHE. The new character is intended to
eliminate those problems, but Adlam implementations will need to be
updated to add the character and its correct rendering to Adlam fonts.
Casing Issues
- A few new uppercase Latin letters have been added, which form case pairs
with existing lowercase Latin letters. Casing tables should be checked
carefully.
General Character Property Changes
- Numerous updates have been made to the Alphabetic and Diacritic property
values, to help keep the DUCET table for collation stable when initial
weights are assigned based on character property values. Most of the
affected characters are tone marks for lesser-known scripts.
- A Script_Extensions property value of {Latn Mong} has been added for
U+202F NARROW NO-BREAK SPACE. Implementations that support Script_Extensions
should check that they are handling this character appropriately,
and that its identification in both Latin and Mongolian script runs is correct.
Numeric Property Changes
Unicode 12.0 adds a large number of Tamil characters used for fractional values
in traditional accounting practices. Some of these fraction characters
introduce fractional values distinct from those noted for fraction characters
in prior versions of the UCD. Implementations which handle numeric values of
Unicode characters and which have special assumptions about how to deal with
fractional values should take note of the following new fractional values
among the Tamil fractions:
- 1/320, 1/80, 1/64, 1/32, 3/64
- Note that these Tamil fractions share structural similarities (and many
values) with Malayalam fractions. See DerivedNumericValues.txt for details.
CJK/Unihan Changes
- The regular expressions for the kIRG_GSource and kIRG_JSource properties were completely rewritten to be comprehensible. See UAX #38 for details.
Standardized Variation Sequences
Many additional new standardized variation sequences have been added, to represent distinctions between variants of some common East Asian punctuation characters.
New Data Files Added to the UCD
A new radical/stroke index was added, for easier lookup of U-Source ideographs.
Code Charts
- The old Phags-pa font has been replaced with a better design.
- The old Bopomofo font has also been replaced with a better design. This
impacts the Bopomofo and Bopomofo Extended blocks, as well as two Bopomofo
tone mark characters (U+02EA and U+02EB) in the Spacing Modifier Letters block.
- The won currency sign glyph (U+20A9) was changed to align with modern Korean fonts.