Unicode® 14.0.0
Version 14.0.0 has been superseded by the latest version of the Unicode Standard.
This page summarizes the important changes for the Unicode Standard, Version 14.0.0.
This version supersedes all previous versions of the Unicode Standard.
A. Summary
B. Technical Overview
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Changes in the Unicode Character Database
G. Changes in the Unicode Standard Annexes
H. Changes in Synchronized Unicode Technical Standards
M. Implications for Migration
Unicode 14.0 adds 838 characters,
for a total of 144,697 characters.
These additions include 5 new scripts,
for a total of 159 scripts, as well as 37 new emoji characters.
The new scripts and characters in Version 14.0 add support for lesser-used languages
and unique written requirements worldwide, including numerous symbols additions.
Funds from the
Adopt-a-Character
program provided support for some of these additions.
The new scripts and characters include:
- Toto, used to write the Toto language in northeast India
- Cypro-Minoan, an undeciphered historical script primarily used on the island of Cyprus
- Vithkuqi, an historic script used to write Albanian, and undergoing a modern revival
- Old Uyghur, an historic script used in Central Asia and elsewhere to write Turkic, Chinese, Mongolian, Tibetan, and Arabic languages
- Tangsa, a modern script used to write the Tangsa language, which is spoken in India and Myanmar
- Many Latin additions for extended IPA
- Arabic script additions used to write languages across Africa and in Iran,
Pakistan, Malaysia, Indonesia, Java, and Bosnia, and to write honorifics,
and additions for Quranic use
- Other character additions support languages of North America and of the Philippines, India, and Mongolia
Popular symbol additions:
- 37 emoji characters. For complete statistics regarding all emoji as of
Unicode 14.0, see
Emoji Counts.
For more information about emoji additions in version 14.0, including new
emoji ZWJ sequences and emoji modifier sequences, see
Emoji Recently Added, v14.0.
Other symbol and notational additions include:
- The som currency sign used in the Kyrgyz Republic
- Znamenny musical notation used to write Znamenny Chant, a form of liturgical singing that developed in Russia in the 11th century CE. It is derived from early Byzantine musical notation and is mainly of scholarly interest.
Support for CJK unified ideographs was enhanced in Version 14.0
by significant corrections and improvements to the Unihan database.
Changes to the Unihan database include updated source lists, regular expressions,
and new and updated fields.
See UAX #38,
Unicode Han Database (Unihan) for more information on the updates.
Additional support for lesser-used languages and scholarly work was extended, including:
- Ahom, Balinese, Brahmi, Canadian aboriginal languages (UCAS), Glagolitic, Kaithi, Kannada, Mongolian, Tagalog, Takri, and Telugu
- Arabic support for Hausa, Wolof, Hindko, and Punjabi, and Ethiopic support for Gurage
Important chart font updates, including:
- CJK auxiliary blocks and enclosed alphanumerics. See the delta charts for detailed information on significant chart font changes.
Synchronization
Several other important Unicode specifications have been updated for Version 14.0.
The following four Unicode Technical Standards are versioned in
synchrony with the Unicode Standard, because their data files cover the same repertoire.
All have been updated to Version 14.0:
Some of the changes in Version 14.0 and associated Unicode Technical Standards
may require modifications
to implementations. For more information, see the migration and modification sections of
UTS #10, UTS #39, UTS #46, and UTS #51.
See Sections D through H below for additional details regarding the changes in this version of
the Unicode Standard, its associated annexes, and the other synchronized Unicode specifications.
Version 14.0 of the Unicode Standard consists of:
- The core specification
- The code charts (delta and archival) for this version
- The Unicode Standard Annexes
- The Unicode Character Database (UCD)
The core specification gives the general principles,
requirements for conformance, and guidelines for implementers. The
code charts show representative glyphs for all the Unicode
characters. The Unicode Standard Annexes supply detailed normative
information about particular aspects of the standard. The Unicode
Character Database supplies normative and informative data for
implementers to allow them to implement the Unicode Standard.
The core specification is available as
a single pdf for viewing.
(14 MB)
Links are also available
in the navigation bar on the left of this page to access
individual chapters and appendices
of the core specification. It is also available as Print-on-Demand (POD) for purchase: Volume 1 and Volume 2.
Several sets of code charts are available. They serve different
purposes:
- The latest set of code charts for
the Unicode Standard is available online. Those charts are always the most current
code charts available, and may be updated at any time. The charts are organized by
scripts and blocks for easy reference.
An online index by character name
is also provided. The Tableaux des caractères
provides a French translation of these latest code charts.
For Unicode 14.0.0 in particular two additional sets of code chart pages are provided:
- A set of delta code charts showing the
new blocks and any blocks in which characters were added for Unicode 14.0.0. The new characters are visually highlighted in the charts.
- A set of archival code charts that represents
the entire set of characters, names and representative glyphs at the time of publication of Unicode 14.0.0.
A French translation of the archival code charts is also available for this version.
The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.
The old, frozen UCS2003 source column has been removed from the multi-column
display for CJK Unified Ideographs Extension B for Version 14.0.0. For permanent reference, a
single source
display of UCS2003 (8.7 MB) for the CJK Unified Ideographs Extension B has been
provided as part of the Version 13.0.0 archival charts.
Links to the individual
Unicode Standard Annexes are available in
the navigation bar on the left of this page. The list of significant changes
in the content of the Unicode Standard Annexes for Version 14.0 can be found
in Section G below.
Data files for Version 14.0 of
the Unicode Character Database are available. The ReadMe.txt in that directory provides a roadmap
to the functions of the various subdirectories.
Zipped versions of the UCD
for bulk download are available, as well.
Version 14.0.0 of the Unicode Standard
should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 14.0.0, (Mountain View, CA: The Unicode Consortium,
2021. ISBN 978-1-936213-29-0)
http://www.unicode.org/versions/Unicode14.0.0/
The terms “Version 14.0” or “Unicode 14.0” are abbreviations for the full version reference, Version 14.0.0.
The citation and permalink for the latest published version of the Unicode Standard is:
The Unicode Consortium. The Unicode Standard.
http://www.unicode.org/versions/latest/
A complete specification of the contributory files for Unicode
14.0 is found on the page Components for 14.0.0.
That page also provides the recommended reference format for Unicode Standard Annexes. For examples of how to cite particular portions of the Unicode Standard, see also the Reference Examples.
Errata incorporated into Unicode 14.0 are listed by date in
a separate table. For corrigenda and errata after the release of Unicode 14.0, see the list of current
Updates and Errata.
There were no significant changes to the Stability Policy of the core specification between Unicode 13.0 and Unicode 14.0.
Five new
scripts were added with accompanying new block descriptions:
Script |
Number of Characters |
Vithkuqi |
70 |
Old Uyghur |
26 |
Cypro-Minoan |
99 |
Tangsa |
89 |
Toto |
31 |
Changes in the Unicode Standard Annexes are listed in Section G.
Character Assignment Overview
838 characters have been added.
Most character additions are in new blocks, but there are also character additions to a number of existing blocks. For details, see delta code charts.
There are no significant new conformance requirements in Unicode 14.0.
The detailed listing of all changes to the contributory data files of the Unicode Character Database
for Version 14.0 can be found in
UAX #44, Unicode Character Database.
The changes listed there include character additions and property revisions to existing characters that will affect implementations.
Some of the important impacts on implementations migrating from earlier versions of the standard are highlighted in
Section M.
In Version 14.0, some of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section
of each UAX, linked directly from the following list of UAXes.
Note that for Unicode 14.0, all pertinent links to URLs on the Unicode website
in these Unicode Standard Annexes were updated to use the https protocol.
Unicode Standard Annex |
Changes |
UAX #9 Unicode Bidirectional Algorithm
|
Section 6.2, Vertical Text was clarified to indicate how the Bidirectional
Algorithm is (or is not) used when text is laid out in vertical orientation. |
UAX
#11 East Asian Width |
No significant changes in this version. |
UAX
#14 Unicode Line Breaking Algorithm |
One redundant rule part was removed from LB27 in Section 6.1, Non-tailorable Line Breaking Rules. Also, LB30b was updated to include potential emoji. |
UAX
#15 Unicode Normalization Forms
|
No significant changes in this version. |
UAX
#24 Unicode Script Property
|
No significant changes in this version. |
UAX
#29 Unicode Text Segmentation |
A Swedish "AIK:are" example was added to the word boundary discussion.
The description of the charts in the auxiliary data files was updated, to
make it more accurate. Other small editorial fixes were applied to the text. |
UAX
#31 Unicode Identifier and Pattern Syntax
|
Scripts new to Unicode 14.0 were added to the appropriate tables.
A new Section 1.5, Notation, was added, referring to the LDML
for the UnicodeSet notation used in this annex. |
UAX
#34 Unicode Named Character Sequences |
No significant changes in this version. |
UAX
#38 Unicode Han Database (Unihan) |
The kCantonese field was redefined, and its description was updated
accordingly. The new kStrange field was added. Regular expressions,
source lists, and descriptions were updated for many other fields. |
UAX
#41 Common References for Unicode Standard Annexes |
All references were updated for Unicode 14.0. |
UAX
#42 Unicode Character Database in XML |
New code point attributes, values, and patterns were added for Unicode 14.0. |
UAX
#44
Unicode Character Database |
The documentation was updated to describe the changes to the UCD for
Version 14.0. The distinction between properties of strings and string-valued
properties was clarified. A note was added clarifying that Vertical_Orientation
defaults to U in some blocks associated with notational systems. An erroneous
statement about which General_Category values can be associated with ccc≠0
was corrected. |
UAX #45
U-Source Ideographs |
Descriptions were added for new data fields (total strokes and
first residual stroke) in the data file associated
with UAX #45. The KangXi dictionary index field was obsoleted.
New information was added about the submission process. |
UAX #50
Unicode Vertical Text Layout |
No significant changes in this version. |
There are also significant revisions in the Unicode Technical Standards whose
versions are synchronized with the Unicode Standard. The most important of these changes are listed below.
For the full details of all changes, see the Modifications section
of each UTS, linked directly from the following list of UTSes.
Unicode Technical Standard |
Changes |
UTS #10 Unicode Collation Algorithm |
No significant changes in this version. |
UTS #39 Unicode Security Mechanisms |
Section 3, Identifier Characters was adjusted to better introduce the topic of
identifiers. The text in Section 3.1, General Security Profile for Identifiers
was clarified regarding the rationales for restricting a character. The
descriptions of identifier types in Table 1 were clarified. |
UTS #46 Unicode IDNA Compatibility Processing |
No significant changes in this version. |
UTS #51 Unicode Emoji |
The introduction was reworded. The definition of Basic_Emoji was
clarified, and it was noted that emoji sets are binary properties
of strings. In Section 2.6.2, Multi-Person Skin Tones, the handshake
was added to the list of emoji with RGI skin tones. |
There are a significant number of changes in Unicode 14.0 which may impact implementations upgrading
to Version 14.0 from earlier versions of the standard. The most important of these are listed
and explained here, to help focus on the issues most likely to cause unexpected trouble during upgrades.
Script-related Changes
Five new scripts have been added in Unicode 14.0.0. Some of these scripts have
particular attributes which may cause issues for implementations. The more
important of these attributes are summarized here.
- Old Uyghur is an abjad, historically related to Sogdian. Representation
of Old Uyghur text poses many significant issues. See the original proposal
documentation in L2/20-191
for an extensive discussion.
Casing Issues
- Four new Latin case pairs and one new Glagolitic case pair have been added
in Version 14.0.0. In addition, one of the newly added scripts, Vithkuqi,
is a bicameral script with casing. Implementations
of case mapping and case folding should be checked to ensure they account
correctly for the new case pairs.
Numeric Property Issues
- A new set of decimal digits has been added for the Tangsa script.
See U+16AC0..U+16AC9. Implementations of digits will need to take those
into account.
CJK/Unihan Changes
- A new provisional property, kStrange, has been added to Unihan.
This property is documented in detail in a new Unicode Technical
Note, UTN #43.
- The provisional kCantonese property was extensively refined.
This work included 6,000 additional property values, as well as changing the property
values for nearly 5,000 existing ideographs to reflect only one reading.
- Over 1,000 kIRG_VSource property values with "VU-"" prefix were changed
to use the "VN-" prefix.
WARNING: There are changes to the ends of three existing
CJK unified ideograph ranges in Unicode 14.0.0. Because implementations often hard-code
ideographic ranges to short-cut lookups and reduce table sizes, it is
especially important that implementers pay close attention to the
implications of range changes for Version 14.0.0. These extensions bump up the end
ranges of the encoded ideographs by a few code points within each block:
- 3 code points for the URO:
ending at U+9FFF [fills the block]
- 2 code points for Extension B: ending at U+2A6DF [fills the block]
- 4 code points for Extension C: ending at U+2B738
See Section 4.4,
Listing of Characters Covered by the Unihan Database
in UAX #38
for the version history of all these small CJK unified ideograph additions
inside existing blocks.
See UAX
#38, Unicode Han Database (Unihan) for further details on these changes,
especially Section 4.2, Listing
by Date of Addition to the Unicode Standard, and Section 4.3, Listing by
Location within Unihan.zip.
UAX #38 also has updated regex values for numerous
Unihan properties.
Emoji Changes
- 37 new emoji characters have been added. However, in addition
to those individual characters, many new emoji sequences have been
recognized, as well. Implementations supporting emoji should be
checked to reflect changes in
UTS #51, Unicode Emoji
and all of its associated data files.
Code Charts
- There was a significant update in the fonts used for many CJK auxiliary blocks,
to improve the design and consistency of glyphs. Details of the affected ranges
of glyphs can be found in the Glyph and Variation Sequence Changes table
on the
single block delta charts page.
- There have also been systematic updates to many glyphs in the
Egyptian Hieroglyphs
block, to more accurately reflect current practice.
The old, frozen UCS2003 source column has been removed from the multi-column
display for CJK Unified Ideographs Extension B for Version 14.0.0. For permanent reference, a
single source
display of UCS2003 (8.7 MB) for the CJK Unified Ideographs Extension B has been
provided as part of the Version 13.0.0 archival charts. The rationale for this change
is that the UCS2003 source was the source corresponding to the single
column chart first printed in Unicode 4.0 in 2003. The glyphs for that single source
had not tracked the extensive updates for characters in Extension B over the intervening
years, and so in some cases were becoming misleading about the identity of some of
the corrected characters in Extension B.