Saturday, December 6, 2008

Java and UTF-8 Shortest Form

Java 6 update 11 contained an interesting change to UTF-8 handling that I think is worth noting.

Here is the original JRE bug:
4486841 UTF-8 decoder should adhere to corrigendum to Unicode 3.0.1

Here is the impact of the problem
The UTF-8 (Unicode Transformation Format-8) decoder in the Java Runtime Environment (JRE) accepts encodings that are longer than the "shortest" form. This behavior is not a vulnerability in Java SE. However, it may be leveraged to exploit systems running software that relies on the JRE UTF-8 decoder to reject non-shortest form sequences. For example, non-shortest form sequences may be decoded into illegal URIs, which may then allow files that are not otherwise accessible to be read, if the URIs are not checked following UTF-8 decoding.

The solution is to flat out reject anything other than shortest-form UTF-8 per http://unicode.org/versions/corrigendum1.html which has been around since - March 2001??
  • In UTF-8, <004d> is serialized as <4d>.
  • The problematic "non-shortest form" byte sequences in UTF-8 were those where BMP characters could be represented in more than one way. These sequences are illegal...

No comments: