It took me two minutes to understand what is going on. What?! The text is not the same! Yes! I rendered some text in a HTML form, and when I got the text back to the server, without changing it by myself, the text HAS BEEN CHANGED!
Huh? How can this happen?
This is an HTML with a form tag, contained some Hebrew text, and this is what I used to send the text back to the server:
1: <!doctype html>
4: <meta charset="utf-8">
5: <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
7: <meta name="viewport" content="width=device-width, initial-scale=1.0">
10: <form method="POST" action="/api/snippets">
11: <input name="Content" type="hidden" value="לְשֵׁם יִחוּד קֻדְשָׁא בְּרִיךְ הוּא וּשְׁכִינְתֵּהּ">
12: <input type="submit">
This is what is sent from the client:
לְשֵׁם יִחוּד קֻדְשָׁא בְּרִיךְ הוּא וּשְׁכִינְתֵּהּ
And this is what the server got back:
לְשֵׁם יִחוּד קֻדְשָׁא בְּרִיךְ הוּא וּשְׁכִינְתֵּהּ
Hua? You must be furrowing your forehead… These two snippets look exactly the same for me!
No they do not. Apparently.
The way that I knew that they different, is that I edited a snippet in the database from a web page, and than tried to query for the original snippet’s text. What surprised me that nothing matched the query. I knew that RavenDB supports Unicode by its core, but in order to be sure I wrote a test that proved it. (And committed it to the ravendb source code, so I’ll be sure that this won’t brake in the future too).
So this text must be not the same. But what is different? At that point I decided to print out the code of each char using the following gist test helper. Here is the result:
Position Expected Actual
0 ל (\u05DC) ל (\u05DC)
1 ְ (\u05B0) ְ (\u05B0)
2 ש (\u05E9) ש (\u05E9)
3 * ׁ (\u05C1) ֵ (\u05B5)
4 * ֵ (\u05B5) ׁ (\u05C1)
5 ם (\u05DD) ם (\u05DD)
6 \s (\u0020) \s (\u0020)
7 י (\u05D9) י (\u05D9)
8 ִ (\u05B4) ִ (\u05B4)
9 ח (\u05D7) ח (\u05D7)
10 ו (\u05D5) ו (\u05D5)
11 ּ (\u05BC) ּ (\u05BC)
12 ד (\u05D3) ד (\u05D3)
13 \s (\u0020) \s (\u0020)
14 ק (\u05E7) ק (\u05E7)
15 ֻ (\u05BB) ֻ (\u05BB)
16 ד (\u05D3) ד (\u05D3)
17 ְ (\u05B0) ְ (\u05B0)
18 ש (\u05E9) ש (\u05E9)
19 * ׁ (\u05C1) ָ (\u05B8)
20 * ָ (\u05B8) ׁ (\u05C1)
21 א (\u05D0) א (\u05D0)
22 \s (\u0020) \s (\u0020)
23 ב (\u05D1) ב (\u05D1)
24 * ּ (\u05BC) ְ (\u05B0)
25 * ְ (\u05B0) ּ (\u05BC)
26 ר (\u05E8) ר (\u05E8)
27 ִ (\u05B4) ִ (\u05B4)
28 י (\u05D9) י (\u05D9)
29 ך (\u05DA) ך (\u05DA)
30 ְ (\u05B0) ְ (\u05B0)
31 \s (\u0020) \s (\u0020)
32 ה (\u05D4) ה (\u05D4)
33 ו (\u05D5) ו (\u05D5)
34 ּ (\u05BC) ּ (\u05BC)
35 א (\u05D0) א (\u05D0)
36 \s (\u0020) \s (\u0020)
37 ו (\u05D5) ו (\u05D5)
38 ּ (\u05BC) ּ (\u05BC)
39 ש (\u05E9) ש (\u05E9)
40 * ׁ (\u05C1) ְ (\u05B0)
41 * ְ (\u05B0) ׁ (\u05C1)
42 כ (\u05DB) כ (\u05DB)
43 ִ (\u05B4) ִ (\u05B4)
44 י (\u05D9) י (\u05D9)
45 נ (\u05E0) נ (\u05E0)
46 ְ (\u05B0) ְ (\u05B0)
47 ת (\u05EA) ת (\u05EA)
48 * ּ (\u05BC) ֵ (\u05B5)
49 * ֵ (\u05B5) ּ (\u05BC)
50 ה (\u05D4) ה (\u05D4)
51 ּ (\u05BC) ּ (\u05BC)
Position: First difference is at position 3
Expected: לְשֵׁם יִחוּד קֻדְשָׁא בְּרִיךְ הוּא וּשְׁכִינְתֵּהּ
Actual: לְשֵׁם יִחוּד קֻדְשָׁא בְּרִיךְ הוּא וּשְׁכִינְתֵּהּ
What we can see from this test that when the browser sends back the above text, it is modifying it. To be more exact, when an Hebrew letter has two Niqqud characters (which they’re like the vowels characters in English, they let you know how to pronounce a specific sign), the browser replace the order of them.
In the above text we had Shin (ש) followed by a ShinDot (\u05C1) followed by a Zeire (\u05B5), but the browser replaced the order to the two Niqqud characters, so when the text posted back to the server, the server got Shin (ש) followed by a Zeire (\u05B5) followed by a ShinDot (\u05C1).
After realizing that this really happens, I than wanted to know why? What is the reason for this behavior?
What he answered is that browsers do some sort of Unicode normalization, which in this case is the Normalization Form C (NFC) probably.
So I dig up more about the Unicode normalization, and I found out this can be serious problem when editing Hebrew text with Niqqud. While the actual case of when this will impact the end result of how the text is displayed is actually rare, it still exists as outlined it the following document (page 9). Besides, this overwrites the Hebrew font convention of the Niqqud order, as mentioned in same document:
… most users familiar with Hebrew would agree that the dagesh should, logically and linguistically, precede the vowel and the cantillation mark, and most would also agree that the vowel should precede the cantillation mark
Searching the web for a solution didn’t yielded any solution. I tried to see if I can come with a custom normalization that will de-normalize the characters to the original order, but I concluded quickly that this not an easy task to complete. Based on the recommended mark ordering here (page 12), I can see that Hiriq is precede Patah, and this will lean the error in word ירושלים which will change its pronunciation from yerušālayim to yerušālaim, as described in page 9 here.
So, does this mean that I cannot use a web page in order to edit such Hebrew text? Really?
or can you come with some solution?
A solution as I can see it can be either:
A way to avoid the normalization action made by browser (Google Chrome).
A de-normalization algorithm to revert the Niqqud to the original – as quated from above from this document, page 8.
I also posted this question to Stack Overflow, you can answer it there if it more convenient for you.