It took me two minutes to understand what is going on. What?! The text is not the same! Yes! I rendered some text in a HTML form, and when I got the text back to the server, without changing it by myself, the text HAS BEEN CHANGED!
Huh? How can this happen?
This is an HTML with a form tag, contained some Hebrew text, and this is what I used to send the text back to the server:
1: <!doctype html>
4: <meta charset="utf-8">
5: <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
7: <meta name="viewport" content="width=device-width, initial-scale=1.0">
10: <form method="POST" action="/api/snippets">
11: <input name="Content" type="hidden" value="לְשֵׁם יִחוּד קֻדְשָׁא בְּרִיךְ הוּא וּשְׁכִינְתֵּהּ">
12: <input type="submit">
This is what is sent from the client:
לְשֵׁם יִחוּד קֻדְשָׁא בְּרִיךְ הוּא וּשְׁכִינְתֵּהּ
And this is what the server got back:
לְשֵׁם יִחוּד קֻדְשָׁא בְּרִיךְ הוּא וּשְׁכִינְתֵּהּ
Hua? You must be furrowing your forehead… These two snippets look exactly the same for me!
No they do not. Apparently.
The way that I knew that they different, is that I edited a snippet in the database from a web page, and than tried to query for the original snippet’s text. What surprised me that nothing matched the query. I knew that RavenDB supports Unicode by its core, but in order to be sure I wrote a test that proved it. (And committed it to the ravendb source code, so I’ll be sure that this won’t brake in the future too).
So this text must be not the same. But what is different? At that point I decided to print out the code of each char using the following gist test helper. Here is the result:
Position Expected Actual
0 ל (\u05DC) ל (\u05DC)
1 ְ (\u05B0) ְ (\u05B0)
2 ש (\u05E9) ש (\u05E9)
3 * ׁ (\u05C1) ֵ (\u05B5)
4 * ֵ (\u05B5) ׁ (\u05C1)
5 ם (\u05DD) ם (\u05DD)
6 \s (\u0020) \s (\u0020)
7 י (\u05D9) י (\u05D9)
8 ִ (\u05B4) ִ (\u05B4)
9 ח (\u05D7) ח (\u05D7)
10 ו (\u05D5) ו (\u05D5)
11 ּ (\u05BC) ּ (\u05BC)
12 ד (\u05D3) ד (\u05D3)
13 \s (\u0020) \s (\u0020)
14 ק (\u05E7) ק (\u05E7)
15 ֻ (\u05BB) ֻ (\u05BB)
16 ד (\u05D3) ד (\u05D3)
17 ְ (\u05B0) ְ (\u05B0)
18 ש (\u05E9) ש (\u05E9)
19 * ׁ (\u05C1) ָ (\u05B8)
20 * ָ (\u05B8) ׁ (\u05C1)
21 א (\u05D0) א (\u05D0)
22 \s (\u0020) \s (\u0020)
23 ב (\u05D1) ב (\u05D1)
24 * ּ (\u05BC) ְ (\u05B0)
25 * ְ (\u05B0) ּ (\u05BC)
26 ר (\u05E8) ר (\u05E8)
27 ִ (\u05B4) ִ (\u05B4)
28 י (\u05D9) י (\u05D9)
29 ך (\u05DA) ך (\u05DA)
30 ְ (\u05B0) ְ (\u05B0)
31 \s (\u0020) \s (\u0020)
32 ה (\u05D4) ה (\u05D4)
33 ו (\u05D5) ו (\u05D5)
34 ּ (\u05BC) ּ (\u05BC)
35 א (\u05D0) א (\u05D0)
36 \s (\u0020) \s (\u0020)
37 ו (\u05D5) ו (\u05D5)
38 ּ (\u05BC) ּ (\u05BC)
39 ש (\u05E9) ש (\u05E9)
40 * ׁ (\u05C1) ְ (\u05B0)
41 * ְ (\u05B0) ׁ (\u05C1)
42 כ (\u05DB) כ (\u05DB)
43 ִ (\u05B4) ִ (\u05B4)
44 י (\u05D9) י (\u05D9)
45 נ (\u05E0) נ (\u05E0)
46 ְ (\u05B0) ְ (\u05B0)
47 ת (\u05EA) ת (\u05EA)
48 * ּ (\u05BC) ֵ (\u05B5)
49 * ֵ (\u05B5) ּ (\u05BC)
50 ה (\u05D4) ה (\u05D4)
51 ּ (\u05BC) ּ (\u05BC)
Position: First difference is at position 3
Expected: לְשֵׁם יִחוּד קֻדְשָׁא בְּרִיךְ הוּא וּשְׁכִינְתֵּהּ
Actual: לְשֵׁם יִחוּד קֻדְשָׁא בְּרִיךְ הוּא וּשְׁכִינְתֵּהּ
What we can see from this test that when the browser sends back the above text, it is modifying it. To be more exact, when an Hebrew letter has two Niqqud characters (which they’re like the vowels characters in English, they let you know how to pronounce a specific sign), the browser replace the order of them.
In the above text we had Shin (ש) followed by a ShinDot (\u05C1) followed by a Zeire (\u05B5), but the browser replaced the order to the two Niqqud characters, so when the text posted back to the server, the server got Shin (ש) followed by a Zeire (\u05B5) followed by a ShinDot (\u05C1).
After realizing that this really happens, I than wanted to know why? What is the reason for this behavior?
What he answered is that browsers do some sort of Unicode normalization, which in this case is the Normalization Form C (NFC) probably.
So I dig up more about the Unicode normalization, and I found out this can be serious problem when editing Hebrew text with Niqqud. While the actual case of when this will impact the end result of how the text is displayed is actually rare, it still exists as outlined it the following document (page 9). Besides, this overwrites the Hebrew font convention of the Niqqud order, as mentioned in same document:
… most users familiar with Hebrew would agree that the dagesh should, logically and linguistically, precede the vowel and the cantillation mark, and most would also agree that the vowel should precede the cantillation mark
Searching the web for a solution didn’t yielded any solution. I tried to see if I can come with a custom normalization that will de-normalize the characters to the original order, but I concluded quickly that this not an easy task to complete. Based on the recommended mark ordering here (page 12), I can see that Hiriq is precede Patah, and this will lean the error in word ירושלים which will change its pronunciation from yerušālayim to yerušālaim, as described in page 9 here.
So, does this mean that I cannot use a web page in order to edit such Hebrew text? Really?
or can you come with some solution?
A solution as I can see it can be either:
A way to avoid the normalization action made by browser (Google Chrome).
A de-normalization algorithm to revert the Niqqud to the original – as quated from above from this document, page 8.
I also posted this question to Stack Overflow, you can answer it there if it more convenient for you.
Cross posted from http://www.code972.com/blog/2012/06/single-point-of-failure/
September 1st, 1983. Korean Airlines flight 007 from New York City to Seoul disappeared a couple of hours after take-off. Only later was it discovered that the plane deviated from its original route; instead of flying through air corridor R-20, it entered Soviet airspace and was shot down by a Soviet interceptor. All 269 people on board were killed.
During an investigation conducted by the National Transportation Safety Board (NTSB) , it was made clear the plane was cruising way northern than it should have been. Instead of flying above international waters, the plane somehow entered Soviet airspace, enforcing them to gun it down, thinking it was a plane in a spying mission. How did the plane deviate that much from its assigned route? NTSB came up with two possible options, both pointed at human error.
The first option was typing the aerial waypoints incorrectly. These are latitude / longitude pairs the co-pilot enters and the captain validates, and they form the flight's route. Mistyping one digit may take the plane way off its planned route, possibly making it enter hostile territories. NTSB also mentioned another possibility - not turning the coordinates-based auto-pilot (INS) on, and instead flying with the Magnetic Heading auto-pilotmode. The Magnetic Heading option is always on during take-off, so it would require the pilot to remember to change the auto-pilot mode. If he failed to do so, the INS system would not use the coordinates they typed to guide the plane, since it would be off.
The captain on board of KAL flight 007 had years of flying experience. 10+ years in KAL, and many years before that in the air-force. Therefore, NTSB deemed the second option "less likely". They thought it is much more likely for typing a number incorrectly, and not caring to verify it, than it was for a very experienced pilot to flip a switch right after take-off. It is a switch you flip on every flight, after all.
Years later, after the Soviet Union fell apart and the investigation was able to conclude using the original black-box from the plane, the real reason for the deviation of the flight was discovered. It turns out the captain forgot to switch the INS system on, so the plane was cruising using the Magnetic Heading. Had he remembered to switch the INS system on in any point during the flight, he would have caught the error and redirect the plain to its assigned route, probably avoiding death.
In the software world we have a lot of slogans, methodologies and names for patterns. Single point of failure is not just a slogan. In this case, the system had many single points of failures, and it was only a matter of time before before it would have mortal consequences. I'm pretty sure this is not the only time the pilot forgot to switch to INS mode; it is the only time (that I know of) it caused death. Of an entire 747.
The Single Point of Failure in this case is not a system crash, or a bottleneck. It is about assuming the operator will always remember to do the right thing at the right time. And that is wrong, even if your user has 10+ years of flawless experience. I'm consciously avoiding the discussion on the poor UX of the auto-pilot system, and this is why I left some details relating to it out. Yes, you can get away from this using some UX tricks, like checklists or blinking signs or whatever, but then in the best scenario you are just making it less likely to happen, which is not good enough.
If it is the common practice to always first have magnetic heading mode turned on, and then switch to something else (not necessarily INS), then having it as a dedicated mode is a wrong assumption. But here I'm talking UX again, so we'll stop here.
When designing any software, not to mention complex systems, don't ever allow for a single point of failure, and don't ever assume it is only about preventing bottlenecks or crashes. In some systems you might save lives, but in most systems you'll just save yourself a lot of support calls.
You can read the full story, with all the details, in the Wikipedia page. National Geographic had a chapter on it in the excellent "Air Crash Investigation" series, which you can watch here. The image above is from that show.