dir.xiph.org/yp.xml applies UTF-8 encoding too many times
i am coding on an application that parses dir.xiph.org/yp.xml (Streamtastic). When i look at the result of this process i see that all UTF-8 chars with 2 or multi byte encodings are broken.
The cause for this seems to be that UTF-8 Encoding is applied multiple times to the source strings. Simply open dir.xiph.org/yp.xml in a browser to see the over encoding.
f.i. für becomes fÃÂ¼r
I was able to work around this issue using a filter stream. But this only works for some 2 byte encodings (f.i. recovering ä,ü). Characters with 3 or 4 byte encodings cant be recovered because some characters are filtered out like utf8 c2 82 or c2 83.
Furthermore the over encoding can t be observed on the website version of the directory. So something is going wrong while creating the yp.xml directory dump.
Example of over encoded ü ü in utf8= c3 bc c3 bc in UTF8 = c3 83 c2 bc c3 83 c2 bc in UTF8 = c3 83 c2 83 c3 82 c2 bc ü found in yp.xml = c3 83 c3 82 c2 bc
As you can see c2 83 is missing in the final output, which makes it impossible to solve this problem by simply decoding UTF-8 multiple times.
It would be nice if dir.xiph.org/yp.xml utf8-encoding could be fixed.