You are here:  » Cannot parse special characters (symbols) like Trademark, Copyright and Registered


Cannot parse special characters (symbols) like Trademark, Copyright and Registered

Submitted by bayuobie on Tue, 2011-06-28 21:46 in

I have used the magic parser to read xml files but it just failed to output anything because one of my files contained special characters. Below is a sample xlm item with the special characters.

<product>
  <id>1591</id>
    <description><![CDATA[
MODELLO PRODOTTO: MONITOR ASUS LCD 23` MS236H FULL HD
                          Power cord
                          Power adapter
                          Quick start guide
                          HDMI-to-DVI cable
                          warranty card
Regulation Approval Energy Star®, UL/cUL, CB, CE, FCC, CCC, BSMI, Gost-R, C-Tick, MEPS, VCCI, PSE, J-MOSS,
                          PSB,China Energy Label Level 1, RoHS, WEEE, Windows Vista WHQL]]></description>
  </product>

Submitted by support on Wed, 2011-06-29 08:13

Hello Bayoubie,

I notice that the data is correctly delimited using CDATA tags, so it is almost certainly down to a character encoding error, or less likely a mis-match between the declared encoding of the document, and the actual data itself.

To work around this I have cleansing versions of the script that I will send to you to use in place of the standard version, please check your email to the same address that you registered on the forum with;

Cheers,
David
--
MagicParser.com

Submitted by bayuobie on Wed, 2011-07-06 14:02

Hey David,

The cleansing version worked well for me but I wanted to add that the UTF-8 version was not working for me because the xml source file had it's header-content type set to ISO-88591. So I just changed to the cleansing version for ISO-88591.

Thanks.