You are here:  » Switching between cleansing versions of Magic parser


Switching between cleansing versions of Magic parser

Submitted by bayuobie on Mon, 2011-08-01 15:25 in

Hi David,

I'm parsing feeds from URLs and I don't have control over the character encoding of the feed files. As a result, some come in UTF-8 whiles others come in ISO88591. So where it's UTF-8, Magic parser works correctly but where it's ISO88591 but I included MagicParserUTF8.php, it doesn't and vice versa (this depends on the type of cleaning version that I include).

I' wondering if there is a way for me to switch between MagicParserISO88591.php and MagicParserUTF8.php. So that if I include MagicParserUTF8.php and it fails to parse the url, then the include file should be MagicParserISO88591.php. If not, i'm now counting on luck to have most of my feeds in UTF-8 encoding else I achieve no results.

Please is there any help with this? Any help will be appreciated.

Regards,

Blaise

Submitted by support on Mon, 2011-08-01 15:37

Hello Blaise,

If it has been a while since you last tried the generic MagicParser.php; I would first recommend giving that a go, as your host may have upgraded PHP / server OS in the mean time so that you no longer have the encoding error intolerance problem.

As a second solution; and this is assuming that individually each of your feeds can be loaded into memory (the default PHP memory limit is 32M) the best solution would be to load via file_get_contents(); and then based on the source URL, apply utf8_encode() or utf8_decode() as required. Consider the following example, where the example.com feed works fine, but example.net must be cleansed to UTF8 (utf8_encode) and the example.com feed must be cleansed to ISO-8859-1 (utf8_decode)...

  $feeds = array();
  $feeds[] = "http://www.example.com/data.xml";
  $feeds[] = "http://www.example.net/data.xml";
  $feeds[] = "http://www.example.org/data.xml";
  $cleanse = array();
  $cleanse["http://www.example.net/data.xml"] = "utf-8";
  $cleanse["http://www.example.org/data.xml"] = "iso-8859-1";
  foreach($feeds as $feed)
  {
    $xml = file_get_contents($feed);
    if (isset($cleanse[$feed]))
    {
      switch($cleanse[$feed])
      {
        case "utf-8":
          $xml = utf8_encode($xml);
          break;
        case "iso-8859-1":
          $xml = utf8_decode($xml);
          break;
      }
    }
    MagicParser_parse("string://".$xml","myRecordHandler","xml|FORMAT/STRING/");
  }

Hope this helps,
Cheers,
David