You are here:  » Character encoding question


Character encoding question

Submitted by idealists on Fri, 2008-12-05 13:32 in

Im pulling a feed and the feed has fancy ' (curly version) charactes etc.

When outputted to screen this shows up like:
"I�m pretty good" (even using the demo on this site).
When outputted to the DB (which is what I need to do), it gets truncatted at the � mark.

What can I do to get the encoding correct?
I have no control over the source files of the feeds.

The DB is formatted as utf8-general or something like that.

Please help. Other than this issue this parser rocks.

Submitted by support on Fri, 2008-12-05 13:34

Hi,

This happens because the HTML page you are generating is not in the same character set as the data. All you need to do is send an appropriate HTML header, and the characters should appear on the page correctly. You can do this with the PHP header() command, but it must be issued before any output is generated otherwise the headers cannot be altered. For utf-8, add this code to the top of your script:

  header("Content-Type: text/html; charset=utf-8");

That should fix it!

Cheers,
David.

Submitted by idealists on Fri, 2008-12-05 13:49

Thanks, I was outputting to the screen to debug only.

It usually outputs ONLY to the DB. And when it outputs to the DB table it gets truncated at the � mark.

The table it is outputting is utf8_general_ci, and the "text" field it is outputting into is also utf8_general_ci.

Submitted by support on Fri, 2008-12-05 14:00

Hi,

That would indicate that your feed is not in the utf-8 character set. The second most common character set is iso-8859-1, so the most appropriate thing to do would be to alter your database (and any subsequent code) to use this character set.

If that's not an option, using PHP's utf8_encode() function might work. You could use this within your myRecordHandler function as follows (non complete example):

<?php
  
function myRecordHandler($record)
  {
    
$someVar utf8_encode($record["SOMEVAR"]);
    
// $someVar now ready for utf-8 database
  
}
?>

Hope this helps!

Cheers,
David.

Submitted by idealists on Fri, 2008-12-05 14:09

Hmmm, thats an improvemement, but now instead of truncating before the � mark in the table field, it is actually putting all the text including � mark.

Submitted by support on Fri, 2008-12-05 14:12

Hi,

Have you confirmed that how you are viewing the data in the table is in the same character set as the data (I think phpMyAdmin makes sure it is the same). If you're viewing the data using a PHP script you would need to be sending the appropriate character set header as above...

Cheers,
David.

Submitted by idealists on Fri, 2008-12-05 14:20

Well, I think so, Im using phpmyadmin to view the data. Which I think, like you said, views it with the correct character set.

Submitted by support on Fri, 2008-12-05 14:24

Hi,

That might indicate character encoding errors within the data. Some versions of PHP's underlying XML library (upon which Magic Parser is based) are better at handling encoding errors than others. Normally, a parse would be aborted when an error is encountered; but this is not always the case.

Could you perhaps email me your XML source (or send a link) and I'll check the data for you?

Cheers,
David.

Submitted by idealists on Fri, 2008-12-05 15:07

Email has been sent with URL.

Submitted by support on Fri, 2008-12-05 19:08

Hi,

Thanks - I'll check the feed out...

Cheers,
David.

Submitted by idealists on Sat, 2008-12-06 00:08

Hi

Please let me know what you found.

Submitted by support on Sat, 2008-12-06 06:08

Hi,

I've checked your source, and the data is valid iso-8859-1. However, it only contains one extended character - the iso-8859-1 version of ', so rather than using utf8_encode() (i'm not sure why that doesn't work) i've done a quick test and been able to make the text valid utf-8 simply by replacing this character with its ASCII counterpart. For example (non complete):

  function myRecordHandler($record)
  {
    $record["FIELD4"] = str_replace(chr(146),"'",$record["FIELD4"]);
  }

($record["FIELD4"] is the description field from your text - which you should then be able to insert into your utf-8 table)

Hope this helps,
Cheers,
David.

Submitted by idealists on Sat, 2008-12-06 07:11

Thanks for this David

The problem for me is that my script pulls feeds from many different providers.
Is there anyway to convert any feed text (which may vary different encodings) to the 1 encoding?

Submitted by support on Sat, 2008-12-06 09:52

Hi,

If all your sources are in English, one option is to cleanse all data to ASCII before inserting into your database. The only drawback is that you may lose the occasional apostrophe. Do this at the top of your myRecordHandler function to make all values UTF-8 compatible, before using $record as normal...

  function myRecordHandler($record)
  {
    foreach($record as $k => $v) {
      $l = strlen($v);
      $m = "";
      for($i=0;$i<$l;$i++) {
        $c = substr($v,$i,1);
        if ((ord($c) >= 32) && (ord($c) <= 127)) $m .= $c;
      }
      $record[$k] = $m;
    }
    // *************************
    // now use $record as normal
  }

Hope this helps!
Cheers,
David.

Submitted by idealists on Sat, 2008-12-06 10:28

Thanks,

How about something like this:

function cleanse_data($in) {
  $in = str_replace(chr(146),"'",$in);
  $in = str_replace(chr(147),'"',$in);
  $in = str_replace(chr(148),'"',$in);
  $in = str_replace(chr(150),"-",$in);
  $in = str_replace(chr(151),"-",$in);
  $l = strlen($in);
  $out = "";
  for($i=0;$i<$l;$i++) {
    $c = substr($in,$i,1);
    if ((ord($c) >= 32) && (ord($c) <= 127)) $out .= $c;
  }
  return $out;
}

Would that cover all basis?

Submitted by support on Sat, 2008-12-06 11:52

Hi,

Good idea - it should work fine.

Cheers,
David.

Submitted by idealists on Sat, 2008-12-06 12:33

Thanks for your help David