☰

You are here: » Character encoding question

Support Forum

Request new password

Active Forum Topics

Character encoding question

Submitted by idealists on Fri, 2008-12-05 13:32 in Magic Parser

Im pulling a feed and the feed has fancy ' (curly version) charactes etc.

When outputted to screen this shows up like:
"I�m pretty good" (even using the demo on this site).
When outputted to the DB (which is what I need to do), it gets truncatted at the � mark.

What can I do to get the encoding correct?
I have no control over the source files of the feeds.

The DB is formatted as utf8-general or something like that.

Please help. Other than this issue this parser rocks.

Hi, This happens because the

Submitted by support on Fri, 2008-12-05 13:34

Hi,

This happens because the HTML page you are generating is not in the same character set as the data. All you need to do is send an appropriate HTML header, and the characters should appear on the page correctly. You can do this with the PHP header() command, but it must be issued before any output is generated otherwise the headers cannot be altered. For utf-8, add this code to the top of your script:

header("Content-Type: text/html; charset=utf-8");

That should fix it!

Cheers,
David.

Thanks, I was outputting to

Submitted by idealists on Fri, 2008-12-05 13:49

Thanks, I was outputting to the screen to debug only.

It usually outputs ONLY to the DB. And when it outputs to the DB table it gets truncated at the � mark.

The table it is outputting is utf8_general_ci, and the "text" field it is outputting into is also utf8_general_ci.

Hi, That would indicate that

Submitted by support on Fri, 2008-12-05 14:00

Hi,

That would indicate that your feed is not in the utf-8 character set. The second most common character set is iso-8859-1, so the most appropriate thing to do would be to alter your database (and any subsequent code) to use this character set.

If that's not an option, using PHP's utf8_encode() function might work. You could use this within your myRecordHandler function as follows (non complete example):

<?php
  function myRecordHandler($record)
  {
    $someVar = utf8_encode($record["SOMEVAR"]);
    // $someVar now ready for utf-8 database
  }
?>

Hope this helps!

Cheers,
David.

Hmmm, thats an

Submitted by idealists on Fri, 2008-12-05 14:09

Hmmm, thats an improvemement, but now instead of truncating before the � mark in the table field, it is actually putting all the text including � mark.

Hi, Have you confirmed that

Submitted by support on Fri, 2008-12-05 14:12

Hi,

Have you confirmed that how you are viewing the data in the table is in the same character set as the data (I think phpMyAdmin makes sure it is the same). If you're viewing the data using a PHP script you would need to be sending the appropriate character set header as above...

Cheers,
David.

Well, I think so, Im using

Submitted by idealists on Fri, 2008-12-05 14:20

Well, I think so, Im using phpmyadmin to view the data. Which I think, like you said, views it with the correct character set.

Hi, That might indicate

Submitted by support on Fri, 2008-12-05 14:24

Hi,

That might indicate character encoding errors within the data. Some versions of PHP's underlying XML library (upon which Magic Parser is based) are better at handling encoding errors than others. Normally, a parse would be aborted when an error is encountered; but this is not always the case.

Could you perhaps email me your XML source (or send a link) and I'll check the data for you?

Cheers,
David.

Email has been sent with

Submitted by idealists on Fri, 2008-12-05 15:07

Email has been sent with URL.

Hi, Thanks - I'll check the

Submitted by support on Fri, 2008-12-05 19:08

Hi,

Thanks - I'll check the feed out...

Cheers,
David.

Hi Please let me know what

Submitted by idealists on Sat, 2008-12-06 00:08

Please let me know what you found.

Hi, I've checked your

Submitted by support on Sat, 2008-12-06 06:08

Hi,

I've checked your source, and the data is valid iso-8859-1. However, it only contains one extended character - the iso-8859-1 version of ', so rather than using utf8_encode() (i'm not sure why that doesn't work) i've done a quick test and been able to make the text valid utf-8 simply by replacing this character with its ASCII counterpart. For example (non complete):

  function myRecordHandler($record)

  {

    $record["FIELD4"] = str_replace(chr(146),"'",$record["FIELD4"]);

  }

($record["FIELD4"] is the description field from your text - which you should then be able to insert into your utf-8 table)

Hope this helps,
Cheers,
David.

Thanks for this David The

Submitted by idealists on Sat, 2008-12-06 07:11

Thanks for this David

The problem for me is that my script pulls feeds from many different providers.
Is there anyway to convert any feed text (which may vary different encodings) to the 1 encoding?

Hi, If all your sources are

Submitted by support on Sat, 2008-12-06 09:52

Hi,

If all your sources are in English, one option is to cleanse all data to ASCII before inserting into your database. The only drawback is that you may lose the occasional apostrophe. Do this at the top of your myRecordHandler function to make all values UTF-8 compatible, before using $record as normal...

  function myRecordHandler($record)

  {

    foreach($record as $k => $v) {

      $l = strlen($v);

      $m = "";

      for($i=0;$i<$l;$i++) {

        $c = substr($v,$i,1);

        if ((ord($c) >= 32) && (ord($c) <= 127)) $m .= $c;

      }

      $record[$k] = $m;

    }

    // *************************

    // now use $record as normal

  }

Hope this helps!
Cheers,
David.

Thanks, How about something

Submitted by idealists on Sat, 2008-12-06 10:28

Thanks,

How about something like this:

function cleanse_data($in) {

  $in = str_replace(chr(146),"'",$in);

  $in = str_replace(chr(147),'"',$in);

  $in = str_replace(chr(148),'"',$in);

  $in = str_replace(chr(150),"-",$in);

  $in = str_replace(chr(151),"-",$in);

  $l = strlen($in);

  $out = "";

  for($i=0;$i<$l;$i++) {

    $c = substr($in,$i,1);

    if ((ord($c) >= 32) && (ord($c) <= 127)) $out .= $c;

  }

  return $out;

}

Would that cover all basis?

Hi, Good idea - it should

Submitted by support on Sat, 2008-12-06 11:52

Hi,

Good idea - it should work fine.

Cheers,
David.

Thanks for your help David

Submitted by idealists on Sat, 2008-12-06 12:33

Thanks for your help David