Im pulling a feed and the feed has fancy ' (curly version) charactes etc.
When outputted to screen this shows up like:
"I�m pretty good" (even using the demo on this site).
When outputted to the DB (which is what I need to do), it gets truncatted at the � mark.
What can I do to get the encoding correct?
I have no control over the source files of the feeds.
The DB is formatted as utf8-general or something like that.
Please help. Other than this issue this parser rocks.
Thanks, I was outputting to the screen to debug only.
It usually outputs ONLY to the DB. And when it outputs to the DB table it gets truncated at the � mark.
The table it is outputting is utf8_general_ci, and the "text" field it is outputting into is also utf8_general_ci.
Hi,
That would indicate that your feed is not in the utf-8 character set. The second most common character set is iso-8859-1, so the most appropriate thing to do would be to alter your database (and any subsequent code) to use this character set.
If that's not an option, using PHP's utf8_encode() function might work. You could use this within your myRecordHandler function as follows (non complete example):
<?php
function myRecordHandler($record)
{
$someVar = utf8_encode($record["SOMEVAR"]);
// $someVar now ready for utf-8 database
}
?>
Hope this helps!
Cheers,
David.
Hmmm, thats an improvemement, but now instead of truncating before the � mark in the table field, it is actually putting all the text including � mark.
Hi,
Have you confirmed that how you are viewing the data in the table is in the same character set as the data (I think phpMyAdmin makes sure it is the same). If you're viewing the data using a PHP script you would need to be sending the appropriate character set header as above...
Cheers,
David.
Well, I think so, Im using phpmyadmin to view the data. Which I think, like you said, views it with the correct character set.
Hi,
That might indicate character encoding errors within the data. Some versions of PHP's underlying XML library (upon which Magic Parser is based) are better at handling encoding errors than others. Normally, a parse would be aborted when an error is encountered; but this is not always the case.
Could you perhaps email me your XML source (or send a link) and I'll check the data for you?
Cheers,
David.
Hi,
Thanks - I'll check the feed out...
Cheers,
David.
Hi,
I've checked your source, and the data is valid iso-8859-1. However, it only contains one extended character - the iso-8859-1 version of ', so rather than using utf8_encode() (i'm not sure why that doesn't work) i've done a quick test and been able to make the text valid utf-8 simply by replacing this character with its ASCII counterpart. For example (non complete):
function myRecordHandler($record)
{
$record["FIELD4"] = str_replace(chr(146),"'",$record["FIELD4"]);
}
($record["FIELD4"] is the description field from your text - which you should then be able to insert into your utf-8 table)
Hope this helps,
Cheers,
David.
Thanks for this David
The problem for me is that my script pulls feeds from many different providers.
Is there anyway to convert any feed text (which may vary different encodings) to the 1 encoding?
Hi,
If all your sources are in English, one option is to cleanse all data to ASCII before inserting into your database. The only drawback is that you may lose the occasional apostrophe. Do this at the top of your myRecordHandler function to make all values UTF-8 compatible, before using $record as normal...
function myRecordHandler($record)
{
foreach($record as $k => $v) {
$l = strlen($v);
$m = "";
for($i=0;$i<$l;$i++) {
$c = substr($v,$i,1);
if ((ord($c) >= 32) && (ord($c) <= 127)) $m .= $c;
}
$record[$k] = $m;
}
// *************************
// now use $record as normal
}
Hope this helps!
Cheers,
David.
Thanks,
How about something like this:
function cleanse_data($in) {
$in = str_replace(chr(146),"'",$in);
$in = str_replace(chr(147),'"',$in);
$in = str_replace(chr(148),'"',$in);
$in = str_replace(chr(150),"-",$in);
$in = str_replace(chr(151),"-",$in);
$l = strlen($in);
$out = "";
for($i=0;$i<$l;$i++) {
$c = substr($in,$i,1);
if ((ord($c) >= 32) && (ord($c) <= 127)) $out .= $c;
}
return $out;
}
Would that cover all basis?
Hi,
Good idea - it should work fine.
Cheers,
David.
Hi,
This happens because the HTML page you are generating is not in the same character set as the data. All you need to do is send an appropriate HTML header, and the characters should appear on the page correctly. You can do this with the PHP header() command, but it must be issued before any output is generated otherwise the headers cannot be altered. For utf-8, add this code to the top of your script:
header("Content-Type: text/html; charset=utf-8");
That should fix it!
Cheers,
David.