Hi David,
I am trying display the feed at http://www.dcothai.com/rss-news.php (encoding=ISO-8859-1) for display on a webpage (charset=ISO-8859-1) but would like some ideas on how to clean up the weird html entities/strange characters in the title and description fields.
Best regards,
Bangkok Bob
Hi David,
Thanks for your prompt response, as always.
Maybe it's a browser issue, then. I go to http://www.dcothai.com/rss-news.php then view source in FireFox so I can save (for subsequent upload to PT) but the ampersand is being altered thus causing problems later. I tried IE but that won't show XML only a blank page and no view source.
<item>
<title>Les m&#233;moires d&#39;un d&#233;tective priv&#233; &#224; Bang</title>
<link>http://www.dcothai.com/product_info.php?products_id=816</link>
<guid>http://www.dcothai.com/product_info.php?products_id=816</guid>
<description><P class=producthead>Les m&#233;moires d&#39;un d&#233;tective priv&#233; &#224; Bangkok</p><P class=productheadtwo>Histoires vraies tir&#233;es des dossiers de Warren Olson</p><p>Quand vous n&#39&#234;tes pas l&#224;, que fait votre copine ? www. Thaiprivateeye .com</p><p>Les filles de bar avec un double emploi, les &#233;pouses suspicieuses, l&#39amante lesbienne, voil&#224; ce qu&#39&#233;taient les journ&#233;es du d&#233;tective priv&#233; de Bangkok Warren Olson.</p><p>Pendant plus de dix ans Olson a parcouru les petites rues de la &#39Grosse Mangue&#39. Parlant parfaitement le Tha&#239; et le Khmer, il &#233;tait capable d&#39aller l&#224; o&#249; les autres d&#233;tectives n&#39osaient pas tra&#238;ner.</p><p>Ces clients &#233;taient des occidentaux qui avaient perdu leur coeur -- et leurs &#233;conomies -- avec des filles de bar affam&#233;es. Mais il eut aussi pas mal de clients tha&#239;s dont une charmante vieille dame qui avait &#233;t&#233; d&#233;trouss&#233;e par un arnaqueur chr&#233;tien et aussi une fille tha&#239;e &#224; laquelle un ancien amant faisait du chantage.</p><p>Personne ne sait mieux que Warren Olson les combines que les filles de bar peuvent employer pour que les clients occidentaux se d&#233;partissent de leur argent durement gagn&#233;. Maintenant qu&#39il a arr&#234;t&#233;, il a envie de partager ses exp&#233;riences avec le reste du monde. Ses grands moments de sagesse sont :<br>- D&#233;finition d&#39une fille de bar : <br>rus&#233;e comme un renard mais avec la cervelle d&#39un poisson rouge. <br>- Quand vous avez comme copine une fille de bar, vous ne la perdez jamais. Vous perdez seulement quelquefois votre place dans la queue.</p><p>Les histoires sont bas&#233;es sur les dossiers de Warren Olson. Pour prot&#233;ger les innocents et les coupables, elles sont arrang&#233;es par l&#39auteur de best-sellers Stephen Leather. Olson est maintenant reparti dans son pays natal, la Nouvelle Z&#233;lande, avec sa femme tha&#239;e et leur fille mais l&#39agence qu&#39il a cr&#233;e est toujours ouverte. On peut avoir des infos sur : www. thaiprivateeye .com</p><p>Paperback<br>Bangkok 2006<br>Bamboo Sinfonia<br>ISBN 9789748284033<br>305 pages</p></description>
<ecommerce:listPrice>$16.45</ecommerce:listPrice>
<ecommerce:SKU>20004186</ecommerce:SKU>
<category>Books Thailand</category>
</item>
When I look at the example you parsed at http://www.magicparser.com/examples/dcothai.php, FireFox shows a validation error.
Warning: entity "'" doesn't end in ';'
Best regards,
Bangkok Bob
Hi Bob,
The ' entiti seems to be a strange one as although it is not encoded correctly in the feed, FireFox displays it correctly as the ' character.
The best way to deal with this if you want to fix it up, I think is to do a str_replace on this string when reading the feed, for example:
$record["DESCRIPTION"] = str_replace("'","'",$record["DESCRIPTION"]);
To save the feed for loading into PT, if you view with a web browser and then go View > Source you can normally then save to disk using File > Save As...
If the problem you are having is later on and you are importing the feed using Price Tapestry, the entities will be broken by default because the "&", "#" and ";" characters are all stripped from fields during import. This behavior is easily changed - see the following thread for the code changes...
http://www.pricetapestry.com/node/1524
Cheers,
David.
Hi David,
Thanks for your information.
I have added a str_replace in PT includes/admin.php in * apply standard filters * before the strip_tags line and this solves the problem with the ' apostrophe.
I am editing the text entries in phpMyAdmin to make foreign characters and other html stuff appear correctly after strip_tags. Maybe this is a slow way to do it (?) but I guess I can avoid duplicate content filters this way, too?
Best regards,
Bangkok Bob
Hi Bob,
You're right about the filters. Remember that you can add as many str_replace() lines as you want to make other changes during import if that will save you editing the database directly...
Cheers,
David.
Hi Bob,
I've just had a look at the feed. I wrote the following test script which simply displays the content "as-is", and the entities are all displayed correctly (at least in my browser):
http://www.magicparser.com/examples/dcothai.php
Here is the source code:
<?php
header("Content-Type: text/html;charset=iso-8859-1");
require("MagicParser.php");
function myRecordHandler($item)
{
print "<h2>".$item["TITLE"]."</h2>";
print "<p>".$item["DESCRIPTION"]."</p>";
}
MagicParser_parse("dcothai.xml","myRecordHandler","xml|RSS/CHANNEL/ITEM/");
?>
(I saved the response from your URL as dcothai.xml in the same directory as the script)
If you're not seeing the same results, make sure that the correct character set header is being sent by your script (see the header() line in the above code)...
Cheers,
David.