I would like to be able to parse a bunch of different blogs easily. Can this be done?
Thanks
Hi Dustin,
Because most feeds from blogs are either going to be RSS or Atom format, you can create a script with multiple record handlers (one for each format) and then make sure you call MagicParser_parse() for each feed with the correct record handler and format string. I've just written an example to demonstrate this using the BBC News (RSS) and Mozillazine (Atom) feeds....
Source:
<?php
require("MagicParser.php");
function myRSSRecordHandler($item)
{
print "<h2><a href='".$item["LINK"]."'>".$item["TITLE"]."</a></h2>";
print "<p>".substr($item["DESCRIPTION"],0,100)."</p>";
}
function myAtomRecordHandler($item)
{
print "<h2><a href='".$item["LINK-HREF"]."'>".$item["TITLE"]."</a></h2>";
print "<p>".substr($item["CONTENT"],0,100)."</p>";
}
$url = "http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml";
MagicParser_parse($url,"myRSSRecordHandler","xml|RSS/CHANNEL/ITEM/");
$url = "http://www.mozillazine.org/atom.xml";
MagicParser_parse($url,"myAtomRecordHandler","xml|FEED/ENTRY/");
?>
Hope this helps!
Cheers,
David.
Thanks David...that does the basics of what I'm looking for, but I don't want it to show them in that order, instead, I'd like for it to have the feeds mixed up. Can that be done without making a database, etc.
Thanks
Also if possiable I would want it to be done by date added.
I'm not sure the best way to do this, but one idea that comes to mind is ...
Make MagicParser parse the feeds into one feed, so as it adds the new records they would be on top...now that I type that it shoulds harder and seems like there should be a better way.
Another thing I want to do is make it put the results in pages. I saw this done already in some past post with a loop, so I should be able to figure that out, but I want to throw that in the mix.
Hi Dustin,
Here's a way to display items in time order (earliest first) without creating your own feed. This loads items into a master array indexed by the time() value of the item date. You can see how this is picked up using the strtotime() function using the value of PUBDATE (for RSS) and CREATED (for Atom). Finally, the master array is sorted using the arsort() function.
Source:
<?php
header("Content-Type: text/html;charset=utf-8");
require("MagicParser.php");
$items = array();
function myRSSRecordHandler($item)
{
global $items;
$temp["url"] = $item["LINK"];
$temp["title"] = $item["TITLE"];
$temp["description"] = $item["DESCRIPTION"];
$time = strtotime($item["PUBDATE"]);
while(isset($items[$time])) $time++;
$items[$time] = $temp;
}
function myAtomRecordHandler($item)
{
global $items;
$temp["url"] = $item["LINK-HREF"];
$temp["title"] = $item["TITLE"];
$temp["description"] = $item["CONTENT"];
$time = strtotime($item["CREATED"]);
while(isset($items[$time])) $time++;
$items[$time] = $temp;
}
$url = "http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml";
MagicParser_parse($url,"myRSSRecordHandler","xml|RSS/CHANNEL/ITEM/");
$url = "http://www.mozillazine.org/atom.xml";
MagicParser_parse($url,"myAtomRecordHandler","xml|FEED/ENTRY/");
arsort($items);
foreach($items as $time => $item)
{
print "<h2><a href='".$item["url"]."'>".$item["title"]."</a></h2>";
print "<p>".$item["description"]."</p>";
}
?>
Hope this helps!
Cheers,
David.
Thanks very much David that works great!
I am have trouble now with the the HTML from the Atom feeds. Some of the links are messed up and part way down the page their is an indention in the output from HTML from the feeds.
Is there something I can do to strip the HTML from the description/content variable?
I am trying to break them up into different pages now, but I am getting some crazy results.
<?php
require("MagicParser.php");
$page = $_GET["page"];
$counter = 0;
$itemsOnPage = "10";
// default to the first page
if (!$page) $page = 1;
$items = array();
function myRSSRecordHandler($item)
{
global $items;
$temp["url"] = $item["LINK"];
$temp["title"] = $item["TITLE"];
$temp["description"] = $item["DESCRIPTION"];
$time = strtotime($item["PUBDATE"]);
while(isset($items[$time])) $time++;
$items[$time] = $temp;
}
function myAtomRecordHandler($item)
{
global $page;
global $counter;
global $itemsOnPage;
$counter++;
// return false whilst parsing items on previous pages
if ($counter < (($page-1)*$itemsOnPage)) return false;
// display the item
global $items;
$temp["url"] = $item["LINK-HREF"];
$temp["title"] = $item["TITLE"];
$temp["description"] = $item["CONTENT"];
$time = strtotime($item["CREATED"]);
while(isset($items[$time])) $time++;
$items[$time] = $temp;
// return true if counter reaches current page * items on each page
return ($counter == ($page*$itemsOnPage));
}
$url = "http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml";
MagicParser_parse($url,"myRSSRecordHandler","xml|RSS/CHANNEL/ITEM/");
$url = "http://www.mozillazine.org/atom.xml";
MagicParser_parse($url,"myAtomRecordHandler","xml|FEED/ENTRY/");
arsort($items);
foreach($items as $time => $item)
{
print "<h2><a href='".$item["url"]."'>".$item["title"]."</a></h2>";
print "<p>".substr($item["description"],0,0)."</p>";
}
print "<p><a href='?page=".($page+1)."'>Next</a></p>";
?>
Do you see anything off hand that is wrong with how I added your code for breaking the results up into pages. I am thinking I put it in the wrong spot but I tried every different spot I though would be logical for this to go. Thanks.
David I just noticed that the output it actually not merged together...you will notice in the example that mozilla zine has all its items then bbc has its items listed. Can it be setup so that they will be mixed together so if mozillazine had a posting today and bbc had one yesterday they would be together and not have all mozilla zines postings then the one from BBC yesterday at the end of the mozilla zine postings.
Thanks
Hi Dustin,
HTML can be stripped using PHP's strip_tags() function. I've incorporated this into the code below. The paging code needs to go in the main display loop - not the Atom item handler as you currently have it positioned. I think the ordering should work OK - it's just that with these two example feeds all the BBC items are current (today) whereas the Mozillazine items are all older, which is why they are not appearing to be mixed up.
Have a go with this version:
<?php
require("MagicParser.php");
$page = $_GET["page"];
$counter = 0;
$itemsOnPage = 10;
// default to the first page
if (!$page) $page = 1;
$items = array();
function myRSSRecordHandler($item)
{
global $items;
$temp["url"] = $item["LINK"];
$temp["title"] = $item["TITLE"];
$temp["description"] = strip_tags($item["DESCRIPTION"]);
$time = strtotime($item["PUBDATE"]);
while(isset($items[$time])) $time++;
$items[$time] = $temp;
}
function myAtomRecordHandler($item)
{
global $items;
$temp["url"] = $item["LINK-HREF"];
$temp["title"] = $item["TITLE"];
$temp["description"] = strip_tags($item["CONTENT"]);
$time = strtotime($item["CREATED"]);
while(isset($items[$time])) $time++;
$items[$time] = $temp;
}
$url = "http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml";
MagicParser_parse($url,"myRSSRecordHandler","xml|RSS/CHANNEL/ITEM/");
$url = "http://www.mozillazine.org/atom.xml";
MagicParser_parse($url,"myAtomRecordHandler","xml|FEED/ENTRY/");
arsort($items);
foreach($items as $time => $item)
{
$counter++;
// don't display while looping through items on previous pages
if ($counter < (($page-1)*$itemsOnPage)) continue;
// break if already displayed maxium number of items per page
if ($counter == ($page*$itemsOnPage)) break;
print "<h2><a href='".$item["url"]."'>".$item["title"]."</a></h2>";
print "<p>".$item["description"]."</p>";
}
// display next link only if there are more items
if ($counter < count($items))
{
print "<p><a href='?page=".($page+1)."'>Next</a></p>";
}
?>
Note that i've removed the substr function from the code that displays the body so it is currently displaying the entire (html stripped) text - you might want to put a restriction back in if you do not want to display the entire text.
Cheers,
David.
David I have it working to a degree...it seems to have a bug or two.
1) The feeds don't seem to change once it is updated. One of my test blogs posted a new blog entry yesterday and it was not added. I'm guessing this might have something to do with the array.
2) I'm still not sure what is going on with the order of the post. I am using two blogs as a test and their dates are around the same but it still has one blog first and the other second. Not sure if it is a problem of the blog feeds or not.
Thanks for all your help...I don't think I could have got this far with out your help.
Hi Dustin,
On second thoughts I think the script should really be using krsort() on the array so that it is ordered by the timestamp value. This will display items in reverse order (newest first). In the latest version, look for:
arsort($items);
..and change this to:
krsort($items);
This should display the latest entry from your Blog first, assuming that the timestamps are being set correctly in the source feed. If you are still not seeing it update it could indicate a caching issue between your server and the server hosting the RSS feed. The first thing to do is verify the feed by browsing to it directly, and then comparing each item with the items displayed by your script (count them to check they are all being displayed). If you are still not seeing it update, let me know and I'll suggest how to get around the caching problem (if that is what is causing it)....
Cheers,
David.
Hi David,
I'm using magicparser parsing multiple blogs
I'm caching:
<?php
function cacheFetch($url,$age)
{
// directory in which to store cached files
$cacheDir = "cache/";
// cache filename constructed from MD5 hash of URL
$filename = $cacheDir.md5($url);
// default to fetch the file
$fetch = true;
// but if the file exists, don't fetch if it is recent enough
if (file_exists($filename))
{
$fetch = (filemtime($filename) < (time()-$age));
}
// fetch the file if required
if ($fetch)
{
// shell to wget to fetch the file
exec("wget -N -O ".$filename." \"".$url."\"");
// update timestamp to now
exec("touch ".$filename);
}
// return the cache filename
return $filename;
}
?>
and I'm using your sort by time script
<?php
// header("Content-Type: text/html;charset=utf-8");
require("MagicParser.php");
$records = array();
function myRSSRecordHandler($record)
{
global $records;
$temp["url"] = $record["GUID"];
$temp["title"] = $record["TITLE"];
$temp["description"] = $record["DESCRIPTION"];
$time = strtotime($record["PUBDATE"]);
while(isset($records[$time])) $time++;
$records[$time] = $temp;
}
function myAtomRecordHandler($record)
{
global $records;
$temp["url"] = $record["LINK@4-HREF"];
$temp["title"] = $record["TITLE"];
$temp["description"] = $record["CONTENT"];
$time = strtotime($record["PUBLISHED"]);
while(isset($records[$time])) $time++;
$records[$time] = $temp;
}
//feed => feedburner
$feeds = array('http://feeds.xxx.it/xxx',
'http://www.xxxxx.it/rss/xxxxx.xml',
'http://feeds2.feedburner.com/xxxxxxxxx',
'http://xxxxx.com/xxxx/extra_/feed.xml',
'http://feeds2.feedburner.com/xxxxxxxxxxxxx',
'http://feeds.feedburner.com/xxxxxxxxxxxxxx',
'http://feeds.xxxxxxxxxxxx.it/xxxxxxxxxxx',
'http://feeds2.feedburner.com/xxxxxxxxxx',
'http://rss.feedsportal.com/c/xxxxx/f/xxxx/index.rss',
'http://xxxxxxx.xxx.com/feed/',
'http://www.xxxxx.net/feed/',
'http://feeds.feedburner.com/xxxxxxxx',
'http://feeds.feedburner.com/blogup/xxxxxx',
'http://xxxxxxx.myblog.it/index.rss',
'http://feeds2.feedburner.com/xxxxxxxxxx',
'http://feeds.feedburner.com/xxxxxxxxxxx',
'http://rss.feedsportal.com/c/xx/f/xx/index.rss',
'http://feeds.feedburner.com/xxxxxxxx/',
'http://feeds.blogo.it/xxxxxxx/it',
'http://www.xxxxx.com/feed/',
);
// cache e parser
foreach($feeds as $url)
{ // fetch (if required)
$filename = cacheFetch($url,21600);
// parse
MagicParser_parse($filename,"myRSSRecordHandler","xml|RSS/CHANNEL/ITEM/");
}
$feeds = array(
'http://blog1.blogspot.com/feeds/posts/default',
'http://blog2.blogspot.com/feeds/posts/default',
'http://blog3.blogspot.com/feeds/posts/default'
);
// cache e parser
foreach($feeds as $url)
{ // fetch (if required)
$filename = cacheFetch($url,21600);
// parse
MagicParser_parse($filename,"myAtomRecordHandler","xml|FEED/ENTRY/");
}
arsort($records);
foreach($records as $time => $record)
{
print "<h2><a href='".$record["url"]."'>".$record["title"]."</a></h2>";
print "<p>".$record["description"]."</p>";
}
?>
The problem is that the script don't execute: there's a very long time blank page only
(ther was a 30 seconds error but I add set_time)
Thanks
Hello Stepius,
What you need to do is add some debug code to generate some output so that you can find out where it stops working. This is easily done with a print statement at strategic points throughout the code.
The first thing I would do is to print_r the first record of the first feed to be parsed and then to exit the script; for example:
function myRSSRecordHandler($record)
{
print_r($record);exit();
global $records;
$temp["url"] = $record["GUID"];
$temp["title"] = $record["TITLE"];
$temp["description"] = $record["DESCRIPTION"];
$time = strtotime($record["PUBDATE"]);
while(isset($records[$time])) $time++;
$records[$time] = $temp;
}
function myAtomRecordHandler($record)
{
print_r($record);exit();
global $records;
$temp["url"] = $record["LINK@4-HREF"];
$temp["title"] = $record["TITLE"];
$temp["description"] = $record["CONTENT"];
$time = strtotime($record["PUBLISHED"]);
while(isset($records[$time])) $time++;
$records[$time] = $temp;
}
This will prove that something is actually being read. If nothing appears; the next stage would be to eliminate any problem to do with the cacheing mechanism, so in each case of:
$filename = cacheFetch($url,21600);
...REPLACE with:
$filename = $url;
...and then run the script again (with the previous debug code still in place)
Let me know how these tests turn out and that should point to the problem and I'll work out the best solution...
Cheers,
David.
Hi David,
I made the test.
When print_r the first record of the first feed to be parsed and then to exit the script,
it goes.
When I change the caching with
$filename = $url;
it goes too.
I'm not using a database: do you suggest to use one?
(I'm not php expert and I don't know how avoid duplicate insertion in mysql...)
Thanks,
Ste
Hi Ste,
Do you mean that the first record _is_ printed out and then the script stops?
(and presumably therefore removing the cache makes no difference?)
It might be worth using SQL, but with the caching it should be very fast after the first request of the period...
Cheers,
David.
Hi Dave,
exactly:
the first record _is_ printed out and then the script stops.
Removing the cache make no difference.
How can we solve the problem?
If I don't use the sort system, the script and the caching are ok
but the post abstracts are not order by time
thanks,
Ste
Hi Ste,
OK, first remove all the debug code added so far; and the next step is to study the $records array that is generated. To do this; replace the following line:
arsort($records);
with:
print_r($records);exit();
arsort($records);
..this should display the content of all records and then exit.
Cheers,
David.
Hi Dave,
according to your instructions,
I find a feed with not date time settings:
I delete this feed and now the parser is ok.
Thanks for your precious help,
Ste
Okay. I'm going to add some more details since I have played with it a bit.
I would like to be able to parse several different feeds from blogs (mainly), forums or other resources and display a list that has a title with link and description. I want all the feeds to come down into one.