You are here:  » Parsing Blogs


Parsing Blogs

Submitted by dustin on Thu, 2006-11-02 23:50 in

I would like to be able to parse a bunch of different blogs easily. Can this be done?

Thanks

Submitted by dustin on Fri, 2006-11-03 00:34

Okay. I'm going to add some more details since I have played with it a bit.

I would like to be able to parse several different feeds from blogs (mainly), forums or other resources and display a list that has a title with link and description. I want all the feeds to come down into one.

Submitted by support on Fri, 2006-11-03 08:27

Hi Dustin,

Because most feeds from blogs are either going to be RSS or Atom format, you can create a script with multiple record handlers (one for each format) and then make sure you call MagicParser_parse() for each feed with the correct record handler and format string. I've just written an example to demonstrate this using the BBC News (RSS) and Mozillazine (Atom) feeds....

View Output

Source:

<?php
  
require("MagicParser.php");
  function 
myRSSRecordHandler($item)
  {
    print 
"<h2><a href='".$item["LINK"]."'>".$item["TITLE"]."</a></h2>";
    print 
"<p>".substr($item["DESCRIPTION"],0,100)."</p>";
  }
  function 
myAtomRecordHandler($item)
  {
    print 
"<h2><a href='".$item["LINK-HREF"]."'>".$item["TITLE"]."</a></h2>";
    print 
"<p>".substr($item["CONTENT"],0,100)."</p>";
  }
  
$url "http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml";
  
MagicParser_parse($url,"myRSSRecordHandler","xml|RSS/CHANNEL/ITEM/");
  
$url "http://www.mozillazine.org/atom.xml";
  
MagicParser_parse($url,"myAtomRecordHandler","xml|FEED/ENTRY/");
?>

Hope this helps!
Cheers,
David.

Submitted by dustin on Sat, 2006-11-04 14:25

Thanks David...that does the basics of what I'm looking for, but I don't want it to show them in that order, instead, I'd like for it to have the feeds mixed up. Can that be done without making a database, etc.

Thanks

Submitted by dustin on Sat, 2006-11-04 14:32

Also if possiable I would want it to be done by date added.

I'm not sure the best way to do this, but one idea that comes to mind is ...

Make MagicParser parse the feeds into one feed, so as it adds the new records they would be on top...now that I type that it shoulds harder and seems like there should be a better way.

Submitted by dustin on Sat, 2006-11-04 14:37

Another thing I want to do is make it put the results in pages. I saw this done already in some past post with a loop, so I should be able to figure that out, but I want to throw that in the mix.

Submitted by support on Sun, 2006-11-05 10:29

Hi Dustin,

Here's a way to display items in time order (earliest first) without creating your own feed. This loads items into a master array indexed by the time() value of the item date. You can see how this is picked up using the strtotime() function using the value of PUBDATE (for RSS) and CREATED (for Atom). Finally, the master array is sorted using the arsort() function.

View Output

Source:

<?php
  header
("Content-Type: text/html;charset=utf-8");
  require(
"MagicParser.php");
  
$items = array();
  function 
myRSSRecordHandler($item)
  {
    global 
$items;
    
$temp["url"] = $item["LINK"];
    
$temp["title"] = $item["TITLE"];
    
$temp["description"] = $item["DESCRIPTION"];
    
$time strtotime($item["PUBDATE"]);
    while(isset(
$items[$time])) $time++;
    
$items[$time] = $temp;
  }
  function 
myAtomRecordHandler($item)
  {
    global 
$items;
    
$temp["url"] = $item["LINK-HREF"];
    
$temp["title"] = $item["TITLE"];
    
$temp["description"] = $item["CONTENT"];
    
$time strtotime($item["CREATED"]);
    while(isset(
$items[$time])) $time++;
    
$items[$time] = $temp;
  }
  
$url "http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml";
  
MagicParser_parse($url,"myRSSRecordHandler","xml|RSS/CHANNEL/ITEM/");
  
$url "http://www.mozillazine.org/atom.xml";
  
MagicParser_parse($url,"myAtomRecordHandler","xml|FEED/ENTRY/");
  
arsort($items);
  foreach(
$items as $time => $item)
  {
    print 
"<h2><a href='".$item["url"]."'>".$item["title"]."</a></h2>";
    print 
"<p>".$item["description"]."</p>";
  }
?>

Hope this helps!
Cheers,
David.

Submitted by dustin on Mon, 2006-11-06 00:54

Thanks very much David that works great!

I am have trouble now with the the HTML from the Atom feeds. Some of the links are messed up and part way down the page their is an indention in the output from HTML from the feeds.

Is there something I can do to strip the HTML from the description/content variable?

Submitted by dustin on Mon, 2006-11-06 02:09

I am trying to break them up into different pages now, but I am getting some crazy results.

<?php
  require("MagicParser.php");
   $page = $_GET["page"];
  $counter = 0;
  $itemsOnPage = "10";
  // default to the first page
  if (!$page) $page = 1;
  $items = array();
  function myRSSRecordHandler($item)
  {
    global $items;
    $temp["url"] = $item["LINK"];
    $temp["title"] = $item["TITLE"];
    $temp["description"] = $item["DESCRIPTION"];
    $time = strtotime($item["PUBDATE"]);
    while(isset($items[$time])) $time++;
    $items[$time] = $temp;
  }
  function myAtomRecordHandler($item)
  {
  global $page;
    global $counter;
    global $itemsOnPage;
    $counter++;
    // return false whilst parsing items on previous pages
    if ($counter < (($page-1)*$itemsOnPage)) return false;
    // display the item
    global $items;
    $temp["url"] = $item["LINK-HREF"];
    $temp["title"] = $item["TITLE"];
    $temp["description"] = $item["CONTENT"];
    $time = strtotime($item["CREATED"]);
    while(isset($items[$time])) $time++;
    $items[$time] = $temp;
// return true if counter reaches current page * items on each page
    return ($counter == ($page*$itemsOnPage));
  }
$url = "http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml";
  MagicParser_parse($url,"myRSSRecordHandler","xml|RSS/CHANNEL/ITEM/");
$url = "http://www.mozillazine.org/atom.xml";
MagicParser_parse($url,"myAtomRecordHandler","xml|FEED/ENTRY/");
  arsort($items);
  foreach($items as $time => $item)
  {
    print "<h2><a href='".$item["url"]."'>".$item["title"]."</a></h2>";
    print "<p>".substr($item["description"],0,0)."</p>";
  }
 print "<p><a href='?page=".($page+1)."'>Next</a></p>";
?>

Do you see anything off hand that is wrong with how I added your code for breaking the results up into pages. I am thinking I put it in the wrong spot but I tried every different spot I though would be logical for this to go. Thanks.

Submitted by dustin on Mon, 2006-11-06 02:27

David I just noticed that the output it actually not merged together...you will notice in the example that mozilla zine has all its items then bbc has its items listed. Can it be setup so that they will be mixed together so if mozillazine had a posting today and bbc had one yesterday they would be together and not have all mozilla zines postings then the one from BBC yesterday at the end of the mozilla zine postings.

Thanks

Submitted by support on Mon, 2006-11-06 07:31

Hi Dustin,

HTML can be stripped using PHP's strip_tags() function. I've incorporated this into the code below. The paging code needs to go in the main display loop - not the Atom item handler as you currently have it positioned. I think the ordering should work OK - it's just that with these two example feeds all the BBC items are current (today) whereas the Mozillazine items are all older, which is why they are not appearing to be mixed up.

Have a go with this version:

<?php
  
require("MagicParser.php");
  
$page $_GET["page"];
  
$counter 0;
  
$itemsOnPage 10;
  
// default to the first page
  
if (!$page$page 1;
  
$items = array();
  function 
myRSSRecordHandler($item)
  {
    global 
$items;
    
$temp["url"] = $item["LINK"];
    
$temp["title"] = $item["TITLE"];
    
$temp["description"] = strip_tags($item["DESCRIPTION"]);
    
$time strtotime($item["PUBDATE"]);
    while(isset(
$items[$time])) $time++;
    
$items[$time] = $temp;
  }
  function 
myAtomRecordHandler($item)
  {
    global 
$items;
    
$temp["url"] = $item["LINK-HREF"];
    
$temp["title"] = $item["TITLE"];
    
$temp["description"] = strip_tags($item["CONTENT"]);
    
$time strtotime($item["CREATED"]);
    while(isset(
$items[$time])) $time++;
    
$items[$time] = $temp;
  }
  
$url "http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml";
  
MagicParser_parse($url,"myRSSRecordHandler","xml|RSS/CHANNEL/ITEM/");
  
$url "http://www.mozillazine.org/atom.xml";
  
MagicParser_parse($url,"myAtomRecordHandler","xml|FEED/ENTRY/");
  
arsort($items);
  foreach(
$items as $time => $item)
  {
    
$counter++;
    
// don't display while looping through items on previous pages
    
if ($counter < (($page-1)*$itemsOnPage)) continue;
    
//  break if already displayed maxium number of items per page
    
if ($counter == ($page*$itemsOnPage)) break;
    print 
"<h2><a href='".$item["url"]."'>".$item["title"]."</a></h2>";
    print 
"<p>".$item["description"]."</p>";
  }
  
// display next link only if there are more items
  
if ($counter count($items))
  {
    print 
"<p><a href='?page=".($page+1)."'>Next</a></p>";
  }
?>

Note that i've removed the substr function from the code that displays the body so it is currently displaying the entire (html stripped) text - you might want to put a restriction back in if you do not want to display the entire text.

Cheers,
David.

Submitted by dustin on Wed, 2006-11-08 11:39

David I have it working to a degree...it seems to have a bug or two.

1) The feeds don't seem to change once it is updated. One of my test blogs posted a new blog entry yesterday and it was not added. I'm guessing this might have something to do with the array.

2) I'm still not sure what is going on with the order of the post. I am using two blogs as a test and their dates are around the same but it still has one blog first and the other second. Not sure if it is a problem of the blog feeds or not.

Thanks for all your help...I don't think I could have got this far with out your help.

Submitted by support on Wed, 2006-11-08 12:03

Hi Dustin,

On second thoughts I think the script should really be using krsort() on the array so that it is ordered by the timestamp value. This will display items in reverse order (newest first). In the latest version, look for:

arsort($items);

..and change this to:

krsort($items);

This should display the latest entry from your Blog first, assuming that the timestamps are being set correctly in the source feed. If you are still not seeing it update it could indicate a caching issue between your server and the server hosting the RSS feed. The first thing to do is verify the feed by browsing to it directly, and then comparing each item with the items displayed by your script (count them to check they are all being displayed). If you are still not seeing it update, let me know and I'll suggest how to get around the caching problem (if that is what is causing it)....

Cheers,
David.

Submitted by stepius on Wed, 2009-08-26 09:42

Hi David,
I'm using magicparser parsing multiple blogs

I'm caching:

<?php
  
function cacheFetch($url,$age)
  {
    
// directory in which to store cached files
    
$cacheDir "cache/";
    
// cache filename constructed from MD5 hash of URL
    
$filename $cacheDir.md5($url);
    
// default to fetch the file
    
$fetch true;
    
// but if the file exists, don't fetch if it is recent enough
    
if (file_exists($filename))
    {
      
$fetch = (filemtime($filename) < (time()-$age));
    }
    
// fetch the file if required
    
if ($fetch)
    {
      
// shell to wget to fetch the file
      
exec("wget -N -O ".$filename." \"".$url."\"");
      
// update timestamp to now
      
exec("touch ".$filename);
    }
    
// return the cache filename
    
return $filename;
  }
?>

and I'm using your sort by time script

<?php
//  header("Content-Type: text/html;charset=utf-8");
  
require("MagicParser.php");
  
$records = array();
  function 
myRSSRecordHandler($record)
  {
    global 
$records;
    
$temp["url"] = $record["GUID"];
    
$temp["title"] = $record["TITLE"];
    
$temp["description"] = $record["DESCRIPTION"];
    
$time strtotime($record["PUBDATE"]);
    while(isset(
$records[$time])) $time++;
    
$records[$time] = $temp;
  }
  function 
myAtomRecordHandler($record)
  {
    global 
$records;
    
$temp["url"] = $record["LINK@4-HREF"];
    
$temp["title"] = $record["TITLE"];
    
$temp["description"] = $record["CONTENT"];
    
$time strtotime($record["PUBLISHED"]);
    while(isset(
$records[$time])) $time++;
    
$records[$time] = $temp;
  }
//feed => feedburner
$feeds = array('http://feeds.xxx.it/xxx',
'http://www.xxxxx.it/rss/xxxxx.xml',
'http://feeds2.feedburner.com/xxxxxxxxx',
'http://xxxxx.com/xxxx/extra_/feed.xml',
'http://feeds2.feedburner.com/xxxxxxxxxxxxx',
'http://feeds.feedburner.com/xxxxxxxxxxxxxx',
'http://feeds.xxxxxxxxxxxx.it/xxxxxxxxxxx',
'http://feeds2.feedburner.com/xxxxxxxxxx',
'http://rss.feedsportal.com/c/xxxxx/f/xxxx/index.rss',
'http://xxxxxxx.xxx.com/feed/',
'http://www.xxxxx.net/feed/',
'http://feeds.feedburner.com/xxxxxxxx',
'http://feeds.feedburner.com/blogup/xxxxxx',
'http://xxxxxxx.myblog.it/index.rss',
'http://feeds2.feedburner.com/xxxxxxxxxx',
'http://feeds.feedburner.com/xxxxxxxxxxx',
'http://rss.feedsportal.com/c/xx/f/xx/index.rss',
'http://feeds.feedburner.com/xxxxxxxx/',
'http://feeds.blogo.it/xxxxxxx/it',
'http://www.xxxxx.com/feed/',
);
// cache e parser
foreach($feeds as $url)
{  
// fetch (if required)
  
$filename cacheFetch($url,21600);
  
// parse
  
MagicParser_parse($filename,"myRSSRecordHandler","xml|RSS/CHANNEL/ITEM/");
}
  
$feeds = array(
  
'http://blog1.blogspot.com/feeds/posts/default',
'http://blog2.blogspot.com/feeds/posts/default',
'http://blog3.blogspot.com/feeds/posts/default'
);
// cache e parser
foreach($feeds as $url)
{  
// fetch (if required)
  
$filename cacheFetch($url,21600);
  
// parse
MagicParser_parse($filename,"myAtomRecordHandler","xml|FEED/ENTRY/");
}
  
arsort($records);
  foreach(
$records as $time => $record)
  {
    print 
"<h2><a href='".$record["url"]."'>".$record["title"]."</a></h2>";
    print 
"<p>".$record["description"]."</p>";
  }
?>

The problem is that the script don't execute: there's a very long time blank page only
(ther was a 30 seconds error but I add set_time)

Thanks

Submitted by support on Wed, 2009-08-26 10:32

Hello Stepius,

What you need to do is add some debug code to generate some output so that you can find out where it stops working. This is easily done with a print statement at strategic points throughout the code.

The first thing I would do is to print_r the first record of the first feed to be parsed and then to exit the script; for example:

  function myRSSRecordHandler($record)
  {
    print_r($record);exit();
    global $records;
    $temp["url"] = $record["GUID"];
    $temp["title"] = $record["TITLE"];
    $temp["description"] = $record["DESCRIPTION"];
    $time = strtotime($record["PUBDATE"]);
    while(isset($records[$time])) $time++;
    $records[$time] = $temp;
  }
  function myAtomRecordHandler($record)
  {
    print_r($record);exit();
    global $records;
    $temp["url"] = $record["LINK@4-HREF"];
    $temp["title"] = $record["TITLE"];
    $temp["description"] = $record["CONTENT"];
    $time = strtotime($record["PUBLISHED"]);
    while(isset($records[$time])) $time++;
    $records[$time] = $temp;
  }

This will prove that something is actually being read. If nothing appears; the next stage would be to eliminate any problem to do with the cacheing mechanism, so in each case of:

  $filename = cacheFetch($url,21600);

...REPLACE with:

  $filename = $url;

...and then run the script again (with the previous debug code still in place)

Let me know how these tests turn out and that should point to the problem and I'll work out the best solution...

Cheers,
David.

Submitted by stepius on Wed, 2009-08-26 12:18

Hi David,
I made the test.
When print_r the first record of the first feed to be parsed and then to exit the script,
it goes.
When I change the caching with
$filename = $url;
it goes too.

I'm not using a database: do you suggest to use one?
(I'm not php expert and I don't know how avoid duplicate insertion in mysql...)

Thanks,
Ste

Submitted by support on Wed, 2009-08-26 12:21

Hi Ste,

Do you mean that the first record _is_ printed out and then the script stops?

(and presumably therefore removing the cache makes no difference?)

It might be worth using SQL, but with the caching it should be very fast after the first request of the period...

Cheers,
David.

Submitted by stepius on Wed, 2009-08-26 13:16

Hi Dave,
exactly:
the first record _is_ printed out and then the script stops.
Removing the cache make no difference.
How can we solve the problem?
If I don't use the sort system, the script and the caching are ok
but the post abstracts are not order by time

thanks,
Ste

Submitted by support on Wed, 2009-08-26 13:22

Hi Ste,

OK, first remove all the debug code added so far; and the next step is to study the $records array that is generated. To do this; replace the following line:

arsort($records);

with:

print_r($records);exit();
arsort($records);

..this should display the content of all records and then exit.

Cheers,
David.

Submitted by stepius on Wed, 2009-08-26 15:07

Hi Dave,
according to your instructions,
I find a feed with not date time settings:
I delete this feed and now the parser is ok.

Thanks for your precious help,
Ste