You are here:  » Cache xml files


Cache xml files

Submitted by womble on Sun, 2006-05-07 16:27 in

Just wondered if you had any tips on caching xml files that Im reading from eBay to reduce load their end. Ive been using JPcache but it seems a bit tempermental, like if the xml file dosent load it caches a blank page. Theres too much information to put it in a database. When JPcache works its great as its really fast but just not reliable.

Submitted by support on Sun, 2006-05-07 17:27

Hi,

This is the sort of technique i've used in the past on applications that needed to cache remote URLs. In this case, my preference has always been to shell out to wget for URL retrieval rather than use PHP methods.

Firstly, you need to create a cache directory that will have write access to the PHP process. The easiest way is just to give world write access to a subdirectory of the directory in which the script will be executed, so if you log in to your server and CHDIR to the same directory as your script ("$" is the command prompt):

$mkdir cache
$chmod a+w cache

If you do not have shell access to your server you should be able to do this using your FTP client - create a new remote directory, and then choose properties to set the access bits for the new directory.

Then, create a function that returns a cache filename (constructed from the MD5 hash of the URL) that conditionally re-fetches the file based on the timestamp of the current cached copy compared to an age value (given in seconds)

<?php
  
function cacheFetch($url,$age)
  {
    
// directory in which to store cached files
    
$cacheDir "cache/";
    
// cache filename constructed from MD5 hash of URL
    
$filename $cacheDir.md5($url);
    
// default to fetch the file
    
$fetch true;
    
// but if the file exists, don't fetch if it is recent enough
    
if (file_exists($filename))
    {
      
$fetch = (filemtime($filename) < (time()-$age));
    }
    
// fetch the file if required
    
if ($fetch)
    {
      
// shell to wget to fetch the file
      
exec("wget -N -O ".$filename." \"".$url."\"");
      
// update timestamp to now
      
exec("touch ".$filename);
    }
    
// return the cache filename
    
return $filename;
  }
?>

To combine that with a Magic Parser script; you would then do something like this. Using a value of 86400 for the $age parameter will mean that the URL is only retreived once a day.

<?php
  
// fetch (if required)
  
$filename cacheFetch("http://www.example.com/feed.xml",86400);
  
// parse
  
MagicParser_parse($filename,"myRecordHandler");
?>

Hope this helps!

Cheers,
David.

Submitted by womble on Mon, 2006-05-08 21:11

Thanks David thats really quite effective, I've been messing around with pear cache::lite all yesterday and just tried the above today and much prefer your way.

I seem to have a problem with the feeds intermittently going down and when that happens it caches a blank page or a 502 error. Any ideas how I could modify the above to re-attempt to open the file if this happens. Maybe detecting page size as an error would result in a very small page returned.. btw it seems a second request immediately after the failed one is usually ok.

Thanks.

Submitted by womble on Tue, 2006-05-09 00:05

Do you think this condition at the end would work?

I cant get the xml file to go wrong so cant see if it works. It hopefully echo's file too small if it errors and then if that works I should be able just to place the wget bit in there again.

<?php
  function cacheFetch($url,$age)
  {
    // directory in which to store cached files
    $cacheDir = "cache/";
    // cache filename constructed from MD5 hash of URL
    $filename = $cacheDir.md5($url);
    // default to fetch the file
    $fetch = true;
    // but if the file exists, don't fetch if it is recent enough
    if (file_exists($filename))
    {
      $fetch = (filemtime($filename) < (time()-$age));
    }
    // fetch the file if required
    if ($fetch)
    {
      // shell to wget to fetch the file
      exec("wget -N -O ".$filename." \"".$url."\"");
      // update timestamp to now
      exec("touch ".$filename);
    }
    // check if size less then 1024
    $size = filesize("$filename");
    if($size < '1024')
    // If less then 1024 run wget again
    echo 'file too small';
    else
    // return the cache filename
    return $filename;
  }
?>

Submitted by support on Tue, 2006-05-09 07:33

Hi,

There's nothing wrong with the logic to test for a small file, but i've just done some tests and something else that may be of use is that wget exits with an exit code of "1" if any kind of HTTP error occured and the file could not be fetched.

Therefore, you could test for the return value (which can be made available to your script via the exec() command), and retry a set number of times. The following should do the trick, and you might want to combine this with your previous mod to return a text string if there is still no response after the retries:

<?php
  
function cacheFetch($url,$age)
  {
    
// directory in which to store cached files, must be writable by PHP
    
$cacheDir "cache/";
    
// cache filename constructed from MD5 hash of URL
    
$filename $cacheDir.md5($url);
    
// default to fetch the file
    
$fetch true;
    
// but if the file exists, don't fetch if it is recent enough
    
if (file_exists($filename))
    {
      
$fetch = (filemtime($filename) < (time()-$age));
    }
    
// fetch the file if required
    
if ($fetch)
    {
      
$maxRetry 2;
      do
      {
        
// shell to wget to fetch the file, storing the exit code in $error
        
exec("wget -N -O ".$filename." \"".$url."\"",$output,$error);
        
// update timestamp to now
        
exec("touch ".$filename);
        
// keep trying if wget failed and not reached $maxRetry
      
} while( $error && $maxRetry--);
    }
    
// return the filename only if wget did not fail
    
if (!$error)
    {
      return 
$filename;
    }
    else
    {
      
// as an error occured, delete the empty file so it is retried next time
      
unlink($filename);
      
// return false
      
return false;
    }
  }
?>

You could then test for the return value of false in the main code before trying to parse the response:

<?php
  
if ($filename cacheFetch("http://www.example.com/feed.xml",86400))
  {
    
MagicParser_parse($filename,"myRecordHandler");
  }
  else
  {
    echo 
'Temporarily unavailable, please try later.';
  }
?>

Submitted by womble on Tue, 2006-05-09 16:55

Thanks top stuff!

Submitted by crounauer on Wed, 2007-03-21 13:31

Hi David,

I am trying to cache this page. There are two elements to it, the first to parse info for title tags etc. and the second for the actual info displayed. Is the code set up correctly?

Thanks in advance,
Simon.

<?php
// Page Cache
if (ENABLE_CACHE == 'true') {
function 
cacheFetch($url,$age)
  {
    
// directory in which to store cached files
    
$cacheDir "cache/";
    
// cache filename constructed from MD5 hash of URL
    
$filename $cacheDir.md5($url);
    
// default to fetch the file
    
$fetch true;
    
// but if the file exists, don't fetch if it is recent enough
    
if (file_exists($filename))
    {
      
$fetch = (filemtime($filename) < (time()-$age));
    }
    
// fetch the file if required
    
if ($fetch)
    {
      
// shell to wget to fetch the file
      
exec("wget -N -O ".$filename." \"".$url."\"");
      
// update timestamp to now
      
exec("touch ".$filename);
    }
    
// return the cache filename
    
return $filename;
  }

function 
lrcountry($country)
  {
    global 
$title_continent;
    
$title_continent $country["REGION_DESCRIPTION"];
    global 
$keywords_continent;
    
$keywords_continent $country["REGION_DESCRIPTION"];
    global 
$description_continent;
    
$description_continent $country["REGION_DESCRIPTION"];
  }
if (
ENABLE_CACHE == 'true') {
$url cacheFetch("http://xmlfeexxxxxxxxxxxxxxrtype=1&rid=".trim($_GET["continent_code"])."",".$cache_value.");
} else {
$url "http://xmlfeed.lxxxxxxxxxxxxype=1&rid=".trim($_GET["continent_code"])."";
}
MagicParser_parse($url,"lrcountry","xml|REGION_LIST/REGION/");
code here...
function 
myRecordHandler1($country)
  {
    print 
"<img src='".$base_url."/images/arrow.gif' alt='' border='0' />&nbsp;<a href='".$base_url."/country/".trim($country["REGION_CODE"])."_".clean_url($country["REGION_DESCRIPTION"]).".html' title='".$country["REGION_DESCRIPTION"]."' target='_parent'>".trim($country["REGION_DESCRIPTION"])."</a><br />";
  }
if (
ENABLE_CACHE == 'true') {
$url cacheFetch("http://xmlfxxxxxxxxxxxxx&rid=".trim($_GET["continent_code"])."",".$cache_value.");
} else {
$url "http://xmlxxxxxxxxxxxid=".trim($_GET["continent_code"])."";
}
MagicParser_parse($url,"myRecordHandler1","xml|REGION_LIST/REGION/");
?>

Submitted by support on Wed, 2007-03-21 13:49

Hi Simon,

A couple of points:

- I would take the cacheFetch function out of the if statement and just include it normally, it will only be called if required

- I think the test should be:

if (ENABLE_CACH == TRUE)

At the moment you are using

if (ENABLE_CACH == 'true')

...which is testing for the string literal "true" so i'm not sure that's going to work. Other than this, it looks fine!

Cheers,
David.

Submitted by tinotriste on Tue, 2007-10-30 11:45

I have tried this and didn't work, it works without the cache function though.
Can you check what have I done wrong? Thanks

<?php
  require("php/MagicParser.php");
  function cacheFetch($url,$age)
  {
    // directory in which to store cached files
    $cacheDir = "/cache";
    // cache filename constructed from MD5 hash of URL
    $filename = $cacheDir.md5($url);
    // default to fetch the file
    $fetch = true;
    // but if the file exists, don't fetch if it is recent enough
    if (file_exists($filename))
    {
      $fetch = (filemtime($filename) < (time()-$age));
    }
    // fetch the file if required
    if ($fetch)
    {
      // shell to wget to fetch the file
      exec("wget -N -O ".$filename." \"".$url."\"");
      // update timestamp to now
      exec("touch ".$filename);
    }
    // return the cache filename
    return $filename;
  }
  function myRecordHandler($record)
  {
    // This is where you write your code to process each record, such as loading a database
    // You can display the record contents using PHP's internal print_r() function:
    print_r($record); ?>
<table width="100%" border="1">
  <tr>
    <td style="color:#FF0000"><?php print $record["TAG"];?></td>
  </tr>
  <tr>
    <td style="color:#00FF00"><?php print $record["NAME"];?></td>
  </tr>
  <tr>
    <td style="color:#0000FF"><?php print $record["COUNT"];?></td>
  </tr>
  <tr>
    <td style="color:#FFFF00"><?php print $record["URL"];?></td>
  </tr>
</table>
<?php }
// fetch (if required)
  $filename = cacheFetch("http://ws.audioscrobbler.com/1.0/artist/Metallica/toptags.xml",86400);
  // parse
  MagicParser_parse($filename,"myRecordHandler","xml|TOPTAGS/TAG/");
?>
</body>
</html>

Submitted by support on Tue, 2007-10-30 12:08

Hi,

It is almost certainly this line that is incorrect:

    $cacheDir = "/cache";

This would imply a cache directory in the top level directory of your server, which is unlikely. It should normally be:

    $cacheDir = "cache/";

...which implies a cache directory in the same folder as the script. However, it is important that this folder is writable by PHP. The easiest way to do this is to make the directory "world" writable. You should be able to do this with your FTP program. Try right-clicking on the cache folder in the remote window, and then looking for either "Permissions..." or "Properties...", and see if there is an option to set the attributes for Owner/Group/World. Make sure that "World" has write access and it should work fine, provided that wget is available for use on your server.

I've been looking into your discogs.com problem, and the reason this is not working via fopen() is because the server requires the client to sent the gzip-accept header, which PHP does not do when the zlib library has not been built in.

wget is able to send this header, so once you have the cache mechanism working you should be able to access the discogs.com feed in this way rather than trying to open the feed directly. If it still doesn't work (and I suspect there may be another issue with double-gzipped data) let me know and we'll look into other ways to perform the gzip deflate...

Cheers,
David.

Submitted by tinotriste on Tue, 2007-10-30 13:19

Hi,
I previously had the cache directory in the root folder.
I have changed to the same directory as the script:

/php/cache/
/php/MagicParser.php

Is still not working though. According to the examples that I've sent you
do I have to use the cache script for discogs.com, and NOT use the cache script for http://ws.audioscrobbler.com?

here's the entire code for discogs.com (with the cache). Really appreciate your help.

<?php
  require("php/MagicParser.php");
  function cacheFetch($url,$age)
  {
    // directory in which to store cached files
    $cacheDir = "cache/";
    // cache filename constructed from MD5 hash of URL
    $filename = $cacheDir.md5($url);
    // default to fetch the file
    $fetch = true;
    // but if the file exists, don't fetch if it is recent enough
    if (file_exists($filename))
    {
      $fetch = (filemtime($filename) < (time()-$age));
    }
    // fetch the file if required
    if ($fetch)
    {
      // shell to wget to fetch the file
      exec("wget -N -O ".$filename." \"".$url."\"");
      // update timestamp to now
      exec("touch ".$filename);
    }
    // return the cache filename
    return $filename;
  }
  function myRecordHandler($record)
  {
    // This is where you write your code to process each record, such as loading a database
    // You can display the record contents using PHP's internal print_r() function:
    print_r($record); ?>
    <?php print $record["RELEASE"];?>
    <?php print $record["RELEASE-ID"];?>
    <?php print $record["RELEASE-STATUS"];?>
    <?php print $record["RELEASE-TYPE"];?>
    <?php print $record["TITLE"];?>
    <?php print $record["FORMAT"];?>
    <?php print $record["LABEL"];?>
<?php }
// fetch (if required)
  $filename = cacheFetch("http://www.discogs.com/artist/radiohead?f=xml&api_key=XXXXXXXXXX",86400);
  // parse
  MagicParser_parse($filename,"myRecordHandler","xml|RESP/ARTIST/RELEASES/RELEASE/");
?>

Submitted by support on Tue, 2007-10-30 13:27

Hi,

If these are the directories you have setup:

php/cache/
php/MagicParser.php

Then the following line is still not correct:

    $cacheDir = "cache/";

Based on the above, I think this should be:

    $cacheDir = "php/cache/";

You can check that it is writable with the following script:

testcache.php:

<?php
  $cacheDir 
"php/cache/";
  if (
is_writable($cacheDir))
  {
    print 
"Yes!";
  }
  else
  {
    print 
"No - Check Permissions";
  }
?>

Cheers,
David.

Submitted by tinotriste on Tue, 2007-10-30 14:23

Gosh! how could I put the cache path wrong?
Sorry my php skills are a bit limited. I'm more of a designer.

I have amended this, the testcache script verified that the cache directory is writable. And is still not working. :(

Submitted by support on Tue, 2007-10-30 14:49

Hi,

Can you firstly check in the cache/ directory and see whether it is empty or not?

Cheers,
David.

Submitted by tinotriste on Tue, 2007-10-30 15:38

I have by the way noticed that there's 2 generated files, one in the the first cache directory that i've created in the root folder. (in the first attempt)

and another in /php/cache/944fda2a54533d0e17858ff30bd44db6

Probably the cache directory doesn't have to be in the same directory as the script.

Submitted by support on Tue, 2007-10-30 15:54

Hi,

The cache directory doesn't have to be in the same directory as the script - all that matters is that the value of $cacheDir is correct. If you change back to that value (which looks like it is working), test again and see if you can open a file from the cache...

Cheers,
David.

Submitted by tinotriste on Tue, 2007-10-30 17:21

I have tried again, and no luck. :(
Does this script normally work with any feed?

My testing file is exactly like this, i think my PHP version is 4.4.x (if that helps)

<?php
  require("php/MagicParser.php");
  function cacheFetch($url,$age)
  {
    // directory in which to store cached files
    $cacheDir = "php/cache/";
    // cache filename constructed from MD5 hash of URL
    $filename = $cacheDir.md5($url);
    // default to fetch the file
    $fetch = true;
    // but if the file exists, don't fetch if it is recent enough
    if (file_exists($filename))
    {
      $fetch = (filemtime($filename) < (time()-$age));
    }
    // fetch the file if required
    if ($fetch)
    {
      // shell to wget to fetch the file
      exec("wget -N -O ".$filename." \"".$url."\"");
      // update timestamp to now
      exec("touch ".$filename);
    }
    // return the cache filename
    return $filename;
  }
  function myRecordHandler($record)
  {
    // This is where you write your code to process each record, such as loading a database
    // You can display the record contents using PHP's internal print_r() function:
    print_r($record); ?>
    <?php print $record["RELEASE"];?>
    <?php print $record["RELEASE-ID"];?>
    <?php print $record["RELEASE-STATUS"];?>
    <?php print $record["RELEASE-TYPE"];?>
    <?php print $record["TITLE"];?>
    <?php print $record["FORMAT"];?>
    <?php print $record["LABEL"];?>
<?php }
// fetch (if required)
  $filename = cacheFetch("http://www.discogs.com/artist/radiohead?f=xml&api_key=XXXXXXXXXX",86400);
  // parse
  MagicParser_parse($filename,"myRecordHandler","xml|RESP/ARTIST/RELEASES/RELEASE/");
?>

Submitted by support on Tue, 2007-10-30 17:55

Hi,

In order to work this out, we need to break the problem down, as there are 3 elements to what you are trying to do here.

Firstly, there is the call to MagicParser_parse(), which works fine with the XML ultimately returned by discogs.com, but there are some problems with that server in the way in returns the XML that you need to overcome somehow (the file is double-gzipped).

Secondly, there is making the caching mechanism work, which depends on wget being installed on your server. I think this is the case, because you have seen cache files downloaded into the cache directory.

Thirdly, there is the slight issue with the actual data returned by discogs.com, as I think it is "double-gziped", that is it is serving gzipped data that is gzipped again.

You can see Magic Parser parsing the end result at the following demo URL:

http://www.magicparser.com/demo?fileID=47276EF370CE7&record=1

What we need to do first is add some simple debug code to see how far the script is getting. The first thing I would suggest simply printing out the cache filename to see if that part of your script is working. To do this, try the following code in your script:

<?php
  require("php/MagicParser.php");
  function cacheFetch($url,$age)
  {
    // directory in which to store cached files
    $cacheDir = "php/cache/";
    // cache filename constructed from MD5 hash of URL
    $filename = $cacheDir.md5($url);
    // default to fetch the file
    $fetch = true;
    // but if the file exists, don't fetch if it is recent enough
    if (file_exists($filename))
    {
      $fetch = (filemtime($filename) < (time()-$age));
    }
    // fetch the file if required
    if ($fetch)
    {
      // shell to wget to fetch the file
      exec("wget -N -O ".$filename." \"".$url."\"");
      // update timestamp to now
      exec("touch ".$filename);
    }
    // return the cache filename
    return $filename;
  }
  function myRecordHandler($record)
  {
    // This is where you write your code to process each record, such as loading a database
    // You can display the record contents using PHP's internal print_r() function:
    print_r($record); ?>
    <?php print $record["RELEASE"];?>
    <?php print $record["RELEASE-ID"];?>
    <?php print $record["RELEASE-STATUS"];?>
    <?php print $record["RELEASE-TYPE"];?>
    <?php print $record["TITLE"];?>
    <?php print $record["FORMAT"];?>
    <?php print $record["LABEL"];?>
<?php }
    // fetch (if required)
  $filename = cacheFetch("http://www.discogs.com/artist/radiohead?f=xml&api_key=XXXXXXXXXX",86400);
  // parse
  print "Cache Filename: ".$filename;exit();
  MagicParser_parse($filename,"myRecordHandler","xml|RESP/ARTIST/RELEASES/RELEASE/");
?>

This will print out the cache filename and then exit. The next stage in debugging is to have a look for that file in the cache directory, and view the contents to verify that it contains the XML you are expecting to receive from discogs.com...

Hope this helps,
Cheers,
David.

Submitted by tinotriste on Tue, 2007-10-30 23:46

Hi David,

It is printing out the cache file name with no problems:

Cache Filename: php/cache/944fda2a54533d0e17858ff30bd44db6

what shall we do next?

Many thanks
Tino

Submitted by support on Wed, 2007-10-31 09:29

Hi Tino,

The next step is to look at the file manually. Firstly to confirm that it exists, however more importantly is the contents - does it look like the XML that you are trying to obtain? You may be able to view the file through your web browser, however it might be easier to download it through your FTP clients and look at it on your local computer....

Cheers,
David.

Submitted by tinotriste on Wed, 2007-10-31 12:40

Morning David,

I had actually checked the file before, there's nothing in it, it's just a blank file.
Should the file have a file extension?

Thanks
Tino

Submitted by support on Wed, 2007-10-31 12:54

Hi Tino,

No - there are no extensions on the cache filenames (they are just an MD5 hash value).

I think this may indicate that wget is not installed on your server, which means that this particular method of caching is not going to work. You can test this with the following script:

<?php
  
if(file_exists("/usr/bin/wget"))
  {
    print 
"WGET Exists";
  }
  else
  {
    print 
"WGET Not Found";
  }
?>

Cheers,
David.

Submitted by tinotriste on Wed, 2007-10-31 14:31

wget is installed on my server.
if there's any other way to make this work even without caching is fine by me.

Thanks
Tino

Submitted by support on Wed, 2007-10-31 14:46

Hi,

There are couple of things to try.

First, could you try the cache code with a feed another feed, for example the BBC News RSS feed at:

http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml

As before, look at the cache filename, and then study the file to view the XML.

Next, if you have command line access (via Telnet or SSH to your server), it would be worth logging in and trying wget manually with the discogs feed, for example:

$wget -O discogs.xml "http://www.discogs.com/artist/radiohead?f=xml&api_key=XXXXXXXXXX"

(where $ is your command prompt)

This will help figure out what's going on on your server...

Cheers,
David.

Submitted by tinotriste on Thu, 2007-11-01 13:15

It didn't work with the BBC feed either mate.
It may be something to do with my server configuration.
Nevermind I have a few feeds that work without the cache, so i don't think that I'll be using the cache feature for now.

Anyway, thanks for you help!
I really, really appreciate your effort and will definitely refer your script to other people.

Tino

Submitted by tinotriste on Thu, 2007-11-15 15:49

I have found a php cache system that seems to work ok:

http://quickcache.codeworxtech.com

I hope this helps anyone. :-)