You are here:  » Looping Through A 33MB XML file


Looping Through A 33MB XML file

Submitted by tekneck on Thu, 2012-02-09 20:48 in

I am having trouble with a large XML file. I am using your refresh loop and iterating through 25 records at a time. By the time it gets to around 600, the delay is unbearable. It starts off quickly then slowly grinds to a halt. I have pulled all my login and SQL out of the script and just called the parser function and let it keep refreshing... sure enough, it starts to slow down even though I am not doing anything with the data (funny since it is usually my scripts that ultimately are to blame when things go wrong).

Anyway, I can give you access to the XML file to see if you can explain why it seems to bog down as it loops and walks through the XML. I have tried your UTF8 and normal script...

It will eventually finish but it can take hours on a robust web server... I use the script to parse dozens of other feed sources (none of the files are 33MB though, avg is 5MB) without issues. It is only this large file from this one source. Below is the bare-bones script I used to test to see if the lag was my code and DB manipulation or just the parsing in general.

Thank you.

<?php
require("MagicParser.php");
error_reporting(E_ALL);
set_time_limit(3000);
ini_set('html_errors', 0);
ini_set('implicit_flush', 0);
  $slowUpdateBlock = 25;
  $slowUpdateSleep = 1;
  $slowStart = (isset($_GET["slowStart"])?$_GET["slowStart"]:1);
  $slowCount = 0;
  $finished = false;
function myRecordHandler($record)
{
global $slowUpdateBlock;
   global $slowCount;
   global $slowStart;
   $slowCount++;
   if ($slowCount < $slowStart) return;
if (!$record) { print MagicParser_getErrorMessage(); }
echo '<pre>Record Number From Feed: '.$record['LISTING-NUM'].'</pre>';
  if ($slowCount == (($slowStart+$slowUpdateBlock)-1)) return TRUE;
}
$xml = file_get_contents("http://www.example.com/feedfile.xml"); /* Contact me for this url which points to a +/-33MB file */
MagicParser_parse("string://".$xml,"myRecordHandler","xml|IDF/DEALERS/DEALER/LISTINGS/LISTING/");
print MagicParser_getErrorMessage();
  if ($slowCount < (($slowStart+$slowUpdateBlock)-1)) $finished = TRUE;
  if (!$finished)
  {
$refresh = "test.php?slowStart=".($slowStart+$slowUpdateBlock);
    print "<meta http-equiv='refresh' content='".$slowUpdateSleep.";url=".$refresh."' />";
  }
?>

Submitted by support on Fri, 2012-02-10 09:21

Hi teknek,

Please could you drop me an email with the actual feed URL and I'll check it out on my test server for you..

Thanks,
David
--
MagicParser.com

Submitted by bayuobie on Mon, 2012-02-20 14:54

Interesting..I have the same problem. many of my clients have about 50MB and it gets extremely difficult doing this. Anyway.. i never knew you had this 'refresh loop' code. can you give me a link to it? Also, what do I do for these clients with large xml files? Thanks

Submitted by support on Mon, 2012-02-20 15:20

Hi bayuobie,

The above method of overcoming server time-outs where a script is unable to run indefinitely (due to hosting platform limitations e.g. a maximum time limit that you have no control over) can become impractical with very large feeds as this method still requires that the parser loop through all records prior to the point of continuation.

However; I have in development and you would be welcome to try out a new method of achieving this which enables you to get the current offset data of the the parser and then to jump directly to that point in the XML source in the next iteration; which has none of the overhead of the above technique - i'll email it to you to try out together with a template calling script to demonstrate usage...

Cheers,
David.