You are here:  » Big File Performance


Big File Performance

Submitted by cogden on Thu, 2006-10-05 02:17 in

Well, I'm starting to get magicparser to hum!

By specifying the format string "xml|f:full/u:user/", the routine now flies through the 767 users, parsing and inserting/updating the database.

However, once the users are done, the system hits mud. It leaves myRecordHandler($record), never to return, and pegs twin processors to 99% each and settle down to 75% each. The server has 3.5GB of RAM (Dual 2.5GHz OS X 10.4.8). There is plenty of free RAM (2.6GB)

I tried it without Zend Studio Debugger in case it was getting in the way, but same results.

The users are the first records of the file. Is it then parsing the rest of the file line by line?

(I've now read through almost every post in the forums and saw where you commented "With a file the size that you are talking about it would really help to use a format string in the optional 3rd parameter to the parse function.". I'm already doing that, so I'd of thought it'd skip the other data.

Is there a practical limit on the size of the XML file that it can parse (ie, this file is 3.7MB)?

thanks for your incredibly speedy responses!

Submitted by cogden on Thu, 2006-10-05 02:40

FWIW, I added a lone user record (copy/pasted/tweaked from one of the 767 above) to the very end of the file. It doesn't appear that it is ever read/processed either. Do XML sibling records have to be grouped together?

Submitted by cogden on Thu, 2006-10-05 02:55

For grins, I set the httpd process to the highest priority and it runs at 98%.

As well, the resource limits when running the above were:

max_execution_time = 30 ; Maximum execution time of each script, in seconds
max_input_time = 60 ; Maximum amount of time each script may spend parsing request data
memory_limit=80M ; Maximum amount of memory a script may consume (8MB)

I've tried it with 10 times the values and it doesn't seem to have made a difference.

max_execution_time = 3000 ; Maximum execution time of each script, in seconds
max_input_time = 6000 ; Maximum amount of time each script may spend parsing request data
memory_limit=150M ; Maximum amount of memory a script may consume (8MB)

Although in any of the cases, I didn't get any PHP time-out error messages.

Submitted by support on Thu, 2006-10-05 06:20

Hi,

There shouldn't be a problem with a file that size - in fact Magic Parser is most often used in affiliate marketing applications against feeds running into several 10's of MB's.

The "lone" record pasted onto the end of the file - should have been read if it was in the same place within the hierarchy as the other records of its type. If it was just pasted directly onto the end of the file (i.e. outside of the closing document element tag) then it would be ignored.

The first thing I would ask you to try is to run the file through a null record handler - in other words just comment out all your code within myRecordHandler() and see if that makes a difference to the end of script lock-up.

Secondly; can you confirm that there is no code after the call to MagicParser_parse() that could cause the lock-up? If there is; I would add a debug print statement to the line after the call to MagicParser_parse() in order to confirm that it is indeed MagicParser_parse() that is not returning.

Hope this helps!
Cheers,
David.

Submitted by cogden on Thu, 2006-10-05 14:21

I stripped the code completely back down to:

<?php
require("MagicParser.php");
function myRecordHandler($record) {
global $colhdrs;
if ($colhdrs) {
echo "<tr>";
foreach($record as $key => $value) {
echo "<th>".$key."</th>";
}
echo "</tr>";
$colhdrs = false;
}
echo "<tr>";
foreach($record as $key => $value) {
echo "<td>".htmlentities($value)."&nbsp;</td>";
}
echo "</tr>";
}
$filename = "jan2002_dec2007.xml";
$format_string = "xml|f:full/u:user/"; //to get just get user info
//Watch out not to specify empty string as filename. It seems PHP is trying to get data from stdin which may end up in script timeout. It may not be trivial to find.
if (!$format_string) {
echo "<p>".MagicParser_getErrorMessage()."</p>";
exit;
}else {
echo "<p><strong>Format String:</strong> ".$format_string."</p>";
}
$colhdrs = true;
echo "<table border='1'>";
$parseerror = MagicParser_parse($filename,"myRecordHandler",$format_string );
echo "</table>";
echo 'done!';
  ?>

and it exhibits the same behaviors. The majority of the lines are displayed almost instantly, then the processors max out at 100%.

I thought maybe the assignment
$parseerror = MagicParser_parse($filename,"myRecordHandler",$format_string );

could be the problem. Removing it did nothing.

I tried a superset of the XML file and it displayed 2,225 users (vs 750 of the previous) instantly and then stoppepd.

I then made it as barebones as I can, to no avail:

<?php
require("MagicParser.php");
function myRecordHandler($record) {
echo "<tr>";
foreach($record as $key => $value) {
echo "<td>".htmlentities($value)."&nbsp;</td>";
}
echo "</tr>";
}
$filename = "jan2002_dec2007.xml";
$format_string = "xml|f:full/u:user/"; //to get just get user info
echo "<table border='1'>";
MagicParser_parse($filename,"myRecordHandler",$format_string );
echo "</table>";
echo 'done!';
  ?>

Submitted by cogden on Thu, 2006-10-05 14:22

Would you like me to email the XML file? (I emailed you yesteday - did it come through or was it trapped in a spam filter?)

I have a client meeting in 4 hours and am starting to despair!