You are here:  » memory exhausted to continuously parse multiple xml files


memory exhausted to continuously parse multiple xml files

Submitted by lang2000 on Tue, 2008-04-29 17:46 in

Hi David:

I am trying to parse 30 XML files continuously in the same folder in one go on the server, the size of these XML files are between 10KB and 4MB, it is ok to parse majority of the files in the folder, however, it always left a few files, and giving the warning message as follows:

Fatal error: Allowed memory size of 33554432 bytes exhausted (tried to allocate 6260901 bytes) in
/home/xxxx/public_html/feed/ldfparser.php on line 71

any idea how to solve the problem please?

Thanks a lot.

Regards
Lin

Submitted by support on Tue, 2008-04-29 21:31

Hi Lin,

The error is indicating that it is caused on line 71 of your script - can you post that line so that I can see what might be causing it... (most text editors indicate the line number in the status bar). If you're not sure - feel free to email me the exact script that is causing this error and i'll check it out for you (reply to your reg code or forum registration email is the easiest way to get me)...

Cheers,
David.

Submitted by lang2000 on Wed, 2008-04-30 11:13

Hi David:

You can see the error on :

http://www.jowjow.co.uk/feed/events_ldf.php

I Think i know why this error occurred, when I looked at the XML file to be parsed:

http://www.jowjow.co.uk/feed/uploads/Dennis_Publishing_Feed_02042008/Dennis_MUSEUM_EVENT_20080402.xml

It actually indicates there is an error in this xml file, even though i am not clear what the error is (any chance you can figure out what's wrong with this xml file?)

This xml file should have the same structure as this xml file which is fine to be parsed by MargicParser:

http://www.jowjow.co.uk/feed/uploads/Dennis_Publishing_Feed_02042008/Dennis_ART_EVENT_20080402.xml

You can see the results to parse this correctly structured XML file at:

http://www.jowjow.co.uk/feed/events_ldf_art.php

Because I am trying to parse a few XML files in one go, Is there any way to avoid the badly formatted the xml file and indicate an error message saying there is an error to parse the file, and skip to the next file to parse?

The php file that I use to parse the XML file is:

<?php
  
require_once('Connections/ldfxml.php');
  require(
"MagicParser.php");
  
mysql_select_db("linj1601_ldfxml") or die(mysql_error());
  
// global array to hold the titles
  
$titles = array();
  function 
changeDate($title_date_origin) {
$title_date_array explode("/"$title_date_origin);
krsort($title_date_array);
return 
implode("-",$title_date_array);
}
  
// record handler to build the above array from the XML
  
function myTitleRecordHandler($title)
  {
    global 
$titles;
    
$titles[] = $title;
    
//print_r($title);
  
}
  
// globl array to hold venues and a mapping array to associated
  // titles with venues
  
$venues = array();
  
$title2venue = array();
  
// record handler to build the above arrays from the XML
  
function myVenueRecordHandler($venue)
  {
    
// create array of venues
    
global $venues;
    
$venues[$venue["VENUE-VENUE_ID"]] = $venue["VENUE-VENUE_NAME"];
    
// now create a mapping array to match venues with titles
    // to do this, we go through the entire record looking for
    // TITLE_ID fields (they will be differentiated with @1, @2..
    // but this can be ignored for now
    
global $title2venue;
    foreach(
$venue as $k => $v)
    {
      if (
strpos($k,"TITLE_ID"))
      {
        
$title2venue[$venue[$k]] = $venue["VENUE-VENUE_ID"];
      }
    }
  }
  
// load the XML into a variable so that we don't hit the remote server twice!
 
$xml "";
  
$filename "uploads/Dennis_Publishing_Feed_02042008/Dennis_ART_EVENT_20080402.xml";
  
$fp fopen($filename,"r");
  if (
$fp)
  {
    while(!
feof($fp)) $xml .= fread($fp,1024);
    
fclose($fp);
  }
  else
  {
    print 
"Error opening ".$filename;
    exit();
  }
  print 
"Bytes Received: ".strlen($xml);
  
// first parse to load all titles into the global array $titles
  
MagicParser_parse("string://".$xml,"myTitleRecordHandler","xml|LISTINGS/POI/VENUE/TITLES/TITLE/");
  
// second parse to generate title > venue mapping array
  
MagicParser_parse("string://".$xml,"myVenueRecordHandler","xml|LISTINGS/POI/VENUE/");
  
// finally we can handle the $titles array using foreach() exactly as the array would have been
  // handled within myRecordHandler, using $title to access the XML elements.
  // the code below shows how to extract the multiple event by using a counter
  // and looking for the way Magic Parser has resolved the duplicate names using @1, @2, etc..
  
foreach($titles as $title)
  {
    
$title_start_date changeDate($title["PERFORMANCE/START_DATE"]);
    
$title_end_date changeDate($title["PERFORMANCE/END_DATE"]);
 
/* $title_start_date_origin = $title["PERFORMANCE/START_DATE"];
  $title_start_date_array = explode("/", $title_start_date_origin);
  krsort($title_start_date_array);
  $title_start_date = implode("-",$title_start_date_array);*/
   
$sql =
    
"REPLACE INTO event
    (
    event_id,
    event_title,
    event_venue_id,
    event_description,
    event_start_date,
    event_end_date
    )
    VALUES
    (
    '"
.mysql_real_escape_string($title["TITLE-TITLE_ID"])."',
    '"
.mysql_real_escape_string($title["TITLE-TITLE_NAME"])."',
    '"
.mysql_real_escape_string($title2venue[$title["TITLE-TITLE_ID"]])."',
    '"
.mysql_real_escape_string($title["PERFORMANCE/PERFORMANCE_DESCRIPTION"])."',
    '"
.mysql_real_escape_string($title_start_date)."',
    '"
.mysql_real_escape_string($title_end_date)."'
    )
    "
;
    if (!
mysql_query($sql))
     {
       
// SQL failed, print error message and abort
       
print mysql_error();exit();
     }
    print 
"<br/>".$sql;
    print 
"<h2>".$title["TITLE-TITLE_NAME"]."<br/>".$title["TITLE-TITLE_ID"]."</h2>";
    print 
"<h3>Venue:".$venues[$title2venue[$title["TITLE-TITLE_ID"]]]."<br/>Venue ID: ".$title2venue[$title["TITLE-TITLE_ID"]]."</h3>";
    print 
"<blockquote>";
    print 
"<h4>Performances: ".$title["PERFORMANCE/PERFORMANCE_DESCRIPTION"]."</h4>";
        print 
"<h4>Start Date: ".$title["PERFORMANCE/START_DATE"]."</45>";
        print 
"<h4>End Date: ".$title["PERFORMANCE/END_DATE"]."</h4>";
    print 
"<ul>";
    
$postfix "";
    
$i 0;
    while(
1) {
      if (
$i$postfix "@".$i;
      if (!
$title["EVENTS/EVENT".$postfix."-EVENT_ID"]) break;
      
$event_id $title["EVENTS/EVENT".$postfix."-EVENT_ID"];
      
$event_start_date $title["EVENTS/EVENT".$postfix."-EVENT_START_DATE"];
      
$event_end_date $title["EVENTS/EVENT".$postfix."-EVENT_END_DATE"];
      
$event_start_time $title["EVENTS/EVENT".$postfix."-EVENT_START_TIME"];
      print 
"<li>".$event_start_date." at ".$event_start_time."</li>";
      
$i++;
    }
    print 
"</ul>";
    print 
"</blockquote>";
  }
?>

Thanks
Lin

Submitted by support on Wed, 2008-04-30 12:20

Hello Lin,

The tricky part about this is, is that it is not really possible to tell that XML is badly formatted until it has been parsed to the point at which the error occurs; by which time the memory excess will have already been reached. What will be happening is that the corrupted XML is causing the parser to build up a very long string (effectively an extremely long value in one of the fields) as a terminating tag will may be missing; so i'll look at the script and consider options for putting a "stop" in for you that would abandon the parse if the size of a single field exceeds a certain amount.

However, looking at the error message; the memory allocation actually fails on the following line (71) of the main script, not within MagicParser.php, so it may be that the entire XML has been read without causing a memory error; and then when it comes to trying to sort it this code takes the script over the memory limit:

  krsort($title_start_date_array);

...but I notice that this is now commented out. Did this remove the error after removing this section?

One option I think would be to check $title for validity before attempting to process / sort...

Cheers,
David.

Submitted by lang2000 on Wed, 2008-04-30 14:29

Hi David:

I have removed all the scripts I wrote (sorting the array, inserting to the database, etc), and left the code you previously suggested in :

http://www.magicparser.com/node/745
(I should have posted this thread into http://www.magicparser.com/node/745 as i think they are relevant.)

The XML to parse:

http://www.jowjow.co.uk/feed/uploads/Dennis_Publishing_Feed_02042008/Dennis_MUSEUM_EVENT_20080402.xml
It can be downloaded at http://www.jowjow.co.uk/feed/Dennis_MUSEUM_EVENT_20080402.xml.zip

The result of parsing:

http://www.jowjow.co.uk/feed/events_ldf_jowjow.php

It still shows the memory problem:

The code for event_ldf_jowjow.php is:

<?php
  
require("MagicParser.php");
  
// global array to hold the titles
  
$titles = array();
  
// record handler to build the above array from the XML
  
function myTitleRecordHandler($title)
  {
    global 
$titles;
    
$titles[] = $title;
  }
  
// globl array to hold venues and a mapping array to associated
  // titles with venues
  
$venues = array();
  
$title2venue = array();
  
// record handler to build the above arrays from the XML
  
function myVenueRecordHandler($venue)
  {
    
// create array of venues
    
global $venues;
    
$venues[$venue["VENUE-VENUE_ID"]] = $venue["VENUE-VENUE_NAME"];
    
// now create a mapping array to match venues with titles
    // to do this, we go through the entire record looking for
    // TITLE_ID fields (they will be differentiated with @1, @2..
    // but this can be ignored for now
    
global $title2venue;
    foreach(
$venue as $k => $v)
    {
      if (
strpos($k,"TITLE_ID"))
      {
        
$title2venue[$venue[$k]] = $venue["VENUE-VENUE_ID"];
      }
    }
  }
  
// load the XML into a variable so that we don't hit the remote server twice!
  
$xml "";
  
$url "http://www.jowjow.co.uk/feed/uploads/Dennis_Publishing_Feed_02042008/Dennis_MUSEUM_EVENT_20080402.xml";
  
$fp fopen($url,$r);
  while(!
feof($fp)) $xml .= fread($fp,1024);
  
fclose($fp);
  print 
"Bytes Received: ".strlen($xml);
  
// first parse to load all titles into the global array $titles
  
MagicParser_parse("string://".$xml,"myTitleRecordHandler","xml|LISTINGS/POI/VENUE/TITLES/TITLE/");
  
print_r($venues);
  
// second parse to generate title > venue mapping array
  
MagicParser_parse("string://".$xml,"myVenueRecordHandler","xml|LISTINGS/POI/VENUE/");
  
// finally we can handle the $titles array using foreach() exactly as the array would have been
  // handled within myRecordHandler, using $title to access the XML elements.
  // the code below shows how to extract the multiple event by using a counter
  // and looking for the way Magic Parser has resolved the duplicate names using @1, @2, etc..
  
foreach($titles as $title)
  {
    print 
"<h2>".$title["TITLE-TITLE_NAME"]."</h2>";
    print 
"<h3>Venue:".$venues[$title2venue[$title["TITLE-TITLE_ID"]]]."</h3>";
    print 
"<blockquote>";
    print 
"<h4>Performances</h4>";
    print 
"<ul>";
    
$postfix "";
    
$i 0;
    while(
1) {
      if (
$i$postfix "@".$i;
      if (!
$title["EVENTS/EVENT".$postfix."-EVENT_ID"]) break;
      
$event_id $title["EVENTS/EVENT".$postfix."-EVENT_ID"];
      
$event_start_date $title["EVENTS/EVENT".$postfix."-EVENT_START_DATE"];
      
$event_end_date $title["EVENTS/EVENT".$postfix."-EVENT_END_DATE"];
      
$event_start_time $title["EVENTS/EVENT".$postfix."-EVENT_START_TIME"];
      print 
"<li>".$event_start_date." at ".$event_start_time."</li>";
      
$i++;
    }
    print 
"</ul>";
    print 
"</blockquote>";
  }
?>

Thanks

Lin

Submitted by support on Wed, 2008-04-30 15:46

Hello Lin,

The XML actually looks fine - I think the problem is the sheer number of EVENT records within that particular feed; and as you are not (currently) parsing at the EVENT level they are being added to the result array, which in turn exceeds the maximum allowed memory limit on your server.

The first thing to try is to see if you are allowed to increase the memory limit to your scripts using the following code at the very top:

  ini_set("memory_limit","128M");

If that doesn't make any difference, it is always worth a quick word with your hosting company to see if they are happy to increase the memory limit on your server; although ultimately I think a different approach would be required.

I'll study the XML and see what alternative strategy would work on this size feed with your 32M memory limitation...

Cheers,
David.

Submitted by lang2000 on Wed, 2008-04-30 16:01

Hi David:

What if I give up parsing the EVENT level? Just want the VENUE and TITLES levels? would that help?

I have tried to commented out these code:

<?php
/*  $postfix = "";
    $i = 0;
    while(1) {
      if ($i) $postfix = "@".$i;
      if (!$title["EVENTS/EVENT".$postfix."-EVENT_ID"]) break;
      $event_id = $title["EVENTS/EVENT".$postfix."-EVENT_ID"];
      $event_start_date = $title["EVENTS/EVENT".$postfix."-EVENT_START_DATE"];
      $event_end_date = $title["EVENTS/EVENT".$postfix."-EVENT_END_DATE"];
      $event_start_time = $title["EVENTS/EVENT".$postfix."-EVENT_START_TIME"];
      print "<li>".$event_start_date." at ".$event_start_time."</li>";
      $i++;
    }*/
?>

And I have insert the following code on top of the php file:

<?php
ini_set
("memory_limit","128M");
?>

Still no luck as you can see:

http://www.jowjow.co.uk/feed/events_ldf_jowjow.php

Thanks

Lin

Submitted by support on Wed, 2008-04-30 16:44

Hello Lin,

Unfortunately, when you try to parse at a higher level everything below that level is included in the parse; hense memory is being exhausted when loading every event into the record. I'll investigate if it would be easily feasible to modify Magic Parser to return only elements at the current level (therefore ignoring duplicate child records).

In the mean time, if as a result of this you decide that Magic Parser is not suitable for your application please let me know and I will of course refund your purchase...

Cheers,
David.