You are here:  » Processing different format of XML feeds


Processing different format of XML feeds

Submitted by Amy on Mon, 2011-09-05 07:44 in

Hi,

Looking to parse different format of XML files. These files can be very large.

My code is looking to:
1. Go through list of files in the directory
2. Looking the the Message/category node to see what type of XML file it is
3. Process and insert entries to DB
4. Move processed files to specific directories based on the category and sport (values in the XML)

Snippet of what I started to try and do. I'm trying to minimize the number of times that I parse the file, use values in it, and would like to maximize the MagicParser_parse function.

Thanks in advance for your help. I have submitted 2 of the files for your service so you can see examples there.
Amy

<?php
  
require("MagicParser.php");
  
$statsFolder "files/";
  
$files = array();
  if (
$handle opendir($statsFolder)) {
    while (
false !== ($file readdir($handle))) {
      if (
"." != $file && ".." != $file && "old" != $file && "stats_used" != $file && !is_dir($file))
        
$files[] = $file;
    }
    
closedir($handle);
  }
  
$fileMax 100;
  foreach(
$files as $file)
  {
    if (
== $fileMax)
      break;
    
$fileMax--;
    
$file preg_split('/\s+/'trim($file));
    
$filename end($file);
    if (
"" == $filename || "old" == $filename)
      continue;
    
//$result = MagicParser_parse("14194510.xml","myRecordHandler");
    
$message MagicParser_parse($statsFolder.$filename,"MainHandler""xml|MESSAGE/");
    
// if there is an error, print to log and move file to stats_failed dir
    
if (!$message)
    {
      print 
$filename." failed parse.  Error msg: ".MagicParser_getErrorMessage();
      
$dest_path $statsFolder."stats_failed/";
      if (!
file_exists ($dest_path)) {
        
mkdir($dest_path0775true);
      }
      
rename($statsFolder.$filename$dest_path.$filename);
    }
    else
    {
      
// TODO: figure out how to get $sport and $category from parse
      
$sport '';
      
$category '';
      
$dest_path $statsFolder."stats_used/".$sport."/".$category."/";
      if (!
file_exists ($dest_path)) {
        
mkdir($dest_path0775true);
      }
      
rename($statsFolder.$filename$dest_path.$filename);
    }    
  }
  function 
MainHandler($record)
  {
    
//print_r($record);
    
$file_id $record['XML_FILE_ID'];
    echo 
$file_id.'|||';
    
$heading $record['HEADING'];
    echo 
$heading.'|||';
    
$sport $record['SPORT'];
    echo 
$sport.'|||';
    
$category $record['CATEGORY'];
    echo 
$category.'|||';
    
// only care about the specific sports
      
switch ($category) {
        case 
"Statistics":
          
handle_stats($record$sport);
          break;
        case 
"Poll":
        case 
"News":
        case 
"Injuries":
        case 
"Weather":
        case 
"Odds":
        case 
"Minor Scores":
          break;
        case 
"Scores":
          
handle_scores($record$sport);
          break;
        case 
"Miscellaneous":
          
handle_miscellaneous($record$sport);
          break;
        default:
          break;
      }
  }
?>

Submitted by support on Mon, 2011-09-05 08:58

Hi Amy,

Regarding your comment;

// TODO: figure out how to get $sport and $category from parse

All you would need to is make $sport and $category global within your myRecordHandler function, e.g.

  function MainHandler($record)
  {
    global $category;
    global $sport;

...and then of course no requirement for these lines; as they will have been populated from the parse.

      $sport = '';
      $category = '';

However, as you're parsing at the top level (MESSAGE/) there may however be a memory issue if, as you say, your files are very large. It may actually be more appropriate therefore to simply read the first 1K of the file into a string; use a regular expression to extract $category and $sport; and then parse using the appropriate lower level format string as normal. Consider the following example, based on a single file "data.xml" however this could be incorporated with your original code that scans a directory to process each file in turn;

<?php
  
require("MagicParser.php");
  function 
myPlayerListingRecordHandler($record)
  {
    
// Handle <PlayerListing> elements here
  
}
  function 
myGameRecordHandler($record)
  {
    
// Handler <game> elements here
  
}
  
$filename "data.xml";
  
$fp fopen($filename,"r");
  
$header fread($fp,1024);
  
preg_match('/<category>(.*)<\/category>/',$header,$matches);
  
$category $matches[1];
  
preg_match('/<sport>(.*)<\/sport>/',$header,$matches);
  
$sport $matches[1];
  switch(
$category)
  {
    case 
"Scores":
      
MagicParser_parse($filename,"myGameRecordHandler","xml|MESSAGE/GAME/");
      break;
    case 
"Statistics":
      
MagicParser_parse($filename,"myPlayerListingRecordHandler","xml|MESSAGE/LISTING/PLAYERLISTING/");
      break;
  }
?>

Hope this helps!
Cheers,
David.