You are here:  » XPath granular parsing (eg, specify the depth of which to return)


XPath granular parsing (eg, specify the depth of which to return)

Submitted by cogden on Wed, 2006-10-04 20:53 in

I have to parse relatively large XML files into a variety of mySQL database tables. Through the great examples on this site, I have categories of data working well from an passed-in node downward. However, I can't figure out how to just extract a given layer of data (ie, start at level x and stop and level y). For example, given:

<?xml version="1.0" encoding="UTF-8"?>
<f:full xmlns:f="xxxxxx/Full" site-code="foo" created-date="2006-08-08T16:03:04.538-07:00">
<f:date-range start="2006-07-31T00:00:00.000-07:00" end="2006-07-31T23:59:59.999-07:00" />
<u:user xmlns:u="xxxxxx/User" membership-no="CME523" email-address="tsamaloukas@onkologe.me.uunet.de">
<u:user-name system-name="antonis" display-name="Antonis Tsamaloukas,MD,Ph.D." />
<u:address foo="chi" />
<u:demographics />
</u:user>
<crdcat:credit-category xmlns:crdcat="xxxxxx/CreditCategory" id="3" agency="yyyyyyy" category-name="blah blah." />
<crdcat:credit-category xmlns:crdcat="xxxxxx/CreditCategory" id="99" agency="Not Currently Accredited" category-name="N/A" />
<crs:course xmlns:crs="xxxxxx/Course" id="zzzz;ONC-11-7-718" title="xxxxxxxxxx" expiration-date="9999-12-31T00:00:00.000-08:00" publication-date="2006-07-31T00:00:00.000-07:00" in-production="true" is-linkable="true" is-expired="false">
<crs:course-credit credit-category-id="3" max-credits="1.0" expiration-date="9999-12-31T00:00:00.000-08:00" />
<crs:parent-node id="foo_node;3" title="Breast Cancer" canonical="true" />
<crs:article resource-id="fee;11/7/718" title="xxxxxxxx" doi="10.1634/fee.11-7-718" sort-num="1" />
<crs:quiz id="foo_quiz;AMP-11-7-718" title="xxxxxxx" scoring-style="some_correct_basic" threshold="70" sort-num="2" num-questions="6">
<crs:question id="1" text="q1 text here" sort-num="1">
<crs:question-type id="1" name="single_select" />
<crs:answer id="a" text="yyyyyyy1" is-correct="false" sort-num="1">
<crs:answer-type id="1" name="multiple_choice" />
</crs:answer>
<crs:answer id="b" text="yyyyyy2" is-correct="true" sort-num="2">
<crs:answer-type id="1" name="multiple_choice" />
</crs:answer>
<crs:answer id="c" text="yyyyyy3" is-correct="false" sort-num="3">
<crs:answer-type id="1" name="multiple_choice" />
</crs:answer>
<crs:answer id="d" text="yyyyyy4" is-correct="false" sort-num="4">
<crs:answer-type id="1" name="multiple_choice" />
</crs:answer>
</crs:question>
<crs:question id="2" text="q2 text here" sort-num="2">
<crs:question-type id="1" name="single_select" />
<crs:answer id="a" text="wwwwww1" is-correct="false" sort-num="1">
<crs:answer-type id="1" name="multiple_choice" />
</crs:answer>
<crs:answer id="b" text="wwwwww2" is-correct="true" sort-num="2">
<crs:answer-type id="1" name="multiple_choice" />
</crs:answer>
<crs:answer id="c" text="wwwwww3" is-correct="false" sort-num="3">
<crs:answer-type id="1" name="multiple_choice" />
</crs:answer>
</crs:question>

I just want to return the data down to the question text:
...q1 text here
...q2 text here
(ie, CRS:QUIZ CRS:QUIZ-ID CRS:QUIZ-TITLE CRS:QUIZ-SCORING-STYLE CRS:QUIZ-THRESHOLD CRS:QUIZ-SORT-NUM CRS:QUIZ-NUM-QUESTIONS CRS:QUESTION CRS:QUESTION-ID CRS:QUESTION-TEXT)

However, magicparser keeps going and brings back all of the answers too.

(ie, CRS:QUESTION-SORT-NUM CRS:QUESTION/CRS:QUESTION-TYPE CRS:QUESTION/CRS:QUESTION-TYPE-ID CRS:QUESTION/CRS:QUESTION-TYPE-NAME CRS:QUESTION/CRS:ANSWER CRS:QUESTION/CRS:ANSWER-ID CRS:QUESTION/CRS:ANSWER-TEXT CRS:QUESTION/CRS:ANSWER-IS-CORRECT CRS:QUESTION/CRS:ANSWER-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE-NAME CRS:QUESTION/CRS:ANSWER@1 CRS:QUESTION/CRS:ANSWER@1-ID CRS:QUESTION/CRS:ANSWER@1-TEXT CRS:QUESTION/CRS:ANSWER@1-IS-CORRECT CRS:QUESTION/CRS:ANSWER@1-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@1 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@1-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@1-NAME CRS:QUESTION/CRS:ANSWER@2 CRS:QUESTION/CRS:ANSWER@2-ID CRS:QUESTION/CRS:ANSWER@2-TEXT CRS:QUESTION/CRS:ANSWER@2-IS-CORRECT CRS:QUESTION/CRS:ANSWER@2-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@2 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@2-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@2-NAME CRS:QUESTION/CRS:ANSWER@3 CRS:QUESTION/CRS:ANSWER@3-ID CRS:QUESTION/CRS:ANSWER@3-TEXT CRS:QUESTION/CRS:ANSWER@3-IS-CORRECT CRS:QUESTION/CRS:ANSWER@3-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@3 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@3-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@3-NAME CRS:QUESTION@1 CRS:QUESTION@1-ID CRS:QUESTION@1-TEXT CRS:QUESTION@1-SORT-NUM CRS:QUESTION/CRS:QUESTION-TYPE@1 CRS:QUESTION/CRS:QUESTION-TYPE@1-ID CRS:QUESTION/CRS:QUESTION-TYPE@1-NAME CRS:QUESTION/CRS:ANSWER@4 CRS:QUESTION/CRS:ANSWER@4-ID CRS:QUESTION/CRS:ANSWER@4-TEXT CRS:QUESTION/CRS:ANSWER@4-IS-CORRECT CRS:QUESTION/CRS:ANSWER@4-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@4 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@4-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@4-NAME CRS:QUESTION/CRS:ANSWER@5 CRS:QUESTION/CRS:ANSWER@5-ID CRS:QUESTION/CRS:ANSWER@5-TEXT CRS:QUESTION/CRS:ANSWER@5-IS-CORRECT CRS:QUESTION/CRS:ANSWER@5-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@5 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@5-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@5-NAME CRS:QUESTION/CRS:ANSWER@6 CRS:QUESTION/CRS:ANSWER@6-ID CRS:QUESTION/CRS:ANSWER@6-TEXT CRS:QUESTION/CRS:ANSWER@6-IS-CORRECT CRS:QUESTION/CRS:ANSWER@6-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@6 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@6-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@6-NAME CRS:QUESTION@2 CRS:QUESTION@2-ID CRS:QUESTION@2-TEXT CRS:QUESTION@2-SORT-NUM CRS:QUESTION/CRS:QUESTION-TYPE@2 CRS:QUESTION/CRS:QUESTION-TYPE@2-ID CRS:QUESTION/CRS:QUESTION-TYPE@2-NAME CRS:QUESTION/CRS:ANSWER@7 CRS:QUESTION/CRS:ANSWER@7-ID CRS:QUESTION/CRS:ANSWER@7-TEXT CRS:QUESTION/CRS:ANSWER@7-IS-CORRECT CRS:QUESTION/CRS:ANSWER@7-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@7 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@7-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@7-NAME CRS:QUESTION/CRS:ANSWER@8 CRS:QUESTION/CRS:ANSWER@8-ID CRS:QUESTION/CRS:ANSWER@8-TEXT CRS:QUESTION/CRS:ANSWER@8-IS-CORRECT CRS:QUESTION/CRS:ANSWER@8-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@8 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@8-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@8-NAME CRS:QUESTION/CRS:ANSWER@9 CRS:QUESTION/CRS:ANSWER@9-ID CRS:QUESTION/CRS:ANSWER@9-TEXT CRS:QUESTION/CRS:ANSWER@9-IS-CORRECT CRS:QUESTION/CRS:ANSWER@9-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@9 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@9-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@9-NAME CRS:QUESTION/CRS:ANSWER@10 CRS:QUESTION/CRS:ANSWER@10-ID CRS:QUESTION/CRS:ANSWER@10-TEXT CRS:QUESTION/CRS:ANSWER@10-IS-CORRECT CRS:QUESTION/CRS:ANSWER@10-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@10 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@10-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@10-NAME CRS:QUESTION@3 CRS:QUESTION@3-ID CRS:QUESTION@3-TEXT CRS:QUESTION@3-SORT-NUM CRS:QUESTION/CRS:QUESTION-TYPE@3 CRS:QUESTION/CRS:QUESTION-TYPE@3-ID CRS:QUESTION/CRS:QUESTION-TYPE@3-NAME CRS:QUESTION/CRS:ANSWER@11 CRS:QUESTION/CRS:ANSWER@11-ID CRS:QUESTION/CRS:ANSWER@11-TEXT CRS:QUESTION/CRS:ANSWER@11-IS-CORRECT CRS:QUESTION/CRS:ANSWER@11-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@11 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@11-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@11-NAME CRS:QUESTION/CRS:ANSWER@12 CRS:QUESTION/CRS:ANSWER@12-ID CRS:QUESTION/CRS:ANSWER@12-TEXT CRS:QUESTION/CRS:ANSWER@12-IS-CORRECT CRS:QUESTION/CRS:ANSWER@12-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@12 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@12-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@12-NAME CRS:QUESTION/CRS:ANSWER@13 CRS:QUESTION/CRS:ANSWER@13-ID CRS:QUESTION/CRS:ANSWER@13-TEXT CRS:QUESTION/CRS:ANSWER@13-IS-CORRECT CRS:QUESTION/CRS:ANSWER@13-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@13 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@13-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@13-NAME CRS:QUESTION@4 CRS:QUESTION@4-ID CRS:QUESTION@4-TEXT CRS:QUESTION@4-SORT-NUM CRS:QUESTION/CRS:QUESTION-TYPE@4 CRS:QUESTION/CRS:QUESTION-TYPE@4-ID CRS:QUESTION/CRS:QUESTION-TYPE@4-NAME CRS:QUESTION/CRS:ANSWER@14 CRS:QUESTION/CRS:ANSWER@14-ID CRS:QUESTION/CRS:ANSWER@14-TEXT CRS:QUESTION/CRS:ANSWER@14-IS-CORRECT CRS:QUESTION/CRS:ANSWER@14-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@14 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@14-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@14-NAME CRS:QUESTION/CRS:ANSWER@15 CRS:QUESTION/CRS:ANSWER@15-ID CRS:QUESTION/CRS:ANSWER@15-TEXT CRS:QUESTION/CRS:ANSWER@15-IS-CORRECT CRS:QUESTION/CRS:ANSWER@15-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@15 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@15-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@15-NAME CRS:QUESTION/CRS:ANSWER@16 CRS:QUESTION/CRS:ANSWER@16-ID CRS:QUESTION/CRS:ANSWER@16-TEXT CRS:QUESTION/CRS:ANSWER@16-IS-CORRECT CRS:QUESTION/CRS:ANSWER@16-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@16 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@16-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@16-NAME CRS:QUESTION/CRS:ANSWER@17 CRS:QUESTION/CRS:ANSWER@17-ID CRS:QUESTION/CRS:ANSWER@17-TEXT CRS:QUESTION/CRS:ANSWER@17-IS-CORRECT CRS:QUESTION/CRS:ANSWER@17-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@17 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@17-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@17-NAME CRS:QUESTION@5 CRS:QUESTION@5-ID CRS:QUESTION@5-TEXT CRS:QUESTION@5-SORT-NUM CRS:QUESTION/CRS:QUESTION-TYPE@5 CRS:QUESTION/CRS:QUESTION-TYPE@5-ID CRS:QUESTION/CRS:QUESTION-TYPE@5-NAME CRS:QUESTION/CRS:ANSWER@18 CRS:QUESTION/CRS:ANSWER@18-ID CRS:QUESTION/CRS:ANSWER@18-TEXT CRS:QUESTION/CRS:ANSWER@18-IS-CORRECT CRS:QUESTION/CRS:ANSWER@18-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@18 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@18-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@18-NAME CRS:QUESTION/CRS:ANSWER@19 CRS:QUESTION/CRS:ANSWER@19-ID CRS:QUESTION/CRS:ANSWER@19-TEXT CRS:QUESTION/CRS:ANSWER@19-IS-CORRECT CRS:QUESTION/CRS:ANSWER@19-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@19 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@19-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@19-NAME CRS:QUESTION/CRS:ANSWER@20 CRS:QUESTION/CRS:ANSWER@20-ID CRS:QUESTION/CRS:ANSWER@20-TEXT CRS:QUESTION/CRS:ANSWER@20-IS-CORRECT CRS:QUESTION/CRS:ANSWER@20-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@20 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@20-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@20-NAME CRS:QUESTION/CRS:ANSWER@21 CRS:QUESTION/CRS:ANSWER@21-ID CRS:QUESTION/CRS:ANSWER@21-TEXT CRS:QUESTION/CRS:ANSWER@21-IS-CORRECT CRS:QUESTION/CRS:ANSWER@21-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@21 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@21-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@21-NAME CRS:QUESTION/CRS:ANSWER@22 CRS:QUESTION/CRS:ANSWER@22-ID CRS:QUESTION/CRS:ANSWER@22-TEXT CRS:QUESTION/CRS:ANSWER@22-IS-CORRECT CRS:QUESTION/CRS:ANSWER@22-SORT-NUM CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@22 CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@22-ID CRS:QUESTION/CRS:ANSWER/CRS:ANSWER-TYPE@22-NAME)

any ideas would be greatly appreciated!

Submitted by support on Wed, 2006-10-04 22:15

Hi,

That happened because the "Answer" element is the most frequently repeating element within the document, and was the one therfore selected by the autodetection logic as being the record enclosure.

To access the questions individually, you need to specifiy an appropriate format string, which in the case of the XML sample you provided is:

xml|F:FULL/CRS:COURSE/CRS:QUIZ/CRS:QUESTION/

To help you get started, i've saved your example XML as "questions.xml", and written an example script to display each question.

Click here to see the demo running

Here's the code:

<?php
  
require("MagicParser.php");
  function 
myRecordHandler($record)
  {
    print 
"<h2>Question ".$record["CRS:QUESTION-ID"]."</h2>";
    print 
"<p>".$record["CRS:QUESTION-TEXT"]."</p>";
  }
  
MagicParser_parse("questions.xml","myRecordHandler","xml|F:FULL/CRS:COURSE/CRS:QUIZ/CRS:QUESTION/");
?>

Hope this helps!
Cheers,
David.