Python — Looking “Sax”-ey

So I got to toy around with the SAX parser in Python some a bit. I’ve got to say, I rather liked it. I was writing a file that needed to compare two XML documents to see if all of the text nodes in the first file appeared in the second file. My first thought was to use minidom (since it was the library my Python book used). Unfortunately, I was out of town and using my mother’s computer (no jokes, please). I didn’t really want to install a new library on her laptop, so I just decided to use the SAX library.

Following is the code for the program:

[sourcecode language=”python”]

import fileinput
import string
import sys
import re
from xml.sax.handler import ContentHandler
from xml.sax import parse, make_parser

file2=open(sys.argv[2])
file2=file2.readlines()

for item in file2:
item=item.strip()

class EventHandling(ContentHandler):
def __init__(self):
self.xpath=”
self.item_number=0

def startElement(self, name, attrs):
self.item_number+=1
file1_array.append("/"+name)
self.xpath=”.join(file1_array)
all_input_xpath_array.append(self.xpath)

def endElement(self, name):
self.item_number=self.item_number-1
del file1_array[self.item_number:]

def characters(self, string):
string=string.strip()
if string!=”:
self.list_of_xpaths=list_of_xpaths
self.list_of_xpaths.append(self.xpath)
self.list_of_text_nodes=list_of_text_nodes
self.list_of_text_nodes.append(string)

self.line_number=self._locator.getLineNumber()
self.line_number_array=line_number_array
self.line_number_array.append(self.line_number)

my_dict={}
line_number=”
file1_array=[]
line_number_array=[]
list_of_xpaths=[]
backward_line_number_array=[]
list_of_text_nodes=[]

for i in reversed(line_number_array):
backward_line_number_array.append(i)

parser=make_parser()
parser.setContentHandler(EventHandling())
parser.parse(sys.argv[1])

[/sourcecode]

I’m going to just stick with discussing the SAX stuff, just to keep this relatively short.

The class EventHandling inherits the ContentHandler class from the SAX library. What this does is — get ready for it — create the event handlers. The class defines three events to act on (rather, it overrides the default actions for the three events). It does somethig for each Start and End tag, as well as any text nodes it finds. The class starts out by initializing a few variables we’ll use in a bit, which I’ll go over when we get there.

Under the StartElement function, you’ll see the item_number variable with a “+=”. Every time it comes across a beginning tag in the xml file, it increments the item_number variable by one. This helps to get whole xpaths. Under the EndElement function, you’ll see it decrements the variable by 1 — again, I’ll go over that a bit more in a minute.

XML file is read in chunks, calling events when it encounters opening tags, text content and closing tags.

 

In StartElement, the “name” variable/argument is the actual tag, so if you see

[sourcecode]<body>[/sourcecode]

, the “name” variable would be taken by the string “body”. File1_array.append(“/”+name) begins the xpath value — in this case “/body”. So each time you encounter a StartElement, File1_array will have a new xpath entry for the individual start element the parser has encountered. Then, “self.xpath” joins all the items in File1_array (which should be all the start tags it has encountered thus far) to give you the a single, whole xpath for that start element. Finally, when all that stuff is done, the script adds the entire xpath for that start element to the list “all_input_xpath_array”.

But what if a start tag has already been closed with an end tag? You’ve got to keep your SAX parser from adding those start tags or their corresponding end tags to the next xpath, right? So what you need is a little deleting. Here’s how to account for that: under EndElement, you’ll see that it subracts 1 from the variable “self.item_number” when it encounters an end element. So if you have

[sourcecode]<body><author>John</author><author>Jim</author></body>[/sourcecode]

, when the parser reaches

[sourcecode]<body>[/sourcecode]

, the value is 1. At both occurrences of

[sourcecode]<author>[/sourcecode]

, the value will be 2, because it will increment the variable by 1 at each

[sourcecode]<author>[/sourcecode]

, and decrement it by 1 at each

[sourcecode]</author>[/sourcecode]

.

Essentially that allows me to set a numeric value for each nest level under a document’s root element the start tag occurs at. Think of it this way: if the xml document had one [tab] for each unclosed element it encounters, it would increment the “item_number” value by 1.

This allows me to delete every end element in file1_array the parser enounters (that’s what “del file1_array[self.item_number:]” does). That keeps end tags out of the xpath array. It ALSO deletes the most recent occurrence of a start tag before recording the next start tag’s xpath (because said deleted start element will already be in the xpath for its own start tag, and we don’t want it to be duplicated in the next start tag’s xpath). Therefore

[sourcecode]<body><author>John</author><author>Jim</author></body>[/sourcecode]

will create an array of xpaths that will be

[sourcecode] [/body, /body/author, /body/author][/sourcecode]

.

The “characters” event is pretty simple, really. All the real logic is in the startelement and endelement functions. Under Characters, whenever the parser encounters a text node, it just adds the xpath, the line number, and the text node to the appropriate arrays, which were all established and calculated through the logic in startelement and endelement functions. The hard part was realizing how to get the parser to know what xpath it was on when it encountered text nodes, done through the startelement and endelement functions.

Hopefully now you have an idea of what my Sax parser does and how it does it!