Thursday, April 20, 2017

Generating and Parsing Markup in Python [1]

The first thing that I'm going to do in building out the markup module's class-structure is to figure out where all of the various members of those classes originate, and at what point they are concrete. One of my priorities, as mentioned before, is to try and keep as much similarity between the classes and their members in the markup module and the equivalent DOM objects in typical JavaScript implementations on the client side.

Conforming to DOM Conventions

I can't really meet that goal, conforming to the interfaces of DOM elements (tags, text-nodes, comments and CDATA sections) until I know what members they expose in a browser context. What I did, then, to determine that was write a chunk of JavaScript living in a bare-bones HTML page (download below) that iterates over the list of properties and methods listed on the w3schools.com site, checked an instance of each node-type (except CDATA sections, more on that in a bit) for each property- and method-member that might be available, and reported what came back in that check-process. If a given element did not report that it had the member, then the equivalent class-member in the markup module could be skipped. If the check returned an expected type, like a function for a method, that member should be kept. Anything else that came back will require some additional discovery.

I'd originally included CDATA sections in my collection of objects to examine, but the browser that I ran the page against (Chromium) wouldn't actually allow the creation of a CDATA section, even though it has a document.createCDATASection method. Creation of CDATA sections is not supported for HTML documents according to the error-message I got back. The closest to an actual CDATA that I could get was a comment that contained all of the CDATA's original content, plus the [CDATA[ start and ]] end text. As a result, I don't really know what a CDATA's members look like without doing more digging around. For the time being, I'm willing to leave that be, though — the Comment, Tag and Text classes will likely suffice for my needs for the time being.

The breakdown I got back from that analysis-script was:

  markup Module Equivalent Class
Member Name Comment Tag Text
Property Members
accessKey n/a string n/a
attributes n/a object n/a
childElementCount n/a number n/a
childNodes object object object
children n/a object n/a
classList n/a object n/a
className n/a string n/a
clientHeight n/a number n/a
clientLeft n/a number n/a
clientTop n/a number n/a
clientWidth n/a number n/a
contentEditable n/a string n/a
dir n/a string n/a
firstChild null object null
firstElementChild n/a null n/a
id n/a string n/a
innerHTML n/a string n/a
isContentEditable n/a boolean n/a
lang n/a string n/a
lastChild null object null
lastElementChild n/a null n/a
namespaceURI n/a string n/a
nextElementSibling null null null
nextSibling null null null
nodeName string string string
nodeType number number number
nodeValue string null string
offsetHeight n/a number n/a
offsetLeft n/a number n/a
offsetParent n/a null n/a
offsetTop n/a number n/a
offsetWidth n/a number n/a
ownerDocument object object object
parentElement null null object
parentNode null null object
previousElementSibling null null null
previousSibling null null null
scrollHeight n/a number n/a
scrollLeft n/a number n/a
scrollTop n/a number n/a
scrollWidth n/a number n/a
style n/a object n/a
tabIndex n/a number n/a
tagName n/a string n/a
textContent string string string
title n/a string n/a
Method Members
addEventListener function function function
appendChild function function function
blur n/a function n/a
click n/a function n/a
cloneNode function function function
compareDocumentPosition function function function
contains function function function
focus n/a function n/a
getAttribute n/a function n/a
getAttributeNode n/a function n/a
getElementsByClassName n/a function n/a
getElementsByTagName n/a function n/a
getFeature n/a n/a n/a
hasAttribute n/a function n/a
hasAttributes n/a function n/a
hasChildNodes function function function
insertBefore function function function
isDefaultNamespace function function function
isEqualNode function function function
isSameNode function function function
isSupported n/a n/a n/a
nodelist.item n/a n/a n/a
normalize function function function
querySelector n/a function n/a
querySelectorAll n/a function n/a
removeAttribute n/a function n/a
removeAttributeNode n/a function n/a
removeChild function function function
removeEventListener function function function
replaceChild function function function
scrollIntoView n/a function n/a
setAttribute n/a function n/a
setAttributeNode n/a function n/a
toString function function function
This gives me enough information to at least start making decisions about where various member-properties and -methods need to be defined, and how. Given the class-relationships already defined: Any keeper item from the table above that exists in all the class-types should be required by the IsNode interface, at least as a default consideration. The same consideration should also be given to any items that return the same values across all the class-types, even if they haven't been flagged as a keeper. The logic behind that statement boils down to the fact that while I checked each node-type in the original JavaScript script, I did not populate a large-enough node- and element-sample in that script to feel confident that I captured every valid low-level member. If possible, those same items should also have a concrete implementation in the BaseNode abstract class. There will probably be a few items that, even though they fall into that category, just don't make sense in those locations, but I'll note those as I go along.

Defining the IsNode interface

Starting, then, with the items in the table that are keepers, or that returned identical values across all the different node-types, the following are either directly valid or need to be looked at in more detail for requirement in IsNode:

  markup Module Equivalent Class
Member Name Comment Tag Text
Property Members
childNodes object object object
nextElementSibling null null null
nextSibling null null null
nodeName string string string
nodeType number number number
nodeValue string null string
parentElement null null object
parentNode null null object
previousElementSibling null null null
previousSibling null null null
textContent string string string
Method Members
addEventListener function function function
appendChild function function function
cloneNode function function function
compareDocumentPosition function function function
contains function function function
hasChildNodes function function function
insertBefore function function function
isDefaultNamespace function function function
isEqualNode function function function
isSameNode function function function
normalize function function function
removeChild function function function
removeEventListener function function function
replaceChild function function function
toString function function function

While I was stripping down the list, I noticed that parentElement and parentNode didn't get flagged in such a way to be considered for inclusion in IsNode, but it's a basic fact of markup-languages that all nodes should have those properties — if they aren't populated, that simply means that the node doesn't have a parent currently, but they might well later after some manipulation. The nodeValue property

Looking over that list of remining members, there are a few that don't make any sense to include in IsNode already:

  • Any members that involve child nodes — Those are aspects of a Tag, certainly, but since Comment and Text will also derive from IsNode and they don't have child nodes (and can't?), those should go away. That removes:
    • The childNodes property;
    • The appendChild method;
    • The hasChildNodes method;
    • The insertBefore method;
    • The removeChild method; and
    • The replaceChild method;
  • Any members relating to manipulation of event-listeners — On the server side, where all of the markup module's functionality is actually running, there is no browser context available, so no event-handling processes, and so none of these members are useful. That removes:
    • The addEventListener method; and
    • The removeEventListener method;
The rest will need to be exmined in more detail, one by one, so let me just jump into that now...

Implementing and Testing the Abstract Properties

Since IsNode is only an interface, there are no concrete implementations of properties to define, only abstract property requirements that will be picked up by derived classes. That makes the definition of those property requirements very simple, and the testing of them pretty straightforward. The real trick is determining where the concrete implementations of them is going to occur. Going through the list of properties:

nextElementSibling
Returns the next element at the same node tree level — w3schools
Abstract property in IsNode
Implement in BaseNode
nextSibling
Returns the next node at the same node tree level — w3schools
Abstract property in IsNode
Implement in BaseNode
nodeName
Returns the name of a node — w3schools
Returns the tag-name for Tags, and magic-string constants for other node-types (#comment for a Comment, #document for a document, #text for a Text object, and #cdata for a CDATA if the pattern is maintained).
Abstract property in IsNode
Implement in CDATA, Comment, Tag and Text classes
nodeType
Returns the node type of a node — w3schools
Returns 8 for Comments, 4 for CDATAs, 1 for Tags and 3 for Texts
Abstract property in IsNode
Implement in CDATA, Comment, Tag and Text classes
nodeValue
Sets or returns the value of a node w3schools
It appears that this method returns the first text-node child of an element, rather than the entire set of text-node values, at least in Chromium. At any rate, it's dependent on the presence of child nodes, so...
Skip
Implement in Tag
parentElement
Returns the parent element node of an element — w3schools
Abstract property in IsNode
Implement in BaseNode
parentNode
Returns the parent node of an element — w3schools
Abstract property in IsNode
Implement in BaseNode
previousElementSibling
Returns the previous element at the same node tree level — w3schools
Abstract property in IsNode
Implement in BaseNode
previousSibling
Returns the previous node at the same node tree level — w3schools
Abstract property in IsNode
Implement in BaseNode
textContent
Sets or returns the textual content of a node and its descendants — w3schools
The return value is, essentially, a concatenation of all child Text nodes in a Tag, or the data value (the content) of a Comment or Text instance. If CDATA is assumed to behave like a Comment, then it would also return the inner content of the instance.
Abstract property in IsNode
Implement in HasTextData and Tag
The abstraction of these properties in isNode is just a few lines of code:
#-----------------------------------#
# Abstract Properties               #
#-----------------------------------#

nextElementSibling = abc.abstractproperty()
nextSibling = abc.abstractproperty()
nodeName = abc.abstractproperty()
nodeType = abc.abstractproperty()
parentElement = abc.abstractproperty()
parentNode = abc.abstractproperty()
previousElementSibling = abc.abstractproperty()
previousSibling = abc.abstractproperty()
textContent = abc.abstractproperty()
The test-methods for each property will follow this pattern:
def testPROPERTYNAME(self):
    """Unit-tests the PROPERTYNAME property of an IsNode instance."""
    try:
        testInstance = markup.IsNode()
    except TypeError, error:
        actual = 'PROPERTYNAME' in str( error )
        self.assertTrue( actual, 'The TypeError raised by trying to '
            'instantiate IsNode should include the "PROPERTYNAME" '
            'abstract method-name' )
    except Exception, error:
        self.fail( 'testPROPERTYNAME expected a TypeError, '
            'but %s was raised instead:\n  - %s' % ( 
                error.__class__.__name__, error
            )
        )
In a nutshell, what this does is ensures that the abstract properties appear in the TypeError that is raised by trying to instantiate IsNode, ensuring that the property being tested is abstract.

Implementing and Testing the Abstract Methods

The same basic rule, that member-definitions need only exist in the IsNode interface, applies to the method members as well. The main decisions that need to be made are also similar: where does a given method-requirement and -definition belong, and yields a similar list as the properties noted above:

cloneNode
Clones an element — w3schools
Since this is capable of making shallow or deep copies, and the mechanism for making those copies will vary, it'll have to be implemented in the concrete classes.
Abstract method in IsNode
Implement in CDATA, Comment, Tag and Text
compareDocumentPosition
Compares the document position of two elements — w3schools
The description of the method on the w3schools site, frankly, has me wondering if there's even any point to implementing this on the server side. I've never seen this method used in the wild, though that doen't mean that it isn't used. I can't think of a use-case for it that isn't better served (at least on the server side) by local Python code, particularly since all the real method returns is a bit-mask number-value that indicates relative position between the owner element and the element provided.
Skip
contains
Returns true if a node is a descendant of a node, otherwise false — w3schools
The contains method applies only to objects that have children, really. That hasn't stopped it from being callable on DOM node where it doesn't really make sense, though. For example, executing this JavaScript:
ook = document.createTextNode( 'ook' );
eek = document.createTextNode( 'eek' );
ook.contains( eek );
in several browsers yields
false
That result kind of makes sense — neither of the created text-nodes is a parent of the other, nor can either be appended to the other (calling ook.appendChild( eek ) throws an error).
I'm going to skip this method for now, but there's some discussion around that decision that I'll dig into shortly.
isDefaultNamespace
Returns true if a specified namespaceURI is the default, otherwise false — w3schools
Text-nodes don't have a namespace — it's not a defined member of that node-type at all. Nor do comments, and I presume that the same would hold true for CDATA sections.
Skip
Implement in Tag
isEqualNode
Checks if two elements are equal — w3schools
The complete criteria for testing equality on the client side is listed at the w3schools link above, but since those criteria are dependent on properties that won't exist across all IsNode instances, the usefulness of that list is, perhaps, questionable. Still, being able to perform a comparison is useful. Then the real question is how is that going to be done? I'll work out more details on that later, but for now:
Abstract method in IsNode
Implement in BaseNode
isSameNode
Checks if two elements are the same node — w3schools
Abstract method in IsNode
Implement in BaseNode
normalize
Joins adjacent text nodes and removes empty text nodes in an element — w3schools
This feels like it's something that shuoldn't exist ouside of a Tag, and that seems to be borne out by the fact that it's not possible to usefully call normalize on a text- or comment-node in the browser.
Skip
Implement in Tag
toString
Converts an element to a string — w3schools
Abstract method in IsNode
Implement in CDATA, Comment, Tag and Text

The contains discussion

Also like the abstract-property definitions, abstract methods don't need much in IsNode:

    @abc.abstractmethod
    def METHODNAME( arg1, arg2=None, *args, **kwargs ):
        raise NotImplementedError( '%s.METHODNAME is not implemented as '
            'required by IsNode' % self.__class__.__name__ )
And the unit-tests, since they're really just checking the same sort of relationship between methods and the IsNode interface-class as the property-tests did, is almost identical:
def testMETHODNAME(self):
    """Unit-tests the METHODNAME method of an IsNode instance."""
    try:
        testInstance = markup.IsNode()
    except TypeError, error:
        actual = 'METHODNAME' in str( error )
        self.assertTrue( actual, 'The TypeError raised by trying to '
            'instantiate IsNode should include the "METHODNAME" '
            'abstract method-name' )
    except Exception, error:
        self.fail( 'testMETHODNAME expected a TypeError, '
            'but %s was raised instead:\n  - %s' % ( 
                error.__class__.__name__, error
            )
        )

With those tests in place for IsNode in the test_markup.py unit-test module, the test-results come back clean:

########################################
Unit-test Results: idic.markup
#--------------------------------------#
Tests were SUCCESSFUL
Number of tests run ... 18
Tests ran in .......... 0.001 seconds
########################################
IsNode, then, is done — written and tested.

Dealing with Enumerations in Python 

Python doesn't really have a formal enumeration-type like several other languages do, but there are a number of ways to work around that. My personal favorite uses namedtuple from the collections module, based on some observations I've made about how an enumeration behaves:

  • An enumeration is a constant;
  • An enumeration is immutable — its values cannot be changed at run-time;
  • An enumeration's members are individually accessible by name; and
  • An enumeration is a container, with members that can be used for comparison purposes. That is, given an enumeration of nodeTypes, with presumably-distinct CDATA, Comment, Tag and Text values:
    nodeTypes.Tag in nodeTypes          # == True
    nodeTypes.Text in nodeTypes         # == True
    nodeTypes.CDATASection in nodeTypes # == True
    nodeTypes.Comment in nodeTypes      # == True
    
There are probably a few more aspects to the behavior of an enumeration, but those three are the main ones, at least that I can think of at this point.

Using a namedtuple, it's actually pretty easy to generate a constant value that exhibits all of those behaviors. The basic code required looks something like this, using the nodeType values from the w3schools site and generating an enumeration-equivalent named nodeTypes that could be added to the markup module:

from collections import namedtuple

nodeTypes = namedtuple(
    'enumNodeTypes', 
    [ 'Tag', 'Text', 'CDATASection', 'Comment' ],
    )(
        Tag=1,
        Text=3,
        CDATASection=4,
        Comment=8,
    )

__all__.append( 'nodeTypes' )
  • nodeTypes is a constant because namedtuple returns a class, and the code then creates an instance of that class;
  • It's immutable because it's not possible to add values to, remove values from, or alter existing values of the named items except by altering the definition of those members in the code;
  • Its members are individually accessible by name because that's a basic capability of a namedtuple-generated class; and
  • It's a container that allows the use of someValue in nodeTypes.
That last item is best explained with a quick demonstration: Printing nodeTypes and all of the nodeTypes.NAME in nodeTypes examples in the container criteria above yields:
All entries in nodeTypes
  + enumNodeTypes( Tag=1, Text=3, CDATASection=4, Comment=8 )
nodeTypes.Tag in nodeTypes ............ True
nodeTypes.Text in nodeTypes ........... True
nodeTypes.CDATASection in nodeTypes ... True
nodeTypes.Comment in nodeTypes ........ True
nodeTypes.CDATASection in nodeTypes ... True
nodeTypes.Comment in nodeTypes ........ True
12 in nodeTypes ....................... False
"ook" in nodeTypes .................... False
So, while this approach may not be a real enumeration, it provides all of the functionality of one that I think I'll need.

It occurs to me that I don't really have a unit-testing strategy or policy for module-level constants, but frankly I'm not sure that one is really needed, at least not yet. I say not yet now because at some level, there simply has to be some trust in the underlying language. Even with nodeTypes being a non-simple value, it's still a value that is tightly tied to core language structures and functionality, and it shouldn't be possible to break that without altering the code itself.

There's been a fair chunk of analysis in this post, but some code too, and the next logical piece to work out would probably push the length of this post past where I'd like, so I'm going to stop here for now. The next few items that I'm going to tackle include the BaseNode and HasTextContent abstract classes, I think, then I'll have enough of the foundational abstraction written to be able to take a swing at the CDATA, Comment and Text concrete classes. I promised to include the analysis JavaScript-page, though, so here's that

No comments:

Post a Comment