I Dream In Code: Generating and Parsing Markup in Python [1]

The first thing that I'm going to do in building out the markup module's class-structure is to figure out where all of the various members of those classes originate, and at what point they are concrete. One of my priorities, as mentioned before, is to try and keep as much similarity between the classes and their members in the markup module and the equivalent DOM objects in typical JavaScript implementations on the client side.

Conforming to DOM Conventions

I can't really meet that goal, conforming to the interfaces of DOM elements (tags, text-nodes, comments and CDATA sections) until I know what members they expose in a browser context. What I did, then, to determine that was write a chunk of JavaScript living in a bare-bones HTML page (download below) that iterates over the list of properties and methods listed on the w3schools.com site, checked an instance of each node-type (except CDATA sections, more on that in a bit) for each property- and method-member that might be available, and reported what came back in that check-process. If a given element did not report that it had the member, then the equivalent class-member in the markup module could be skipped. If the check returned an expected type, like a function for a method, that member should be kept. Anything else that came back will require some additional discovery.

I'd originally included CDATA sections in my collection of objects to examine, but the browser that I ran the page against (Chromium) wouldn't actually allow the creation of a CDATA section, even though it has a document.createCDATASection method. Creation of CDATA sections is not supported for HTML documents according to the error-message I got back. The closest to an actual CDATA that I could get was a comment that contained all of the CDATA's original content, plus the [CDATA[ start and ]] end text. As a result, I don't really know what a CDATA's members look like without doing more digging around. For the time being, I'm willing to leave that be, though — the Comment, Tag and Text classes will likely suffice for my needs for the time being.

The breakdown I got back from that analysis-script was:

	`markup` Module Equivalent Class
Member Name	Comment	Tag	Text
Property Members
accessKey	n/a	string	n/a
attributes	n/a	object	n/a
childElementCount	n/a	number	n/a
childNodes	object	object	object
children	n/a	object	n/a
classList	n/a	object	n/a
className	n/a	string	n/a
clientHeight	n/a	number	n/a
clientLeft	n/a	number	n/a
clientTop	n/a	number	n/a
clientWidth	n/a	number	n/a
contentEditable	n/a	string	n/a
dir	n/a	string	n/a
firstChild	null	object	null
firstElementChild	n/a	null	n/a
id	n/a	string	n/a
innerHTML	n/a	string	n/a
isContentEditable	n/a	boolean	n/a
lang	n/a	string	n/a
lastChild	null	object	null
lastElementChild	n/a	null	n/a
namespaceURI	n/a	string	n/a
nextElementSibling	null	null	null
nextSibling	null	null	null
nodeName	string	string	string
nodeType	number	number	number
nodeValue	string	null	string
offsetHeight	n/a	number	n/a
offsetLeft	n/a	number	n/a
offsetParent	n/a	null	n/a
offsetTop	n/a	number	n/a
offsetWidth	n/a	number	n/a
ownerDocument	object	object	object
parentElement	null	null	object
parentNode	null	null	object
previousElementSibling	null	null	null
previousSibling	null	null	null
scrollHeight	n/a	number	n/a
scrollLeft	n/a	number	n/a
scrollTop	n/a	number	n/a
scrollWidth	n/a	number	n/a
style	n/a	object	n/a
tabIndex	n/a	number	n/a
tagName	n/a	string	n/a
textContent	string	string	string
title	n/a	string	n/a
Method Members
addEventListener	function	function	function
appendChild	function	function	function
blur	n/a	function	n/a
click	n/a	function	n/a
cloneNode	function	function	function
compareDocumentPosition	function	function	function
contains	function	function	function
focus	n/a	function	n/a
getAttribute	n/a	function	n/a
getAttributeNode	n/a	function	n/a
getElementsByClassName	n/a	function	n/a
getElementsByTagName	n/a	function	n/a
getFeature	n/a	n/a	n/a
hasAttribute	n/a	function	n/a
hasAttributes	n/a	function	n/a
hasChildNodes	function	function	function
insertBefore	function	function	function
isDefaultNamespace	function	function	function
isEqualNode	function	function	function
isSameNode	function	function	function
isSupported	n/a	n/a	n/a
nodelist.item	n/a	n/a	n/a
normalize	function	function	function
querySelector	n/a	function	n/a
querySelectorAll	n/a	function	n/a
removeAttribute	n/a	function	n/a
removeAttributeNode	n/a	function	n/a
removeChild	function	function	function
removeEventListener	function	function	function
replaceChild	function	function	function
scrollIntoView	n/a	function	n/a
setAttribute	n/a	function	n/a
setAttributeNode	n/a	function	n/a
toString	function	function	function

This gives me enough information to at least start making decisions about where various member-properties and -methods need to be defined, and how. Given the class-relationships already defined:

Any keeper item from the table above that exists in all the class-types should be required by the IsNode interface, at least as a default consideration. The same consideration should also be given to any items that return the same values across all the class-types, even if they haven't been flagged as a keeper. The logic behind that statement boils down to the fact that while I checked each node-type in the original JavaScript script, I did not populate a large-enough node- and element-sample in that script to feel confident that I captured every valid low-level member. If possible, those same items should also have a concrete implementation in the BaseNode abstract class. There will probably be a few items that, even though they fall into that category, just don't make sense in those locations, but I'll note those as I go along.

Defining the `IsNode` interface

Starting, then, with the items in the table that are keepers, or that returned identical values across all the different node-types, the following are either directly valid or need to be looked at in more detail for requirement in IsNode:

	`markup` Module Equivalent Class
Member Name	Comment	Tag	Text
Property Members
childNodes	object	object	object
nextElementSibling	null	null	null
nextSibling	null	null	null
nodeName	string	string	string
nodeType	number	number	number
nodeValue	string	null	string
parentElement	null	null	object
parentNode	null	null	object
previousElementSibling	null	null	null
previousSibling	null	null	null
textContent	string	string	string
Method Members
addEventListener	function	function	function
appendChild	function	function	function
cloneNode	function	function	function
compareDocumentPosition	function	function	function
contains	function	function	function
hasChildNodes	function	function	function
insertBefore	function	function	function
isDefaultNamespace	function	function	function
isEqualNode	function	function	function
isSameNode	function	function	function
normalize	function	function	function
removeChild	function	function	function
removeEventListener	function	function	function
replaceChild	function	function	function
toString	function	function	function

While I was stripping down the list, I noticed that parentElement and parentNode didn't get flagged in such a way to be considered for inclusion in IsNode, but it's a basic fact of markup-languages that all nodes should have those properties — if they aren't populated, that simply means that the node doesn't have a parent currently, but they might well later after some manipulation. The nodeValue property

Looking over that list of remining members, there are a few that don't make any sense to include in IsNode already:

Any members that involve child nodes — Those are aspects of a Tag, certainly, but since Comment and Text will also derive from IsNode and they don't have child nodes (and can't?), those should go away. That removes:
- The childNodes property;
- The appendChild method;
- The hasChildNodes method;
- The insertBefore method;
- The removeChild method; and
- The replaceChild method;
Any members relating to manipulation of event-listeners — On the server side, where all of the markup module's functionality is actually running, there is no browser context available, so no event-handling processes, and so none of these members are useful. That removes:
- The addEventListener method; and
- The removeEventListener method;

The rest will need to be exmined in more detail, one by one, so let me just jump into that now...

Implementing and Testing the Abstract Properties

Since IsNode is only an interface, there are no concrete implementations of properties to define, only abstract property requirements that will be picked up by derived classes. That makes the definition of those property requirements very simple, and the testing of them pretty straightforward. The real trick is determining where the concrete implementations of them is going to occur. Going through the list of properties:

nextElementSibling: Returns the next element at the same node tree level — w3schools; Abstract property in IsNode; Implement in BaseNode
nextSibling: Returns the next node at the same node tree level — w3schools; Abstract property in IsNode; Implement in BaseNode
nodeName: Returns the name of a node — w3schools; Returns the tag-name for Tags, and magic-string constants for other node-types (#comment for a Comment, #document for a document, #text for a Text object, and #cdata for a CDATA if the pattern is maintained).; Abstract property in IsNode; Implement in CDATA, Comment, Tag and Text classes
nodeType: Returns the node type of a node — w3schools; Returns 8 for Comments, 4 for CDATAs, 1 for Tags and 3 for Texts; Abstract property in IsNode; Implement in CDATA, Comment, Tag and Text classes
nodeValue: Sets or returns the value of a node w3schools; It appears that this method returns the first text-node child of an element, rather than the entire set of text-node values, at least in Chromium. At any rate, it's dependent on the presence of child nodes, so...; Skip; Implement in Tag
parentElement: Returns the parent element node of an element — w3schools; Abstract property in IsNode; Implement in BaseNode
parentNode: Returns the parent node of an element — w3schools; Abstract property in IsNode; Implement in BaseNode
previousElementSibling: Returns the previous element at the same node tree level — w3schools; Abstract property in IsNode; Implement in BaseNode
previousSibling: Returns the previous node at the same node tree level — w3schools; Abstract property in IsNode; Implement in BaseNode
textContent: Sets or returns the textual content of a node and its descendants — w3schools; The return value is, essentially, a concatenation of all child Text nodes in a Tag, or the data value (the content) of a Comment or Text instance. If CDATA is assumed to behave like a Comment, then it would also return the inner content of the instance.; Abstract property in IsNode; Implement in HasTextData and Tag

The abstraction of these properties in isNode is just a few lines of code:

#-----------------------------------#
# Abstract Properties               #
#-----------------------------------#

nextElementSibling = abc.abstractproperty()
nextSibling = abc.abstractproperty()
nodeName = abc.abstractproperty()
nodeType = abc.abstractproperty()
parentElement = abc.abstractproperty()
parentNode = abc.abstractproperty()
previousElementSibling = abc.abstractproperty()
previousSibling = abc.abstractproperty()
textContent = abc.abstractproperty()

The test-methods for each property will follow this pattern:

def testPROPERTYNAME(self):
    """Unit-tests the PROPERTYNAME property of an IsNode instance."""
    try:
        testInstance = markup.IsNode()
    except TypeError, error:
        actual = 'PROPERTYNAME' in str( error )
        self.assertTrue( actual, 'The TypeError raised by trying to '
            'instantiate IsNode should include the "PROPERTYNAME" '
            'abstract method-name' )
    except Exception, error:
        self.fail( 'testPROPERTYNAME expected a TypeError, '
            'but %s was raised instead:\n  - %s' % ( 
                error.__class__.__name__, error
            )
        )

In a nutshell, what this does is ensures that the abstract properties appear in the TypeError that is raised by trying to instantiate IsNode, ensuring that the property being tested is abstract.

Implementing and Testing the Abstract Methods

The same basic rule, that member-definitions need only exist in the IsNode interface, applies to the method members as well. The main decisions that need to be made are also similar: where does a given method-requirement and -definition belong, and yields a similar list as the properties noted above:

cloneNode

Clones an element — w3schools

Since this is capable of making shallow or deep copies, and the mechanism for making those copies will vary, it'll have to be implemented in the concrete classes.

Abstract method in IsNode

Implement in CDATA, Comment, Tag and Text

compareDocumentPosition

Compares the document position of two elements — w3schools

The description of the method on the w3schools site, frankly, has me wondering if there's even any point to implementing this on the server side. I've never seen this method used in the wild, though that doen't mean that it isn't used. I can't think of a use-case for it that isn't better served (at least on the server side) by local Python code, particularly since all the real method returns is a bit-mask number-value that indicates relative position between the owner element and the element provided.

Skip

contains

Returns true if a node is a descendant of a node, otherwise false — w3schools

The contains method applies only to objects that have children, really. That hasn't stopped it from being callable on DOM node where it doesn't really make sense, though. For example, executing this JavaScript:

ook = document.createTextNode( 'ook' );
eek = document.createTextNode( 'eek' );
ook.contains( eek );

in several browsers yields

false

That result kind of makes sense — neither of the created text-nodes is a parent of the other, nor can either be appended to the other (calling ook.appendChild( eek ) throws an error).

I'm going to skip this method for now, but there's some discussion around that decision that I'll dig into shortly.

isDefaultNamespace

Returns true if a specified namespaceURI is the default, otherwise false — w3schools

Text-nodes don't have a namespace — it's not a defined member of that node-type at all. Nor do comments, and I presume that the same would hold true for CDATA sections.

Skip

Implement in Tag

isEqualNode

Checks if two elements are equal — w3schools

The complete criteria for testing equality on the client side is listed at the w3schools link above, but since those criteria are dependent on properties that won't exist across all IsNode instances, the usefulness of that list is, perhaps, questionable. Still, being able to perform a comparison is useful. Then the real question is how is that going to be done? I'll work out more details on that later, but for now:

Abstract method in IsNode

Implement in BaseNode

isSameNode

Checks if two elements are the same node — w3schools

Abstract method in IsNode

Implement in BaseNode

normalize

Joins adjacent text nodes and removes empty text nodes in an element — w3schools

This feels like it's something that shuoldn't exist ouside of a Tag, and that seems to be borne out by the fact that it's not possible to usefully call normalize on a text- or comment-node in the browser.

Skip

Implement in Tag

toString

Converts an element to a string — w3schools

Abstract method in IsNode

Implement in CDATA, Comment, Tag and Text

The contains discussion

Also like the abstract-property definitions, abstract methods don't need much in IsNode:

    @abc.abstractmethod
    def METHODNAME( arg1, arg2=None, *args, **kwargs ):
        raise NotImplementedError( '%s.METHODNAME is not implemented as '
            'required by IsNode' % self.__class__.__name__ )

And the unit-tests, since they're really just checking the same sort of relationship between methods and the IsNode interface-class as the property-tests did, is almost identical:

def testMETHODNAME(self):
    """Unit-tests the METHODNAME method of an IsNode instance."""
    try:
        testInstance = markup.IsNode()
    except TypeError, error:
        actual = 'METHODNAME' in str( error )
        self.assertTrue( actual, 'The TypeError raised by trying to '
            'instantiate IsNode should include the "METHODNAME" '
            'abstract method-name' )
    except Exception, error:
        self.fail( 'testMETHODNAME expected a TypeError, '
            'but %s was raised instead:\n  - %s' % ( 
                error.__class__.__name__, error
            )
        )

With those tests in place for IsNode in the test_markup.py unit-test module, the test-results come back clean:

########################################
Unit-test Results: idic.markup
#--------------------------------------#
Tests were SUCCESSFUL
Number of tests run ... 18
Tests ran in .......... 0.001 seconds
########################################

IsNode, then, is done — written and tested.

Dealing with Enumerations in Python

Python doesn't really have a formal enumeration-type like several other languages do, but there are a number of ways to work around that. My personal favorite uses namedtuple from the collections module, based on some observations I've made about how an enumeration behaves:

An enumeration is a constant;
An enumeration is immutable — its values cannot be changed at run-time;
An enumeration's members are individually accessible by name; and
An enumeration is a container, with members that can be used for comparison purposes. That is, given an enumeration of nodeTypes, with presumably-distinct CDATA, Comment, Tag and Text values:
```
nodeTypes.Tag in nodeTypes          # == True
nodeTypes.Text in nodeTypes         # == True
nodeTypes.CDATASection in nodeTypes # == True
nodeTypes.Comment in nodeTypes      # == True
```

There are probably a few more aspects to the behavior of an enumeration, but those three are the main ones, at least that I can think of at this point.

Using a namedtuple, it's actually pretty easy to generate a constant value that exhibits all of those behaviors. The basic code required looks something like this, using the nodeType values from the w3schools site and generating an enumeration-equivalent named nodeTypes that could be added to the markup module:

from collections import namedtuple

nodeTypes = namedtuple(
    'enumNodeTypes', 
    [ 'Tag', 'Text', 'CDATASection', 'Comment' ],
    )(
        Tag=1,
        Text=3,
        CDATASection=4,
        Comment=8,
    )

__all__.append( 'nodeTypes' )

nodeTypes is a constant because namedtuple returns a class, and the code then creates an instance of that class;
It's immutable because it's not possible to add values to, remove values from, or alter existing values of the named items except by altering the definition of those members in the code;
Its members are individually accessible by name because that's a basic capability of a namedtuple-generated class; and
It's a container that allows the use of someValue in nodeTypes.

That last item is best explained with a quick demonstration: Printing nodeTypes and all of the nodeTypes.NAME in nodeTypes examples in the container criteria above yields:

All entries in nodeTypes
  + enumNodeTypes( Tag=1, Text=3, CDATASection=4, Comment=8 )
nodeTypes.Tag in nodeTypes ............ True
nodeTypes.Text in nodeTypes ........... True
nodeTypes.CDATASection in nodeTypes ... True
nodeTypes.Comment in nodeTypes ........ True
nodeTypes.CDATASection in nodeTypes ... True
nodeTypes.Comment in nodeTypes ........ True
12 in nodeTypes ....................... False
"ook" in nodeTypes .................... False

So, while this approach may not be a real enumeration, it provides all of the functionality of one that I think I'll need.

It occurs to me that I don't really have a unit-testing strategy or policy for module-level constants, but frankly I'm not sure that one is really needed, at least not yet. I say not yet now because at some level, there simply has to be some trust in the underlying language. Even with nodeTypes being a non-simple value, it's still a value that is tightly tied to core language structures and functionality, and it shouldn't be possible to break that without altering the code itself.

There's been a fair chunk of analysis in this post, but some code too, and the next logical piece to work out would probably push the length of this post past where I'd like, so I'm going to stop here for now. The next few items that I'm going to tackle include the BaseNode and HasTextContent abstract classes, I think, then I'll have enough of the foundational abstraction written to be able to take a swing at the CDATA, Comment and Text concrete classes. I promised to include the analysis JavaScript-page, though, so here's that

dom-scans.html

6.9kB

I Dream In Code

Thursday, April 20, 2017

Generating and Parsing Markup in Python [1]

Conforming to DOM Conventions

Defining the `IsNode` interface

Implementing and Testing the Abstract Properties

Implementing and Testing the Abstract Methods

Dealing with Enumerations in Python

No comments:

Post a Comment

Thursday, April 20, 2017

Generating and Parsing Markup in Python [1]

Conforming to DOM Conventions

Defining the IsNode interface

Implementing and Testing the Abstract Properties

Implementing and Testing the Abstract Methods

Dealing with Enumerations in Python

No comments:

Post a Comment

Defining the `IsNode` interface