Tuesday, May 16, 2017

Generating and Parsing Markup in Python [7]

With the Tag class finally implemented (minus the couple of deferred items waiting on MarkupParser), it's time, I think, to work out the conventions for various types of markup documents. The approach that I plan on taking is to define a BaseDocument abstract class that derives from Tag in order to carry the capabilities of Tag through to all documents.

From that point on, it's just a matter of defining concrete document-classes for each of the document-types that I'm expecting to be using:
  • An HTML5Document;
  • An XHTMLDocument; and (probably)
  • An XMLDocument;
I'm not sure that this strategy would work across other languages with any frequency, though a quick check with PHP would seem to indicate that it would work. The following code, at any rate, doen't raise any errors when executed from the command-line:
<?php

abstract class BaseNode
{
}

class Tag extends BaseNode
{
}

abstract class BaseDocument extends Tag
{
}

class HTML5Document extends Tag
{
}

$doc = new HTML5Document();
?>

Before I can actually define those concrete document-classes, though, I need to determine what their members are, and how theier behavior differs from Tag...

What Are the Differences Between a Document and a Tag?

There are two main areas where documents differ from tag-elements: their object-members, and how they render. Since the three document-types that I'm concerned with have significantly different members (XML won't have head or body properties like an HTML document does, for example, and there are several other properties that fall into a similar classification), the list of members that need to be implemented is actually very short:

The all property
Basically just a call to Tag.getElementsByTagName( '*' ), returning all of the Tag children of the document;
The contentType property
Returns the MIME-type of the document
The doctype property
I haven't been able to pin down exactly what this does on the browser side, but it definitely relates to the <!DOCTYPE html> declaration in an HTML 5 document. My expectation as I'm writing this is that it will return the DOCTYPE for the instance as it would render.
While there are 200-odd more members in an HTML document, these are the only ones that I think might be relevant on the server side that aren't already going to be members of a BaseDocument just because it derives from Tag. A lot of the remainder were properties that have no meaning or use outside a broswer context (e.g., the readyState property and createEvent method, as well as all the on...* event-methods). Several of the properties are essentially just wrappers around some variation of Tag.getElementsByTagName for specific tags (images and scripts) or some similar mechanism with different criteria (links, possibly), and I'd probably not implement them (at last not in BaseDocument) even if they weren't document-type specific. I may well add those to specific concrete document-classes down the line, just so they're available, but they don't belong in BaseDocument.

On the rendering side, each of the document-types I'm planning to build out has slightly different output. Optional items are in brackets []:

HTML 5
<DOCTYPE html>
<html>
    <!-- the document head, body -->
</html>
XHTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html[ xmlns="http://www.w3.org/1999/xhtml"]>
    <!-- the document head, body -->
</html>
(This is XHTML transitional, but there are also official strict and frameset variants, per the w3.org site)
Since it's technically XML, XHTML (of any flavor) can have processing-instructions and any other pre-document-root elements that XML allows.
XML
<?xml[ version="#.#"][ encoding="XXXX"]?>
[<?xml-processing-instruction(s) attributes="allowed"?>]
[<!DOCTYPE root-element[ PUBLIC "PUBLIC identifier"][ "SYSTEM identifier"]>]
<root-element[ attributes]>
    <!-- child elements and content -->
</root-element>
All of these, I think, can live in BaseDocument — the rules for them are of varying complexity, but they all feel like they're achievable at this level.

Looking at DOCTYPE

All three document-types have (or are allowed) some variation of a <!DOCTYPE>. The rules for a DOCTYPE declaration and how it gets rendered are pretty simple:

  • It starts with <!DOCTYPE
  • That's followed by the tagName of the root tag of the document
  • It may have a public identifier (a single-line text-value), in which case that should be rendered and prefixed with PUBLIC
  • It may also have a system identifier (also a single-line text-value):
    • If there is no public identifier, the system identifier should be rendered with SYSTEM as a prefix
    • Otherwise, it can just be rendered, with no prefix
  • It ends with >
  • It's the last thing rendered in output before the start of the Tag-derived structure and output
There are at least three different ways that a DOCTYPE representation could be implemented that I can think of.

The simplest way is, I think, to just store the applicable public and system identifiers as class-level constants, and render them accordingly in the __str__ and __unicode__ methods of the document-instance. That would work fine for HTML 5 and XHTML document-types, since the public and system identifiers for those document-types shouldn't ever change, really. That also has the advantage (I think) of keeping everything in one class-definition, so there'd be less code to manage. Unfortunately, that starts to fall apart as soon as XML documents enter the picture, unless a distinct document-class is built out for each and every XML document-type. That prospect feels ugly.

A more complicated approach is defining a class to represent a DOCTYPE. So long as an instance of that class has a reference to the document it's associated with, that'd allow it to grab the tagName that it needs at render-time. That would leave only the public- and system-identifier values to add to the __init__, so that they could be passed to the DOCTYPE-representative object during the construction of a document-instance. That doesn't feel horrible, but since it'd require additional arguments, with cryptic values, it feels clumsy. Even if those values were set up as module-level constants (which would help, I think), that's still more stuff that has to be remembered every time a document has to be created.

Still another possibility: A DOCTYPE for any given document-type is almost certainly as distinct as its namespace. If the public- and system-identifier values were attached to each Namespace instance, even if it were done outside the object-construction process, and a document's namespace were required during its construction, then the storage of those identifier-values is in a single place (a Namespace instance), and could be accessed in the __str__ and __unicode__ methods much like they could if they were class constants in the first alternative.

That feels pretty reasonable to me, but I think I'd also want to set up some sort of mechanism that would allow namespaces to be defined outside the actual Python code — possibly by setting one or many configuration-files that would define names and other relevant properties for any number of namespaces. The trade-off there is that any namespaces defined by that sort of configurable set-up probably couldn't be referred to as module-level constants — As things stand right now, I'd defined Namespace-instance constants in the markup module for HTML 5 and XHTML documents both:

# HTML 5 namespace
HTML5Namespace = Namespace(
    'html5',
    'http://www.w3.org/2015/html', 
    renderingModels.RequireEndTag,
    br=renderingModels.NoChildren,
    img=renderingModels.NoChildren,
    link=renderingModels.NoChildren,
    )
__all__.append( 'HTML5Namespace' )

# XHTML namespace
XHTMLNamespace = Namespace(
    'xhtml',
    'http://www.w3.org/1999/xhtml', 
    renderingModels.RequireEndTag,
    br=renderingModels.NoChildren,
    img=renderingModels.NoChildren,
    link=renderingModels.NoChildren,
    )
__all__.append( 'XHTMLNamespace' )
Those constants would, I think, have to go away in order to keep access to all available namespaces reasonably consistent.

Or, perhaps not, now that I think on it more. If each application that needs to have one or more namespaces defined actually defines them as constants within that application, then they are accessible as constants within that application's codebase. That might, down the line, require some movement of more general-purpose Namespace definitions into a common location — maybe in markup, maybe elsewhere — but they'd still be accessible the same way that HTML5Namespace and XHTMLNamespace are now.

With all of those options considered, I'm going to take this last approach, I think. It feels reasonable, keeps things relatively well-contained, and doesn't require a huge amount of refactoring of Namespace or a lot of additional code in BaseDocument. As it turns out, a similar approach/solution can be applied to the contentType of BaseDocument — referring to the document's namespace, which in turn has a ContentType property whose value is set during the construction of the instance. The complete changes to Namespace are:

#-----------------------------------#
# Instance property-getter methods  #
#-----------------------------------#

@describe.AttachDocumentation()
def _GetContentType( self ):
    """
Gets the MIME-type associated with documents of the namespace the instance 
represents"""
    return self._contentType

# ...

@describe.AttachDocumentation()
def _GetPublicIdentifier( self ):
    """
Gets the public identifier of the namespace."""
    return self._publicIdentifier

@describe.AttachDocumentation()
def _GetSystemIdentifier( self ):
    """
Gets the system identifier of the namespace."""
    return self._systemIdentifier

#-----------------------------------#
# Instance property-setter methods  #
#-----------------------------------#

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the MIME-Type to set for documents of the namespace the instance '
    'represents',
    str, unicode, None
)
@describe.raises( TypeError, 
    'if passed a value that is not a str, a unicode, or None'
)
@describe.raises( ValueError, 
    'if passed a value that is not a member of %s' % ( 
        sorted( KnownMIMETypes )
    )
)
def _SetContentType( self, value ):
    """
Sets the MIME-Type of the instance"""
    if value != None and type( value ) not in ( str, unicode ):
        raise TypeError( '%s.ContentType expects a str or unicode value '
            'that is one of the known MIME-types on the system, or None, '
            'but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__ )
            )
    if value not in KnownMIMETypes:
        raise ValueError( '%s.ContentType expects a str or unicode value '
            'that is one of the known MIME-types on the system, or None, '
            'but was passed "%s" which could not be found' % ( 
                self.__class__.__name__, value )
            )
    self._contentType = value

# ...

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the public identifier of the namespace to set for the instance',
    str, unicode
)
@describe.raises( TypeError, 
    'if passed a value that is not a str or unicode type or None'
)
@describe.raises( ValueError, 
    'if passed a value that has multiple lines in it'
)
def _SetPublicIdentifier( self, value ):
    """
Sets the public-identifier value for the instance"""
    if type( value ) not in ( str, unicode ) and value != None:
        raise TypeError( '%s.PublicIdentifier expects a single-line str '
            'or unicode value, or None, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__ )
            )
    if value:
        if '\n' in value or '\r' in value:
            raise ValueError( '%s.PublicIdentifier expects a single-line '
                'str or unicode value, or None, but was passed "%s" (%s) '
                'which has multiple lines' % ( 
                    self.__class__.__name__, value, type( value ).__name__
                )
            )
    self._publicIdentifier = value

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the system identifier of the namespace to set for the instance',
    str, unicode
)
@describe.raises( TypeError, 
    'if passed a value that is not a str or unicode type or None'
)
@describe.raises( ValueError, 
    'if passed a value that has multiple lines in it'
)
def _SetSystemIdentifier( self, value ):
    """
Sets the System-identifier value for the instance"""
    if type( value ) not in ( str, unicode ) and value != None:
        raise TypeError( '%s.SystemIdentifier expects a single-line str '
            'or unicode value, or None, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__ )
            )
    if value:
        if '\n' in value or '\r' in value:
            raise ValueError( '%s.SystemIdentifier expects a single-line '
                'str or unicode value, or None, but was passed "%s" (%s) '
                'which has multiple lines' % ( 
                    self.__class__.__name__, value, type( value ).__name__ )
                )
    self._systemIdentifier = value

#-----------------------------------#
# Instance property-deleter methods #
#-----------------------------------#

@describe.AttachDocumentation()
def _DelContentType( self ):
    """
"Deletes" the MIME-type associated with documents of the namespace the instance 
represents by setting it to None"""
    self._contentType = None

# ...

@describe.AttachDocumentation()
def _DelPublicIdentifier( self ):
    """
"Deletes" the public identifier of the namespace by setting it to None."""
    self._publicIdentifier = None

@describe.AttachDocumentation()
def _DelSystemIdentifier( self ):
    """
"Deletes" the system identifier of the namespace by setting it to None."""
    self._systemIdentifier = None

#-----------------------------------#
# Instance Properties               #
#-----------------------------------#

ContentType = describe.makeProperty(
    _GetContentType, None, None, 
    'the MIME-type of the content expected for a document of the '
    'namespace the instance represents',
    str, unicode, None
)

# ...

PublicIdentifier = describe.makeProperty(
    _GetPublicIdentifier, None, None, 
    'the public identifier of the namespace',
    str, unicode, None
)
SystemIdentifier = describe.makeProperty(
    _GetSystemIdentifier, None, None, 
    'the system identifier of the namespace',
    str, unicode, None
)

#-----------------------------------#
# Instance Initializer              #
#-----------------------------------#
@describe.AttachDocumentation()

# ...

@describe.argument( 'contentType', 
    'the MIME-type of the content associate with the instance',
    str, unicode, None
)
    @describe.argument( 'publicId', 
        'the public-identifier of the namespace',
        str, unicode, None
    )
@describe.argument( 'systemId', 
    'the system-identifier of the namespace',
    str, unicode, None
)

# ...

def __init__( self, name, namespaceURI, contentType, systemId=None, 
    publicId=None, defaultRenderingModel=renderingModels.Mixed, 
    **tagRenderingModels ):
    """
Instance initializer"""

    # ...

    # Set default instance property-values with _Del... methods as needed.
    self._DelContentType()

    # ...

    self._DelPublicIdentifier()
    self._DelSystemIdentifier()
    # Set instance property values from arguments if applicable.
    self._SetContentType( contentType )

    # ...

    self._SetPublicIdentifier( publicId )
    self._SetSystemIdentifier( systemId )
The HTML5Namespace- and XHTMLNamespace-constants change slightly, to:
#-----------------------------------#
# Default Namespace constants       #
# provided by the module.           #
#-----------------------------------#

# HTML 5 namespace
HTML5Namespace = Namespace(
    'html5',
    'http://www.w3.org/2015/html', 
    'text/html',
    None,
    None,
    renderingModels.RequireEndTag,
    br=renderingModels.NoChildren,
    img=renderingModels.NoChildren,
    link=renderingModels.NoChildren,
    )
__all__.append( 'HTML5Namespace' )

# XHTML namespace
XHTMLNamespace = Namespace(
    'xhtml',
    'http://www.w3.org/1999/xhtml', 
    'text/html',
    'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd',
    '-//W3C//DTD XHTML 1.0 Transitional//EN',
    renderingModels.RequireEndTag,
    br=renderingModels.NoChildren,
    img=renderingModels.NoChildren,
    link=renderingModels.NoChildren,
    )
__all__.append( 'XHTMLNamespace' )
Finally, unit-tests get updated and run:
########################################
Unit-test results
########################################
Tests were successful ..... False
Number of tests run ....... 265
 + Tests ran in ........... 0.17 seconds
Number of errors .......... 0
Number of failures ........ 1
Number of tests skipped ... 107
########################################
FAILURES
#--------------------------------------#
testCodeCoverage (__main__.testmarkupCodeCoverage)
AssertionError: 
    Unit-testing policies require test-cases for all classes 
    and functions in the idic.markup module, but the following 
    have not been defined:
        (testBaseDocument)
And that, I believe, provides everything needed to implement the doctype property in BaseDocument, which could then also be used in its __str__ and __unicode__ methods to render it if/as needed.

Looking at the XML Headers

Apart from their final output, both the initial XML declaration and any XML processing-instruction that I've run across look, structurally, like they could be represented by a Tag-variant: They have a name, and can have attributes. That does not include any inline DTD specifications (see the An Internal DTD Declaration section here for an example of this), but I don't honestly expect that providing an inline DTD is something that will be needed, so I'm not going to worry too much about that, at least for the time being.

As a result, my first thought with regards to implementing those is to generate a Tag subclass, possibly as an inline/nested class in BaseDocument itself, that overrides the __str__ and __unicode__ methods to generate the right output. That would then allow a document-level property (call it XMLDeclaration) to provide the initial XML declaration, and a collection of those tag-types (XMLProcessingInstructions, as an ElementList) to represent any of the XML processing-instructions for a document-instance. That implementation looks like this:


@describe.InitClass()
class BaseDocument( Tag, object ):

    # ...

    #-----------------------------------#
    # Inline class definitions          #
    #-----------------------------------#

    @describe.InitClass()
    class XMLTag( Tag, object ):

        # ...

        #-----------------------------------#
        # Instance Initializer              #
        #-----------------------------------#
        @describe.AttachDocumentation()
        @describe.argument( 'tagName',
            'the tag-name to set in this created instance',
            str, unicode
        )
        @describe.keywordargs( 
            'the attribute names/values to set in the created instance'
        )
        def __init__( self, tagName, **attributes ):
            """
Instance initializer"""
            Tag.__init__( self, tagName, **attributes )

        # ...

        #-----------------------------------#
        # Instance Methods                  #
        #-----------------------------------#

        @describe.AttachDocumentation()
        def __str__( self ):
            """
Returns a string representation of the instance"""
            try:
                result = '<?%s' % ( self.tagName )
                for name in self.attributes:
                    result += ' %s="%s"' % ( name, self.attributes[ name ] )
                result += '?>'
                return result
            except ( UnicodeDecodeError, UnicodeEncodeError, UnicodeError ):
                return __unicode__( self )

        @describe.AttachDocumentation()
        def __unicode__( self ):
            """
Returns a unicode representation of the instance"""
            result = u'<?%s' % ( self.tagName )
            for name in self.attributes:
                result += u' %s="%s"' % ( name, self.attributes[ name ] )
            result += u'?>'
            return result

        # ...

    #-----------------------------------#
    # Class attributes (and instance-   #
    # attribute default values)         #
    #-----------------------------------#

    # ...
With that class available, the two BaseDocument properties noted above can be implemented, making them available for use in the rendering processes of BaseDocument.__str__, and BaseDocument.__unicode__. Some helper-methods defined in BaseDocument, to set or add items to those properties, will also need to be created, but they feel pretty simple:
SetXMLVersion( version ):
Sets the "version" attribute of the instance's XMLDeclaration, creating it in the process if necessary
SetXMLEncoding( encoding ):
Sets the "encoding" attribute of the instance's XMLDeclaration, creating it in the process if necessary
CreateXMLInstruction( name, **attributes ):
Creates and adds an XML processing-instruction to the instance's XMLProcessingInstructions collection

The Final Implementation of BaseDocument

There's not a whole lot present in BaseDocument, but there are some significant chunks over and above the properties noted earlier. The implementation of the __init__ methodof BaseDocument and the three XML-structure-related methods are pretty simple:

#-----------------------------------#
# Instance Initializer              #
#-----------------------------------#
@describe.AttachDocumentation()
@describe.argument( 'tagName',
    'the tag-name to set in this created instance',
    str, unicode
)
@describe.argument( 'namespace',
    'the namespace that the instance belongs to',
    Namespace
)
@describe.keywordargs( 
    'the attribute names/values to set in the created instance'
)
def __init__( self, tagName, namespace, **attributes ):
    """
Instance initializer"""
    # BaseDocument is intended to be an abstract class,
    # and is NOT intended to be instantiated. Alter at your own risk!
    if self.__class__ == BaseDocument:
        raise NotImplementedError( 'BaseDocument is '
            'intended to be an abstract class, NOT to be instantiated.' )
    # Call parent initializers, if applicable.
    Tag.__init__( self, tagName, namespace, **attributes )
    # Set default instance property-values with _Del... methods as needed.
    self._DelXMLDeclaration()
    self._DelXMLProcessingInstructions()
    # Set instance property values from arguments if applicable.
    # Other set-up

#-----------------------------------#
# Instance Garbage Collection       #
#-----------------------------------#

#-----------------------------------#
# Instance Methods                  #
#-----------------------------------#

@describe.AttachDocumentation()
@describe.argument( 'tagName',
    'the tag-name to set in this created instance',
    str, unicode
)
@describe.keywordargs( 
    'the attribute names/values to set in the created instance'
)
def CreateXMLInstruction( self, tagName, **attributes ):
    """
Creates an XMLTag instance with the supplied tag-name and attributes and appends 
it to the instance's XML processing-instructions"""
    if not self.XMLDeclaration:
        self._SetXMLDeclaration( BaseDocument.XMLTag( 'xml' ) )
    self.XMLProcessingInstructions.append( 
        BaseDocument.XMLTag( tagName, **attributes )
    )

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the encoding value to set in the instance\'s xml declaration', 
    str, unicode
)
@describe.raises( TypeError, 
    'if passed an encoding value that is not a str or unicode'
)
@describe.raises( ValueError, 
    'if passed an encoding value that is not a single word'
)
def SetXMLEncoding( self, value ):
    """
Sets the "encoding" attribute-value in the instance's XMLDeclaration"""
    if type( value ) not in ( str, unicode ):
        raise TypeError( '%s.SetXMLEncoding expects a single-word str or '
            'unicode value for its encoding, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__ )
            )
    if ' ' in value or '\n' in value or '\t' in value or '\r' in value:
        raise ValueError( '%s.SetXMLEncoding expects a single-word str or '
            'unicode value for its encoding, but was passed "%s" which is '
            'invalid' % ( self.__class__.__name__, value )
        )
    if not self.XMLDeclaration:
        self._SetXMLDeclaration( BaseDocument.XMLTag( 'xml' ) )
    self.XMLDeclaration.setAttribute( 'encoding', value )

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the version value to set in the instance\'s xml declaration', 
    str, unicode, float, int, long
)
@describe.raises( ValueError, 
    'if passed a version value that is not a float and cannot be converted '
    'to one'
)
def SetXMLVersion( self, value ):
    """
Sets the "version" attribute-value in the instance's XMLDeclaration"""
    if type( value ) != float:
        try:
            checkValue = float( value )
            if checkValue < 1.0:
                raise ValueError
            value = str( checkValue )
        except:
            raise ValueError( '%s.SetXMLVersion expects a float value '
                'greater than or equal to one, or a text or numeric value '
                'that can be converted to one, but was passed '
                '"%s" (%s)' % ( 
                    self.__class__.__name__, value, type( value ).__name__ )
                )
    else:
        if value < 1.0:
            raise ValueError( '%s.SetXMLVersion expects a float value '
                'greater than or equal to one, or a text or numeric value '
                'that can be converted to one, but was passed '
                '"%s" (%s)' % ( 
                    self.__class__.__name__, value, type( value ).__name__ )
                )
    if not self.XMLDeclaration:
        self._SetXMLDeclaration( BaseDocument.XMLTag( 'xml' ) )
    self.XMLDeclaration.setAttribute( 'version', str( value ) )
The __str__ and __unicode__ methods arent complex either, though they might seem so atr first glance, but I'm pretty confident that the comments in the code tell the entore story of how they work:
@describe.AttachDocumentation()
def __str__( self ):
    """
Returns a string representation of the instance"""
    # Try rendering the instance as a string:
    try:
        result = ''
        # TODO: Add XML declaration, if applicable 
        if self.XMLDeclaration:
            result += '%s' % self.XMLDeclaration
        # TODO: Add XML processing-instructions, if applicable 
        for instruction in self.XMLProcessingInstructions:
            result += '%s' % instruction
        # TODO: Add DOCTYPE, if applicable
        result += '%s' % self.doctype
        result += '<%s' % ( self.tagName )
        # If the instance has a namespace, render that too
        if self.namespace:
            result += ' xmlns="%s"' % ( self.namespace.namespaceURI )
        # If there are child namespaces that aren't the same as the local 
        # namespace, they need to be included:
        for ns in self.childNamespaces:
            if ns != self.namespace:
                result += ' xmlns:%s="%s"' % ( 
                    self.namespace.Name, self.namespace.namespaceURI
                )
        # Since a document is also a tag, it can have attributes, so render 
        # any present:
        for attr in self.attributes:
            result += '%s="%s"' % ( 
                attr, self.attributes[ attr ]
            )
        # Close the starting tag
        result += '>'
        # Add Tag.childNodes.__str__ to results
        for child in self.childNodes:
            result += '%s' % child
        # Strip the current results just to keep things clean
        result = result.strip()
        # Add the closing tag
        result += '</%s>' % ( self.tagName )
        # And return it
        return result
    # If string-rendering fails because it needs unicode, return the 
    # unicode representation instead.
    except ( UnicodeDecodeError, UnicodeEncodeError, UnicodeError ):
        return __unicode__( self )

@describe.AttachDocumentation()
def __unicode__( self ):
    """
Returns a unicode representation of the instance"""
    result = u''
    # TODO: Add XML declaration, if applicable 
    if self.XMLDeclaration:
        result += u'%s' % self.XMLDeclaration
    # TODO: Add XML processing-instructions, if applicable 
    for instruction in self.XMLProcessingInstructions:
        result += u'%s' % instruction
    # TODO: Add DOCTYPE, if applicable
    result += u'%s' % self.doctype
    result += u'<%s' % ( self.tagName )
    # If the instance has a namespace, render that too
    if self.namespace:
        result += u' xmlns="%s"' % ( self.namespace.namespaceURI )
    # If there are child namespaces that aren't the same as the local 
    # namespace, they need to be included:
    for ns in self.childNamespaces:
        if ns != self.namespace:
            result += u' xmlns:%s="%s"' % ( 
                self.namespace.Name, self.namespace.namespaceURI
            )
    # Since a document is also a tag, it can have attributes, so render 
    # any present:
    for attr in self.attributes:
        result += u'%s="%s"' % ( 
            attr, self.attributes[ attr ]
        )
    # Close the starting tag
    result += u'>'
    # Add Tag.childNodes.__unicode__ to results
    for child in self.childNodes:
        result += u'%s' % child
    # Strip the current results just to keep things clean
    result = result.strip()
    # Add the closing tag
    result += u'</%s>' % ( self.tagName )
    # And return it
    return result

How BaseDocument Will Be Used

The next logical step, I think, is to define document-type classes for HTML 5 and XHTML document-types — one document-type for each Namespace constant available in the markup module. The implementation of those is where differentiation between the two HTML dialects starts to take shape, as do the differences between the two of them and any generic XML-derived markup. The HTML-variant implementations will be very simple: Neither will override much (if any) of the functionality of BaseDocument, both may well have some common structures added (like head and body properties that provide direct access to the Tag-instance representing them, for example). Down the line, they'll both likey have support for script- and stylesheet-management attached in some fashion, but that's a topic for a later post.

There's one other difference that I can think of, offhand, between the two HTML variants, maybe: An XHTML document's __init__ might set XML-declaration values (version and encoding) in order to conform to the XML requirements that underlie it. There's also the possibility that XML processing-instructions might be added, though that's not part of a baseline XHTML document. Given that this is the only difference I can identify between the two dialects that isn't already accounted for through the relvant Namespace associated, I'm going to give some thought to how best to proceed on defining concrete HTML-document classes while I work my way through the MarkupParser in my next post.

And that's all for today, I think.

Thursday, May 11, 2017

Unit-testing vs. Development Disruptions

So, the short story about the disruption in implementing Tag that I alluded to is that I got sidetracked with other things, and couldn't spare the attention to Tag for the best part of two weeks. The specific details why aren't important, though — in a real world, dev-shop environment, similar derailments happen for any number of reasons: Temporary changes in priorities, critical bugs that need attention right now, the list of possible reasons is probably huge. The important part is what the effects of the derailment were:

  • Development was paused mid-stream;
  • It was set aside for a relatively long period, and despite making some efforts to at least try to keep track of where I'd left off, I ended up being away from it for too long to remember what I was thinking at the time;
  • I hadn't gotten through completely documenting (or in a few cases, even loosely defining) many of Tag's members;
  • I still had a fair number of unit-tests that I'd had to defer because of dependencies with other classes;
Note that second point: Normally, if I can, I like to try and at least take some time to scribble down a few simple notes about where I had to leave off, what I was doing, what the immediate concerns were when I had to shelve my efforts, that sort of thing. In this case, though I made an effort to do so, it simply wasn't enough. I'd completely lost track of my thoughts after getting as far as stubbing out all of the members of Tag after finishing the unit-testing of Namespace, minus some dependencies on Tag.

Some of this disruption could have been avoided, maybe, if I'd chosen to build out the concrete classes in a different order. As a result of my decision to take them in the order I did, I was left in the position of waiting to finish the unit-testing of Namespace until I was done with implementing Tag because of some dependencies between the two:

I'd rather get all of the unit-testing resolved at once after Tag is done, and the markup module is more complete.
In retrospect, that was maybe not the best decision I could've made. I haven't gone back yet to look at the interconnections of the markup classes to see if there was a better, less troublesome path I could have taken, so it may well be that there isn't one. I'm going to plan to do that after I finish the module, and if there is anything worthwile that I discover, I'll be sure to post it.

Where That Left Me

As a result, I spent a day or so thrashing about trying to figure out where I'd left off and what my next steps were. Not unlike coming into a project that some other developer has started in a real-word development position, really.

And there is where having solid unit-testing policies and practices came to the rescue, I think. With what had been completed, it was a matter of a few minutes to set up the testTag test-case class and fire off a test-run. That gave me a couple lists of missing test-methods, one for properties of Tag, and one for its methods. It took maybe another half hour or so to stub out all of those test-methods, so that all of them were returning failures, and then another couple of days to work my way through all of the failing tests and get them to pass. All told, there were three types of tasks I had to undertake, guided by those tests:

  • Implement unit-tests on properties and methods that had been completed;
  • Implement unit-tests on properties and methods that were implemented, but also broken in some fashion; and
  • Implement unit-tests that revealed that I hadn't even defined the Tag-member that they were supposed to test, then implement those missing members;
All in all, this process was similar to the sort of thing that is done regularly in TDD shops, if much less formally:
  • I had member-requirements enforced by the test-methods;
  • I had, at least in some cases, functional requirements for those members that were easily converted into usable test-methods;
  • In the cases where I didn't have solid functional requirements, I could refer to the JavaScript API that I was trying to maintain consistency with for most members; and
  • In the (few) remaining cases where I was creating new functionality, I had documented what those members were supposed to do, or had a very good idea what I wanted them to be capable of.
It was still very chaotic (and more than a little frustrating), but it was workable.

Because I was feeling pressed for time, I didn't think to capture a lot of the results of those initial test-runs. In fact, it wasn't until I'd gotten a fair way through them that it occurred to me that what I was going through might be worth posting about, and by that time my results looked like this:

########################################
Unit-test results
########################################
Tests were successful ... False
Number of tests run ..... 219
 + Tests ran in ......... 0.02 seconds
Number of errors ........ 0
Number of failures ...... 92
########################################

By the time Tag was complete, that had grown a bit:

########################################
Unit-test results
########################################
Tests were successful ..... False
Number of tests run ....... 225
 + Tests ran in ........... 0.12 seconds
Number of errors .......... 0
Number of failures ........ 16
Number of tests skipped ... 77
########################################
I'd also decided that I wanted to be able to see both a summary of the number of test-methods that I'd explicitly skipped, and some details about those skipped test-methods. A lot of them were skipped because they were the various getter-, setter- or deleter-methods for propeties that were completely tested in the test-method for the property. I'd also force-skipped the items that were dependent on an implementation of some other class that I hadn't gotten to yet (MarkupParser):
########################################
SKIPPED
#--------------------------------------#
test_DelParent (__main__.testBaseNode)
 - _DelParent is tested in testparent
test_GetParent (__main__.testBaseNode)
 - _GetParent is tested in testparent

...

test_SetParent (__main__.testBaseNode)
 - _SetParent is tested in testparent

...

test_DelinnerHTML (__main__.testTag)
 - ## DEFERRING until MarkupParser is implemented

...

test_GetinnerHTML (__main__.testTag)
 - ## DEFERRING until MarkupParser is implemented

...

test_SetinnerHTML (__main__.testTag)
 - ## DEFERRING until MarkupParser is implemented

...

testcloneNode (__main__.testTag)
 - ## DEFERRING until MarkupParser is implemented
testinnerHTML (__main__.testTag)
 - ## DEFERRING until MarkupParser is implemented

...

########################################
FAILURES

The take-away from this entire story, for me, boiled down to

Having a unit-testing policy, and processes that implement that policy, can help a developer resume their efforts after a disruption as well as ensuring that changes to code didn't break anything.

Even with that, though, Tag still ended up taking longer to finish than I'd expected — an argument, perhaps, for picking the sequence of classes for development more carefully...

Other Things that I Encountered or Thought Of

Here's a potentially-useful trick, with some back-story. BaseNode implements some concrete functionality that's inherited by CDATA, Comment, Tag and Text — I'll use the nextElementSibling property as the example, but there are seven other properties that have the same relationship to the same derived classes. In order to really test those properties, there needs to be a class that has a complete, concrete implementation that derived from BaseNode. Normally, in building out unit-tests for an abstract class like BaseNode, I'd also define a derived class as part of the unit-testing module (BaseNodeDerived, for example), and would use instances of that class as the test-objects in the various test-methods for those concrete items. At some point while I was working through the pile of unit-tests for Tag, it occurred to me that it would be possible (though maybe not desirable in this case) to have one of the test-methods for BaseNode require test-methods in the test-case classes for its derived classes instead. That didn't turn out to be very useful in this case, since BaseNode ended uup being an abstract class with no abstract members (something that I'll have to think on later), but the concept seemed, for a while, to be sound enough that I had code that would do just that:

def testnextElementSibling(self):
    """Unit-tests the nextElementSibling property of a BaseNode instance."""
    # It makes little sense to test here, since that would require 
    # spinning up derived (and potentially broken) classes and there 
    # are *actual* classes where the tests can be run, so require tests
    # in those test-case classes here instead.
    testCases = [ testCDATA, testComment, testTag, testText ]
    testName = 'testnextElementSibling'
    missingCases = [ 
        c.__name__ for c in testCases if not hasattr( c, testName )
    ]
    self.assertEqual( missingCases, [], 
        'testBaseNode requires "%s" test-cases in %s' % 
            ( testName, ', '.join( missingCases ) )
    )
Ultimately, since Tag (and CDATA, Comment and Text) derive from BaseNode, it was possible to build useful test-methods for all of the incomplete BaseNode test-methods using instance of those classes, so this approach wasn't actually needed. Probably just as well: Though I could show that it worked to my satisfaction, it still ended up relying a bit too much on a human making a decision (which items to require tests for, in this case) to maintain some certainty of the code-coverage I'm striving for. That said, I may well come back to that idea, perhaps implementing it as a decorator-method that can be applied to test-case classes, the way AddMethodTesting and AddPropertyTesting a work right now.

A Missing Property Example: ownerDocument

One of the missing members I discovered as a result of the big list of members of Tag was the ownerDocument property. In all honesty, I simply missed implementing, or even requiring it, so the unit-testing approach mentioned above didn't catch it — it was purely human observation and effort that revealed it. Since it was also a common property for all of the BaseNode-derived classes, and something that should be common to all node-types, even if they aren't derived from BaseNode, I required it as an abstract property in IsNode:

#-----------------------------------#
# Abstract Properties               #
#-----------------------------------#

nextElementSibling = abc.abstractproperty()
nextSibling = abc.abstractproperty()
nodeName = abc.abstractproperty()
nodeType = abc.abstractproperty()
ownerDocument = abc.abstractproperty()
parentElement = abc.abstractproperty()
parentNode = abc.abstractproperty()
previousElementSibling = abc.abstractproperty()
previousSibling = abc.abstractproperty()
textContent = abc.abstractproperty()
Once that was in place, though, the unit-testing policies picked up that it was missing immediately raising failures resembling this one, from testCDATA:
########################################
ERRORS
#--------------------------------------#
Traceback (most recent call last):
  File "test_markup.py", line 361, in test__init__
    testObject = CDATA( testValue )
TypeError: 
    Can't instantiate abstract class CDATA with 
      abstract methods ownerDocument

From there, implementing it in BaseNode was simple:

@describe.AttachDocumentation()
def _GetownerDocument( self ):
    """
Returns the top-most IsElement object in the instance's DOM tree"""
    if not self._parent:
        return self
    else:
        return self._parent.ownerDocument
This deviates from the JavaScript ownerDocument, though, and I'm not sure if I'll keep it as is, or alter it later when I get to implementing BaseDocument: In JavaScript, as far as I've ben able to determine, there is no way to create an element that is not a member of a document — even if that element hasn't been attached to the DOM of the document. All parsed tags are automatically document-members, and the createElement method is only available to the document. Taken together, these effectively prevent an element from not being a member of the document they were created in. In the idic framework (so far), it's possible to create Tags even if no document has been defined. I'll have to ponder on that, but for now, I'll let it stand.

Once again, the unit-testing policies picked up that there was still something missing:

#--------------------------------------#
Traceback (most recent call last):
  File "unit_testing.py", line 332, in testMethodCoverage
    target.__name__, missingMethods
AssertionError: 
    Unit-testing policy requires test-methods to be created 
    for all public and protected methods, but testBaseNode 
    is missing the following test-methods: 
        ['test_GetownerDocument']
#--------------------------------------#
Traceback (most recent call last):
  File "unit_testing.py", line 373, in testPropertyCoverage
    'methods: %s' % ( target.__name__, missingMethods )
AssertionError: 
    Unit-testing policy requires test-methods to be created 
    for all public properties, but testBaseNode is missing 
    the following test-methods: 
        ['testownerDocument']
#--------------------------------------#
As annoying as this might seem, it really was a good thing — The unit-testing processes and policies set up back at the beginning of last month were catching that changes had been made, that unit-testing for those changes wasn't complete, and that work needed to be done because of those changes.

That feels like a validation of my unit-testing policies to me.

Two other properties came to my attention while organizing the big Tag members-table: While I'd set up tests for the nodeName and nodeType properties in the testHasTextData test-case class, there were no corresponding tests in testCDATA, testComment or testText.

Again, fixing that didn't take much effort. The example structure for the test-methods for all of the concrete classes looked almost the same as the test-methods for CDATA :

def testnodeName(self):
    """Unit-tests the nodeName property of a CDATA instance."""
    testObject = CDATA( 'test-instance' )
    self.assertEquals( testObject.nodeName, CDATA._nodeName,
        'CDATA does not have a defined _nodeName attribute, or is '
        'inheriting the default None value from HasTextData.'
    )

def testnodeType(self):
    """Unit-tests the nodeType property of a CDATA instance."""
    testObject = CDATA( 'test-instance' )
    self.assertEquals( testObject.nodeType, nodeTypes.CDATASection,
        'CDATA does not have a defined _nodeType attribute, or is '
        'inheriting the default None value from HasTextData.'
    )
Once the underlying class-attributes had been defined for all three concrete classes, and a few other minor things that I noticed that were buried in the 20+ failures from Namespace and BaseNode tests were cleaned up, things were back to a reasonable/expected number of failures and errors.

The moral of this story? Perhaps the idea of embedding the node-types and -names as class properties, then retrieving them with common getter-methods in HasTextData was... too clever, maybe? At the time it felt fairly elegant — Store the actual values in the classes themselves, keeping them nicely encapsulated, etc., etc. But, when push came to shove, the combination of that storage-approach and the coverage-testing routines left a hole that a bug slipped through. It was pure, dumb luck that I happened to notice it when I did.

Getting the Remaining Tests Running

Eventually, after I got done with Tag's implementation and had reconciled all of the expected missing tests, I got to a point where the test-run yielded only a handful of failures:

########################################
Unit-test results
########################################
Tests were successful ..... False
Number of tests run ....... 226
 + Tests ran in ........... 0.14 seconds
Number of errors .......... 0
Number of failures ........ 8
Number of tests skipped ... 77
########################################
Those failures included a variety of items, including:
#--------------------------------------#
testCodeCoverage (__main__.testmarkupCodeCoverage)
AssertionError: 
    Unit-testing policies require test-cases for all classes 
    and functions in the idic.markup module, but the following 
    have not been defined: 
        (testAttributesDict)
#--------------------------------------#
testMethodCoverage (__main__.testNamespace)
AssertionError: 
    Unit-testing policy requires test-methods to be 
    created for all public and protected methods, but 
    testNamespace is missing the following test-methods: 
        ['testGetNamespaceByName', 'testGetNamespaceByURI', 
        'testRegisterNamespace', 
        'test_DelDefaultRenderingModel', 'test_DelName', 
        'test_DelTagRenderingModels', 'test_DelnamespaceURI', 
        'test_GetDefaultRenderingModel', 'test_GetName', 
        'test_GetTagRenderingModels', 'test_GetnamespaceURI', 
        'test_SetDefaultRenderingModel', 'test_SetName', 
        'test_SetTagRenderingModels', 'test_SetnamespaceURI']
#--------------------------------------#
testPropertyCoverage (__main__.testNamespace)
AssertionError: 
    Unit-testing policy requires test-methods to be created 
    for all public properties, but testNamespace is missing 
    the following test-methods: 
        ['testDefaultRenderingModel', 'testName', 
        'testTagRenderingModels', 'testnamespaceURI']
#--------------------------------------#
I handled all of these failures with the normal unit-test-definition process.

Since I was already in a unit-testing frame of mind, I went ahead and dealt with all of the remaining outstanding test-failures that weren't part of Tag's test-case as well. That leaves me with a clean slate, more or less, for the next post, with the following results:

########################################
Unit-test Results: idic.markup
#--------------------------------------#
Tests were SUCCESSFUL
Number of tests run ....... 253
Number of tests skipped ... 95
Tests ran in .......... 0.140 seconds
#--------------------------------------#
########################################
Unit-test Results: idic
#--------------------------------------#
Tests were SUCCESSFUL
Number of tests run ....... 276
Number of tests skipped ... 95
Tests ran in .......... 0.318 seconds
#--------------------------------------#

The unit_testing module had some minor modifications, not much more than some restructuring of the test-results reporting, really, with the addition of the code that counted and displayed details on skipped tests. The current test-results for test_markup is long enough that I didn't want to just dump it into the post, but I still want to share it, so it's downloadable as well.

Tuesday, May 9, 2017

Generating and Parsing Markup in Python [6]

Though there are fewer methods in the Tag class than properties, there are still a good number of them that are, in my opinion, worth taking a detailed look at, so I'll just dive right in.

Method Implementations

There are three groups of methods that have a common theme between them, and that share a fair amount of logic as a result:

  • Methods relating to manipulating a Tag's childNodes (variations of adding children to and removing children from a Tag, ultimately);
  • Methods for getting various logical groupings or sets of child Tags from a Tag instance; and
  • Methods for working with a the attributes of a Tag
The majority of the other methods of Tag are either already implemented in BaseNode, are very simple implementations (in my opinion), or need to be deferred until some other class in the markup module is implemented.

Methods Relating to Manipulating Child Nodes

There are five different methods that add child nodes to a Tag instance in some fashion, three that involve removing a child, and two that relate to replacing a child (methods that I'm adding are indicated like this):

appendChild:
Adds a child to the instance's childNodes at the end of the collection
insertAfter:
Adds a child to the instance's childNodes after the position of a specified existing child
insertBefore:
Adds a child to the instance's childNodes before the position of a specified existing child
insertChildAt:
Adds a child to the instance's childNodes at a specified index/position in the colelction
prependChild:
Adds a child to the instance's childNodes at the beginning of the collection
removeChild:
Removes the specified child from the childNodes collection
removeChildAt:
Removes the child at a specified index/position from the childNodes collection
removeSelf:
Removes the instance from its parent's childNodes collection
replaceChild:
Replaces the specified child in the instance's childNodes collection with a new child
replaceChildAt:
Replaces the child at a specified index/position in the instance's childNodes collection with a new child
Experience with a previous incarnation of this module (the one I am re-creating from the ground up) led me to the conclusion that any method that alters a Tag's childNodes should return a relevant node-value. That is, for example:
# Creating a <table> to display values in a list of dicts:
listOfDicts = [
    { 'name':'row1 - name', 'value':'row1 - value' },
    { 'name':'row2 - name', 'value':'row2 - value' },
    { 'name':'row3 - name', 'value':'row3 - value' },
    { 'name':'row4 - name', 'value':'row4 - value' },
    ]

table = Tag( 'table', border='1' )
thead = table.appendChild( Tag( 'thead' ) )
tbody = table.appendChild( Tag( 'tbody' ) )
tr = thead.appendChild( Tag( 'tr' ) )
for key in sorted( listOfDicts[ 0 ] ):
    th = tr.appendChild( Tag( 'th' ) )
    th.appendChild( Text( key ) )
for row in listOfDicts:
    tr = tbody.appendChild( Tag( 'tr' ) )
    for key in sorted( row ):
        td = tr.appendChild( Tag( 'td' ) )
        td.appendChild( Text( str( row[ key ] ) ) )
generates the following <table> quickly and easily because, in part, each appendChild returns the Tag appended, which can then be used to append other nodes to the new element:
name value
row1 - name row1 - value
row2 - name row2 - value
row3 - name row3 - value
row4 - name row4 - value

It didn't occur to me to check what the JavaScript methods I'm copying did until I wrote this post, but a cursory check indicates that they do the same thing — returning an appended child — so in that respect, at least, I feel this decision is solid. Similarly, anything that removes a child should return the child being removed, which is also what the JavaScript equivalents do. The lone JavaScript replace* method (replaceChild) returns the node being replaced, so I followed that convention in all of the remove* methods of Tag in the interests of consistency.

All of these methods are also responsible for making certain that the nodes being manipulated have their parent updated as part of the process. That is:

  • Any method that adds a child to a Tag assures that those child nodes' parent doesn't already exist, and that they are set to the Tag that they are being added to; and
  • Any method that removes a child also clears that child's parent (which makes it available to be added to a different Tag if needed);
Since the replace* methods are, essentially, just a removal of an existing child and the addition of a new one, they are responsible for performing both of those tasks if they aren't handled by calling some other methods.

The carry-through of nodes added to a Tag is shown in the implementation of appendChild, as is the checking of a child's parent before executing the addition to the Tag:

@describe.AttachDocumentation()
@describe.argument( 'child', 
    'the child to append to the instance\'s childNodes', 
    BaseNode
)
@describe.raises( TypeError, 
    'if the specified child is not an instance of BaseNode'
)
@describe.raises( MarkupError, 
    'if the child to be appended is already a child of another element'
)
@describe.returns( 'the child that was appended to the instance\'s '
    'childNodes' )
def appendChild( self, child ):
    if not isinstance( child, BaseNode ):
        raise TypeError( '%s.appendChild expects an instance of BaseNode, '
            'but was passed "%s" (%s)' % ( 
                self.__class__.__name__, child, type( child ).__name__ )
            )
    # Checking for the child's current parent
    if child.parent:
        raise MarkupError( '%s.appendChild cannot append "%s" to "%s" '
            'because the node to be appended is already a child of '
            'another element (%s)' % ( 
                self.__class__.__name__, child, self, child.parent )
            )
    # Append the child
    self._childNodes.append( child )
    # Set the child's new parent
    child._SetParent( self )
    # Return the child
    return child
The return of a removed item is shown in removeChild:
@describe.AttachDocumentation()
@describe.argument( 'child', 
    'the child node to remove', 
    BaseNode
)
@describe.raises( TypeError, 
    'if the specified child is not a BaseNode instance'
)
@describe.raises( MarkupError, 
    'if the specified child is not a child of the instance'
)
@describe.returns( 'the child node removed' )
def removeChild( self, child ):
    """
Removes the specified child from the instance's childNodes"""
    if not isinstance( child, BaseNode ):
        raise TypeError( '%s.replaceChild expects a BaseNode-derived object '
            'for its child argument, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, child, 
                type( child ).__name__
            )
        )
    try:
        del self._childNodes[ self._childNodes.index( child ) ]
    except ValueError:
        raise MarkupError( '%s.removeChild could not remove %s because it '
            'is not a childNode of %s' % ( self.__class__.__name__, 
                child, self )
            )
    return child
Finally, the return of the old child is shown in replaceChildAt (which is called by replaceChild):
@describe.AttachDocumentation()
@describe.argument( 'index', 
    'the position of the child to be replaced in the instance\'s childNodes', 
    int, long
)
@describe.argument( 'newChild', 
    'the new child to replace the specified child with', 
    BaseNode
)
@describe.raises( TypeError, 
    'if the newChild specified is not a BaseNode-derived object'
)
@describe.returns( 'the child being replaced in the instance\'s '
    'childNodes' )
def replaceChildAt( self, index, newChild ):
    """
Replaces the child at the specified index/position in the instance's childNodes 
with a new child"""
    if not isinstance( newChild, BaseNode ):
        raise TypeError( '%s.replaceChild expects a BaseNode-derived object '
            'for its newChild argument, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, newChild, 
                type( newChild ).__name__
            )
        )
    oldChild = self._childNodes[ index ]
    self._childNodes[ index ] = newChild
    oldChild._DelParent()
    newChild._SetParent( self )
    return oldChild

The Collected getElement* Methods

JavaScript provides a number of methods that can be used to retrieve zero-to-many tag-elements based on various criteria, and I've added three more to the mix in Tag:

getElementById:
Returns the first Tag child with an id-attribute whose value matches the id supplied
getElementsByAttributeValue:
Returns a list of Tag children that have a specific attribute whose value exactly matches the value supplied
getElementsByClassName:
Returns a list of Tag children that have a class-attribute (using the classList property) containing the value supplied
getElementsByNamespace:
Returns a list of Tag children whose namespaces match the one provided, including children of children that inherit their parent's namespace
getElementsByPath:
Returns a list of Tag children whose DOM-paths relative to the instance match the path specified
getElementsByTagName:
Returns a list of Tag children whose tag-names match the tag-name specified
Many of these methods rely on being able to start with a list of all of an instance's children, so the implementation of getElementsByTagName (which can fulfil that need) is important enough to show in some detail before discussing the remainder.

getElementsByTagName in JavaScript allows * to be provided as a wild-card tag-name. If that wild-card is provided, then the method returns all child tags. That behavior is mirrored in Tag.getElementsByTagName, as is the return of a null (None) value if there are no matches.

@describe.AttachDocumentation()
@describe.argument( 'tagName',
    'the tag-name to search for in child elements. Using "*" will return '
    'all children',
    str, unicode
)
@describe.returns( 'list of child IsElement objects whose tag-name matches '
    'the tagName provided, or None' )
@describe.raises( TypeError, 
    'if the tagName provided is not a str or unicode value'
)
def getElementsByTagName( self, tagName ):
    """
Gets all child tags whose name matches the tag-name provided"""
    if type( tagName ) not in ( str, unicode ):
        raise TypeError( '%s.getElementsByTagName expects a string or '
            'unicode value for the tag-name it\'s to search for but was '
            'passed "%s" (%s)' % ( 
                self.__class__.__name__, tagName, type( tagName ).__name__ )
            )
    results = []
    for child in self.children:
        if child.tagName == tagName or tagName == '*':
            results.append( child )
        subResults = child.getElementsByTagName( tagName )
        if subResults:
            results += subResults
    if results:
        return results
    return None
getElementsByTagName makes use of recursion, by calling itself again for each child found in the current execution and appending the results of that recursive call to the results at the current level of execution. I suspect that there may be a better way of implementing this method, perhaps using a generator in some fashion, but as of this writing, I simply haven't dug into the idea enough to see if it would be worth pursuing.

With getElementsByTagName available, several of the remaining getElement* methods become fairly simple candidate-filtering problems in their implementation:

getElementsByAttributeValue:
Each result is a candidate that has the specified attribute with the specified value
getElementsByClassName:
Each result is a candidate that has the specified value in its classList
getElementsByNamespace:
Each result is a candidate whose namespace matches the one provided
Each, then, follows the same pattern as getElementsByAttributeValue:
@describe.AttachDocumentation()
@describe.argument( 'name',
    'the name of the attribute whose value is to be checked',
    str, unicode
)
@describe.argument( 'value',
    'the value in the specified attribute that must be matched',
    str, unicode
)
@describe.raises( TypeError, 
    'if the specified name is not a str or unicode value' )
@describe.raises( TypeError, 
    'if the specified value is not a str or unicode value' )
@describe.returns( 'a list of Tag instance matching the '
    'attribute-name/-value criteria' )
def getElementsByAttributeValue( self, name, value ):
    """
Gets the child elements that have the attribute specified containg the value 
specified"""
    if type( name ) not in ( str, unicode ):
        raise TypeError( '%s.getElementsByAttributeValue expects a str or '
            'unicode value for its name, but was passed "%s" (%s)' % 
                ( self.__class__.__name__, name, 
                    type( name ).__name__
                )
            )
    if type( value ) not in ( str, unicode ):
        raise TypeError( '%s.getElementsByAttributeValue expects a str or '
            'unicode value for its value, but was passed "%s" (%s)' % 
                ( self.__class__.__name__, value, 
                    type( value ).__name__
                )
            )
    results = [ 
            c for c in self.getElementsByTagName( '*' ) 
            if c.attributes.get( name ) == value
        ]
    if results: 
        return results
    return None
The variations of the other two are, ultimately, just in the generation of the results being returned:
# getElementsByClassName
results = [ 
            c for c in self.getElementsByTagName( '*' ) 
            if className in c.classList 
        ]
# getElementsByNamespace
    results = [ 
        c for c in self.getElementsByTagName( '*' ) 
        if c.namespace == namespace
    ]

getElementById uses getElementsByAttributeValue as a helper-method, but also provides an optional strict argument (defaulting to False) that allows it to raise a MarkupError if more than one result is found:

# ...
results = self.getElementsByAttributeValue( 'id', value )
if strict and len( results ) > 1:
    raise MarkupError( '%s.getElementById, with strict enforcement, '
        'found more than one child with the specified id' % ( 
            self.__class__.__name__, value )
        )
if results:
    return results[ 0 ]
return None

The last remaining method in this group, getElementsByPath, may take some explanation. Consider a web-page that has a fair amount of content, including a lot contained in <div> tags. The page also has two <form>s in it, and within one form are a number of rows, constructed with <div>s. As part of the application's requirements, there is a need to apply some CSS classes to every form row <div>, without altering any of the others, and it has to be done, for whatever reason, server-side in the code. The form to be altered can be identified by a specific DOM path relative to the document, as can the <div>s that need to be altered. That path, to each of those <div>s, might look something like /div/form/fieldset/div from the <body> of the page.

That is what getElementsByPath is built to do. Like getElementsByTagName it uses recursion, but in this case it uses it to drill down through the DOM tree, matching tag-names (and allowing the same wild-card capabilities) in order to find all of the children in the right position relative to the Tag that the method was called from. Its implementation is simpler than might be expected, given the complexity of what it's doing:

@describe.AttachDocumentation()
@describe.argument( 'path', 
    'the path to find matching elements for, delimited by "/", and allowing '
    'wild-cards ("*")',
    str, unicode
)
@describe.raises( TypeError, 'if the supplied path is not a str or '
    'unicode value' )
@describe.returns( 'list of elements whose dom-path from the instance '
    'matched the one specified, or None' )
def getElementsByPath( self, path ):
    """
Gets all child tags that can be identified by following matching tag-names 
down the tree"""
    if type( path ) not in ( str, unicode ):
        raise TypeError( '%s.getElementsByPath expects a string or '
            'unicode value for the path it\'s to search for but was '
            'passed "%s" (%s)' % ( 
                self.__class__.__name__, path, type( path ).__name__ )
            )
    results = []
    try:
        tagName, subPath = path.split( '/', 1 )
    except:
        tagName = path
        subPath = None
    for child in self.children:
        if tagName == '*' or child.tagName == tagName:
            if subPath != None:
                subPathResults = child.getElementsByPath( subPath )
                if subPathResults:
                    results += subPathResults
            else:
                results.append( child )
    if results:
        return results
    return None

Attribute-Related Methods

getAttribute:
Returns the value of the named attribute of the instance, or None if it doesn't exist
hasAttribute:
Checks for the existence of a specific attribute in the instance's attributes collection
hasAttributes:
Checks forthe existence of any attributes in the instance's attributes collection
removeAttribute:
Removes the specified attribute from the instance's attributes collection
setAttribute:
Sets the value of an attribute in the instance's attributes collection

Minus the type-checking of the name argument, getAttribute is really nothing more than:

return self.attributes.get( name )

Similarly, hasAttribute is:

if self.attributes.get( name ):
        return True
    return False
and hasAttributes is:
if self.attributes:
        return True
    return False

Setting attributes is a bit more detailed, but only because the name and value of the inbound attribute are both checked before the set actually occur, and the special handling for data-* attributes:

@describe.AttachDocumentation()
@describe.argument( 'name', 
    'the name of the attribute to set', 
    str, unicode
)
@describe.argument( 'value', 
    'the value of the attribute to set', 
    str, unicode, None
)
@describe.raises( TypeError, 
    'if the supplied name is not a str or unicode value'
)
@describe.raises( TypeError, 
    'if the supplied value is not a str or unicode value or None'
)
@describe.raises( ValueError, 
    'if the supplied name is not a valid attribute-name'
)
@describe.raises( ValueError, 
    'if the supplied name is not a valid attribute-value'
)
def setAttribute( self, name, value ):
    if type( name ) not in ( str, unicode ):
        raise TypeError( '%s.setAttribute expects a str or unicode value '
            'that is a valid attribute-name for the name of the attribute '
            'to be set, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, name, type( name ).__name__
            )
        )
    if not self.attributes.IsValidName( name ):
        raise ValueError( '%s.setAttribute expects a str or unicode value '
            'that is a valid attribute-name for the name of the attribute '
            'to be set, but was passed "%s" which is not valid' % ( 
                self.__class__.__name__, name
            )
        )
    if value == None:
        try:
            del self._attributes[ name ]
        except KeyError:
            pass
        return
    if type( value ) not in ( str, unicode ):
        raise TypeError( '%s.setAttribute expects a str or unicode value '
            'that is a valid attribute value for the value of the attribute '
            'to be set, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, name, type( name ).__name__
            )
        )
    if not self.attributes.IsValidValue( value ):
        raise ValueError( '%s.setAttribute expects a str or unicode value '
            'that is a valid attribute value for the value of the attribute '
            'to be set, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, name, type( name ).__name__
            )
        )
    if name[0:5] == 'data_':
        name = name.replace( 'data_', 'data-' )
    self._attributes[ name ] = value
Attribute removal, though, is also very simple — barring the type-checking of the name, it's basically just a fail-safe deletion of an item from the attributes collection:
try:
    del self.attributes[ name ]
except KeyError:
    pass

The balance of Tag's methods are pretty straightforward, and I won't go into any depth on them:

cloneNode:
Returns a copy of the Tag with the option of returning copies of all of the its childNodes as well
contains:
Determines if a Tag contains another specified node
hasChildNodes:
Determines if a Tag has any childNodes members
The cloneNode method, like the innerHTML property, had to wait until I've got MarkupParser implemented. My plan for implementing it centers around either creating a new Tag instance for shallow copies, or using the MarkupParser class to generate complete copies of a markup-tree from the __str__ and/or __unicode__ methods of Tag, since nodes in general, and tags in particular, cannot have multiple parents (as noted earlier).

In light of how much code there actually is behind the implementation of Tag, and how much of it I didn't do any sort of deep dive into in this post or the last one, I figured I'd share the Tag class code, as well as its unit-test code before I signed off for the day. These are not the complete markup or test_markup modules, so they won't actually execute for lack of various dependencies, but all the code for both (as of this post) is there:

92.9kB
120.1kB

Thursday, May 4, 2017

Generating and Parsing Markup in Python [5]

The Tag class turned out to be something of a beast, partly because of the sheer scope of it, and partly for reasons that had nothing to do with the code involved. The non-code reasons I'll discuss in my next post after Tag is complete, because I think some interesting points surfaced that bear some discussion, but today, I'm going to stick to telling the story of how Tag's implementation unfolded.

Long Post and More to Come

I had really hoped that I'd be able to get all of the implementation of Tag covered in a single post, but by the time I got to the end of the properties (this post), this was already the longest post I've written to date, so I'll pick up next time with the methods implementations.

Tag is the Workhorse of the Module

It should hopefully come as no great surprise that Tag is a pretty large class — the markup-construct that it represents is the foundation for the structure of web-pages and other document-types in other languages. Given the relationships it has with other classes in the markup module:

there were 33 properties and 28 methods that I originally expected to have to implement, some of which were required by IsElement, BaseNode or IsNode. I took some time to gather all of these together into one coherent list in an effort to make sure that I could just progress down that list, implementing as I went, without missing anything. It's a pretty substantial list, despite the occasional items I decided to remove (usually because they served no real purpose in a server-side context). There were also a few members that I decided I wanted to add to the class, and a few relatively minor concerns about name-conflicts that required some thought about altering the member-names. Here's where the final member-list landed, with the additions and alterations noted:
  Tag Members  
Member Name Impl. Req. By Notes
Property Members
accessKey Tag   Is attribute (accesskey)
attributes Tag    
childElementCount Tag IsElement  
childNamespaces Tag    
childNodes Tag IsElement  
children Tag IsElement  
classList Tag   Relates to attribute (class)
className Tag   Relates to attribute (class)
dir Tag   Is attribute (dir),
Name-conflict
firstChild Tag IsElement  
firstElementChild Tag IsElement  
id Tag   Is attribute (id),
Name-conflict
innerHTML Tag    
lang Tag   Is attribute (lang)
lastChild Tag IsElement  
lastElementChild Tag IsElement  
namespace Tag   Relates to Namespace class
namespaceURI Tag   Relates to Namespace class
nextElementSibling BaseNode    
nextSibling BaseNode    
nodeName Tag IsNode  
nodeType Tag IsNode  
ownerDocument BaseNode IsNode  
parent BaseNode    
parentElement BaseNode    
parentNode BaseNode    
previousElementSibling BaseNode    
previousSibling BaseNode    
style Tag   Is attribute (style)
styleList Tag   Relates to attribute (style)
tabIndex Tag   Is attribute (tabindex)
tagName Tag    
title Tag   Is attribute (title)
Method Members
appendChild Tag IsElement  
cloneNode Tag IsElement  
contains Tag IsElement  
getAttribute Tag    
getElementById Tag    
getElementsByAttributeValue Tag    
getElementsByClassName Tag    
getElementsByNamespace Tag    
getElementsByPath Tag    
getElementsByTagName Tag    
hasAttribute Tag    
hasAttributes Tag    
hasChildNodes Tag IsElement  
insertBefore Tag IsElement  
insertChildAt Tag IsElement  
isDefaultNamespace Tag    
isEqualNode Tag IsNode  
isSameNode BaseNode IsNode  
normalize Tag    
prependChild Tag    
removeAttribute Tag    
removeChild Tag IsElement  
removeChildAt Tag IsElement  
removeSelf Tag IsElement  
replaceChild Tag IsElement  
replaceChildAt Tag IsElement  
setAttribute Tag    
toString Tag IsNode  
As before, these members are derived from the w3schools' HTML DOM Element Object page, with some additions from their list of HTML Global Attributes.

Property Implementations

There were a total of five basic patterns that cropped up while I was working through the implementation of Tag's properties, each with their own particular aspects that I found interesting. I've grouped them accordingly in the discussion below.

Storing Attribute Values

When I realized that several of the properties of Tag also had to be expressed as attributes in the rendered markup, I had to give some serious thought to how I wanted to implement the storage of attributes in general, as well as how I was going to link those properties to the attributes they were related to. There are seven properties that are, in a typical HTML/JavaScript environment, both DOM-object properties and attributes that can be set in the text of the markup:

  • accesskey
  • dir
  • id
  • lang
  • style
  • tabindex
  • title
Those do not include the other eight that were added in HTML 5 (see the list noted earlier for details on those).

To further complicate matters, two of them, dir and id are also the names of built-in functions in Python. Setting the naming-conflict aside for the moment, these properties were a potential concern because as attributes, changes to their values as properties should also be reflected in the markup generated and rendered for Tag-instances that use them. That is, given a Tag instance myTag:

# myTag is a Tag instance
myTag.accessKey = 'X'
myTag.style += 'padding:6px;'
myTag.setAttribute( 'name', 'tagname' )
# or myTag.attributes[ 'name' ] = 'tagname'
should eventually render markup that looks something like this:
<myTag accesskey="X" name="tagname" style="padding:6px;">
Given that I expected to implement at least two different ways to set attribute-values, using the setAttribute method and setting the values directly in an attributes dict, my first thought was to simply use an internal dict as the underlying data-storage mechansim for a Tag's attributes. The next potential concern is that as a dict, the attributes property would be both mutable and unconstrained, which felt like a point of some concern. Specifically, because a dict would be mutable, it'd possible to alter an attribute's value to something that isn't legitimate (a non-text value). It'd also be possible to set an attribute with a non-text name (key), because while the keys of a dict can't be any type of value, they can be any of a lot of types that didn't make sense as an attribute-name. For example:
tagInstance.attributes[ True ] = 'value'
shouldn't be valid as an attribute-name, but wouldn't raise an error when it was attempted, which I was concerned about on a longer-term basis. The mutability concern felt like it would become moot if the underlying dict that stores the attributes were able to perform type- and/or value-checking when setting keys and values.

I's not seen any way to accomplish that sort of key- or value-constraint on a standard dict, which more or less required that I create a custom dict-equivalent or -subclass to handle that. I called it AttributesDict.

AttributesDict is a pretty sparse class — it's got an __init__, mostly to assure that the parent dict.__init__ is called, a couple of instance methods for checking the validity of attribute names and values, and an override of the __setitem__ method of the base dict that performs the type- and value-checking:

#-----------------------------------#
# Class attributes (and instance-   #
# attribute default values)         #
#-----------------------------------#

validNameRE = re.compile( '[_A-Za-z][-_A-Za-z0-9]*' )

# ...

#-----------------------------------#
# Instance Methods                  #
#-----------------------------------#

@describe.AttachDocumentation()
@describe.argument( 'name', 
    'the name to check as being valid as an attribute name'
)
@describe.returns( 'True if valid, False otherwise' )
@describe.raises( TypeError, 
    'if the supplied value is not a str or unicode value'
)
def IsValidName( self, name ):
    """
Determines whether the supplied name is valid as an attribute name"""
    if type( name ) not in ( str, unicode ):
        raise TypeError( '%s._IsValidAttributeValue expects a str or '
            'unicode value, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__ ) )
    if '\n' in name or '\r' in name:
        return False
    if self.validNameRE.sub( '', name ) != '':
        return False
    # TODO: Other checks for validity of the name?
    return True

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the value to check as being valid in an attribute', 
    str, unicode
)
@describe.returns( 'True if valid, False otherwise' )
@describe.raises( TypeError, 
    'if the supplied value is not a str or unicode value'
)
def IsValidValue( self, value ):
    """
Determines whether the supplied value is valid as an attribute value"""
    if type( value ) not in ( str, unicode ):
        raise TypeError( '%s._IsValidAttributeValue expects a str or '
            'unicode value, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__ ) )
    if '\n' in value or '\r' in value:
        return False
    # TODO: Other checks for validity of the name?
    return True

@describe.AttachDocumentation()
@describe.argument( 'key', 
    'the key-name to set the value to',
    str, unicode
)
@describe.argument( 'value', 
    'the value to set in the key-name',
    str, unicode
)
@describe.raises( TypeError, 
    'if passed a key-name value that is not a str or unicode type'
)
@describe.raises( TypeError, 
    'if passed a member-value that is not a str or unicode type'
)
@describe.raises( MarkupError, 
    'if passed an invalid key-name'
)
@describe.raises( MarkupError, 
    'if passed an invalid member-value'
)
def __setitem__( self, key, value ):
    """
Override of standard dict.__setitem__ that checks the types and values of key 
and value arguments both before allowing the itemn to be set"""
    if not isinstance( key, ( str, unicode ) ):
        raise TypeError( '%s cannot accept key-names that are not str or '
            'unicode values, or a type derived from one of them. "%s" (%s) '
            'is not allowed' % ( 
                self.__class__.__name__, key, type( key ).__name__
            )
        )
    if not isinstance( value, ( str, unicode ) ):
        raise TypeError( '%s cannot accept member values that are not str '
            'or unicode values, or a type derived from one of them. '
            '"%s" (%s) is not allowed' % ( 
                self.__class__.__name__, value, type( value ).__name__
            )
        )
    if not self.IsValidName( key ):
        raise AttributeError( '%s is not a valid attribute-name in a %s' 
            % ( key, self.__class__.__name__ )
        )
    if not self.IsValidValue( value ):
        raise AttributeError( '%s is not a valid attribute-value in a %s' 
            % ( key, self.__class__.__name__ )
        )
    dict.__setitem__( self, key, value )
The _Delattributes method of Tag then uses an instance of AttributesDict instead of a normal dict:
@describe.AttachDocumentation()
def _Delattributes( self ):
    """
"Deletes" the attributes of the instance by setting it to a new, empty 
AttributesDict instance"""
    self._attributes = AttributesDict()
and the constraint-concern is taken care of. Whether that's enough to resolve the mutability concern remains to be seen.

Implementing Attribute-Properties

Five of the seven Tag-properties that were also attributes all followed a very similar implementation-pattern. Those five properties were:

  • accesskey
  • lang
  • style
  • tabindex
  • title
Here's what accesskey's methods look like, in detail:
#-----------------------------------#
# Instance property-getter methods  #
#-----------------------------------#

@describe.AttachDocumentation()
@describe.returns( 'str or unicode character, or None' )
def _GetaccessKey( self ):
    """
Returns the value of the instance's "accesskey" attribute"""
    return self._attributes.get( 'accesskey' )

# ...

#-----------------------------------#
# Instance property-setter methods  #
#-----------------------------------#

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the value to set the instance\'s "accesskey" attribute to',
    str, unicode
)
@describe.raises( TypeError, 
    'if passed a value that is not a str or unicode value'
)
@describe.raises( ValueError, 
    'if passed a value that is more than one character in length'
)
@describe.raises( MarkupError, 
    'if passed a value that is not valid as an attribute value'
)
def _SetaccessKey( self, value ):
    """
Sets the value of the instance's "accesskey" attribute"""
    if not value:
        self._DelaccessKey()
    else:
        if not self._IsValidAttributeValue( value ):
            raise MarkupError( '%s.accessKey could not be set to "%s" '
                '- That value is not a valid attribute-value' % ( 
                    self.__class__.__name__, value
                )
            )
        if type( value ) not in ( str, unicode ):
            raise TypeError( '%s.accessKey expects a single-character '
                'str or unicode value, but was passed "%s" (%s)' % ( 
                    self.__class__.__name__, value, 
                    type( value ).__name__
                )
            )
        if len( value ) > 1:
            raise ValueError( '%s.accessKey expects a single-character '
                'str or unicode value, but was passed "%s" (%s)' % ( 
                    self.__class__.__name__, value, 
                    type( value ).__name__
                )
            )
        self._attributes[ 'accesskey' ] = value

# ...

#-----------------------------------#
# Instance property-deleter methods #
#-----------------------------------#

@describe.AttachDocumentation()
def _DelaccessKey( self ):
    """
"Deletes" the value of the instance's "accesskey" attribute by removing 
it from the instance's attributes"""
    try:
        del self._attributes[ 'accesskey' ]
    except KeyError:
        # No such attribute available to delete; ignore
        pass
There was a bit more underneath the getter-, setter- and deleter-methods for lang and tabIndex and less for title's methods. The variations across those three properties were:
lang
Could potentially be validated against any of the various standard ISO language-codes, but I decided to leave that for later, if I even go that far. At present it only raises a TypeError.
tabIndex
The tab-index of a tag is a string representation of a non-negative integer value, so the setter checks to see if the value can be converted into an int, and raises a ValueError if it can't.
title
Has no value-checking, since it should be able to accept anything that is a valid attribute-value.

Implementing Name-Conflicted Attribute-Properties

The last two properties that are also attributes are dir and id. The concern that I had with using those as names as-is was that my plan for creating a Tag instance was to build out an __init__ that looked like this:

def __init__( self, tagName, namespace, **attributes):
    # ...
that allowed attributes to be specified in the code using the attributes keyword-argument. That felt like it made the prospect of instantiating new Tags pretty simple and straightforward. However, since dir and id are defined as Python built-in functions, allowing those to be used as keywords felt... sketchy. By way of example, consider:
class Example( object ):
    def __init__( self, **kwargs ):
        print kwargs

testExample = Example( id='id-value', dir='dir-value' )
actually executes successfully (for now):
{'id': 'id-value', 'dir': 'dir-value'}
There's no guarantee that this would always be the case, though — and even if it never raises any errors in the future because of using a potentially-reserved word, it would still make the built-in dir and id functions unavailable in the body of the function. While I couldn't think of any use-case where that would've been a concern, I also couldn't say with any certainty that it wouldn't be a problem down the line either.

I gave some serious consideration to the idea of establishing a pattern where any attributes specified that began with html would have the html stripped, and the rest reduced to lower-case before being stored as attributes. That would've allowed, for example, htmlId to be used as a keyword for the id attribute, which felt pretty good. Then I thought through what that would mean for an htmlClass attribute-specification. htmlClass would set and read the className property, and would tie to a class attribute. That felt awkward to me. Very awkward. I spent a lot of time going back and forth on various ways of implementing that before deciding that I couldn't really decide how I wanted things to work. Ultimately, in order to keep development moving, I ended up settling on a more brute-force approach, but one that I felt would be easier to refactor later if I could ever escape the analysis paralysis I was encountering about the different approaches. I ended up with a Tag.__init__ looking like this:

#-----------------------------------#
# Instance Initializer              #
#-----------------------------------#
@describe.AttachDocumentation()
def __init__( self, tagName, namespace=None, **attributes ):
    """
Instance initializer"""
    # Call parent initializers, if applicable.
    BaseNode.__init__( self )
    IsElement.__init__( self )
    # Set default instance property-values with _Del... methods as needed.
    # - Attributes first, since many of the rest use that
    self._Delattributes()
    # - Then the rest
    self._DelaccessKey()
    self._DelchildNodes()
    self._Delclass()
    self._DelhtmlDir()
    self._DelhtmlId()
    self._DelinnerHTML()
    self._Dellang()
    self._Delnamespace()
    self._Delstyle()
    # Set instance property values from arguments if applicable.
    self._SettagName( tagName )
    # Various attribute-setters that collide with "reserved" words 
    # in Python
    if 'className' in attributes:
        self._SetclassName( attributes[ 'className' ] )
        del attributes[ 'className' ]
    if 'htmlDir' in attributes:
        self._SethtmlDir( attributes[ 'htmlDir' ] )
        del attributes[ 'htmlDir' ]
    if 'htmlId' in attributes:
        self._SethtmlId( attributes[ 'htmlId' ] )
        del attributes[ 'htmlId' ]
    if 'htmlFor' in attributes:
        self.setAttribute( 'for', attributes[ 'htmlFor' ] )
        del attributes[ 'htmlFor' ]
    # The remaining (normal) attributes:
    if attributes:
        self._Setattributes( attributes )
    # The namespace
    if namespace:
        self._Setnamespace( namespace )
    # Other set-up
and htmlDir and htmlId properties (className has special considerations that I'll go into in a bit, and htmlFor is really only a convenience item for creating <label> tags, so I didn't feel the need to set up an htmlFor property).

The implementation of the getter-/setter-/deleter-mthods for htmlId and htmlDir are very similar, though it seemed prudent to put some value-checks in htmlDir, since the attribute was not supposed to allow completely free-form values. htmlDir's related methods ended up looking like this:

#-----------------------------------#
    # Instance property-getter methods  #
    #-----------------------------------#

    # ...

    @describe.AttachDocumentation()
    def _GethtmlDir( self ):
        """
Returns the value of the instance's "dir" attribute"""
        return self._attributes.get( 'dir' )

    # ...

    #-----------------------------------#
    # Instance property-setter methods  #
    #-----------------------------------#

    # ...

    @describe.AttachDocumentation()
    @describe.argument( 'value', 
        'the value to set the instance\'s "dir" attribute to',
        str, unicode
    )
    @describe.raises( TypeError, 
        'if passed a value that is not a str or unicode value'
    )
    @describe.raises( MarkupError, 
        'if passed a value that is not valid as an attribute value'
    )
    def _SethtmlDir( self, value ):
        """
Sets the value of the instance's "dir" attribute"""
        if value == None or value == '':
            self._DelhtmlDir()
        else:
            validValues = ( 'auto', 'ltr', 'rtl' )
            if not self.attributes.IsValidValue( value ):
                raise MarkupError( '%s.dir could not be set to "%s" '
                    '- That value is not a valid attribute-value' % ( 
                        self.__class__.__name__, value
                    )
                )
            if type( value ) not in ( str, unicode ):
                raise TypeError( '%s.htmlDir expects a str or unicode '
                    'value, one of %s, but was passed "%s" (%s)' % ( 
                        self.__class__.__name__, str( validValues ), 
                        value, type( value ).__name__
                    )
                )
            if value.lower() not in validValues:
                raise ValueError( '%s.htmlDir expects a str or unicode '
                    'value, one of %s, but was passed "%s" (%s)' % ( 
                        self.__class__.__name__, str( validValues ), 
                        value, type( value ).__name__
                    )
                )
            self._attributes[ 'dir' ] = value

    # ...

    #-----------------------------------#
    # Instance property-deleter methods #
    #-----------------------------------#

    # ...

    @describe.AttachDocumentation()
    def _DelhtmlDir( self ):
        """
Deletes the value of the instance's "dir" attribute by removing 
it from the instance's attributes"""
        try:
            del self._attributes[ 'dir' ]
        except KeyError:
            # No such attribute available to delete; ignore
            pass

    # ...

    #-----------------------------------#
    # Instance Properties               #
    #-----------------------------------#

    # ...

    htmlDir = describe.makeProperty(
        _GethtmlDir, _SethtmlDir, _DelhtmlDir, 
        'the value of the instance\'s dir attribute',
        str, unicode
    )

    # ...

Properties that Relate to Attributes

The className and classList properties have an interesting relationship on the browser side: Altering one affects the other, which means that it's possible to use array-based operations on classList, and those changes will carry through to className. Consider:

<div id="example" class="class1 class2 class3">Example div</div>
<script>
    example = document.getElementById( 'example' );
    console.log( 'example.className ... ' + example.className );
    console.log( 'example.classList ... ' + example.classList );
    console.dir( example.classList );
    console.log( 'Removing class2' );
    example.classList.remove( 'class2' );
    console.log( 'example.classList ... ' + example.classList );
    console.dir( example.classList );
    console.log( 'Adding class4 to className' );
    example.className += ' class4'
    console.log( 'example.className ... ' + example.className );
    console.log( 'example.classList ... ' + example.classList );
    console.dir( example.classList );
</script>
If this is executed in a browser (Chromium in my case), the console shows:
example.className ... class1 class2 class3
example.classList ... class1 class2 class3
   [DOMTokenList] ... [ 'class1', 'class2', 'class3' ]
Removing class2
example.classList ... class1 class3
   [DOMTokenList] ... [ 'class1', 'class3' ]
Adding class4 to className
example.className ... class1 class3 class4
   [DOMTokenList] ... [ 'class1', 'class3', 'class4' ]
That's actually kind of neat, I think.

I first discovered that when I was building out the first big list of properties and methods at the beginning of the markup module posts, and it got me thinking about doing the same sort of thing with the style attributes and a styleList attribute — having the ability to use list-operations against individual class-names and inline style specifications seemed like a potentially powerful tool to add. The real challenge felt like it would be in how to actually implement that sort of functionality, because of all the varied interactions needed:

  • The underlying storage still needs to live in an attribute-value, as a flat text-value if possible, so that special considerations don't have to be made for rendering the class and style attributes;
  • The getter-methods for the *List properties need to return a list-structure from the flat-text attribute-value;
  • The setter-methods for the *List properties need to accept a list, and generate the flat text-value in the applicable attribute;
  • The deleter-methods for the *List properties needs to not destroy the interaction between the *List and non-*List properties;
I determined that all of this could be managed by creating a class that either derived from the built-in list, and overrode the functions that allow the mutation of members (the full list of properties and methods is published on the Python site, or creating a completely custom class that does all the list-emulation needed. In either case, any change to the members of the object would have to be able to call the appropriate setter-method of the instance, and the rest would take care of itself. The only other consideration is that CSS classes and inline styles have different member-separators: Classes use a space, and style-declarations use a semicolon.

At first, I wasn't sure if that would be too complex for what I needed — Since I'd just encountered the classList property in the last couple of weeks, I'd obviously never used it, so it wasn't a big concern for me to not include it. At the same time, it definitely felt like it could be of a lot of use, so I preferred to implement it if I could. As it turned out, though, it wasn't as complex as I'd feared. The implementation proof-of-concept code is too long for me to cover in great detail if I want to keep this post to anything close to a reasonable length, but I'll make it downloadable at the end of the post. Here's a quick summary of what I ended up with:

  • I defined a class (AttributeValueList) that derives from list;
  • I added pointers to the getter-, setter- and deleter-methods for the flat-text attributes to the __init__ of the new class, as well as a separator value that would be used elsewhere to fetch a new list from the flat-text value, or to join the instance's list as a new flat-text value:
    def __init__( self, getter, setter, deleter, separator, iterable=[] ):
        self._getter = getter
        self._setter = setter
        self._deleter = deleter
        self._separator = separator
        list.__init__( self, iterable )
  • I defined two helper-methods (_pullFromGetter() and _pushToSetter()) that would refresh the instance's list-values from the flat-text attribute and re-set the flat-text value from the current list-values, respectively:
    def _pullFromGetter( self ):
        print '### Calling %s._SetclassName' % self.__class__.__name__
        # remove all current members
        while len( self ):
            self.remove( 0 )
        # get the new values
        values = self._getter()
        # append each of them to self
        for value in values:
            self.append( value )
    
    def _pushToSetter( self ):
        print '### Calling %s._pushToSetter' % self.__class__.__name__
        self._setter( self._separator.join( self ) )
  • I overrode all of the methods of list that could affect the members if the base list, following a pattern like:
    def __some_list_method__( self, [args] ):
        # Call the original list-method against the instance, 
        # with the arguments passed to the method
        self.__some_list_method__( self, [args] )
        # Call a protected helper-method to update the 
        # "flat-text" value in the object that the 
        # instance relates to
        self._pushToSetter()
  • Finally, in a very stripped-down copy of Tag, I created basic property getter-, setter- and deleter-methods and the corresponding properties, wiring things up so that:
    • The deleter-methods for the *List properties created a new, empty instance of AttributeValueList;
    • The setter-methods for the *List properties removed all members from the current AttributeValueList storage-object, then added in the new values;
    • The setter-methods for the flat-text attributes would call the _pullFromGetter method of their AttributeValueList equivalent;
    The bare-bones implementation for the classList/classNameclassList property-set in the POC shows all of that:
    def _GetclassName( self ):
        return self._attributes.get( 'class' )
    
    def _GetclassList( self ):
        return self._classList
    
    def _SetclassName( self, value ):
        self._attributes[ 'class' ] = value
        self._classList._pullFromGetter()
    
    def _SetclassList( self, value ):
        self._classList = AttributeValueList( 
            self._GetclassName, 
            self._SetclassName, 
            self._DelclassName, 
            ' ', value
        )
    
    def _DelclassName( self ):
        try:
            del self._attributes[ 'class' ]
        except KeyError:
            pass
    
    def _DelclassList( self ):
        self._classList = AttributeValueList( 
            self._GetclassName, 
            self._SetclassName, 
            self._DelclassName, 
            ' '
        )
    
    className = property( _GetclassName, _SetclassName, _DelclassName )
    classList = property( _GetclassList, _SetclassList, _DelclassList )
That proved out enough of the concept that I could run with it behind the scenes. The quick-and-nasty testing from the POC script performed a few typical/basic manipulations:
def printItem( item ):
    print '+- className .............. %s (%s)' % ( 
        item.className, type( item.className ).__name__ )
    print '+- classList .............. %s (%s)' % ( 
        item.classList, type( item.classList ).__name__ )

example = Tag()
print 'example Tag: %s' % example
print '+- classList._getter ...... %s' % ( example.classList._getter.__name__ )
print '+- classList._setter ...... %s' % ( example.classList._setter.__name__ )
print '+- classList._deleter ..... %s' % ( example.classList._deleter.__name__ )
print '+- classList._separator ... "%s"' % ( example.classList._separator )
print '#' + '-'*38 + '#'

print 'example'
printItem( example )

print '| == example.classList += [ \'addedClass\' ]'
example.classList += [ 'addedClass' ]
printItem( example )

print '| == example.className = \'class1 class2\''
example.className = 'class1 class2'
printItem( example )

print '| == example.classList.remove( \'class1\' )'
example.classList.remove( 'class1' )
printItem( example )

print '| == example.classList.append( \'class4\' )'
example.classList.append( 'class4' )
printItem( example )

print '| == example.classList.insert( 1, \'class1\' )'
example.classList.insert( 1, 'class1' )
printItem( example )

print '| == example.classList += [ \'addedClass\' ]'
example.classList += [ 'addedClass' ]
printItem( example )
and yielded expected results for those actions/operations:
example Tag: <__main__.Tag object at 0x7f996ac85e10>
+- classList._getter ...... _GetclassName
+- classList._setter ...... _SetclassName
+- classList._deleter ..... _DelclassName
+- classList._separator ... " "
#--------------------------------------#
example
+- className .............. None (NoneType)
+- classList .............. []
| == example.classList += [ 'addedClass' ]
+- className .............. addedClass (str)
+- classList .............. ['addedClass']
| == example.className = 'class1 class2'
+- className .............. class1 class2 (str)
+- classList .............. ['class1', 'class2']
| == example.classList.remove( 'class1' )
+- className .............. class2 (str)
+- classList .............. ['class2']
| == example.classList.append( 'class4' )
+- className .............. class2 class4 (str)
+- classList .............. ['class2', 'class4']
| == example.classList.insert( 1, 'class1' )
+- className .............. class2 class1 class4 (str)
+- classList .............. ['class2', 'class1', 'class4']
| == example.classList += [ 'addedClass' ]
+- className .............. class2 class1 class4 addedClass (str)
+- classList .............. ['class2', 'class1', 'class4', 'addedClass']

As a side-note: There were other ways to accommodate the list- and non-list versions of both attributes. One that I contemplated for a while was to simply store the actual list-of-string values, and just collapse those down during the rendering process. The problem that I ended up having with that approach was a combination of the discrepancy between the attribute- and property-names for class/className and a strong desire to be able to just dump the attribute keys and values during rendering. For style that wasn't a concern — the property- and attribute-names are identical. Trying to come up with a process that would handle rendering the class from a className that was, in turn calculated from classList started giving me a headache pretty quickly, though I believe I found a way to make it workable. The trade-off, though, was the addition of what might be called special handling for just that one attribute. I'm not a big fan of hard-coding exceptions into code if there's any viable way around it. Ultimately, that preference on my part was why I took the path I did.

Parents and Children, Nodes and Elements, and Their Related Properties

In the JavaScript world that I'm modeling the proeprties and methods of Tag after, there is a distinction between nodes and elements. An element is a type of node, as are text-nodes, comments, and (presumably) CDATA sections. What sets an element apart from a node, as far as my analysis seemed to indicate, was that elements can have children, which are also nodes, and may be elements.

It seems likely that's why JavaScript elements have both children and childNodes properties, and why there are members like lastChild and lastElementChild — to allow retrieval of either all children, or only children that are elements.

While I didn't see a whole lot of use for that distinction while working on the baseline notes and ideas for the markup module, providing as close a parallel functionality-set as I could more or less required that I implement all of those members as well. The foundation of all of them was the childNodes property, and the storage of child BaseNode objects therein.

When push comes to shove, BaseNode children of a Tag are a sequence of objects, so I started with a basic Python list to store them as a proof of concept, but it shared a lot of the concerns that I had that led to the creation of AttributesDict: The underlying list was mutable and unconstrained, so it would be possible to directly insert a member that wasn't valid as a child. Fundamenally, the ElementList class that I built to handle the constraint wasn't all that different from the AttributeValueList class I mentioned earlier. The main difference was really in that there was no reason to care when items were removed, so the method-overrides that related to that could be stripped out, leaving:

#-----------------------------------#
# Instance Methods                  #
#-----------------------------------#

@describe.AttachDocumentation()
def __add__( self, y ):
    """
Override of the base method from list that performs type-checking on the item 
to be added"""
    if not isinstance( y, BaseNode ):
        raise TypeError( '%s is only allowed to have BaseNode-'
            'derived members: "%s" (%s) cannot be added.' % ( 
                self.__class__.__name__, y, type( y ).__name__ )
            )
    list.__add__( self, y )

def __iadd__( self, y ):
    """
Override of the base method from list that performs type-checking on the item 
to be added"""
    if not isinstance( y, BaseNode ):
        raise TypeError( '%s is only allowed to have BaseNode-'
            'derived members: "%s" (%s) cannot be added.' % ( 
                self.__class__.__name__, y, type( y ).__name__ )
            )
    for item in y:
        list.append( self, item )
    return self

def __imul__( self, y ):
    """
Override of the base method from list that performs type-checking on the item 
to be multiplied"""
    if not isinstance( y, BaseNode ):
        raise TypeError( '%s is only allowed to have BaseNode-'
            'derived members: "%s" (%s) cannot be multiplied.' % ( 
                self.__class__.__name__, y, type( y ).__name__ )
            )
    list.__imul__( self, y )
    return self

def __mul__( self, y ):
    """
Override of the base method from list that performs type-checking on the item 
to be multiplied"""
    if not isinstance( y, BaseNode ):
        raise TypeError( '%s is only allowed to have BaseNode-'
            'derived members: "%s" (%s) cannot be multiplied.' % ( 
                self.__class__.__name__, y, type( y ).__name__ )
            )
    list.__mul__( self, y )

def __rmul__( self, y ):
    """
Override of the base method from list that performs type-checking on the item 
to be multiplied"""
    if not isinstance( y, BaseNode ):
        raise TypeError( '%s is only allowed to have BaseNode-'
            'derived members: "%s" (%s) cannot be multiplied.' % ( 
                self.__class__.__name__, y, type( y ).__name__ )
            )
    list.__rmul__( self, y )

def __setitem__( self, i, y ):
    """
Override of the base method from list that performs type-checking on the item 
to be set"""
    if not isinstance( y, BaseNode ):
        raise TypeError( '%s is only allowed to have BaseNode-'
            'derived members: "%s" (%s) is not allowed.' % ( 
                self.__class__.__name__, y, type( y ).__name__ )
            )
    list.__setitem__( self, i, y )

def __setslice__( self, i, j, y ):
    """
Override of the base method from list that performs type-checking on the item 
to be set"""
    if not isinstance( y, BaseNode ):
        raise TypeError( '%s is only allowed to have BaseNode-'
            'derived members: "%s" (%s) is not allowed.' % ( 
                self.__class__.__name__, y, type( y ).__name__ )
            )
    list.__setslice__( self, i, j, y )

def append( self, y ):
    """
Override of the base method from list that performs type-checking on the item 
to be appended"""
    if not isinstance( y, BaseNode ):
        raise TypeError( '%s is only allowed to have BaseNode-'
            'derived members: "%s" (%s) is not allowed.' % ( 
                self.__class__.__name__, y, type( y ).__name__ )
            )
    list.append( self, y )

def extend( self, iterable ):
    """
Override of the base method from list that performs type-checking on the items 
to be extended"""
    badItems = [ i for i in iterable if not isinstance( i, BaseNode ) ]
    if badItems:
        raise TypeError( '%s is only allowed to have BaseNode-'
            'derived members, but included %s whish are not allowed.' 
            % ( self.__class__.__name__, badItems )
        )
    list.extend( self, iterable )

def insert( self, index, obj ):
    """
Override of the base method from list that performs type-checking on the item 
to be inserted"""
    if not isinstance( obj, BaseNode ):
        raise TypeError( '%s is only allowed to have BaseNode-'
            'derived members: "%s" (%s) is not allowed.' % ( 
                self.__class__.__name__, obj, type( obj ).__name__ )
            )
    list.insert( self, index, obj )

The childNodes property was set up to use an instance of ElementList for its storage, and is read-only. The children property, also read-only, was built using a list comprehension that filtered childNodes down to only those members that were instances of IsElement:

@describe.AttachDocumentation()
def _Getchildren( self ):
    """
Gets the the sequence of all children of the instance that are elements"""
    return [ 
        c for c in self._childNodes 
        if isinstance( c, IsElement )
    ]
With those two properties in place, a lot of the remaining properties were easily implemented:
childElementCount
The number of members of children
firstChild
The first member of childNodes
firstElementChild
The first member of children
lastChild
The last member of childNodes
lastElementChild
The last member of children
nextElementSibling
The first member of childNodes after the index of the instance itself that is an IsElement instance
nextSibling
The first member of childNodes after the index of the instance itself
previousElementSibling
The first member of childNodes before the index of the instance itself that is an IsElement instance
previousSibling
The first member of childNodes before the index of the instance itself
The parentElement and parentNode properties seemed to me to be needlessly confusing — As far as I've been able to tell, there is no way for a non-element node to be a parent to another node. I looked around for a while to see if I was missing anything, but couldn't find anything that led me to think otherwise. Ultimately I decided to collapse the two of them down into the parent property.

The Remaining Properties

The majority of the remaining properties' implementations either follow some simple variation of my normal property-getter, -setter and -deleter structure, storing the value in an underlying protected local attribute, or are calculated in some fashion from another property or one of the underlying local attributes. The namespace and namespaceURI properties are a perhaps-typical example of that structure as it applies to a non-simple underlying-attribute type/value.

#-----------------------------------#
# Instance property-getter methods  #
#-----------------------------------#

# ...

@describe.AttachDocumentation()
def _Getnamespace( self ):
    """
Gets the Namespace associated with the instance"""
    if self._namespace:
        return self._namespace
    else:
        if parent:
            return parent.namespace
        else:
            return None

@describe.AttachDocumentation()
def _GetnamespaceURI( self ):
    """
Gets the URI of the Namespace associated with the instance"""
    if self.namespace:
        return self.namespace.namespaceURI
    return None

# ...

#-----------------------------------#
# Instance property-setter methods  #
#-----------------------------------#

# ...

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'The Namespace, or the URI/unique identifier of the Namespace to '
    'associate with the instance', 
    Namespace, str, unicode
)
@describe.raises( TypeError, 
    ''
)
def _Setnamespace( self, value ):
    """
Sets the Namespace association for the instance"""
    if not value:
        self._Delnamespace()
    if type( value ) in ( str, unicode ):
        try:
            value = Namespace.GetNamespace( value )
        except MarkupError:
            value = None
    if not isinstance( value, Namespace ):
        raise TypeError( '%s.namespace expects a Namespace instance, '
            'or a str or unicode URI value of a registered Namespace, '
            'but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, 
                type( value ).__name__
            )
        )
    self._namespace = value

# ...

#-----------------------------------#
# Instance property-deleter methods #
#-----------------------------------#

@describe.AttachDocumentation()
def _Delnamespace( self ):
    """
"Deletes" the namespace-association of the instance by setting it to None"""
    self._namespace = None

There is one remaining property that I can't implement just yet: innerHTML. The getter-method side of it is pretty straightforward, since it really just boils down to rendering the children of the instance. While that's in the method-members that I haven't touched on yet (__str__ and/or __unicode), I don't expect the getter functionality of innerHTML to be much more than a call to one or the other. On the setter side of the property, though, I need the ability to parse markup to be functional before I can work that out. That means that innerHTML is waiting on the implementation of the MarkupParser class.

That wraps up the properties of Tag. I promised earlier to make the proof-of-concept code for AttributeValueList available for download, so before I stop, here it is: