I Dream In Code: Generating and Parsing Markup in Python [9]

The last missing piece in parsing markup is figuring out how to reliably identify documents in a chunk of markup-text. The basic parsing-process already identifies them as tags, and even assigns the appropriate namespace if one is provided in the original markup (in an xmlns attribute). Different document-types, though, have different output rules once they're being rendered to a client browser. I've noted this before, but put off actually defining concrete document-classes until the parsing-process was complete. Now it's time to take a more complete look at those.

Where Things Stand Right Now

In the original planning for the markup module, I'd specified an abstract base class (BaseDocument) that would be used as the starting-point for concrete classes for specific document-types:

I anticipate two patterns to show up in application-code that would result in a document being created:

Creating a new document as an object in an application's code, which might look something like this:
```
newDocument = HTML5Document( ... )
```
Parsing a document from markup-source (obviously, perhaps, since the MarkupParserclass was just completed in the last post); and

The first of those scenarios is pretty straightforward, at least in the context of developer-side interaction with the framework. I can't think of anything meaningful to discuss about that use-case that won't, hopefully, be obvious from the signature of the applicable __init__ method.

Documents from a `MarkupParser`

The use-case of the last of those scenarios is the one that I was most concerned with when I started this series of posts. The relevant goal was to provide designer-side control over a document-type, simply by changing the document-specification in some fashion in an XML template-file that they have ready access to, without requiring access to any application-code. That XML template-file could (and should, for HTML5 and XHTML documents) still use HTML tag-names, in order to keep things reasonably familiar.

Truth be told, I spent weeks working through this aspect of document-creation, pondering a number of variants, trying to determine how they would fit in to the markupParser class, etc., etc. At times, though some of the solutions I arrived at were reasonably elegant (and others turned out to be horrible kudges), I either lost sight of that designer-side control goal, or the results, while probably functional, just felt too complicated. It wasn't until I took a break from the code, took a step back, and started asking myself what does that designer-side-control goal actually mean that I hit on the approach that I'm going to implement.

So, what does the designer-side-control goal really mean? It's actually not all that complicated. If the development efforts of a web-application are broken into developer-side and designer-side tasks and needs, the designer side of that equation needs:

The ability to modify a page-template without having to worry about breaking existing application functionality

So, for example, if a page's layout needs to be altered by moving content (and, down the line, page-components) around, all that the designer should have to modify is the page-template file.

The ability to add existing functionality into a page-template that it doesn't currently exist in

This essentially boils down to being able to add ready-for-use page-components to an existing page-template, and have it just work. I haven't started exploring the idea of page-components just yet, but it's something that I will touch on (at least conceptually) for a while before I work out actual implementations. At present, what I have in mind for page-components looks something like this:

Page components will be represented in page templates as tags;
Those tags will map back to server-side objects in some fashion;
The component-tags will allow in-markup templating of their content;
Components may act as a View in a perhaps-formal server-side MVC pattern (it's too early to determine that for certain just yet, but that's a should have goal for them at any rate)

The ability to specify (and thus change) output document-types without having to deal with the application code itself

This specification needs to be as simple as possible, in order to cater to the possibilities of:

People in a designer role who don't have much experience with whatever markup language(s) are in use (which also helps, at some level, to future-proof the page-templates and their corresponding processes);
and/or
To accommodate for the possibility that the tools being used to modify page-templates might mangle markup that is too complex, or doesn't conform to some known standard (a too-corrective editor, one that fixes markup it determines to be invalid, breaking the template's markup-requirements in the process).

I don't expect either of these to be a major concern, but I do feel that they need to be addresses, and the resulting ease-of-use should be beneficial even for experienced designers.

This is where my main hang-up with various implementations came into play, prompting my re-think of the process that led to the implementation here.

Not having a bunch of functional code living in page-templates

This ties in with the "existing functionality" item above, and also relates to the page-component concept. One of the things that I don't like about many of the Python frameworks I've looked at is the inclusion of logical control structures in the template markup. Django is one of the more popular of those frameworks, and is pretty mature, so I'll use its built-in template tags and filters as an example. It provides dozens of logical/functional items, a simple example of which is:

<ul>
{% for athlete in athlete_list %}
    <li>{{ athlete.name }}</li>
{% endfor %}
</ul>

Now, I'll grant that it's a well-thought-out collection of capabilities, hands down. I'll also grant that it provides designer-level presentation-control. The aspects that really bother me about this structure are:

That's (potentially) a lot of additional pseudo-markup to potentially throw at a (potentially) non-technical user, particularly in more complex function-and-design scenarios.
A (perhaps) typical designer developer is more likely to be using one of the various WYSIWYG editors (Dreamweaver, for example), and the additional template-code may make it difficult, or even impossible to really work in that WYSIWYG mode. Granted, my recent experience with that kind of scenario has been pretty solidly shaped by the environment I was in — an advertising agency — where the standard tool-chain for designers and creatives included Dreamweaver, so I may be more concerned about this than is really warranted, but even so...
At a more functional level, perhaps, it just feels wrong to me to have decision-making happening in the presentation layer. Simple iteration makes some sense, though I'd hope that something on the server side would take care of that, flat out. Barring that, if there's functionality that must happen in the presentation layer, I'd rather it be done with something that already exists, like JavaScript (though I might have issue with that too).

That also completely ignores the potential for conflicts between server-side templating and some of the common/popular client-side templating engines, like Angular JS. To be fair, I don't know that any specific client-side engine's syntax actually collides with any server-side templating structure/syntax, but I definitely see the potential for it happening (without, admittedly, having actually taken the time to try and implement something like a Django/Angular application).

Where I Ultimately Landed

The ultimate question, on the parsing side of things at least, that drove the implementation of document-parsing boiled down to

How can a designer specify a document type in the markup in order for it to be recognized and used accordingly?

I examined several possibilities in my initial approaches:

Using a custom XML processing instruction
Using the DOCTYPE
Using the xmlns attribute
Using a custom attribute on the root tag (initially, I was thinking of a data-* attribute)

The first three approaches all had various issues that I felt over-complicated the framework-code, added requirements to template-markup that felt like they'd either confuse newer designers needlessly, or risk breaking templates in too-corrective editors. To be fair, I'm not aware of any current versions of markup-editors that fall into this category, but I distinctly remember an earlier version of Adobe's Dreamweaver causing this sort of problem in the past, and other editors may well cause similar issues even today.

So, the final rule-set and process for parsing page templates ends up as:

Page templates will follow XML structure- and tag-closure rules — for all practical purposes, they are XML documents, though they can (and will) use and support HTML tag-names and attributes, even those introduced in HTML 5.
The BaseDocument abstract class will be used to collect all of the functionality common to all the concrete document-type classes.
There will be discrete concrete classes for both the current and most-recent non-current HTML versions (i.e., HTML 5 and XHTML).
There will also be a concrete class for generic XML documents — not so much because I expect to need one as because I can't rule out the potential need for one.
Each of those document-classes, derived from BaseDocument, will be responsible for defining several things:
- Any items/values that should be rendered for an instance of the document-type that occur before the standard Tag-rendering process takes over:
  
  HTML 5
  
  The <!DOCTYPE> declaration
  
  XHTML
  
  The <!DOCTYPE> declaration, and the xmlns attribute of the root <html> tag
  
  XML
  
  The XML declaration (<?xml version="1.0" encoding="UTF-8"?>), any XML processing-instructions (<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">, for example), and a <!DOCTYPE> if one is specified
- A common class-method (FromTag) that can be passed a Tag (and its children) that will return the same DOM-tree structure as an instance of the class.
- A root-tag name that can be used to verify the document root during the parsing process.
- The Namespace instance that relates to the document, if one is applicable.
Each concrete class will be registered with the MarkupParser class, using a unique and designer-meaningful name (html5 or xhtml, for example), allowing MarkupParser to look up a concrete document-class by that name and either create an instance of the document-class as the top of the DOM-results, or call the document-class' FromTag method when the root tag is closed to convert those results to the appropriate document-type.
The root-tag of a page template will include a document-type attribute whose value will be one of the names registered with MarkupParser, tying the designer-specified document-type to the registered BaseDocument-derived document-class during parsing.

Changes to `BaseDocument`

Some basic definition for BaseDocument was done a couple posts back, but with the final decisions above now made, there are some potentially-significant alterations that need to be made.

Hooks for Concrete Class Properties

The root-tag-name, namespace and document-type name that will be registered with MarkupParser are values that should be constant for each concrete subclass of BaseDocument, so I can't think of any reason not to make them properties of those classes, and set up default values for them in BaseDocument:

#-----------------------------------#
# Class attributes (and instance-   #
# attribute default values)         #
#-----------------------------------#

# The Namespace-instance that documents of this type should be associated with
_documentNamespace = None
# The name that will be used to register the class with MarkupParser
_markupParserName = None
# The root tag for an instance of the document-type, if applicable
_rootTagName = None

The document-type's associated Namespace feels like it should be required for instances of HTML5Document and XHTMLDocument, but not for XMLDocument (since any given XML document could, by definition, have its own distinct namespace). That's something that I'll have to cover in more detail further on, once I start implementing XMLDocument, but for now, it suffices to know that the execution of any checks of <class>._documentNamespace will have to reside in the applicable derived classes. It wouldn't hurt to have a common helper-method that BaseDocument does provide, though, to perform that check:

@describe.AttachDocumentation()
@describe.raises( AttributeError, 
    'if the instance\'s class does not have a Namespace defined'
)
@describe.raises( TypeError, 
    'if the instance\'s class defines a namespace that is not an '
    'instance of Namespace'
)
def _CheckNamespace( self ):
    """
Checks the instance's class to see if it has a Namespace association defined."""
    if self.__class__._documentNamespace == None:
        raise AttributeError( '%s does not have a defined Namespace '
            'relationship. Please set its _documentNamespace class-'
            'attribute to an instance of a Namespace' % ( 
                self.__class__.__name__
            )
        )
    if not isinstance( self.__class__._documentNamespace, Namespace ):
        raise TypeError( 
            '%s defines a document-namespace relationship in its '
            '_documentNamespace class-attribute, but it is a %s, '
            'not a Namespace instance' % ( 
                self.__class__.__name__, 
                type( self.__class__._documentNamespace ).__name__
            )
        )

That method, _CheckNamespace, is then available to any derived class' instances, and can be used to enforce the namespace requirement during the __init__ process of document-types that must have a namespace.

A similar helper-method, _CheckTagName, which can be used to check for class-attributes that correctly define a root tag-name in document-types where one is relevant would also be useful:

@describe.AttachDocumentation()
@describe.raises( AttributeError, 
    'if the instance\'s class does not have a root tag-name defined'
)
@describe.raises( TypeError, 
    'if the instance\'s class defines a root tag-name that is not '
    'valid'
)
def _CheckTagName( self ):
    """
Checks the instance's class to see if it has a valid root-tag-name association 
defined."""
    try:
        rootTagName = self.__class__._rootTagName
    except:
        raise AttributeError( '%s does not have a root-tag-name defined '
            'as a class attribute' % ( self.__class__.__name__ ) )
    if not rootTagName:
        raise AttributeError( '%s does not have a root-tag-name defined '
            'as a class attribute' % ( self.__class__.__name__ ) )
    if not self._IsValidTagName( rootTagName ):
        raise ValueError( '%s has a root-tag-name defined ("%s"), but it'
            'is not a valid tag-name' % ( 
                self.__class__.__name__, rootTagName
            )
    )

This would be used in a manner similar to _CheckNamespace during initialization of a document-type instance where it's relevant.

Both of these methods will be called in the __init__ of HTML5Document and XHTMLDocument, and I'll show that in a while.

The FromTag class-method can be defined as a member of BaseDocument, so long as it can handle document-type classes that don't specify a root tag-name or namespace:

@classmethod
@describe.AttachDocumentation()
@describe.argument( 'sourceTag', 
    'the Tag instance that contains the markup to be copied into the '
    'resulting document instance',
    Tag
)
@describe.returns( 'An instance of the class' )
@describe.raises( TypeError, 
    'if passed a sourceTag value that is not an instance of Tag'
)
def FromTag( cls, sourceTag ):
    """
Creates an instance of the class and populates it with the attributes and 
children of the provided source-tag"""
    if not isinstance( sourceTag, Tag ):
        raise TypeError( '%s.FromTag expects a Tag instance for its '
            'sourceTag argument, but was passed "%s" (%s) instead' % 
            ( cls.__name__, type( sourceTag ).__name__ )
    )
    # Use the tag-name from the document class, or from the sourceTag's 
    # top-level node if no root tag-name is specified for the class
    if cls._rootTagName:
        result = cls( cls._rootTagName )
    else:
        result = cls( sourceTag.tagName )
    # Add the default namespace, if applicable, or the source-tag's 
    # namespace if *that* is applicable
    if cls._documentNamespace:
        result._SetNamespace( cls._documentNamespace )
    elif sourceTag.Namespace:
        result._SetNamespace( sourceTag.Namespace )
    # Copy the attributes over first
    for attr in sourceTag.attributes:
        result.setAttribute( attr, sourceTag.attributes[ attr ] )
    # Then the children
    while sourceTag.childNodes:
        result.appendChild( 
            sourceTag.removeChild( 
                sourceTag.childNodes[ 0 ]
            )
        )
    # Then return the result
    return result

Even if MarkupParser ends up creating a BaseDocument instance for its main results, the ability to convert an arbitrary Tag into one of the provided document-types feels like it might be useful, so having this defined doesn't feel bad.

But... That's not really any closer...

While the FromTag method is, I think, pretty neat in its own right, that doesn't really get me any closer to parsing whole documents from template-files. There are a few things that need to happen, and some of the hooks put in play for that method will help. Ultimately, what needs to happen during parsing ends up being a pretty simple sub-section of the current _NodesFromTokens method:

When a node is created from a token, some sort of check needs to be made to see if should be a document-root node for the namespace of the node.
That means that the individual Namespace globals needs to be able to keep track of what root tag-names it needs to be concerned with;
That also means that a given Namespace global needs to know what BaseDocument-derived class should be instantiated when a root-node creation is detected.

In my first pass in solving this challenge, I started down the path of adding more and more stuff to the Namespace instances (HTML5Namespace, etc.). That might've been workable, but after putting the class-level document-namespace hook-attribute in place for FromTag, that led to a circular reference issue: The document-class needed to know aqbout the namespace-class, and vice versa. Taking a step back, and thinking about the issue a bit more, I eventually came to the conclusion that while the Namespace globals did need to know about their related documents, there was no reason that it had to happen during object-initialization. As part of that initial pursuit, I'd already created the Namespace-instance properties needed to store the document-class value, as well as the root tag:

# ...

#-----------------------------------#
# Instance property-getter methods  #
#-----------------------------------#

# ...

@describe.AttachDocumentation()
def _GetDocumentClass( self ):
    """
Gets the class to be used as root-document nodes"""
    return self._documentClass

# ...

@describe.AttachDocumentation()
def _GetRootTagName( self ):
    """
Gets the root tag-name of documents of the namespace."""
    return self._rootTagName

# ...

#-----------------------------------#
# Instance property-setter methods  #
#-----------------------------------#

# ...

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the document-class to be used to create root-document nodes when '
    'parsing documents of the namespace',
    TypeType, None
)
@describe.raises( TypeError, 
    'if passed a value that is not a class'
)
@describe.raises( ValueError, 
    'if passed a value that is not a subclass of BaseDocument'
)
def _SetDocumentClass( self, value ):
    """
Sets the document-class to be used to create root-document nodes when parsing 
documents of the namespace"""
    # Since BaseDocument is an ABCMeta-based class, it's NOT going 
    # to be testable with type(value) == types.TypeType...
    if type( value ) != abc.ABCMeta:
        raise TypeError( '%s.DocumentClass expects a class derived '
            'from BaseDocument. but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__
            )
        )
    if not issubclass(value, BaseDocument):
        raise ValueError( '%s.DocumentClass expects a class derived '
            'from BaseDocument. but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__
            )
        )
    self._documentClass = value

# ...

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the root tag-name of documents of the namespace to set for the '
    'instance',
    str, unicode
)
@describe.raises( TypeError, 
    'if passed a value that is not a str or unicode type or None'
)
@describe.raises( ValueError, 
    'if passed a value that has multiple lines in it'
)
@describe.raises( ValueError, 
    'if passed a value that is not valid as a tag-name'
)
def _SetRootTagName( self, value ):
    """
Sets the root tag-name of documents of the namespace"""
    if type( value ) not in ( str, unicode ) and value != None:
        raise TypeError( '%s.RootTagName expects a single-line str '
            'or unicode value, or None, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__ )
            )
    if value:
        if '\n' in value or '\r' in value:
            raise ValueError( '%s.RootTagName expects a single-line '
                'str or unicode value, or None, but was passed "%s" (%s) '
                'which has multiple lines' % ( 
                    self.__class__.__name__, value, type( value ).__name__
                )
            )
    _validTagNameRE = re.compile( '[_A-Za-z][-_a-zA-Z0-9]*' )
    if _validTagNameRE.sub( '', value ) != '':
        raise ValueError( '%s.RootTagName expects a valid tag-name '
            'value, but was passed "%s" (%s) which is not valid' % ( 
                self.__class__.__name__, value, type( value ).__name__
            )
        )
    self._rootTagName = value

# ...

#-----------------------------------#
# Instance property-deleter methods #
#-----------------------------------#

# ...

@describe.AttachDocumentation()
def _DelDocumentClass( self ):
    """
"Deletes" the the document-class to be used to create root-document nodes 
when parsing documents of the namespace by setting it to None"""
    self._documentClass = None

# ...

@describe.AttachDocumentation()
def _DelRootTagName( self ):
    """
"Deletes" the root tag-name of a document in the namespace by setting it 
to None."""
    self._rootTagName = None

# ...

#-----------------------------------#
# Instance Properties               #
#-----------------------------------#

# ...

DocumentClass = describe.makeProperty(
    _GetDocumentClass, None, None, 
    'the document-class to be used to create root-document nodes '
    'when parsing documents of the namespace',
    str, unicode, None
)

# ...

RootTagName = describe.makeProperty(
    _GetRootTagName, None, None, 
    'the tag-name of a root tag of a document in the namespace',
    str, unicode, None
)

I'd also already integrated those new properties into Namespace.__init:


#-----------------------------------#
# Instance Initializer              #
#-----------------------------------#
@describe.AttachDocumentation()

# ...

def __init__( self, name, namespaceURI, contentType, systemId=None, 
    publicId=None, defaultRenderingModel=renderingModels.Mixed, 
    **tagRenderingModels ):
    """
Instance initializer"""

    # ...

    # Set default instance property-values with _Del... methods as needed.
    # ...
    self._DelDocumentClass()
    # ...
    self._DelRootTagName()
    # ...
    # Set instance property values from arguments if applicable.
    # ...
    # Other set-up
    self.__class__.RegisterNamespace( self )

I'd also done some exploration in the current MarkupParser.NodesFromTokens method to assure that the related namespace could be identified. That was not much more than dropping a print into the code and running a test-parse with a short but still relatively complex document:

def _NodesFromTokens( self, tokens ):
    """
Generates a list of nodes from the supplied tokens"""

    #...

    # Create a new tag and deal with it accordingly
    if currentElement:
        # It'll be a child of currentElement
        newNode = Tag( tagName, tagNamespace, **attrDict )
        if not unary:
            # Can have children, so it takes currentElement's place
            currentElement = currentElement.appendChild( newNode )
        else:
            # No children allowed, so just append it
            currentElement.appendChild( newNode )
    else:
        # It can't be a child of an element, for whatever reason
        # TODO: Check to see if tagNamespace is identified, there's 
        #       an appropriate/related class, and if so, create an 
        #       instance of that class instead of Tag, maybe?
        if tagNamespace:
            print '### tagNamespace (from namespace URI)'
        newNode = Tag( tagName, tagNamespace, **attrDict )
        if not unary:
            # But it can HAVE children, so it takes currentElement's 
            # place.
            currentElement = newNode
        results.append( newNode )

which demonstrated pretty succinctly that it was at least recognizing the associated namespace:

###### tagNamespace.Name: xhtml

All that left, then, was figuring out how to register the document-classes with their associated Namespace-constants...

Modifying `Namespace`, and the Implications

The only other item, then, that needs to be handled in Namespace is providing a mechanism to actually make the document-class and root-tag-name associations. Since there are already property-setters for both, that's a very simple method:

@describe.AttachDocumentation()
@describe.argument( 'documentClass',
    'the BaseDocument-derived class to register with the Namespace instance '
    'as the document-type to create when the root tag is encountered '
    'during parsing',
    TypeType
)
def SetDocumentClass( self, documentClass ):
    """
Registers a BaseDocument-derived class with the Namespace instance as its 
document-type of record."""
    self._SetDocumentClass( documentClass )
    self._SetRootTagName( documentClass._rootTagName)

Actually making that association is then nothing more than calling SetDocumentClass on each Namespace instance after both it and its associated document class have been defined:


HTML5Namespace = Namespace(
    'html5',
    'http://www.w3.org/2015/html', 
    # ...
    )

@describe.InitClass()
class HTML5Document( BaseDocument, object ):
    """Represents a markup-document in HTML 5"""

# ...

__all__.append('HTML5Document')

HTML5Namespace.SetDocumentClass( HTML5Document )

Implementing and Organizing the Document Classes

It occurred to me while I was doing some light-weight testing of the XMLDocument class that it might be a better idea to define the markup namespace as a package, in order to allow new specific document-types and -namespaces to be added to the framework more easily. As things stand right now, with markup being a module, the process for adding a new, defined document-type (say something like DocBook XML) to the available document-types involved changes to markup.py. The inevitable progression, as more and more new document-types get added, will result in a potentially unmanageable growth of the code in markup.py, raising the risk of it getting more and more brittle as time goes on. That ignores the possible run-time speed/performance drop that would also be introduced (though, to be fair, I don't expect that would be all that significant... Still...).

Ideally, as time goes on, it might be better (it'd certainly be neat) if the process of adding a new document-type to the idic framework involved nothing more than creating the relevant Namespace and BaseDocument-derived types in a free-standing module-file, dropping that new file in the appropriate location, and moving on. While I didn't go quite so far as to make the current document-types auto-detected, I did break them out into their own modules and pull them in at the end of the core markup.py file, after converting that to a package-structure.

With that structure in play, implementation of the two external document-types got a minor facelift. In order to import the relevant classes from the sub-modules, when those maodules and the main package-header could get called repeatedly, the Namespace creation and configuration had to ensure that the individual Namespace instances weren't being overwritten. That looked like this (using HTML5Namespace as an example):

try:
    checkNamespace = Namespace.GetNamespaceByName( 'html5' )
except MarkupError:
    Namespace(
        'html5',
        'http://www.w3.org/2015/html', 
        'text/html',
        None,
        None,
        renderingModels.RequireEndTag,
        br=renderingModels.NoChildren,
        img=renderingModels.NoChildren,
        link=renderingModels.NoChildren,
        script=renderingModels.RequireEndTag,
        )
HTML5Namespace = Namespace.GetNamespaceByName( 'html5' )
__all__.append( 'HTML5Namespace' )

The individual document-classes are (for now) very simple.

`HTML5Document`

@describe.InitClass()
@describe.attribute( '_documentNamespace', 
    'the Namespace-instance that documents of this type should be '
    'associated with'
)
@describe.attribute( '_rootTagName', 
    'root tag for an instance of the document-type, if applicable'
)
class HTML5Document( BaseDocument, object ):
    """Represents a markup-document in HTML 5"""
    #-----------------------------------#
    # Class attributes (and instance-   #
    # attribute default values)         #
    #-----------------------------------#

    # The Namespace-instance that documents of this type should be associated with
    _documentNamespace = HTML5Namespace
    # The root tag for an instance of the document-type, if applicable
    _rootTagName = 'html'

    #-----------------------------------#
    # Instance property-getter methods  #
    #-----------------------------------#

    #-----------------------------------#
    # Instance property-setter methods  #
    #-----------------------------------#

    #-----------------------------------#
    # Instance property-deleter methods #
    #-----------------------------------#

    #-----------------------------------#
    # Instance Properties               #
    #-----------------------------------#

    #-----------------------------------#
    # Instance Initializer              #
    #-----------------------------------#
    @describe.AttachDocumentation()
    @describe.keywordargs( 
        'the attribute names/values to set in the created instance'
    )
    def __init__( self, **attributes ):
        """
Instance initializer"""
        # HTML5Document is intended to be a nominally-final class
        # and is NOT intended to be extended. Alter at your own risk!
        #---------------------------------------------------------------------#
        # At least in theory, NO concrete document-class should ever NEED to  #
        # be extended. There's just no use-case that I can come up with that  #
        # wouldn't be better served (in my opinion) by creating a new         #
        # concrete document-class instead...                                  #
        #---------------------------------------------------------------------#
        if self.__class__ != HTML5Document:
            raise NotImplementedError('HTML5Document is '
                'intended to be a nominally-final class, NOT to be extended.')
        # Call BaseDocument check-helper methods:
        BaseDocument._CheckNamespace( self )
        BaseDocument._CheckTagName( self )
        # Call parent initializers, if applicable.
        BaseDocument.__init__( self, **attributes )
        # Set default instance property-values with _Del... methods as needed.
        # Set instance property values from arguments if applicable.
        # Other set-up

    #-----------------------------------#
    # Instance Garbage Collection       #
    #-----------------------------------#

    #-----------------------------------#
    # Instance Methods                  #
    #-----------------------------------#

    #-----------------------------------#
    # Class Methods                     #
    #-----------------------------------#

    #-----------------------------------#
    # Static Class Methods              #
    #-----------------------------------#

#---------------------------------------#
# Append to __all__                     #
#---------------------------------------#
__all__.append('HTML5Document')

HTML5Namespace.SetDocumentClass( HTML5Document )

`XHTMLDocument`

At this point, the only difference between the HTML5Document and XHTMLDocument classes, apart from their names, are the class attributes relating to the document-namespace:

# The Namespace-instance that documents of this type should be associated with
_documentNamespace = XHTMLNamespace

`XMLDocument`

The XMLDocument document-class doesn't even have definitions for those class-attributes, inheriting None values from BaseDocument:

# The Namespace-instance that documents of this type should be associated with
_documentNamespace = None
# The root tag for an instance of the document-type, if applicable
_rootTagName = None

Finally: Parsing Documents

The balance of the efforts in ths post are probably going to seem anticlimactic... The sum total of the relevant changes needed in MarkupParser._NodesFromTokens is actually pretty trivial (changed lines have a green left border):

def _NodesFromTokens( self, tokens ):
    """
Generates a list of nodes from the supplied tokens"""

# ...
    # Create a new tag and deal with it accordingly
    if currentElement:
        # It'll be a child of currentElement
        newNode = Tag( tagName, tagNamespace, **attrDict )
        if not unary:
            # Can have children, so it takes currentElement's place
            currentElement = currentElement.appendChild( newNode )
        else:
            # No children allowed, so just append it
            currentElement.appendChild( newNode )
    else:
        # It can't be a child of an element, for whatever reason
        # - If the tag-namespace points to the tag-name being a 
        #   document-root tag, then create a document instead 
        #   of a tag...
        if tagNamespace and tagName == tagNamespace.RootTagName:
            try:
                del attrDict['xmlns']
            except KeyError:
                pass
            newNode = tagNamespace.DocumentClass( **attrDict )
        else:
            newNode = Tag( tagName, tagNamespace, **attrDict )
        if not unary:
            # But it can HAVE children, so it takes currentElement's 
            # place.
            currentElement = newNode
        results.append( newNode )

With all of these changes in place, I whipped up a quick and dirty test-script (parse-test.py, available for download at the end of the post) just to make sure that it was doing what I expected/needed. It yielded the following results with the XHTML namespace:

mparsed, <idic.markup.xhtml.XHTMLDocument object at 0x7f9ed66f0450>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

[truncated for brevity's sake]

</html>
Source length 1052
Parsed length 889

and nearly-identical results using the HTML5 namespace:

mparsed, <idic.markup.html5.HTML5Document object at 0x7f2143f24450>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/2015/html">

[truncated for brevity's sake]

</html>
Source length 1051
Parsed length 813

Both runs verified that the root node of the parsed document was an instance of the appropriate BaseDocument-derived class: XHTMLDocument in the first case, and HTML5Document in the second.

This post has gone on for quite a while now, so I won't spend any time writing about the unit-testing that went into it as well. I hope, by now, that the process I go through has been covered in sufficient detail that it's not really necessary (but for those who might've missed it, see my earlier post, walking through it from the ground up). I will mention, though, that even with all of these changes, there weren't all that many new tests that had to be generated:

########################################
Unit-test results
########################################
Tests were successful ..... False
Number of tests run ....... 321
 + Tests ran in ........... 0.53 seconds
Number of errors .......... 0
Number of failures ........ 13
Number of tests skipped ... 131
########################################

It's also (maybe) worth mentioning that the import-structure for HTML5Document and XHTMLDocument did not allow either of those classes to be discovered as items in need of tests — a fact that I will look into at a later date. For now, that simply means that I must also explicitly generate and include unit-test modules for the new html5.py and xhtml.py modules that contain those classes.

parse-test.py

1.4kB

idic.zip

43.4kB

I Dream In Code

Tuesday, May 23, 2017

Generating and Parsing Markup in Python [9]

Where Things Stand Right Now

Documents from a `MarkupParser`

Where I Ultimately Landed

Changes to `BaseDocument`

Hooks for Concrete Class Properties

But... That's not really any closer...

Modifying `Namespace`, and the Implications

Implementing and Organizing the Document Classes

`HTML5Document`

`XHTMLDocument`

`XMLDocument`

Finally: Parsing Documents

No comments:

Post a Comment

Tuesday, May 23, 2017

Generating and Parsing Markup in Python [9]

Where Things Stand Right Now

Documents from a MarkupParser

Where I Ultimately Landed

Changes to BaseDocument

Hooks for Concrete Class Properties

But... That's not really any closer...

Modifying Namespace, and the Implications

Implementing and Organizing the Document Classes

HTML5Document

XHTMLDocument

XMLDocument

Finally: Parsing Documents

No comments:

Post a Comment

Documents from a `MarkupParser`

Changes to `BaseDocument`

Modifying `Namespace`, and the Implications

`HTML5Document`

`XHTMLDocument`

`XMLDocument`