Wednesday, May 31, 2017

You Just Forgot Rule 19…

Nothing to worry about, lad. You just forgot Rule Nineteen. Submit?
Rule Nineteen? What the hell is Rule Nineteen? Yes, yes, submit, submit!
Remember To Never Forget Rule One. And always ask yourself how come it was created in the first place, eh?
Terry Pratchett, from Thief of Time (1)

If you were to ask around of several developers what critical principles have the most impact on their coding practices, style, etc., I suspect that you'd get almost as many different answers as different people you asked. The variations may be mostly in priorities, and those may be shaped or set by other people (management), or by other various external forces or events (deadlines), but even then, I'd almost be willing to bet that there'd be significant differences, if only because they were shaped by different experiences.

My current priorities, shaped by my experiences, started as a set of half-serious comments to my boss many, many moons ago. Over the subsequent years, and across the various positions I've held and across the handful of companies where I held them, I invariably came back to them over and over again. These days, though I probably don't pay as much attention to them as I should, they still end up shaping a lot of my development practices. For whatever they might be worth, and in the hopes that they'll help someone in some (probably small) way, here they are...

In retrospect:
These probably sound pretty cynical... Nevertheless, I'd stand by the principles they relate to, even if the phrasing is... overly harsh...

Rule #1: Never Trust User Input

There are any number of examples of where this principle got ignored, or at least not followed through adequately. In combination with Rule #2, below, insufficient attention to this principle can (eventually will?) result in some potentially significant longer-term issues, like:

At a minimum, there's potentially-significant risk to basic application-data integrity and/or validity. It's not surprising to me that three of the OWASP Top Ten are directly attributable to insufficient validation of user input, and a handful of others could be, depending on the specifics of implementation.

Solution: If data is created directly by a user, validate it,validate it all, and raise errors if anything looks at all fishy.

Rule #2: All Input Is Generated By Users

A bit of a semantics quibble, maybe, but absolutely true from at least one perspective: Even data generated by some completely autonomous process — sensors attached to very simple processors, running very simple programs to report that data out still have significant human involvement. It's just buried a few layers deep, under whatever code is running to perform those basic tasks. Even if you know who wrote the code (even if you wrote the code), there's always a possibility that something got missed, that a bug will throw bad data at your code. That's at the begnign end of the spectrum. At the other end, you'll have people and/or programs intentionally trying to break stuff.

Somewhere on that spectrum, or at some line drawn in whatever processes and logic a codebase is concerned with, there has to be some measure of trust, though. Otherwise, there's no point in even making the effort, yes? The alternative is writing all the code needed for a project. All of it, down to the OS. And who has time for that?

Solution: In addition to making sure that Rule #1 has been applied, make conscious decisions about where your boundaries of trust are, know where those boundaries are, and don't be afraid to revisit or review them as circumstances warrant.

Rule #3: If You're Gonna Use a Framework, Use the Framework

This one is a relatively recent addition, and was formed after experience with two separate projects. In both cases, development was undertaken after selection of an application framework, one with an associated ORM. In both cases, a substantial portion of the final codebase was built out without leveraging the facilities provided by their frameworks.

To be fair, there may have been good reasons for the design decisions involved. Or, at a minimum, the reasons may have seemed like they were valid when the decision was made. I suspect, though, that any time this sort of decision is implemented, it will raise complications. I know that I've seen variations of the following between those two projects:

  • Processes and logic that were well outside the functional expectations that would be normal for the framework (e.g., not having models or databases that conform to the ORM);
  • Difficulty in implementing automated/unit testing of components that don't follow the framework's standards, probably because of the additional time required; to
  • Difficulty in adequately documenting components that don't follow the framework's standards, also probably because of the additional time required;
There are probably other potential issues that can/will arise.

How Well Do I Adhere To These...?

I'd like to think that I keep these in mind as often as they apply, but in the real world, with real deadlines and other real-world constraints, actually abiding by them is sometimes just not practical. That said, even if they aren't actually enforced, keeping them in mind, and being able to point out the risks that will arise as a result of setting them aside in favor of making that deadline, or meeting whatever other factors are shaping the time available is, I think, beneficial.


(1) If you haven't already surmised from before, yes, I'm an avid Terry Pratchett fan...

+++Divide By Cucumber Error. Please Reinstall Universe And Reboot+++

Thursday, May 25, 2017

Interlude and Schedule-change

So, most of the reason I started this blog was because I wanted to start writing about writing code again. As motivations go, I think that's pretty straightforward. Within a few days of my starting the current efforts, though, I was notified that my position was going to be closed at the end of Q1, 2017. The resulting job-hunt processes took up a fair chunk of my time, but after the first few weeks, it became a pretty standard routine — something I could complete the daily task-set on, most days, before I even left for work.

I also picked up a freelance project with a friend and former co-worker, with an eye towards getting some income-buffer started if my job ended before I found a new one. As might be expected, that also ate into my available writing-time, though not very much or for very long.

While I had five years worth of development efforts to hand off to others before I left, the rigorous application of some of the practices I've written about here so far made most of those quick and relatively painless:

  • I'd built an earlier version of the documentation decorators, so most of the codebase had pretty thorough and very consistent documentation. Honestly, there's more stuff that I'd built out there than I've had the need or time to build equivalents for here (so far), but those will percolate up here too, eventually.
  • Everything I'd written that was actually in use anywhere also had a fully-developed build-process in place, so as changes are made, there's a simple, consistent way to build and deploy them.
  • That build-process also leveraged a unit-testing structure that, while probably not as thorough as the one I've created here, was still very thorough, and was attached to the live build-processes so that a build targeted for a production deployment couldn't be completed without all tests passing.
  • All of my efforts there either started with a uniform project-structure or, in the couple of cases where I hadn't thought through that structure until after the project was underway, or even deployed, had been re-organized into one.
With all of these in play, the hand-offs for all my work there actually became amazingly simple, from my perspective, at least. Spending some time walking through those common structures first allowed me to point the recipients at the documentation, the project-structure, etc., etc., for all of the simple questions. That kept a lot of my time free.

Enough so that I was able to write almost every post that's shown up here so far before the beginning of March, 2017.

I managed, in fact, to write and schedule publication of every post here through the one before last before I found and started another job.

Now, however, I'm actually at that new job, and have been for a couple of months. Between that and the occasional vagaries of the side-project noted, my writing-time has been pretty scarce, and very scattered. It took the better part of ten weeks for me to complete the previous post — mostly because I'd set it aside to work on something else, pick it up with an idea for a new and better approach to some task or challenge within it, then do the same thing over and over again.

This isn't the first time I've been in this sort of situation. Unlike the last time, though, I'm pretty darn certain that I won't have to worry about writing code for this blog that has conflicts of interest with what I'm doing at work. On the other hand, the new job is very different from the previous one, and makes more (and unpredictable) demands on my time that have interfered with my writing-efforts in the last couple of months. The side-project is also gaining traction again, after a month or more of near-total inactivity.

Putting all of these together, I'm not completely sure what that means for my writing- and publication-schedule here yet, at least on a long-term basis, but I am planning continue my efforts here, as much as I can manage — I've still got a fair chunk of stuff thought out that I can write about, it's just a matter of finding the time to do so.

Tuesday, May 23, 2017

Generating and Parsing Markup in Python [9]

The last missing piece in parsing markup is figuring out how to reliably identify documents in a chunk of markup-text. The basic parsing-process already identifies them as tags, and even assigns the appropriate namespace if one is provided in the original markup (in an xmlns attribute). Different document-types, though, have different output rules once they're being rendered to a client browser. I've noted this before, but put off actually defining concrete document-classes until the parsing-process was complete. Now it's time to take a more complete look at those.

Where Things Stand Right Now

In the original planning for the markup module, I'd specified an abstract base class (BaseDocument) that would be used as the starting-point for concrete classes for specific document-types:

I anticipate two patterns to show up in application-code that would result in a document being created:
  • Creating a new document as an object in an application's code, which might look something like this:
    newDocument = HTML5Document( ... )
  • Parsing a document from markup-source (obviously, perhaps, since the MarkupParserclass was just completed in the last post); and
The first of those scenarios is pretty straightforward, at least in the context of developer-side interaction with the framework. I can't think of anything meaningful to discuss about that use-case that won't, hopefully, be obvious from the signature of the applicable __init__ method.

Documents from a MarkupParser

The use-case of the last of those scenarios is the one that I was most concerned with when I started this series of posts. The relevant goal was to provide designer-side control over a document-type, simply by changing the document-specification in some fashion in an XML template-file that they have ready access to, without requiring access to any application-code. That XML template-file could (and should, for HTML5 and XHTML documents) still use HTML tag-names, in order to keep things reasonably familiar.

Truth be told, I spent weeks working through this aspect of document-creation, pondering a number of variants, trying to determine how they would fit in to the markupParser class, etc., etc. At times, though some of the solutions I arrived at were reasonably elegant (and others turned out to be horrible kudges), I either lost sight of that designer-side control goal, or the results, while probably functional, just felt too complicated. It wasn't until I took a break from the code, took a step back, and started asking myself what does that designer-side-control goal actually mean that I hit on the approach that I'm going to implement.

So, what does the designer-side-control goal really mean? It's actually not all that complicated. If the development efforts of a web-application are broken into developer-side and designer-side tasks and needs, the designer side of that equation needs:

The ability to modify a page-template without having to worry about breaking existing application functionality
So, for example, if a page's layout needs to be altered by moving content (and, down the line, page-components) around, all that the designer should have to modify is the page-template file.
The ability to add existing functionality into a page-template that it doesn't currently exist in
This essentially boils down to being able to add ready-for-use page-components to an existing page-template, and have it just work. I haven't started exploring the idea of page-components just yet, but it's something that I will touch on (at least conceptually) for a while before I work out actual implementations. At present, what I have in mind for page-components looks something like this:
  • Page components will be represented in page templates as tags;
  • Those tags will map back to server-side objects in some fashion;
  • The component-tags will allow in-markup templating of their content;
  • Components may act as a View in a perhaps-formal server-side MVC pattern (it's too early to determine that for certain just yet, but that's a should have goal for them at any rate)
The ability to specify (and thus change) output document-types without having to deal with the application code itself
This specification needs to be as simple as possible, in order to cater to the possibilities of:
  • People in a designer role who don't have much experience with whatever markup language(s) are in use (which also helps, at some level, to future-proof the page-templates and their corresponding processes);
    and/or
  • To accommodate for the possibility that the tools being used to modify page-templates might mangle markup that is too complex, or doesn't conform to some known standard (a too-corrective editor, one that fixes markup it determines to be invalid, breaking the template's markup-requirements in the process).
I don't expect either of these to be a major concern, but I do feel that they need to be addresses, and the resulting ease-of-use should be beneficial even for experienced designers.
This is where my main hang-up with various implementations came into play, prompting my re-think of the process that led to the implementation here.
Not having a bunch of functional code living in page-templates
This ties in with the "existing functionality" item above, and also relates to the page-component concept. One of the things that I don't like about many of the Python frameworks I've looked at is the inclusion of logical control structures in the template markup. Django is one of the more popular of those frameworks, and is pretty mature, so I'll use its built-in template tags and filters as an example. It provides dozens of logical/functional items, a simple example of which is:
<ul>
{% for athlete in athlete_list %}
    <li>{{ athlete.name }}</li>
{% endfor %}
</ul>
Now, I'll grant that it's a well-thought-out collection of capabilities, hands down. I'll also grant that it provides designer-level presentation-control. The aspects that really bother me about this structure are:
  • That's (potentially) a lot of additional pseudo-markup to potentially throw at a (potentially) non-technical user, particularly in more complex function-and-design scenarios.
  • A (perhaps) typical designer developer is more likely to be using one of the various WYSIWYG editors (Dreamweaver, for example), and the additional template-code may make it difficult, or even impossible to really work in that WYSIWYG mode. Granted, my recent experience with that kind of scenario has been pretty solidly shaped by the environment I was in — an advertising agency — where the standard tool-chain for designers and creatives included Dreamweaver, so I may be more concerned about this than is really warranted, but even so...
  • At a more functional level, perhaps, it just feels wrong to me to have decision-making happening in the presentation layer. Simple iteration makes some sense, though I'd hope that something on the server side would take care of that, flat out. Barring that, if there's functionality that must happen in the presentation layer, I'd rather it be done with something that already exists, like JavaScript (though I might have issue with that too).
That also completely ignores the potential for conflicts between server-side templating and some of the common/popular client-side templating engines, like Angular JS. To be fair, I don't know that any specific client-side engine's syntax actually collides with any server-side templating structure/syntax, but I definitely see the potential for it happening (without, admittedly, having actually taken the time to try and implement something like a Django/Angular application).

Where I Ultimately Landed

The ultimate question, on the parsing side of things at least, that drove the implementation of document-parsing boiled down to

How can a designer specify a document type in the markup in order for it to be recognized and used accordingly?
I examined several possibilities in my initial approaches:
  • Using a custom XML processing instruction
  • Using the DOCTYPE
  • Using the xmlns attribute
  • Using a custom attribute on the root tag (initially, I was thinking of a data-* attribute)
The first three approaches all had various issues that I felt over-complicated the framework-code, added requirements to template-markup that felt like they'd either confuse newer designers needlessly, or risk breaking templates in too-corrective editors. To be fair, I'm not aware of any current versions of markup-editors that fall into this category, but I distinctly remember an earlier version of Adobe's Dreamweaver causing this sort of problem in the past, and other editors may well cause similar issues even today.

So, the final rule-set and process for parsing page templates ends up as:

  • Page templates will follow XML structure- and tag-closure rules — for all practical purposes, they are XML documents, though they can (and will) use and support HTML tag-names and attributes, even those introduced in HTML 5.
  • The BaseDocument abstract class will be used to collect all of the functionality common to all the concrete document-type classes.
  • There will be discrete concrete classes for both the current and most-recent non-current HTML versions (i.e., HTML 5 and XHTML).
  • There will also be a concrete class for generic XML documents — not so much because I expect to need one as because I can't rule out the potential need for one.
  • Each of those document-classes, derived from BaseDocument, will be responsible for defining several things:
    • Any items/values that should be rendered for an instance of the document-type that occur before the standard Tag-rendering process takes over:
      HTML 5
      The <!DOCTYPE> declaration
      XHTML
      The <!DOCTYPE> declaration, and the xmlns attribute of the root <html> tag
      XML
      The XML declaration (<?xml version="1.0" encoding="UTF-8"?>), any XML processing-instructions (<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">, for example), and a <!DOCTYPE> if one is specified
    • A common class-method (FromTag) that can be passed a Tag (and its children) that will return the same DOM-tree structure as an instance of the class.
    • A root-tag name that can be used to verify the document root during the parsing process.
    • The Namespace instance that relates to the document, if one is applicable.
  • Each concrete class will be registered with the MarkupParser class, using a unique and designer-meaningful name (html5 or xhtml, for example), allowing MarkupParser to look up a concrete document-class by that name and either create an instance of the document-class as the top of the DOM-results, or call the document-class' FromTag method when the root tag is closed to convert those results to the appropriate document-type.
  • The root-tag of a page template will include a document-type attribute whose value will be one of the names registered with MarkupParser, tying the designer-specified document-type to the registered BaseDocument-derived document-class during parsing.

Changes to BaseDocument

Some basic definition for BaseDocument was done a couple posts back, but with the final decisions above now made, there are some potentially-significant alterations that need to be made.

Hooks for Concrete Class Properties

The root-tag-name, namespace and document-type name that will be registered with MarkupParser are values that should be constant for each concrete subclass of BaseDocument, so I can't think of any reason not to make them properties of those classes, and set up default values for them in BaseDocument:

#-----------------------------------#
# Class attributes (and instance-   #
# attribute default values)         #
#-----------------------------------#

# The Namespace-instance that documents of this type should be associated with
_documentNamespace = None
# The name that will be used to register the class with MarkupParser
_markupParserName = None
# The root tag for an instance of the document-type, if applicable
_rootTagName = None
The document-type's associated Namespace feels like it should be required for instances of HTML5Document and XHTMLDocument, but not for XMLDocument (since any given XML document could, by definition, have its own distinct namespace). That's something that I'll have to cover in more detail further on, once I start implementing XMLDocument, but for now, it suffices to know that the execution of any checks of <class>._documentNamespace will have to reside in the applicable derived classes. It wouldn't hurt to have a common helper-method that BaseDocument does provide, though, to perform that check:
@describe.AttachDocumentation()
@describe.raises( AttributeError, 
    'if the instance\'s class does not have a Namespace defined'
)
@describe.raises( TypeError, 
    'if the instance\'s class defines a namespace that is not an '
    'instance of Namespace'
)
def _CheckNamespace( self ):
    """
Checks the instance's class to see if it has a Namespace association defined."""
    if self.__class__._documentNamespace == None:
        raise AttributeError( '%s does not have a defined Namespace '
            'relationship. Please set its _documentNamespace class-'
            'attribute to an instance of a Namespace' % ( 
                self.__class__.__name__
            )
        )
    if not isinstance( self.__class__._documentNamespace, Namespace ):
        raise TypeError( 
            '%s defines a document-namespace relationship in its '
            '_documentNamespace class-attribute, but it is a %s, '
            'not a Namespace instance' % ( 
                self.__class__.__name__, 
                type( self.__class__._documentNamespace ).__name__
            )
        )
That method, _CheckNamespace, is then available to any derived class' instances, and can be used to enforce the namespace requirement during the __init__ process of document-types that must have a namespace.

A similar helper-method, _CheckTagName, which can be used to check for class-attributes that correctly define a root tag-name in document-types where one is relevant would also be useful:

@describe.AttachDocumentation()
@describe.raises( AttributeError, 
    'if the instance\'s class does not have a root tag-name defined'
)
@describe.raises( TypeError, 
    'if the instance\'s class defines a root tag-name that is not '
    'valid'
)
def _CheckTagName( self ):
    """
Checks the instance's class to see if it has a valid root-tag-name association 
defined."""
    try:
        rootTagName = self.__class__._rootTagName
    except:
        raise AttributeError( '%s does not have a root-tag-name defined '
            'as a class attribute' % ( self.__class__.__name__ ) )
    if not rootTagName:
        raise AttributeError( '%s does not have a root-tag-name defined '
            'as a class attribute' % ( self.__class__.__name__ ) )
    if not self._IsValidTagName( rootTagName ):
        raise ValueError( '%s has a root-tag-name defined ("%s"), but it'
            'is not a valid tag-name' % ( 
                self.__class__.__name__, rootTagName
            )
    )
This would be used in a manner similar to _CheckNamespace during initialization of a document-type instance where it's relevant.

Both of these methods will be called in the __init__ of HTML5Document and XHTMLDocument, and I'll show that in a while.

The FromTag class-method can be defined as a member of BaseDocument, so long as it can handle document-type classes that don't specify a root tag-name or namespace:

@classmethod
@describe.AttachDocumentation()
@describe.argument( 'sourceTag', 
    'the Tag instance that contains the markup to be copied into the '
    'resulting document instance',
    Tag
)
@describe.returns( 'An instance of the class' )
@describe.raises( TypeError, 
    'if passed a sourceTag value that is not an instance of Tag'
)
def FromTag( cls, sourceTag ):
    """
Creates an instance of the class and populates it with the attributes and 
children of the provided source-tag"""
    if not isinstance( sourceTag, Tag ):
        raise TypeError( '%s.FromTag expects a Tag instance for its '
            'sourceTag argument, but was passed "%s" (%s) instead' % 
            ( cls.__name__, type( sourceTag ).__name__ )
    )
    # Use the tag-name from the document class, or from the sourceTag's 
    # top-level node if no root tag-name is specified for the class
    if cls._rootTagName:
        result = cls( cls._rootTagName )
    else:
        result = cls( sourceTag.tagName )
    # Add the default namespace, if applicable, or the source-tag's 
    # namespace if *that* is applicable
    if cls._documentNamespace:
        result._SetNamespace( cls._documentNamespace )
    elif sourceTag.Namespace:
        result._SetNamespace( sourceTag.Namespace )
    # Copy the attributes over first
    for attr in sourceTag.attributes:
        result.setAttribute( attr, sourceTag.attributes[ attr ] )
    # Then the children
    while sourceTag.childNodes:
        result.appendChild( 
            sourceTag.removeChild( 
                sourceTag.childNodes[ 0 ]
            )
        )
    # Then return the result
    return result
Even if MarkupParser ends up creating a BaseDocument instance for its main results, the ability to convert an arbitrary Tag into one of the provided document-types feels like it might be useful, so having this defined doesn't feel bad.

But... That's not really any closer...

While the FromTag method is, I think, pretty neat in its own right, that doesn't really get me any closer to parsing whole documents from template-files. There are a few things that need to happen, and some of the hooks put in play for that method will help. Ultimately, what needs to happen during parsing ends up being a pretty simple sub-section of the current _NodesFromTokens method:

  • When a node is created from a token, some sort of check needs to be made to see if should be a document-root node for the namespace of the node.
  • That means that the individual Namespace globals needs to be able to keep track of what root tag-names it needs to be concerned with;
  • That also means that a given Namespace global needs to know what BaseDocument-derived class should be instantiated when a root-node creation is detected.
In my first pass in solving this challenge, I started down the path of adding more and more stuff to the Namespace instances (HTML5Namespace, etc.). That might've been workable, but after putting the class-level document-namespace hook-attribute in place for FromTag, that led to a circular reference issue: The document-class needed to know aqbout the namespace-class, and vice versa. Taking a step back, and thinking about the issue a bit more, I eventually came to the conclusion that while the Namespace globals did need to know about their related documents, there was no reason that it had to happen during object-initialization. As part of that initial pursuit, I'd already created the Namespace-instance properties needed to store the document-class value, as well as the root tag:
# ...

#-----------------------------------#
# Instance property-getter methods  #
#-----------------------------------#

# ...

@describe.AttachDocumentation()
def _GetDocumentClass( self ):
    """
Gets the class to be used as root-document nodes"""
    return self._documentClass

# ...

@describe.AttachDocumentation()
def _GetRootTagName( self ):
    """
Gets the root tag-name of documents of the namespace."""
    return self._rootTagName

# ...

#-----------------------------------#
# Instance property-setter methods  #
#-----------------------------------#

# ...

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the document-class to be used to create root-document nodes when '
    'parsing documents of the namespace',
    TypeType, None
)
@describe.raises( TypeError, 
    'if passed a value that is not a class'
)
@describe.raises( ValueError, 
    'if passed a value that is not a subclass of BaseDocument'
)
def _SetDocumentClass( self, value ):
    """
Sets the document-class to be used to create root-document nodes when parsing 
documents of the namespace"""
    # Since BaseDocument is an ABCMeta-based class, it's NOT going 
    # to be testable with type(value) == types.TypeType...
    if type( value ) != abc.ABCMeta:
        raise TypeError( '%s.DocumentClass expects a class derived '
            'from BaseDocument. but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__
            )
        )
    if not issubclass(value, BaseDocument):
        raise ValueError( '%s.DocumentClass expects a class derived '
            'from BaseDocument. but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__
            )
        )
    self._documentClass = value

# ...

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the root tag-name of documents of the namespace to set for the '
    'instance',
    str, unicode
)
@describe.raises( TypeError, 
    'if passed a value that is not a str or unicode type or None'
)
@describe.raises( ValueError, 
    'if passed a value that has multiple lines in it'
)
@describe.raises( ValueError, 
    'if passed a value that is not valid as a tag-name'
)
def _SetRootTagName( self, value ):
    """
Sets the root tag-name of documents of the namespace"""
    if type( value ) not in ( str, unicode ) and value != None:
        raise TypeError( '%s.RootTagName expects a single-line str '
            'or unicode value, or None, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__ )
            )
    if value:
        if '\n' in value or '\r' in value:
            raise ValueError( '%s.RootTagName expects a single-line '
                'str or unicode value, or None, but was passed "%s" (%s) '
                'which has multiple lines' % ( 
                    self.__class__.__name__, value, type( value ).__name__
                )
            )
    _validTagNameRE = re.compile( '[_A-Za-z][-_a-zA-Z0-9]*' )
    if _validTagNameRE.sub( '', value ) != '':
        raise ValueError( '%s.RootTagName expects a valid tag-name '
            'value, but was passed "%s" (%s) which is not valid' % ( 
                self.__class__.__name__, value, type( value ).__name__
            )
        )
    self._rootTagName = value

# ...

#-----------------------------------#
# Instance property-deleter methods #
#-----------------------------------#

# ...

@describe.AttachDocumentation()
def _DelDocumentClass( self ):
    """
"Deletes" the the document-class to be used to create root-document nodes 
when parsing documents of the namespace by setting it to None"""
    self._documentClass = None

# ...

@describe.AttachDocumentation()
def _DelRootTagName( self ):
    """
"Deletes" the root tag-name of a document in the namespace by setting it 
to None."""
    self._rootTagName = None

# ...

#-----------------------------------#
# Instance Properties               #
#-----------------------------------#

# ...

DocumentClass = describe.makeProperty(
    _GetDocumentClass, None, None, 
    'the document-class to be used to create root-document nodes '
    'when parsing documents of the namespace',
    str, unicode, None
)

# ...

RootTagName = describe.makeProperty(
    _GetRootTagName, None, None, 
    'the tag-name of a root tag of a document in the namespace',
    str, unicode, None
)
I'd also already integrated those new properties into Namespace.__init:

#-----------------------------------#
# Instance Initializer              #
#-----------------------------------#
@describe.AttachDocumentation()

# ...

def __init__( self, name, namespaceURI, contentType, systemId=None, 
    publicId=None, defaultRenderingModel=renderingModels.Mixed, 
    **tagRenderingModels ):
    """
Instance initializer"""

    # ...

    # Set default instance property-values with _Del... methods as needed.
    # ...
    self._DelDocumentClass()
    # ...
    self._DelRootTagName()
    # ...
    # Set instance property values from arguments if applicable.
    # ...
    # Other set-up
    self.__class__.RegisterNamespace( self )

I'd also done some exploration in the current MarkupParser.NodesFromTokens method to assure that the related namespace could be identified. That was not much more than dropping a print into the code and running a test-parse with a short but still relatively complex document:

def _NodesFromTokens( self, tokens ):
    """
Generates a list of nodes from the supplied tokens"""

    #...

    # Create a new tag and deal with it accordingly
    if currentElement:
        # It'll be a child of currentElement
        newNode = Tag( tagName, tagNamespace, **attrDict )
        if not unary:
            # Can have children, so it takes currentElement's place
            currentElement = currentElement.appendChild( newNode )
        else:
            # No children allowed, so just append it
            currentElement.appendChild( newNode )
    else:
        # It can't be a child of an element, for whatever reason
        # TODO: Check to see if tagNamespace is identified, there's 
        #       an appropriate/related class, and if so, create an 
        #       instance of that class instead of Tag, maybe?
        if tagNamespace:
            print '### tagNamespace (from namespace URI)'
        newNode = Tag( tagName, tagNamespace, **attrDict )
        if not unary:
            # But it can HAVE children, so it takes currentElement's 
            # place.
            currentElement = newNode
        results.append( newNode )
which demonstrated pretty succinctly that it was at least recognizing the associated namespace:
###### tagNamespace.Name: xhtml
All that left, then, was figuring out how to register the document-classes with their associated Namespace-constants...

Modifying Namespace, and the Implications

The only other item, then, that needs to be handled in Namespace is providing a mechanism to actually make the document-class and root-tag-name associations. Since there are already property-setters for both, that's a very simple method:

@describe.AttachDocumentation()
@describe.argument( 'documentClass',
    'the BaseDocument-derived class to register with the Namespace instance '
    'as the document-type to create when the root tag is encountered '
    'during parsing',
    TypeType
)
def SetDocumentClass( self, documentClass ):
    """
Registers a BaseDocument-derived class with the Namespace instance as its 
document-type of record."""
    self._SetDocumentClass( documentClass )
    self._SetRootTagName( documentClass._rootTagName)
Actually making that association is then nothing more than calling SetDocumentClass on each Namespace instance after both it and its associated document class have been defined:

HTML5Namespace = Namespace(
    'html5',
    'http://www.w3.org/2015/html', 
    # ...
    )

@describe.InitClass()
class HTML5Document( BaseDocument, object ):
    """Represents a markup-document in HTML 5"""

# ...

__all__.append('HTML5Document')

HTML5Namespace.SetDocumentClass( HTML5Document )

Implementing and Organizing the Document Classes

It occurred to me while I was doing some light-weight testing of the XMLDocument class that it might be a better idea to define the markup namespace as a package, in order to allow new specific document-types and -namespaces to be added to the framework more easily. As things stand right now, with markup being a module, the process for adding a new, defined document-type (say something like DocBook XML) to the available document-types involved changes to markup.py. The inevitable progression, as more and more new document-types get added, will result in a potentially unmanageable growth of the code in markup.py, raising the risk of it getting more and more brittle as time goes on. That ignores the possible run-time speed/performance drop that would also be introduced (though, to be fair, I don't expect that would be all that significant... Still...).

Ideally, as time goes on, it might be better (it'd certainly be neat) if the process of adding a new document-type to the idic framework involved nothing more than creating the relevant Namespace and BaseDocument-derived types in a free-standing module-file, dropping that new file in the appropriate location, and moving on. While I didn't go quite so far as to make the current document-types auto-detected, I did break them out into their own modules and pull them in at the end of the core markup.py file, after converting that to a package-structure.

With that structure in play, implementation of the two external document-types got a minor facelift. In order to import the relevant classes from the sub-modules, when those maodules and the main package-header could get called repeatedly, the Namespace creation and configuration had to ensure that the individual Namespace instances weren't being overwritten. That looked like this (using HTML5Namespace as an example):

try:
    checkNamespace = Namespace.GetNamespaceByName( 'html5' )
except MarkupError:
    Namespace(
        'html5',
        'http://www.w3.org/2015/html', 
        'text/html',
        None,
        None,
        renderingModels.RequireEndTag,
        br=renderingModels.NoChildren,
        img=renderingModels.NoChildren,
        link=renderingModels.NoChildren,
        script=renderingModels.RequireEndTag,
        )
HTML5Namespace = Namespace.GetNamespaceByName( 'html5' )
__all__.append( 'HTML5Namespace' )
The individual document-classes are (for now) very simple.

HTML5Document

@describe.InitClass()
@describe.attribute( '_documentNamespace', 
    'the Namespace-instance that documents of this type should be '
    'associated with'
)
@describe.attribute( '_rootTagName', 
    'root tag for an instance of the document-type, if applicable'
)
class HTML5Document( BaseDocument, object ):
    """Represents a markup-document in HTML 5"""
    #-----------------------------------#
    # Class attributes (and instance-   #
    # attribute default values)         #
    #-----------------------------------#

    # The Namespace-instance that documents of this type should be associated with
    _documentNamespace = HTML5Namespace
    # The root tag for an instance of the document-type, if applicable
    _rootTagName = 'html'

    #-----------------------------------#
    # Instance property-getter methods  #
    #-----------------------------------#

    #-----------------------------------#
    # Instance property-setter methods  #
    #-----------------------------------#

    #-----------------------------------#
    # Instance property-deleter methods #
    #-----------------------------------#

    #-----------------------------------#
    # Instance Properties               #
    #-----------------------------------#

    #-----------------------------------#
    # Instance Initializer              #
    #-----------------------------------#
    @describe.AttachDocumentation()
    @describe.keywordargs( 
        'the attribute names/values to set in the created instance'
    )
    def __init__( self, **attributes ):
        """
Instance initializer"""
        # HTML5Document is intended to be a nominally-final class
        # and is NOT intended to be extended. Alter at your own risk!
        #---------------------------------------------------------------------#
        # At least in theory, NO concrete document-class should ever NEED to  #
        # be extended. There's just no use-case that I can come up with that  #
        # wouldn't be better served (in my opinion) by creating a new         #
        # concrete document-class instead...                                  #
        #---------------------------------------------------------------------#
        if self.__class__ != HTML5Document:
            raise NotImplementedError('HTML5Document is '
                'intended to be a nominally-final class, NOT to be extended.')
        # Call BaseDocument check-helper methods:
        BaseDocument._CheckNamespace( self )
        BaseDocument._CheckTagName( self )
        # Call parent initializers, if applicable.
        BaseDocument.__init__( self, **attributes )
        # Set default instance property-values with _Del... methods as needed.
        # Set instance property values from arguments if applicable.
        # Other set-up

    #-----------------------------------#
    # Instance Garbage Collection       #
    #-----------------------------------#

    #-----------------------------------#
    # Instance Methods                  #
    #-----------------------------------#

    #-----------------------------------#
    # Class Methods                     #
    #-----------------------------------#

    #-----------------------------------#
    # Static Class Methods              #
    #-----------------------------------#

#---------------------------------------#
# Append to __all__                     #
#---------------------------------------#
__all__.append('HTML5Document')

HTML5Namespace.SetDocumentClass( HTML5Document )

XHTMLDocument

At this point, the only difference between the HTML5Document and XHTMLDocument classes, apart from their names, are the class attributes relating to the document-namespace:

# The Namespace-instance that documents of this type should be associated with
_documentNamespace = XHTMLNamespace

XMLDocument

The XMLDocument document-class doesn't even have definitions for those class-attributes, inheriting None values from BaseDocument:

# The Namespace-instance that documents of this type should be associated with
_documentNamespace = None
# The root tag for an instance of the document-type, if applicable
_rootTagName = None

Finally: Parsing Documents

The balance of the efforts in ths post are probably going to seem anticlimactic... The sum total of the relevant changes needed in MarkupParser._NodesFromTokens is actually pretty trivial (changed lines have a green left border):

def _NodesFromTokens( self, tokens ):
    """
Generates a list of nodes from the supplied tokens"""

# ...
    # Create a new tag and deal with it accordingly
    if currentElement:
        # It'll be a child of currentElement
        newNode = Tag( tagName, tagNamespace, **attrDict )
        if not unary:
            # Can have children, so it takes currentElement's place
            currentElement = currentElement.appendChild( newNode )
        else:
            # No children allowed, so just append it
            currentElement.appendChild( newNode )
    else:
        # It can't be a child of an element, for whatever reason
        # - If the tag-namespace points to the tag-name being a 
        #   document-root tag, then create a document instead 
        #   of a tag...
if tagNamespace and tagName == tagNamespace.RootTagName: try: del attrDict['xmlns'] except KeyError: pass newNode = tagNamespace.DocumentClass( **attrDict ) else: newNode = Tag( tagName, tagNamespace, **attrDict )
if not unary: # But it can HAVE children, so it takes currentElement's # place. currentElement = newNode results.append( newNode )
With all of these changes in place, I whipped up a quick and dirty test-script (parse-test.py, available for download at the end of the post) just to make sure that it was doing what I expected/needed. It yielded the following results with the XHTML namespace:
mparsed, <idic.markup.xhtml.XHTMLDocument object at 0x7f9ed66f0450>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

[truncated for brevity's sake]

</html>
Source length 1052
Parsed length 889
and nearly-identical results using the HTML5 namespace:
mparsed, <idic.markup.html5.HTML5Document object at 0x7f2143f24450>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/2015/html">

[truncated for brevity's sake]

</html>
Source length 1051
Parsed length 813
Both runs verified that the root node of the parsed document was an instance of the appropriate BaseDocument-derived class: XHTMLDocument in the first case, and HTML5Document in the second.

This post has gone on for quite a while now, so I won't spend any time writing about the unit-testing that went into it as well. I hope, by now, that the process I go through has been covered in sufficient detail that it's not really necessary (but for those who might've missed it, see my earlier post, walking through it from the ground up). I will mention, though, that even with all of these changes, there weren't all that many new tests that had to be generated:

########################################
Unit-test results
########################################
Tests were successful ..... False
Number of tests run ....... 321
 + Tests ran in ........... 0.53 seconds
Number of errors .......... 0
Number of failures ........ 13
Number of tests skipped ... 131
########################################
It's also (maybe) worth mentioning that the import-structure for HTML5Document and XHTMLDocument did not allow either of those classes to be discovered as items in need of tests — a fact that I will look into at a later date. For now, that simply means that I must also explicitly generate and include unit-test modules for the new html5.py and xhtml.py modules that contain those classes.
43.4kB

Thursday, May 18, 2017

Generating and Parsing Markup in Python [8]

Almost the entire process and structure for generating markup is complete now — minus a handful of items that are waiting on the ability to parse markup-text, and some decisions about how to handle document-creation. Those two topics are loosely related, at least, so today I'm going to get the parsing capabilities worked out in the hopes that it will provide a basis for deciding how to implement concrete document-classes. So: MarkupParser:

Parsing, from 10,000 feet

Parsing markup is in interesting problem (perhaps in the sense of the old Chinese curse: may you live in interesting times):

  • There are rules for how tags can be expressed, but those rules may not be the same for any two tags, at least in certain markup dialects (HTML 5). Those rules help to define what constitues valid markup, but may not be enough, by themselves, to determine validity.
  • Within the boundaries of those rules, markup can be very broad (a single tag can have any number of children), and can also be very deep (it's possible to have a tag within a tag within a tag... down to any depth).
The first of those two considerations is why I've chosen to use page-templates that follow XML rules — The tag-rules are simpler, and don't require keeping track of which tags do and do not require closing tags, or which cannot have closing-tags. Keeping the template-files in XML, and allowing the rendering process (using data from the document's namespace) to determine whether a closing-tag is needed (or allowed). I haven't yet thought about how I'm going to deal with determining how to render unary tags, like <img>, which render differently in HTML 5 and XHTML:
<-- HTML 5 -->
<img src="">
versus
<-- XHTML -->
<img src="" />
As things stand right now, both HTML 5 and XHTML renderings of an <img> look like the XHTML output above. That won't cause any issues in any browsers I'm aware of right now, but there's no guarantee that it won't in the future. I'm contemplating setting up an IsXMLDialect property in Namespace, defaulting to False, that would be re-set to True when an XML declaration is created on a document-instance, but that's a discussion for later, perhaps. For the time being, I'm content to leave it alone, rather than thrash through Tag.__str__ and Tag.__unicode to fix what isn't currently a problem.

But I digress...

The second item is, arguably, the more significant of the two, given that I'm not expecting to be parsing markup that isn't XML-compliant for a while. It's significant because it makes the scope of the parsing problem theoretically infinite — Even with some fairly realistic expectations (no more than, say, 100 children for any given tag, and no deeper nesting than, say, 30 levels), it's still a huge and dynamic set of possibilities to try and come up with a solution around. And that's just in a document's scope — in order to fully implement Tag.innerHTML, a parsing process has to be able to contend with mixed node-types from the supplied test. That is,

myTag.innerHTML = ( 'This is my markup. <strong>' + 
    'There are many like it, but this one is ' + 
    '<em>mine</em>!</strong>' )
should yield markup along the lines of:
<myTag>
    This is my markup. <strong>There are many like it, 
    but this one is <em>mine</em></strong>
</myTag>
The results of parsing that initial string aren't a single element with some number of children (as might be expected when parsing a document or a template-file that's used to generate a document). Instead, it's a sequence of nodes that would then have to be appended to myTag's child-nodes after removing any existing child-nodes.

How I'm Going To Approach This

There are two significant processes that MarkupParser is going to undertake. In summary, they are:

  1. Breaking the provided markup-text down into tokens, where each token is one of:
    • An XML declaration
    • An XML Processing-instruction;
    • A DOCTYPE declaration (maybe — I'm still thinking on this);
    • A document;
    • A tag;
    • A CDATA section;
    • A comment; or
    • A text-node.
    I expect the tokens to be a sequence of text-values, probably a list.
  2. Iterating over the sequence of tokens, while keeping track of both a root element-node (which might be a document, but would certainly be a Tag-derived instance) and a current element-node (a Tag also), and setting aside (for now) any DOCTYPE handling or document-identification:
    • If an XML declaration is encountered, and hasn't already been defined for the root element-node, store it for later use;
    • If an XML processing-instruction is encountered, store it for later use;
    • If a start-tag is encountered, create a corresponding Tag-instance, and:
      • Set the current element to the newly-created Tag-instance
      • If there is no root element defined, set it to the newly-created Tag-instance;
      • Otherwise, append it to the current element
    • If an end-tag is encountered:
      • Check its tag-name (and namespace, if applicable) against the tag-name of the current element, raising a MarkupError if they don't match;
      • Otherwise, set the current element to the parent element of the current element, effectively closing the tag by preventing further child-appending to the Tag instance representing it.
    • If a CDATA-, comment- or text-node is encountered, create an instance of CDATA, Comment or Text, respectively, populate it with the applicable text from the token, and append it to the current node.
    • If the root node is being closed/completed, attach any XML declarations and processing-intructions to the (presumed) document;

Creating the Token Sequence

Creating the sequence of tokens is surprisingly easy. My previous efforts used a regular-expression-based process (using re.findall) to extract token-items, and it was mostly functional, but it had a couple of drawbacks:

  • It was occasionally prone to errors, requiring revision of the regular-expression definition, which grew increasingly complex and hard to manage;
  • It ended up causing an odd requirement that there could be no empty tags in the source markup, which was at least occasionally frustrating for the less-technical designer-types who were working on page-templates. By way of example, using Glyphicons that are available as part of Bootstrap), the following markup-structure was required in order to prevent throwing ParsingErrors:
    <div>
      <i class="glyphicon glyphicon-user"><!-- . --></i>
      A user icon
    </div>
    Compare with what should have been allowed:
    <div>
      <i class="glyphicon glyphicon-user"></i>
      A user icon
    </div>
    While that wasn't too awkward to deal with, it definitely took some getting used to, and was disruptive until the habit had been formed, so it's something to avoid in the current implementation.

After some experimentation, where I landed was the following function:

def Tokenize( markupText ):
    tokenChunks = [ 
        tc for tc in markupText.split( '<' ) if tc
    ]
    result = []
    for token in tokenChunks:
        if '>' in token:
            try:
                tag, text = token.split( '>' )
                result.append( '<%s>' % tag )
                if not text.strip():
                    text = ' '
                if text:
                    result.append( text )
            except ValueError:
                result.append( '<%s' % token )
        else:
            if token:
                result.append( token )
    return result
This got pretty close, maybe exactly what's needed for both an XML document-template and a markup-fragment with a mixture of tags and text, in a mere 20 lines of code. Those token-sequences have some potentially odd-looking items in their results, but nothing that seems likely to prevent them from being used in the node-generation iteration to come.

Running Tokenize Against an XML Document Template

I started with a stripped-down version of the same XML Document Template that I shared at the end of last month — basically, I just stripped out any application or framework items, both namespaces and tags, leaving:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <!--
    Example comment for parsing
  -->
  <![CDATA[
    Example CDATA for parsing
  ]]>
  <head>
    <title>Page Title</title>
    <script src="//some-domain.com/file.js"></script>
    <link rel="stylesheet" href="//some-domain.com/file.css" />
  </head>
  <body>
    <div id="main" class="container">
      <h1>Page Title</h1>
    </div>
    <div id="main" class="container">
      <h1>Page Title: Log-in Required:</h1>
      <form action="" method="post">
        <div class="form-group">
          <label for="username">Name:</label>
          <input type="text" id="username" name="username" class="form-control" />
        </div>
        <div class="form-group">
          <label for="userpass">Password:</label>
          <input type="password" id="userpass" name="userpass" class="form-control" />
        </div>
        <div>
          <button type="submit" class="btn btn-default">Log In</button>
        </div>
      </form>
    </div>
  </body>
</html>
Executing the Tokenize prototype-function against that markup (as a string-constant in a throwaway file) returned the following sequence of tokens:
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. " " (a single space character)
  3. <html xmlns="http://www.w3.org/1999/xhtml">
  4. " "
  5. <!--
    Example comment for parsing
    -->
  6. " "
  7. <![CDATA[
    Example CDATA for parsing
    ]]>
  8. " "
  9. <head>
  10. " "
  11. <title>
  12. "Page Title"
  13. </title>
  14. " "
  15. <script src="//some-domain.com/file.js">
  16. " "
  17. </script>
  18. " "
  19. <link rel="stylesheet" href="//some-domain.com/file.css" />
  20. " "
  21. </head>
  22. " "
  23. <body>
  24. " "
  25. <div id="main" class="container">
  26. " "
  27. <h1>
  28. "Page Title"
  29. </h1>
  30. " "
  31. </div>
  32. " "
  33. <div id="main" class="container">
  34. " "
  35. <h1>
  36. "Page Title: Log-in Required:"
  37. </h1>
  38. " "
  39. <form action="" method="post">
  40. " "
  41. <div class="form-group">
  42. " "
  43. <label for="username">
  44. "Name:"
  45. </label>
  46. " "
  47. <input type="text" id="username" name="username" class="form-control" />
  48. " "
  49. </div>
  50. " "
  51. <div class="form-group">
  52. " "
  53. <label for="userpass">
  54. "Password:"
  55. </label>
  56. " "
  57. <input type="password" id="userpass" name="userpass" class="form-control" />
  58. " "
  59. </div>
  60. " "
  61. <div>
  62. " "
  63. <button type="submit" class="btn btn-default">
  64. "Log In"
  65. </button>
  66. " "
  67. </div>
  68. " "
  69. </form>
  70. " "
  71. </div>
  72. " "
  73. </body>
  74. " "
  75. </html>
  76. " "

The recurring items might seem a bit odd, at first, or like they might be problematic when it comes to the node-generation process, but they aren't all that surprising, given how Tokenize works. Looking at the first form-group <div>:

<-- just inside the form -->
<div class="form-group">
  <label for="username">Name:</label>
  <input type="text" id="username" name="username" class="form-control" />
</div><-- and so on... -->
The initial tokenChunks in Tokenize would acquire the following segments of the original markup-text:
  1. div class="form-group"> (plus the whitespace following until the next tag is encountered)
  2. label for="username">Name:
  3. /label> (plus whitespace after)
  4. input type="text" id="username" name="username" class="form-control" /> (plus whitespace after)
  5. /div>
As the token iteration runs through those token-chunks:
  1. The tag and text values will be successfully acquired by
    tag, text = token.split( '>' )
    with tag containing everything inside the original < and > from the markup-text, and text containing the line-break and indentation white-space that followed the tag.
    From that point on, the rest is just clean-up and formatting, with the bulk of text being reduced down to a single space, before both are appended to the final results.
  2. The same steps will capture label for="username" and Name: for the tag and text values in the <label ...> line.
  3. Follows the same steps as #0, above, with very similar results: A tag-token and an empty string-token
  4. The same again
  5. The attempt to split this (assuming the comment were in place, or that the </div> was the last bit of text in the source-markup) fails with a ValueError and execution drops into the except block, where the token is re-assembled as the end-tag
The first time I saw this token-output, I was somewhat concerned about the sheer number of the single space items that came back in it. It seemed to me that adding in all of those tokens during the iteration over them would lead to a potentially-significant amount of useless additional information in the final rendering, since each of them would eventually be converted into a Text item storing a single space-character as its content. At the same time, it's perfectly valid in markup to have single spaces between tags that also need to be preserved. I suspect that such cases would mostly show up in content, in situations like:
...this is <code>strong</code> <strong>content</strong> ...
The space between the </code> and <strong> tags is legitimate.

I cannot think of a good (and simple) way to strip out the single-space tokens that should be removed — Doing so would require a way to identify which of them were of legitimate use, and which were just rendered down from (probably) indentation in the markup-source. I set that aside for a bit, to look at the results from a markup-fragment (below), and after looking over that token-result, I'm pretty sure that my concerns were a non-issue:

  • The whitespace-reduction that happens because of indentation or other contiguous whitespace in the markup-source would be similarly handled by the client browser, so the size and content of those long/contiguous white-space chunks turns out to be irrelevant, so long as their existance is preserved;
  • So long as any trailing- or leading-white-space in a text-token isn't stripped down to nothing, any content-resident white-space reduction is fine. Running a quick test against
    ...this is <code>strong</code> <strong>content</strong> ...
    yielded the following tokens:
    1. "... this is " (space after the text is preserved)
    2. <code>
    3. "strong"
    4. </code>
    5. " " (space between the tags is preserved)
    6. <strong>
    7. "content"
    8. </strong>
    9. " ..." (space before the text is preserved)

Running Tokenize Against a Markup Fragment

Between the fragment I checked above and this one:

myTag.innerHTML = "This is for a tag's ' +
    '<code>innerHTML</code> property<br />' +
    'It contains mixed and unary tags, as well as text!"
which yielded:
  1. "This is for a tag's " (note the inclusion of the space at the end of the text-value)
  2. <code>
  3. "innerHTML"
  4. </code>
  5. " property" (note the inclusion of the space at the beginning of the text-value)
  6. <br />
  7. "It contains mixed and unary tags, as well as text!"
there's nothing surprising to me, so it feels like that prototype Tokenize function will work just fine.

Converting the Token Sequence into Objects

The basic outline I provided earlier is pretty close to the final process that I came up with. I didn't specify what was actually being returned, now that I look at it again, but my original thought/expectation was returning a list of Tag instances. That won't work, though: since a parsing process needs to be able to parse markup-fragments, and those fragments may not start with a tag, it needs to return a list of IsNode instances, including CDATA, Comment, Tag and Text object-types, rather than just returning a top-level IsElement object — It's quite possible, as shown, that the markup being parsed will include text-nodes that aren't children of any element. Not a major change, as far as what it returns, but that involved some pretty significant differences in how items were handled inside the process.

It also needs the ability to split tag-names from attributes, for regular tags as well as for the various XML items that could be provided. Once the attributes-text has been isolated, it also needs to be able to identify attributes' name/value pairs. Both of these tasks are, I think, best implemented with a couple of regular expressions:

attributeFinder = re.compile( 
    r"""([_a-zA-Z][-_0-9A-Za-z]*) *= *["'](.*)["']"""
)
whitespaceSplitter = re.compile( '\s' )
I suspect I'll need to tweak the attributeFinder, to assure that it'll grab onto any valid attribute-name — the pattern specified is from memory, and may not be accurate, but I'll review that before I finalize the markup module, and tweak it if my recollection is incorrect.

The prototype function for the iteration (NodesFromTokens) is fairly lengthy, so I'll break it down in a couple of chunks:

def NodesFromTokens( tokens ):
    currentElement = None
    rootElement = None
    xmlDeclaration = None
    xmlInstructions = []
    currentIndex = 0
    errorTokenOffset = 3
    results = []
    for token in tokens:
        if token[ 0 ] == '<':
            # Some sort of tag, or a document
Within the function, there are a fair few variables that keep track of various items as the iteration progresses:
currentElement
None or a IsElement instance: The current element that new nodes will be added to with appendChild.
rootElement
None or a IsElement instance: The top-level element of the DOM-tree that currentElement belongs to.
xmlDeclaration
None or BaseDocument.XMLTag: The XML declaration of the rootElement.
xmlInstructions
list of BaseDocument.XMLTag: The XML processing-instructions of the rootElement, if any.
currentIndex
int: keeps track of the current position in the token-sequence of the current token, used mostly for generating reasonably-detailed error-messages if the parsing-process goes awry.
errorTokenOffset
int: The number of tokens before and after the current token to display when an error is raised.
results
list of BaseNode: The results to be returned by the function.
The balance of the function starts by determining what type of token is being examined. The branches for comments and CDATA sections are almost identical, save for what kinds of objects get created, and what the identification-criteria are:
if token[ 0:4 ] == '<!--':
    # Comment: <!-- content -->
    newNode = Comment( token[ 4:-3 ].strip() )
    if currentElement:
        currentElement.appendChild( newNode )
    else:
        results.append( newNode )
elif token[ 0:9 ] == '<![CDATA[':
    # CDATA: <![CDATA[ content ]]>
    newNode = CDATA( token[ 9:-3 ].strip() )
    if currentElement:
        currentElement.appendChild( newNode )
    else:
        results.append( newNode )
The XML-items are pretty straightforward, though they are stored for addition to a document later, since odds are good that the document's root tag hasn't been created yet. I have yet to work out how documents' root-tags will be identified at this point, but I'll address that later in this post.
elif token[ 0:2 ] == '<?':
    # XML item: Could be an XML declaration or a 
    # processing-instruction
    innerTag = token[ 2:-2 ]
    try:
        tagName, attributes = innerTag.split( ' ', 1 )
    except ValueError:
        tagName = innerTag
        attributes = ''
    # Generate attributes, if applicable
    attrDict = {}
    if attributes:
        for keyValuePair in attributes.split( ' ' ):
            try:
                key, value = attributeFinder.match( keyValuePair ).groups()
                attrDict[ key ] = value
            except ValueError:
                pass
    if tagName == 'xml':
        xmlDeclaration = BaseDocument.XMLTag( tagName )
    else:
        xmlInstructions.append( BaseDocument.XMLTag( tagName ) )
Processing for tags is more complex, though it starts simply enough:
else:
    # Tag: <[xmlns:]tagName[ attribute=""]*[/]?> | </[xmlns:]tagName>
    innerTag = token[ 1:-1 ].strip()
    # Determine if the tag is unary/self-closing as it appears in 
    # the markup-source. This will determine whether or not it gets 
    # used as a currentElement value later on.
    unary = ( token[ -2 ] == '/' )
    # Split out the tag-name and its attribute-text
    try:
        tagName, attributes = whitespaceSplitter.split( innerTag, 1 )
    except ValueError:
        tagName = innerTag
        attributes = ''
    # Clean up attributes text
    if attributes and attributes[ -1 ] == '/':
        attributes = attributes[ 0:-1 ].strip()
    # Clean up tag-name
    if tagName[ -1] == '/':
        tagName = tagName[ 0:-1 ].strip()
    # Determine if there's a namespace attached to the tag, and 
    # handle accordingly
    try:
        xmlns, tagName = tagName.split( ':', 1 )
    except ValueError:
        xmlns = None
    if xmlns:
        tagNamespace = Namespace.GetNamespaceByName( xmlns )
    else:
        tagNamespace = None
By this point, the tag's tagName and namespace have been identified. The first thing to do is to see if it's an end-tag, though: If it is, the rest of the processing can be skipped, since it won't have attributes, children, or any other items that need to be processed. All that really needs to happen is verifying that the end-tag matches the item stored in currentElement — if it doesn't, that's an error, and I'll raise a fairly detailed ParsingError accordingly:
    # If the first character of tagName is "/", then it's an end-tag.
    # Check the namespace and the modified tag-name against the current 
    # element's equivalents: If they match the current element can be 
    # closed, and the loop can continue.
    if tagName[ 0 ] == '/':
        endName = tagName[1:]
        if currentElement:
            if endName == currentElement.tagName and \
                (
                    # If there is no tagNamespace, it should be OK?
                    not tagNamespace or 
                    # Otherwise it needs to match!
                    tagNamespace == currentElement.namespace
                ):
                # The end-tag is legit, so we can set currentElement 
                # back to its parentElement. First, though:
                # - Remove any whitespace-only Text at the 
                #   beginning of the childNodes
                while currentElement.childNodes and \
                    isinstance( 
                        currentElement.childNodes[ 0 ], Text
                    ) and not currentElement.childNodes[ 0 ].data.strip():
                    currentElement.childNodes[ 0 ].removeSelf()
                # - Remove any whitespace-only Text at the end of 
                #   the childNodes
                while currentElement.childNodes and \
                    isinstance( 
                        currentElement.childNodes[ -1 ], Text
                    ) and not currentElement.childNodes[ -1 ].data.strip():
                    currentElement.childNodes[ -1 ].removeSelf()
                currentElement = currentElement.parentElement
            else:
                # Generate a fairly detailed error-message, 
                # including some before-and-after tag-tokens to 
                # make debugging bad markup easier
                tcStart = max( currentIndex - errorTokenOffset, 0 )
                tcEnd = min( currentIndex + errorTokenOffset, 
                    len( tokens )
                )
                tokenContext = tokens[ tcStart:tcEnd ]
                # Include WHY the failure happened: tag-name or 
                # -namespace issue:
                errorCauses = []
                if endName != currentElement.tagName:
                    errorCauses.append( 
                        'tag-name mismatch: %s != %s'  % ( 
                            endName, currentElement.tagName
                        )
                    )
                if xmlns != currentElement.namespace:
                    errorCauses.append( 
                        'namespace mismatch: %s != %s'  % ( 
                            xmlns, currentElement.namespace.Name
                        )
                    )
                raise ParsingError( 
                    'Mismatched end-tag (%s) while parsing '
                    '%s: %s' % ( 
                        token, tokenContext, 
                        ' and '.join( errorCauses )
                    )
                )
        # If everything worked, the rest can be skipped for this 
        # tag-token, since end-tags don't have attributes, etc.
        continue
When a tag is closed, I want to remove any empty Text children if they are at the beginning or end of the Tag's childNodes. This is just some basic clean-up, but it will prevent odd occurrances like
<head>
    <script src="//some-domain.com/script.js"></script>
</head>
being processed and resulting in
<head>
    <script src="//some-domain.com/script.js"> </script>
</head>
(Note the space between the starting and ending <script> tags)

If the current token doesn't represent an end-tag, then it must represent a start-tag, so any attributes need to be processed:

    # Create the actual attributes
    attrDict = {}
    if attributes:
        for keyValuePair in whitespaceSplitter.split( attributes ):
            # Each name="value" item
            try:
                key, value = attributeFinder.match( 
                    keyValuePair.strip() ).groups()
                attrDict[ key ] = value
            except AttributeError:
                # Raised if the match-result has no groups(), so 
                # do nothing
                pass
Then, finally, a new Tag instance can be created. If currentElement exists, then the new Tag instance will be appended to it, otherwise it'll be appended to the results list. That allows the same process to deal with both document (or at least document-like) markup-structure, where the entire markup-body is contained in a single tag, and markup fragments, where there may be a number of nodes, including tags, all at the same level.
    # If an xmlns attribute has been specified, grab that and 
    # attach it to the tag:
    nsURI = attrDict.get( 'xmlns' )
    if nsURI:
        tagNamespace = Namespace.GetNamespaceByURI( nsURI )
    # Create a new tag and deal with it accordingly
    if currentElement:
        # It'll be a child of currentElement
        newNode = Tag( tagName, tagNamespace, **attrDict )
        if not unary:
            # Can have children, so it takes currentElement's place
            currentElement = currentElement.appendChild( newNode )
        else:
            # No children allowed, so just append it
            currentElement.appendChild( newNode )
    else:
        # It can't be a child of an element, for whatever reason
        newNode = Tag( tagName, tagNamespace, **attrDict )
        if not unary:
            # But it can HAVE children, so it takes currentElement's 
            # place.
            currentElement = newNode
        results.append( newNode )
The last remaining node-type that needs to be handled is Text instances, which are very simple:
else:
    # Text
    newNode = Text( token )
    if currentElement:
        currentElement.appendChild( newNode )
    else:
        results.append( newNode )
currentIndex += 1
Since the results are a list, and it's possible (probably even likely in the case of documents) for there to be empty Text-items at the beginning and/or end of the results, I want to remove them. I don't want to remove any empty Text-items from anywhere else, though: markup-fragments can legitimately have them, and even documents will too...
# Clear out any starting- or ending-results that are empty text-nodes
while isinstance( results[ 0 ], Text ) and not results[ 0 ].data.strip():
    results = results[ 1: ]
while isinstance( results[ -1 ], Text ) and not results[ -1 ].data.strip():
    results = results[ :-1 ]
return results

Executing these prototype functions against the XML source witht he following code:


tokens = Tokenize( xmlSource )
parsed = NodesFromTokens( tokens )
print parsed
print parsed[ 0 ]
yields these results (line-breaks added for clarity, empty Text-items are shown as  ):
[<idic.markup.Tag object at 0x7fbcb613f490>]
<html xmlns="http://www.w3.org/1999/xhtml"> 
<!-- Example comment for parsing --> 
<![CDATA[ Example CDATA for parsing ]]> 
<head><title>Page Title</title> 
<script src="//some-domain.com/file.js" /></script> 
<link href="//some-domain.com/file.css" rel="stylesheet" />
</head> <body><div id="main" class="container">
<h1>Page Title</h1></div> 
<div id="login" class="container">
<h1>Page Title: Log-in Required:</h1> 
<form method="post"><div class="form-group">
<label for="username">Name:</label> 
<input id="username" type="text" class="form-control" name="username" />
</div> <div class="form-group">
<label for="userpass">Password:</label> 
<input id="userpass" type="password" class="form-control" name="userpass" />
</div> <div><button type="submit">Log In</button>
</div></form></div></body></html>

Defining the MarkupParser Class

I decided to make MarkupParser a class for several reasons:

  • Being able to create separate instances for different markup-source items allows the source, the tokens and the results to persist across any number of parsing-executions.
  • Parsers living in separate instances don't run any risk of cross-contamination, which could happen if the process were a free-standing function. To be fair, avoiding that sort of cross-contamination is probably just a matter of discipline, so it's not really required. At the same time, it just feels... safer, really.
  • Separate parser-instances can be attached to other objects as properties if needed — I have this gut feeling that I'll want to be able to do that down the line, probably at about the point when I'm working out page-components;
Ultimately, given the work that I'd done on the prototype functions, creating MarkupParser involved little more than moving the constants and functions into the class definition (and making them protected), and creating properties for the souorce, tokens and results:
@describe.InitClass()
class MarkupParser( object ):
    """
Provides an object-class that can be instantiated with a chunk of markup that 
can parse that markup into an IsNode-derived-objects DOM tree."""
    #-----------------------------------#
    # Class attributes (and instance-   #
    # attribute default values)         #
    #-----------------------------------#

    _attributeFinder = re.compile( 
        r"""([_a-zA-Z][-_0-9A-Za-z]*) *= *["'](.*)["']"""
    )
    _whitespaceSplitter = re.compile( '\s' )

    #-----------------------------------#
    # Instance property-getter methods  #
    #-----------------------------------#

    @describe.AttachDocumentation()
    def _GetResults( self ):
        """
Gets the final parsed markup-items generated from the original source"""
        try:
            return self._results
        except AttributeError:
            self._results = self._NodesFromTokens( self.Tokens )
            return self._results

    @describe.AttachDocumentation()
    def _GetSource( self ):
        """
Gets the original source supplied to the instance to be parsed"""
        return self._source

    @describe.AttachDocumentation()
    def _GetTokens( self ):
        """
Gets the sequence of tokens that the instance will use to generate final 
results"""
        try:
            return self._tokens
        except AttributeError:
            self._tokens = self._Tokenize( self.Source )
            return self._tokens

    #-----------------------------------#
    # Instance property-setter methods  #
    #-----------------------------------#

    @describe.AttachDocumentation()
    @describe.argument( 'value', 
        'the original source markup-text to be parsed by the instance', 
        str, unicode
    )
    @describe.raises( TypeError, 
        'if passed a value that is not a str or unicode'
    )
    def _SetSource( self, value ):
        """
Sets the original source markup-text to be parsed by the instance"""
        if type( value ) not in ( str, unicode ):
            raise TypeError( '%s.Source expects a string or unicode value, '
                'but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__
                )
            )
        try:
            del self._tokens
        except:
            pass
        try:
            del self._results
        except:
            pass
        self._source = value

    #-----------------------------------#
    # Instance property-deleter methods #
    #-----------------------------------#

    @describe.AttachDocumentation()
    def _DelSource( self ):
        """
"Deletes" the original source markup-text to be parsed by the instance by 
setting it to None"""
        self._source = None

    #-----------------------------------#
    # Instance Properties               #
    #-----------------------------------#

    Results = describe.makeProperty( _GetResults, None, None, 
        'the results of parsing the supplied source into markup-module objects',
        list
    )
    Source = describe.makeProperty( _GetSource, _SetSource, None, 
        'the original markup-text to be parsed by the instance',
        str, unicode
    )
    Tokens = describe.makeProperty( _GetTokens, None, None, 
        '',
        list
    )

    #-----------------------------------#
    # Instance Initializer              #
    #-----------------------------------#
    @describe.AttachDocumentation()
    @describe.argument( 'source', 
        'the markup-text value to parse', 
        str, unicode
    )
    def __init__( self, source ):
        """
Instance initializer"""
        # MarkupParser is intended to be a nominally-final class
        # and is NOT intended to be extended. Alter at your own risk!
        #---------------------------------------------------------------------#
        # TODO: Explain WHY it's nominally final!                             #
        #---------------------------------------------------------------------#
        if self.__class__ != MarkupParser:
            raise NotImplementedError( 'MarkupParser is '
                'intended to be a nominally-final class, NOT to be extended.' )
        # Call parent initializers, if applicable.
        # Set default instance property-values with _Del... methods as needed.
        self._DelSource()
        # Set instance property values from arguments if applicable.
        self.Source = source
        # Other set-up

# ...

    @describe.AttachDocumentation()
    @describe.argument( 'tokens', 
        'the list of tokens to create nodes from', 
        list
    )
    @describe.raises( TypeError, 
        'if passed a non-list tokens value'
    )
    @describe.raises( ValueError, 
        'if the supplied tokens contain any non-str, non-unicode values'
    )
    def _NodesFromTokens( self, tokens ):
        """
Generates a list of nodes from the supplied tokens"""
        # The function-code from the NodesFromTokens
        # prototype function

# ...

    @describe.AttachDocumentation()
    @describe.argument( 'markupText', 
        'the markup-text to tokenize', 
        str, unicode
    )
    @describe.raises( TypeError, 
        'if passed a markupText value that is not a str or unicode'
    )
    @describe.returns( 'a list of strings, each containing one node-item '
        '(start- or end-tag, text, comment, CDATA-section) from the supplied '
        'markupText' )
    def _Tokenize( self, markupText ):
        """
Returns a sequence of markup-token from the supplied markup-text"""

# ...

#---------------------------------------#
# Append to __all__                     #
#---------------------------------------#
__all__.append( 'MarkupParser' )
I'll make the entire code for MarkupParser available for download, like I did for Tag in a previous post

Now that MarkupParser is implemented, the implementations of Tag.cloneNode and Tag.innerHTML can be completed, and their unit-tests reinstated:

#-----------------------------------#
# Instance property-getter methods  #
#-----------------------------------#

# ...

@describe.AttachDocumentation()
def _GetinnerHTML( self ):
    """
Gets a text representation of the markup of all children of the instance"""
    result = ''
    for child in self._childNodes:
        result += child.__str__()
    return result

# ...

#-----------------------------------#
# Instance property-setter methods  #
#-----------------------------------#

# ...

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the markup to set the instance\'s childNodes to after parsing them',
    str, unicode
)
@describe.raises( TypeError, 
    'if passed a value that is not a str or unicode value'
)
def _SetinnerHTML( self, value ):
    """
Sets the children of the instance by parsing the supplied markup and 
replacing the instance's children with those parsed items"""
    if type( value ) not in ( str, unicode ):
        raise TypeError( '%s.innerHTML expects a str or unicode '
            'value, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, 
                type( value ).__name__
            )
        )
    self._DelchildNodes()
    for child in MarkupParser( value ).Results:
        self.appendChild( child )

# ...

#-----------------------------------#
# Instance Methods                  #
#-----------------------------------#

# ...

@describe.AttachDocumentation()
@describe.argument( 'deep', 
    'indicates whether to make a "deep" copy (True) or '
    'not (False)',
    bool
)
def cloneNode( self, deep=False ):
    """
Returns a copy of the current node with all of its children cloned as well"""
    if deep:
        return MarkupParser( str( self ) ).Results[ 0 ]
    else:
        return Tag( self.tagName, self.namespace, **self.attributes )
The getter- and setter-methods for innerHTML seem pretty simple to me:
  • _GetinnerHTML really just renders all of the instance's child-nodes into a string (or unicode, if a string-representaiton raises any of several unicode-related errors) and returns the entire text.
  • _SetinnerHTML takes the provided text, runs it through a MarkupParser, clears all of the instance's childNodes out, then appends each child from the parsed Results.
A very basic unit-test of innerHTML is complete at this point, but I'm expecting that I'll need to add cases to it as I start using Tag:
def testinnerHTML(self):
    """Unit-tests the innerHTML property of a Tag instance."""
    testObject = Tag( 'body' )
    markupSources = [
        'text only',
        '<!-- a comment -->', 
        '<![CDATA[ a CDATA section ]]>', 
        '<a href="#somewhere">somewhere</a>',
        ( 'text before a tag <code>strong</code> text: '
            '<strong>strong text</strong> unary tag '
            '<br />\ntext after a tag' ),
        ]
    markupSources.append( ' '.join( markupSources ) )
    for expected in markupSources:
        testObject.innerHTML = expected
        actual = testObject.innerHTML
        self.assertEquals( actual, expected, 'A Tag\'s innerHTML, if set '
            'to "%s" should return the value it was set to, but "%s" (%s) '
            'was returned instead.' % ( expected, actual, 
                type( actual ).__name__ )
            )
This test, though it's very simple (too simple, perhaps) passes.

The cloneNode method also leverages MarkupParser: The basic idea is that if a deep clone is needed, rather than having to clone the node, then recursively clone each child-node, with all of the re-parenting that would entail, to simply parse the string output of the node being cloned. Using a recursive approach would probably work, but could get very deep very quickly if the original Tag being cloned had any significant depth to it. Taking the parsing approach felt simpler, and easier to maintain on a long-term basis:

@describe.AttachDocumentation()
@describe.argument( 'deep', 
    'indicates whether to make a "deep" copy (True) or '
    'not (False)',
    bool
)
def cloneNode( self, deep=False ):
    """
Returns a copy of the current node with all of its children cloned as well"""
    if deep:
        return MarkupParser( str( self ) ).Results[ 0 ]
    else:
        return Tag( self.tagName, self.namespace, **self.attributes )
I generated a very basic unit-test of cloneNode, which passes:
def testcloneNode(self):
    """Unit-tests the cloneNode method of a Tag instance."""
    testObject = Tag( 'div' )
    row1 = testObject.appendChild( Tag( 'div', className='row', htmlId='row1' ) )
    label1 = row1.appendChild( Tag( 'label', htmlId='label1' ) )
    label1.appendChild( Text( 'Label 1' ) )
    row2 = testObject.appendChild( Tag( 'div', className='row', htmlId='row2' ) )
    label2 = row2.appendChild( Tag( 'label', htmlId='label2' ) )
    label2.appendChild( Text( 'Label 2' ) )
    row3 = testObject.appendChild( Tag( 'div', className='row', htmlId='row3' ) )
    label3 = row3.appendChild( Tag( 'label', htmlId='label3' ) )
    label3.appendChild( Text( 'Label 3' ) )
    clonedNode = testObject.cloneNode( True )
    self.assertEquals( str( clonedNode ).replace( ' ', '' ), 
        str( testObject ).replace( ' ', '' ) )
The replacement of single-space strings in the final assertEquals was put in place because the parsed items may not have the same whitespace as the source string — another variant of the whitespace-removal concerns that I initially had and mentioned earlier. I'm not completely happy with this test, in all honesty, but like the test-method for innerHTML, I'll probably come back to it and add new items as I encounter them.

That just leaves the unit-testing for MarkupParser itself. At this point, I'm not sure what the various trategies are that will yield useful tests, though I have some ideas that I'm going to pursue in the future. For the time being, though I really don't like skipping them, I'm going to — with a big, old DEFERRED message in the results so that I won't forget to come back to them later:

@unittest.skip( '### DEFERRED: Need to come up with a good, '
    'solid testing strategy still.' )
Part of the decision to defer these hinges on the fact that I'll want to make sure to test MarkupParser against markup-files that are more or less representative of the sort of page-templates that I'm eventually expecting to pass to them. At present, I'm struggling with how to keep track of the expected results for those files — without a reliable expected value to compare the actual value against, the testing process would likely be meaningless...

There's been a sizable chunk of code shown here, and the post is getting pretty long, so this feels like a good point to stop for now. I have yet to work in how the parsing-process will identify documents — really, I have yet to actually define concrete document-classes, so that will be the focus for my next post. I'm pretty sure that'll wrap up markup-generation and -parsing, though (finally).