I Dream In Code: Generating and Parsing Markup in Python [7]

With the Tag class finally implemented (minus the couple of deferred items waiting on MarkupParser), it's time, I think, to work out the conventions for various types of markup documents. The approach that I plan on taking is to define a BaseDocument abstract class that derives from Tag in order to carry the capabilities of Tag through to all documents.

From that point on, it's just a matter of defining concrete document-classes for each of the document-types that I'm expecting to be using:

An HTML5Document;
An XHTMLDocument; and (probably)
An XMLDocument;

I'm not sure that this strategy would work across other languages with any frequency, though a quick check with PHP would seem to indicate that it would work. The following code, at any rate, doen't raise any errors when executed from the command-line:

<?php

abstract class BaseNode
{
}

class Tag extends BaseNode
{
}

abstract class BaseDocument extends Tag
{
}

class HTML5Document extends Tag
{
}

$doc = new HTML5Document();
?>

Before I can actually define those concrete document-classes, though, I need to determine what their members are, and how theier behavior differs from Tag...

What Are the Differences Between a Document and a `Tag`?

There are two main areas where documents differ from tag-elements: their object-members, and how they render. Since the three document-types that I'm concerned with have significantly different members (XML won't have head or body properties like an HTML document does, for example, and there are several other properties that fall into a similar classification), the list of members that need to be implemented is actually very short:

The all property: Basically just a call to Tag.getElementsByTagName( '*' ), returning all of the Tag children of the document;
The contentType property: Returns the MIME-type of the document
The doctype property: I haven't been able to pin down exactly what this does on the browser side, but it definitely relates to the <!DOCTYPE html> declaration in an HTML 5 document. My expectation as I'm writing this is that it will return the DOCTYPE for the instance as it would render.

While there are 200-odd more members in an HTML document, these are the only ones that I think might be relevant on the server side that aren't already going to be members of a BaseDocument just because it derives from Tag. A lot of the remainder were properties that have no meaning or use outside a broswer context (e.g., the readyState property and createEvent method, as well as all the on...* event-methods). Several of the properties are essentially just wrappers around some variation of Tag.getElementsByTagName for specific tags (images and scripts) or some similar mechanism with different criteria (links, possibly), and I'd probably not implement them (at last not in BaseDocument) even if they weren't document-type specific. I may well add those to specific concrete document-classes down the line, just so they're available, but they don't belong in BaseDocument.

On the rendering side, each of the document-types I'm planning to build out has slightly different output. Optional items are in brackets []:

HTML 5

<DOCTYPE html>
<html>
    <!-- the document head, body -->
</html>

XHTML

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html[ xmlns="http://www.w3.org/1999/xhtml"]>
    <!-- the document head, body -->
</html>

(This is XHTML transitional, but there are also official strict and frameset variants, per the w3.org site)

Since it's technically XML, XHTML (of any flavor) can have processing-instructions and any other pre-document-root elements that XML allows.

XML

<?xml[ version="#.#"][ encoding="XXXX"]?>
[<?xml-processing-instruction(s) attributes="allowed"?>]
[<!DOCTYPE root-element[ PUBLIC "PUBLIC identifier"][ "SYSTEM identifier"]>]
<root-element[ attributes]>
    <!-- child elements and content -->
</root-element>

All of these, I think, can live in BaseDocument — the rules for them are of varying complexity, but they all feel like they're achievable at this level.

Looking at `DOCTYPE`

All three document-types have (or are allowed) some variation of a <!DOCTYPE>. The rules for a DOCTYPE declaration and how it gets rendered are pretty simple:

It starts with <!DOCTYPE
That's followed by the tagName of the root tag of the document
It may have a public identifier (a single-line text-value), in which case that should be rendered and prefixed with PUBLIC
It may also have a system identifier (also a single-line text-value):
- If there is no public identifier, the system identifier should be rendered with SYSTEM as a prefix
- Otherwise, it can just be rendered, with no prefix
It ends with >
It's the last thing rendered in output before the start of the Tag-derived structure and output

There are at least three different ways that a DOCTYPE representation could be implemented that I can think of.

The simplest way is, I think, to just store the applicable public and system identifiers as class-level constants, and render them accordingly in the __str__ and __unicode__ methods of the document-instance. That would work fine for HTML 5 and XHTML document-types, since the public and system identifiers for those document-types shouldn't ever change, really. That also has the advantage (I think) of keeping everything in one class-definition, so there'd be less code to manage. Unfortunately, that starts to fall apart as soon as XML documents enter the picture, unless a distinct document-class is built out for each and every XML document-type. That prospect feels ugly.

A more complicated approach is defining a class to represent a DOCTYPE. So long as an instance of that class has a reference to the document it's associated with, that'd allow it to grab the tagName that it needs at render-time. That would leave only the public- and system-identifier values to add to the __init__, so that they could be passed to the DOCTYPE-representative object during the construction of a document-instance. That doesn't feel horrible, but since it'd require additional arguments, with cryptic values, it feels clumsy. Even if those values were set up as module-level constants (which would help, I think), that's still more stuff that has to be remembered every time a document has to be created.

Still another possibility: A DOCTYPE for any given document-type is almost certainly as distinct as its namespace. If the public- and system-identifier values were attached to each Namespace instance, even if it were done outside the object-construction process, and a document's namespace were required during its construction, then the storage of those identifier-values is in a single place (a Namespace instance), and could be accessed in the __str__ and __unicode__ methods much like they could if they were class constants in the first alternative.

That feels pretty reasonable to me, but I think I'd also want to set up some sort of mechanism that would allow namespaces to be defined outside the actual Python code — possibly by setting one or many configuration-files that would define names and other relevant properties for any number of namespaces. The trade-off there is that any namespaces defined by that sort of configurable set-up probably couldn't be referred to as module-level constants — As things stand right now, I'd defined Namespace-instance constants in the markup module for HTML 5 and XHTML documents both:

# HTML 5 namespace
HTML5Namespace = Namespace(
    'html5',
    'http://www.w3.org/2015/html', 
    renderingModels.RequireEndTag,
    br=renderingModels.NoChildren,
    img=renderingModels.NoChildren,
    link=renderingModels.NoChildren,
    )
__all__.append( 'HTML5Namespace' )

# XHTML namespace
XHTMLNamespace = Namespace(
    'xhtml',
    'http://www.w3.org/1999/xhtml', 
    renderingModels.RequireEndTag,
    br=renderingModels.NoChildren,
    img=renderingModels.NoChildren,
    link=renderingModels.NoChildren,
    )
__all__.append( 'XHTMLNamespace' )

Those constants would, I think, have to go away in order to keep access to all available namespaces reasonably consistent.

Or, perhaps not, now that I think on it more. If each application that needs to have one or more namespaces defined actually defines them as constants within that application, then they are accessible as constants within that application's codebase. That might, down the line, require some movement of more general-purpose Namespace definitions into a common location — maybe in markup, maybe elsewhere — but they'd still be accessible the same way that HTML5Namespace and XHTMLNamespace are now.

With all of those options considered, I'm going to take this last approach, I think. It feels reasonable, keeps things relatively well-contained, and doesn't require a huge amount of refactoring of Namespace or a lot of additional code in BaseDocument. As it turns out, a similar approach/solution can be applied to the contentType of BaseDocument — referring to the document's namespace, which in turn has a ContentType property whose value is set during the construction of the instance. The complete changes to Namespace are:

#-----------------------------------#
# Instance property-getter methods  #
#-----------------------------------#

@describe.AttachDocumentation()
def _GetContentType( self ):
    """
Gets the MIME-type associated with documents of the namespace the instance 
represents"""
    return self._contentType

# ...

@describe.AttachDocumentation()
def _GetPublicIdentifier( self ):
    """
Gets the public identifier of the namespace."""
    return self._publicIdentifier

@describe.AttachDocumentation()
def _GetSystemIdentifier( self ):
    """
Gets the system identifier of the namespace."""
    return self._systemIdentifier

#-----------------------------------#
# Instance property-setter methods  #
#-----------------------------------#

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the MIME-Type to set for documents of the namespace the instance '
    'represents',
    str, unicode, None
)
@describe.raises( TypeError, 
    'if passed a value that is not a str, a unicode, or None'
)
@describe.raises( ValueError, 
    'if passed a value that is not a member of %s' % ( 
        sorted( KnownMIMETypes )
    )
)
def _SetContentType( self, value ):
    """
Sets the MIME-Type of the instance"""
    if value != None and type( value ) not in ( str, unicode ):
        raise TypeError( '%s.ContentType expects a str or unicode value '
            'that is one of the known MIME-types on the system, or None, '
            'but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__ )
            )
    if value not in KnownMIMETypes:
        raise ValueError( '%s.ContentType expects a str or unicode value '
            'that is one of the known MIME-types on the system, or None, '
            'but was passed "%s" which could not be found' % ( 
                self.__class__.__name__, value )
            )
    self._contentType = value

# ...

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the public identifier of the namespace to set for the instance',
    str, unicode
)
@describe.raises( TypeError, 
    'if passed a value that is not a str or unicode type or None'
)
@describe.raises( ValueError, 
    'if passed a value that has multiple lines in it'
)
def _SetPublicIdentifier( self, value ):
    """
Sets the public-identifier value for the instance"""
    if type( value ) not in ( str, unicode ) and value != None:
        raise TypeError( '%s.PublicIdentifier expects a single-line str '
            'or unicode value, or None, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__ )
            )
    if value:
        if '\n' in value or '\r' in value:
            raise ValueError( '%s.PublicIdentifier expects a single-line '
                'str or unicode value, or None, but was passed "%s" (%s) '
                'which has multiple lines' % ( 
                    self.__class__.__name__, value, type( value ).__name__
                )
            )
    self._publicIdentifier = value

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the system identifier of the namespace to set for the instance',
    str, unicode
)
@describe.raises( TypeError, 
    'if passed a value that is not a str or unicode type or None'
)
@describe.raises( ValueError, 
    'if passed a value that has multiple lines in it'
)
def _SetSystemIdentifier( self, value ):
    """
Sets the System-identifier value for the instance"""
    if type( value ) not in ( str, unicode ) and value != None:
        raise TypeError( '%s.SystemIdentifier expects a single-line str '
            'or unicode value, or None, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__ )
            )
    if value:
        if '\n' in value or '\r' in value:
            raise ValueError( '%s.SystemIdentifier expects a single-line '
                'str or unicode value, or None, but was passed "%s" (%s) '
                'which has multiple lines' % ( 
                    self.__class__.__name__, value, type( value ).__name__ )
                )
    self._systemIdentifier = value

#-----------------------------------#
# Instance property-deleter methods #
#-----------------------------------#

@describe.AttachDocumentation()
def _DelContentType( self ):
    """
"Deletes" the MIME-type associated with documents of the namespace the instance 
represents by setting it to None"""
    self._contentType = None

# ...

@describe.AttachDocumentation()
def _DelPublicIdentifier( self ):
    """
"Deletes" the public identifier of the namespace by setting it to None."""
    self._publicIdentifier = None

@describe.AttachDocumentation()
def _DelSystemIdentifier( self ):
    """
"Deletes" the system identifier of the namespace by setting it to None."""
    self._systemIdentifier = None

#-----------------------------------#
# Instance Properties               #
#-----------------------------------#

ContentType = describe.makeProperty(
    _GetContentType, None, None, 
    'the MIME-type of the content expected for a document of the '
    'namespace the instance represents',
    str, unicode, None
)

# ...

PublicIdentifier = describe.makeProperty(
    _GetPublicIdentifier, None, None, 
    'the public identifier of the namespace',
    str, unicode, None
)
SystemIdentifier = describe.makeProperty(
    _GetSystemIdentifier, None, None, 
    'the system identifier of the namespace',
    str, unicode, None
)

#-----------------------------------#
# Instance Initializer              #
#-----------------------------------#
@describe.AttachDocumentation()

# ...

@describe.argument( 'contentType', 
    'the MIME-type of the content associate with the instance',
    str, unicode, None
)
    @describe.argument( 'publicId', 
        'the public-identifier of the namespace',
        str, unicode, None
    )
@describe.argument( 'systemId', 
    'the system-identifier of the namespace',
    str, unicode, None
)

# ...

def __init__( self, name, namespaceURI, contentType, systemId=None, 
    publicId=None, defaultRenderingModel=renderingModels.Mixed, 
    **tagRenderingModels ):
    """
Instance initializer"""

    # ...

    # Set default instance property-values with _Del... methods as needed.
    self._DelContentType()

    # ...

    self._DelPublicIdentifier()
    self._DelSystemIdentifier()
    # Set instance property values from arguments if applicable.
    self._SetContentType( contentType )

    # ...

    self._SetPublicIdentifier( publicId )
    self._SetSystemIdentifier( systemId )

The HTML5Namespace- and XHTMLNamespace-constants change slightly, to:

#-----------------------------------#
# Default Namespace constants       #
# provided by the module.           #
#-----------------------------------#

# HTML 5 namespace
HTML5Namespace = Namespace(
    'html5',
    'http://www.w3.org/2015/html', 
    'text/html',
    None,
    None,
    renderingModels.RequireEndTag,
    br=renderingModels.NoChildren,
    img=renderingModels.NoChildren,
    link=renderingModels.NoChildren,
    )
__all__.append( 'HTML5Namespace' )

# XHTML namespace
XHTMLNamespace = Namespace(
    'xhtml',
    'http://www.w3.org/1999/xhtml', 
    'text/html',
    'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd',
    '-//W3C//DTD XHTML 1.0 Transitional//EN',
    renderingModels.RequireEndTag,
    br=renderingModels.NoChildren,
    img=renderingModels.NoChildren,
    link=renderingModels.NoChildren,
    )
__all__.append( 'XHTMLNamespace' )

Finally, unit-tests get updated and run:

########################################
Unit-test results
########################################
Tests were successful ..... False
Number of tests run ....... 265
 + Tests ran in ........... 0.17 seconds
Number of errors .......... 0
Number of failures ........ 1
Number of tests skipped ... 107
########################################
FAILURES
#--------------------------------------#
testCodeCoverage (__main__.testmarkupCodeCoverage)
AssertionError: 
    Unit-testing policies require test-cases for all classes 
    and functions in the idic.markup module, but the following 
    have not been defined:
        (testBaseDocument)

And that, I believe, provides everything needed to implement the doctype property in BaseDocument, which could then also be used in its __str__ and __unicode__ methods to render it if/as needed.

Looking at the XML Headers

Apart from their final output, both the initial XML declaration and any XML processing-instruction that I've run across look, structurally, like they could be represented by a Tag-variant: They have a name, and can have attributes. That does not include any inline DTD specifications (see the An Internal DTD Declaration section here for an example of this), but I don't honestly expect that providing an inline DTD is something that will be needed, so I'm not going to worry too much about that, at least for the time being.

As a result, my first thought with regards to implementing those is to generate a Tag subclass, possibly as an inline/nested class in BaseDocument itself, that overrides the __str__ and __unicode__ methods to generate the right output. That would then allow a document-level property (call it XMLDeclaration) to provide the initial XML declaration, and a collection of those tag-types (XMLProcessingInstructions, as an ElementList) to represent any of the XML processing-instructions for a document-instance. That implementation looks like this:


@describe.InitClass()
class BaseDocument( Tag, object ):

    # ...

    #-----------------------------------#
    # Inline class definitions          #
    #-----------------------------------#

    @describe.InitClass()
    class XMLTag( Tag, object ):

        # ...

        #-----------------------------------#
        # Instance Initializer              #
        #-----------------------------------#
        @describe.AttachDocumentation()
        @describe.argument( 'tagName',
            'the tag-name to set in this created instance',
            str, unicode
        )
        @describe.keywordargs( 
            'the attribute names/values to set in the created instance'
        )
        def __init__( self, tagName, **attributes ):
            """
Instance initializer"""
            Tag.__init__( self, tagName, **attributes )

        # ...

        #-----------------------------------#
        # Instance Methods                  #
        #-----------------------------------#

        @describe.AttachDocumentation()
        def __str__( self ):
            """
Returns a string representation of the instance"""
            try:
                result = '<?%s' % ( self.tagName )
                for name in self.attributes:
                    result += ' %s="%s"' % ( name, self.attributes[ name ] )
                result += '?>'
                return result
            except ( UnicodeDecodeError, UnicodeEncodeError, UnicodeError ):
                return __unicode__( self )

        @describe.AttachDocumentation()
        def __unicode__( self ):
            """
Returns a unicode representation of the instance"""
            result = u'<?%s' % ( self.tagName )
            for name in self.attributes:
                result += u' %s="%s"' % ( name, self.attributes[ name ] )
            result += u'?>'
            return result

        # ...

    #-----------------------------------#
    # Class attributes (and instance-   #
    # attribute default values)         #
    #-----------------------------------#

    # ...

With that class available, the two BaseDocument properties noted above can be implemented, making them available for use in the rendering processes of BaseDocument.__str__, and BaseDocument.__unicode__. Some helper-methods defined in BaseDocument, to set or add items to those properties, will also need to be created, but they feel pretty simple:

SetXMLVersion( version ):: Sets the "version" attribute of the instance's XMLDeclaration, creating it in the process if necessary
SetXMLEncoding( encoding ):: Sets the "encoding" attribute of the instance's XMLDeclaration, creating it in the process if necessary
CreateXMLInstruction( name, **attributes ):: Creates and adds an XML processing-instruction to the instance's XMLProcessingInstructions collection

The Final Implementation of `BaseDocument`

There's not a whole lot present in BaseDocument, but there are some significant chunks over and above the properties noted earlier. The implementation of the __init__ methodof BaseDocument and the three XML-structure-related methods are pretty simple:

#-----------------------------------#
# Instance Initializer              #
#-----------------------------------#
@describe.AttachDocumentation()
@describe.argument( 'tagName',
    'the tag-name to set in this created instance',
    str, unicode
)
@describe.argument( 'namespace',
    'the namespace that the instance belongs to',
    Namespace
)
@describe.keywordargs( 
    'the attribute names/values to set in the created instance'
)
def __init__( self, tagName, namespace, **attributes ):
    """
Instance initializer"""
    # BaseDocument is intended to be an abstract class,
    # and is NOT intended to be instantiated. Alter at your own risk!
    if self.__class__ == BaseDocument:
        raise NotImplementedError( 'BaseDocument is '
            'intended to be an abstract class, NOT to be instantiated.' )
    # Call parent initializers, if applicable.
    Tag.__init__( self, tagName, namespace, **attributes )
    # Set default instance property-values with _Del... methods as needed.
    self._DelXMLDeclaration()
    self._DelXMLProcessingInstructions()
    # Set instance property values from arguments if applicable.
    # Other set-up

#-----------------------------------#
# Instance Garbage Collection       #
#-----------------------------------#

#-----------------------------------#
# Instance Methods                  #
#-----------------------------------#

@describe.AttachDocumentation()
@describe.argument( 'tagName',
    'the tag-name to set in this created instance',
    str, unicode
)
@describe.keywordargs( 
    'the attribute names/values to set in the created instance'
)
def CreateXMLInstruction( self, tagName, **attributes ):
    """
Creates an XMLTag instance with the supplied tag-name and attributes and appends 
it to the instance's XML processing-instructions"""
    if not self.XMLDeclaration:
        self._SetXMLDeclaration( BaseDocument.XMLTag( 'xml' ) )
    self.XMLProcessingInstructions.append( 
        BaseDocument.XMLTag( tagName, **attributes )
    )

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the encoding value to set in the instance\'s xml declaration', 
    str, unicode
)
@describe.raises( TypeError, 
    'if passed an encoding value that is not a str or unicode'
)
@describe.raises( ValueError, 
    'if passed an encoding value that is not a single word'
)
def SetXMLEncoding( self, value ):
    """
Sets the "encoding" attribute-value in the instance's XMLDeclaration"""
    if type( value ) not in ( str, unicode ):
        raise TypeError( '%s.SetXMLEncoding expects a single-word str or '
            'unicode value for its encoding, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__ )
            )
    if ' ' in value or '\n' in value or '\t' in value or '\r' in value:
        raise ValueError( '%s.SetXMLEncoding expects a single-word str or '
            'unicode value for its encoding, but was passed "%s" which is '
            'invalid' % ( self.__class__.__name__, value )
        )
    if not self.XMLDeclaration:
        self._SetXMLDeclaration( BaseDocument.XMLTag( 'xml' ) )
    self.XMLDeclaration.setAttribute( 'encoding', value )

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the version value to set in the instance\'s xml declaration', 
    str, unicode, float, int, long
)
@describe.raises( ValueError, 
    'if passed a version value that is not a float and cannot be converted '
    'to one'
)
def SetXMLVersion( self, value ):
    """
Sets the "version" attribute-value in the instance's XMLDeclaration"""
    if type( value ) != float:
        try:
            checkValue = float( value )
            if checkValue < 1.0:
                raise ValueError
            value = str( checkValue )
        except:
            raise ValueError( '%s.SetXMLVersion expects a float value '
                'greater than or equal to one, or a text or numeric value '
                'that can be converted to one, but was passed '
                '"%s" (%s)' % ( 
                    self.__class__.__name__, value, type( value ).__name__ )
                )
    else:
        if value < 1.0:
            raise ValueError( '%s.SetXMLVersion expects a float value '
                'greater than or equal to one, or a text or numeric value '
                'that can be converted to one, but was passed '
                '"%s" (%s)' % ( 
                    self.__class__.__name__, value, type( value ).__name__ )
                )
    if not self.XMLDeclaration:
        self._SetXMLDeclaration( BaseDocument.XMLTag( 'xml' ) )
    self.XMLDeclaration.setAttribute( 'version', str( value ) )

The __str__ and __unicode__ methods arent complex either, though they might seem so atr first glance, but I'm pretty confident that the comments in the code tell the entore story of how they work:

@describe.AttachDocumentation()
def __str__( self ):
    """
Returns a string representation of the instance"""
    # Try rendering the instance as a string:
    try:
        result = ''
        # TODO: Add XML declaration, if applicable 
        if self.XMLDeclaration:
            result += '%s' % self.XMLDeclaration
        # TODO: Add XML processing-instructions, if applicable 
        for instruction in self.XMLProcessingInstructions:
            result += '%s' % instruction
        # TODO: Add DOCTYPE, if applicable
        result += '%s' % self.doctype
        result += '<%s' % ( self.tagName )
        # If the instance has a namespace, render that too
        if self.namespace:
            result += ' xmlns="%s"' % ( self.namespace.namespaceURI )
        # If there are child namespaces that aren't the same as the local 
        # namespace, they need to be included:
        for ns in self.childNamespaces:
            if ns != self.namespace:
                result += ' xmlns:%s="%s"' % ( 
                    self.namespace.Name, self.namespace.namespaceURI
                )
        # Since a document is also a tag, it can have attributes, so render 
        # any present:
        for attr in self.attributes:
            result += '%s="%s"' % ( 
                attr, self.attributes[ attr ]
            )
        # Close the starting tag
        result += '>'
        # Add Tag.childNodes.__str__ to results
        for child in self.childNodes:
            result += '%s' % child
        # Strip the current results just to keep things clean
        result = result.strip()
        # Add the closing tag
        result += '</%s>' % ( self.tagName )
        # And return it
        return result
    # If string-rendering fails because it needs unicode, return the 
    # unicode representation instead.
    except ( UnicodeDecodeError, UnicodeEncodeError, UnicodeError ):
        return __unicode__( self )

@describe.AttachDocumentation()
def __unicode__( self ):
    """
Returns a unicode representation of the instance"""
    result = u''
    # TODO: Add XML declaration, if applicable 
    if self.XMLDeclaration:
        result += u'%s' % self.XMLDeclaration
    # TODO: Add XML processing-instructions, if applicable 
    for instruction in self.XMLProcessingInstructions:
        result += u'%s' % instruction
    # TODO: Add DOCTYPE, if applicable
    result += u'%s' % self.doctype
    result += u'<%s' % ( self.tagName )
    # If the instance has a namespace, render that too
    if self.namespace:
        result += u' xmlns="%s"' % ( self.namespace.namespaceURI )
    # If there are child namespaces that aren't the same as the local 
    # namespace, they need to be included:
    for ns in self.childNamespaces:
        if ns != self.namespace:
            result += u' xmlns:%s="%s"' % ( 
                self.namespace.Name, self.namespace.namespaceURI
            )
    # Since a document is also a tag, it can have attributes, so render 
    # any present:
    for attr in self.attributes:
        result += u'%s="%s"' % ( 
            attr, self.attributes[ attr ]
        )
    # Close the starting tag
    result += u'>'
    # Add Tag.childNodes.__unicode__ to results
    for child in self.childNodes:
        result += u'%s' % child
    # Strip the current results just to keep things clean
    result = result.strip()
    # Add the closing tag
    result += u'</%s>' % ( self.tagName )
    # And return it
    return result

How `BaseDocument` Will Be Used

The next logical step, I think, is to define document-type classes for HTML 5 and XHTML document-types — one document-type for each Namespace constant available in the markup module. The implementation of those is where differentiation between the two HTML dialects starts to take shape, as do the differences between the two of them and any generic XML-derived markup. The HTML-variant implementations will be very simple: Neither will override much (if any) of the functionality of BaseDocument, both may well have some common structures added (like head and body properties that provide direct access to the Tag-instance representing them, for example). Down the line, they'll both likey have support for script- and stylesheet-management attached in some fashion, but that's a topic for a later post.

There's one other difference that I can think of, offhand, between the two HTML variants, maybe: An XHTML document's __init__ might set XML-declaration values (version and encoding) in order to conform to the XML requirements that underlie it. There's also the possibility that XML processing-instructions might be added, though that's not part of a baseline XHTML document. Given that this is the only difference I can identify between the two dialects that isn't already accounted for through the relvant Namespace associated, I'm going to give some thought to how best to proceed on defining concrete HTML-document classes while I work my way through the MarkupParser in my next post.

And that's all for today, I think.

I Dream In Code

Tuesday, May 16, 2017

Generating and Parsing Markup in Python [7]

What Are the Differences Between a Document and a `Tag`?

Looking at `DOCTYPE`

Looking at the XML Headers

The Final Implementation of `BaseDocument`

How `BaseDocument` Will Be Used

No comments:

Post a Comment

Tuesday, May 16, 2017

Generating and Parsing Markup in Python [7]

What Are the Differences Between a Document and a Tag?

Looking at DOCTYPE

Looking at the XML Headers

The Final Implementation of BaseDocument

How BaseDocument Will Be Used

No comments:

Post a Comment

What Are the Differences Between a Document and a `Tag`?

Looking at `DOCTYPE`

The Final Implementation of `BaseDocument`

How `BaseDocument` Will Be Used