The last missing piece in parsing markup is figuring out how to reliably identify
documents in a chunk of markup-text. The basic parsing-process already identifies
them as tags, and even assigns the appropriate namespace if one is provided in the
original markup (in an xmlns
attribute). Different document-types, though,
have different output rules once they're being rendered to a client browser. I've
noted this before, but put off actually defining concrete document-classes
until the parsing-process was complete. Now it's time to take a more complete look
at those.
Where Things Stand Right Now
In the original planning for the markup
module, I'd specified an
abstract base class (BaseDocument
) that would be used as the starting-point
for concrete classes for specific document-types:
- Creating a new document as an object in an application's code, which
might look something like this:
newDocument = HTML5Document( ... )
- Parsing a document from markup-source (obviously, perhaps, since the
MarkupParser
class was just completed in the last post); and
developer-sideinteraction with the framework. I can't think of anything meaningful to discuss about that use-case that won't, hopefully, be obvious from the signature of the applicable
__init__
method.
Documents from a MarkupParser
The use-case of the last of those scenarios is the one that I was most concerned
with when I started this series of posts. The relevant goal was to provide designer-side
control over a document-type, simply by changing the document-specification in some
fashion in an XML template-file that they have ready access to, without
requiring access to any application-code. That XML template-file could (and should,
for HTML5 and XHTML documents) still use HTML tag-names, in order to keep things
reasonably familiar.
Truth be told, I spent weeks working through this aspect of document-creation,
pondering a number of variants, trying to determine how they would fit in to the
markupParser
class, etc., etc. At times, though some of the solutions
I arrived at were reasonably elegant (and others turned out to be horrible kudges),
I either lost sight of that designer-side control
goal, or the results, while
probably functional, just felt too complicated. It wasn't until I took a
break from the code, took a step back, and started asking myself what does that
designer-side-control goal actually mean
that I hit on the approach
that I'm going to implement.
So, what does the designer-side-control
goal really mean? It's
actually not all that complicated. If the development efforts of a web-application
are broken into developer-side
and designer-side
tasks and needs, the
designer side of that equation needs:
- The ability to modify a page-template without having to worry about breaking existing application functionality
- So, for example, if a page's layout needs to be altered by moving content
(and, down the line, page-components) around, all that the
designer
should have to modify is the page-template file. - The ability to add existing functionality into a page-template that it doesn't currently exist in
- This essentially boils down to being able to add ready-for-use page-components
to an existing page-template, and have it
just work
. I haven't started exploring the idea of page-components just yet, but it's something that I will touch on (at least conceptually) for a while before I work out actual implementations. At present, what I have in mind for page-components looks something like this:- Page components will be represented in page templates as tags;
- Those tags will map back to server-side objects in some fashion;
- The component-tags will allow in-markup templating of their content;
- Components may act as a View in a perhaps-formal server-side
MVC
pattern (it's too early to determine that for certain just yet, but
that's a
should have
goal for them at any rate)
- The ability to specify (and thus change) output document-types without having to deal with the application code itself
- This specification needs to be as simple as possible, in order to cater to
the possibilities of:
- People in a designer role who don't have much experience with whatever
markup language(s) are in use (which also helps, at some level, to
future-proof the page-templates and their corresponding processes);
and/or - To accommodate for the possibility that the tools being
used to modify page-templates might mangle markup that is too complex,
or doesn't conform to some known standard (a
too-corrective
editor, one thatfixes
markup it determines to be invalid, breaking the template's markup-requirements in the process).
- People in a designer role who don't have much experience with whatever
markup language(s) are in use (which also helps, at some level, to
future-proof the page-templates and their corresponding processes);
- This is where my main hang-up with various implementations came into play, prompting my re-think of the process that led to the implementation here.
- Not having a bunch of
functional code
living in page-templates - This ties in with the "existing functionality" item above, and also relates
to the page-component concept. One of the things that I don't like
about many of the Python frameworks I've looked at is the inclusion of logical
control structures in the template markup. Django is one of the more popular
of those frameworks, and is pretty mature, so I'll use its
built-in template tags and filters as an example. It provides dozens
of logical/functional items, a simple example of which is:
Now, I'll grant that it's a well-thought-out collection of capabilities, hands down. I'll also grant that it provides<ul> {% for athlete in athlete_list %} <li>{{ athlete.name }}</li> {% endfor %} </ul>
designer-level
presentation-control. The aspects that really bother me about this structure are:- That's (potentially) a lot of additional pseudo-markup to potentially throw at a (potentially) non-technical user, particularly in more complex function-and-design scenarios.
- A (perhaps) typical
designer
developer is more likely to be using one of the various WYSIWYG editors (Dreamweaver, for example), and the additional template-code may make it difficult, or even impossible to really work in that WYSIWYG mode. Granted, my recent experience with that kind of scenario has been pretty solidly shaped by the environment I was in — an advertising agency — where the standard tool-chain for designers and creatives included Dreamweaver, so I may be more concerned about this than is really warranted, but even so... - At a more functional level, perhaps, it just feels wrong to me to have decision-making happening in the presentation layer. Simple iteration makes some sense, though I'd hope that something on the server side would take care of that, flat out. Barring that, if there's functionality that must happen in the presentation layer, I'd rather it be done with something that already exists, like JavaScript (though I might have issue with that too).
Where I Ultimately Landed
The ultimate question, on the parsing side of things at least, that drove the implementation of document-parsing boiled down to
How can a designer specify a document type in the markup in order for it to be recognized and used accordingly?I examined several possibilities in my initial approaches:
- Using a custom XML processing instruction
- Using the
DOCTYPE
- Using the
xmlns
attribute - Using a custom attribute on the root tag (initially, I was thinking of a
data-*
attribute)
too-correctiveeditors. To be fair, I'm not aware of any current versions of markup-editors that fall into this category, but I distinctly remember an earlier version of Adobe's Dreamweaver causing this sort of problem in the past, and other editors may well cause similar issues even today.
So, the final rule-set and process for parsing page templates ends up as:
- Page templates will follow XML structure- and tag-closure rules — for all practical purposes, they are XML documents, though they can (and will) use and support HTML tag-names and attributes, even those introduced in HTML 5.
- The
BaseDocument
abstract class will be used to collect all of the functionality common to all the concrete document-type classes. - There will be discrete concrete classes for both the current and most-recent non-current HTML versions (i.e., HTML 5 and XHTML).
- There will also be a concrete class for
generic
XML documents — not so much because I expect to need one as because I can't rule out the potential need for one. - Each of those document-classes, derived from
BaseDocument
, will be responsible for defining several things:- Any items/values that should be rendered for an instance of the
document-type that occur before the standard
Tag
-rendering process takes over:- HTML 5
- The
<!DOCTYPE>
declaration - XHTML
- The
<!DOCTYPE>
declaration, and thexmlns
attribute of the root<html>
tag - XML
- The XML declaration (
<?xml version="1.0" encoding="UTF-8"?>
), any XML processing-instructions (<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
, for example), and a<!DOCTYPE>
if one is specified
- A common class-method (
FromTag
) that can be passed aTag
(and its children) that will return the same DOM-tree structure as an instance of the class. - A root-tag name that can be used to verify the document root during the parsing process.
- The
Namespace
instance that relates to the document, if one is applicable.
- Any items/values that should be rendered for an instance of the
document-type that occur before the standard
- Each concrete class will be registered with the
MarkupParser
class, using a unique and designer-meaningful name (html5
orxhtml
, for example), allowingMarkupParser
to look up a concrete document-class by that name and either create an instance of the document-class as the top of the DOM-results, or call the document-class'FromTag
method when the root tag is closed to convert those results to the appropriate document-type. - The root-tag of a page template will include a
document-type
attribute whose value will be one of the names registered withMarkupParser
, tying the designer-specified document-type to the registeredBaseDocument
-derived document-class during parsing.
Changes to BaseDocument
Some basic definition for BaseDocument
was done a
couple posts back, but with the final decisions above now made, there are some
potentially-significant alterations that need to be made.
Hooks for Concrete Class Properties
The root-tag-name, namespace and document-type name that will be registered with
MarkupParser
are values that should be constant for each concrete
subclass of BaseDocument
, so I can't think of any reason not
to make them properties of those classes, and set up default values for them in
BaseDocument
:
#-----------------------------------#
# Class attributes (and instance- #
# attribute default values) #
#-----------------------------------#
# The Namespace-instance that documents of this type should be associated with
_documentNamespace = None
# The name that will be used to register the class with MarkupParser
_markupParserName = None
# The root tag for an instance of the document-type, if applicable
_rootTagName = None
The document-type's associated Namespace
feels like it should be required
for instances of HTML5Document
and XHTMLDocument
, but not
for XMLDocument
(since any given XML document could, by definition,
have its own distinct namespace). That's something that I'll have to cover in more
detail further on, once I start implementing XMLDocument
, but for now,
it suffices to know that the execution of any checks of <class>._documentNamespace
will have to reside in the applicable derived classes. It wouldn't hurt to have a
common helper-method that BaseDocument
does provide, though,
to perform that check:
@describe.AttachDocumentation()
@describe.raises( AttributeError,
'if the instance\'s class does not have a Namespace defined'
)
@describe.raises( TypeError,
'if the instance\'s class defines a namespace that is not an '
'instance of Namespace'
)
def _CheckNamespace( self ):
"""
Checks the instance's class to see if it has a Namespace association defined."""
if self.__class__._documentNamespace == None:
raise AttributeError( '%s does not have a defined Namespace '
'relationship. Please set its _documentNamespace class-'
'attribute to an instance of a Namespace' % (
self.__class__.__name__
)
)
if not isinstance( self.__class__._documentNamespace, Namespace ):
raise TypeError(
'%s defines a document-namespace relationship in its '
'_documentNamespace class-attribute, but it is a %s, '
'not a Namespace instance' % (
self.__class__.__name__,
type( self.__class__._documentNamespace ).__name__
)
)
That method, _CheckNamespace
, is then available to any derived class'
instances, and can be used to enforce the namespace requirement during the __init__
process of document-types that must have a namespace.
A similar helper-method, _CheckTagName
, which can be used to check
for class-attributes that correctly define a root tag-name in document-types where
one is relevant would also be useful:
@describe.AttachDocumentation()
@describe.raises( AttributeError,
'if the instance\'s class does not have a root tag-name defined'
)
@describe.raises( TypeError,
'if the instance\'s class defines a root tag-name that is not '
'valid'
)
def _CheckTagName( self ):
"""
Checks the instance's class to see if it has a valid root-tag-name association
defined."""
try:
rootTagName = self.__class__._rootTagName
except:
raise AttributeError( '%s does not have a root-tag-name defined '
'as a class attribute' % ( self.__class__.__name__ ) )
if not rootTagName:
raise AttributeError( '%s does not have a root-tag-name defined '
'as a class attribute' % ( self.__class__.__name__ ) )
if not self._IsValidTagName( rootTagName ):
raise ValueError( '%s has a root-tag-name defined ("%s"), but it'
'is not a valid tag-name' % (
self.__class__.__name__, rootTagName
)
)
This would be used in a manner similar to _CheckNamespace
during initialization
of a document-type instance where it's relevant.
Both of these methods will be called in the __init__
of HTML5Document
and XHTMLDocument
, and I'll show that in a while.
The FromTag
class-method can be defined as a member of BaseDocument
,
so long as it can handle document-type classes that don't specify a root tag-name
or namespace:
@classmethod
@describe.AttachDocumentation()
@describe.argument( 'sourceTag',
'the Tag instance that contains the markup to be copied into the '
'resulting document instance',
Tag
)
@describe.returns( 'An instance of the class' )
@describe.raises( TypeError,
'if passed a sourceTag value that is not an instance of Tag'
)
def FromTag( cls, sourceTag ):
"""
Creates an instance of the class and populates it with the attributes and
children of the provided source-tag"""
if not isinstance( sourceTag, Tag ):
raise TypeError( '%s.FromTag expects a Tag instance for its '
'sourceTag argument, but was passed "%s" (%s) instead' %
( cls.__name__, type( sourceTag ).__name__ )
)
# Use the tag-name from the document class, or from the sourceTag's
# top-level node if no root tag-name is specified for the class
if cls._rootTagName:
result = cls( cls._rootTagName )
else:
result = cls( sourceTag.tagName )
# Add the default namespace, if applicable, or the source-tag's
# namespace if *that* is applicable
if cls._documentNamespace:
result._SetNamespace( cls._documentNamespace )
elif sourceTag.Namespace:
result._SetNamespace( sourceTag.Namespace )
# Copy the attributes over first
for attr in sourceTag.attributes:
result.setAttribute( attr, sourceTag.attributes[ attr ] )
# Then the children
while sourceTag.childNodes:
result.appendChild(
sourceTag.removeChild(
sourceTag.childNodes[ 0 ]
)
)
# Then return the result
return result
Even if MarkupParser
ends up creating a BaseDocument
instance for its main results, the ability to convert an arbitrary Tag
into one of the provided document-types feels like it might be useful, so having
this defined doesn't feel bad.
But... That's not really any closer...
While the FromTag
method is, I think, pretty neat in its own right,
that doesn't really get me any closer to parsing whole documents from template-files.
There are a few things that need to happen, and some of the hooks put in play for that
method will help. Ultimately, what needs to happen during parsing ends up being a
pretty simple sub-section of the current _NodesFromTokens
method:
- When a node is created from a token, some sort of check needs to be made to see if should be a document-root node for the namespace of the node.
- That means that the individual
Namespace
globals needs to be able to keep track of what root tag-names it needs to be concerned with; - That also means that a given
Namespace
global needs to know whatBaseDocument
-derived class should be instantiated when a root-node creation is detected.
Namespace
instances (HTML5Namespace
,
etc.). That might've been workable, but after putting the class-level document-namespace
hook-attribute in place for FromTag
, that led to a circular reference
issue: The document-class needed to know aqbout the namespace-class, and vice versa.
Taking a step back, and thinking about the issue a bit more, I eventually came to
the conclusion that while the Namespace
globals did need
to know about their related documents, there was no reason that it had to happen
during object-initialization. As part of that initial pursuit, I'd already created
the Namespace
-instance properties needed to store the document-class
value, as well as the root tag:
# ...
#-----------------------------------#
# Instance property-getter methods #
#-----------------------------------#
# ...
@describe.AttachDocumentation()
def _GetDocumentClass( self ):
"""
Gets the class to be used as root-document nodes"""
return self._documentClass
# ...
@describe.AttachDocumentation()
def _GetRootTagName( self ):
"""
Gets the root tag-name of documents of the namespace."""
return self._rootTagName
# ...
#-----------------------------------#
# Instance property-setter methods #
#-----------------------------------#
# ...
@describe.AttachDocumentation()
@describe.argument( 'value',
'the document-class to be used to create root-document nodes when '
'parsing documents of the namespace',
TypeType, None
)
@describe.raises( TypeError,
'if passed a value that is not a class'
)
@describe.raises( ValueError,
'if passed a value that is not a subclass of BaseDocument'
)
def _SetDocumentClass( self, value ):
"""
Sets the document-class to be used to create root-document nodes when parsing
documents of the namespace"""
# Since BaseDocument is an ABCMeta-based class, it's NOT going
# to be testable with type(value) == types.TypeType...
if type( value ) != abc.ABCMeta:
raise TypeError( '%s.DocumentClass expects a class derived '
'from BaseDocument. but was passed "%s" (%s)' % (
self.__class__.__name__, value, type( value ).__name__
)
)
if not issubclass(value, BaseDocument):
raise ValueError( '%s.DocumentClass expects a class derived '
'from BaseDocument. but was passed "%s" (%s)' % (
self.__class__.__name__, value, type( value ).__name__
)
)
self._documentClass = value
# ...
@describe.AttachDocumentation()
@describe.argument( 'value',
'the root tag-name of documents of the namespace to set for the '
'instance',
str, unicode
)
@describe.raises( TypeError,
'if passed a value that is not a str or unicode type or None'
)
@describe.raises( ValueError,
'if passed a value that has multiple lines in it'
)
@describe.raises( ValueError,
'if passed a value that is not valid as a tag-name'
)
def _SetRootTagName( self, value ):
"""
Sets the root tag-name of documents of the namespace"""
if type( value ) not in ( str, unicode ) and value != None:
raise TypeError( '%s.RootTagName expects a single-line str '
'or unicode value, or None, but was passed "%s" (%s)' % (
self.__class__.__name__, value, type( value ).__name__ )
)
if value:
if '\n' in value or '\r' in value:
raise ValueError( '%s.RootTagName expects a single-line '
'str or unicode value, or None, but was passed "%s" (%s) '
'which has multiple lines' % (
self.__class__.__name__, value, type( value ).__name__
)
)
_validTagNameRE = re.compile( '[_A-Za-z][-_a-zA-Z0-9]*' )
if _validTagNameRE.sub( '', value ) != '':
raise ValueError( '%s.RootTagName expects a valid tag-name '
'value, but was passed "%s" (%s) which is not valid' % (
self.__class__.__name__, value, type( value ).__name__
)
)
self._rootTagName = value
# ...
#-----------------------------------#
# Instance property-deleter methods #
#-----------------------------------#
# ...
@describe.AttachDocumentation()
def _DelDocumentClass( self ):
"""
"Deletes" the the document-class to be used to create root-document nodes
when parsing documents of the namespace by setting it to None"""
self._documentClass = None
# ...
@describe.AttachDocumentation()
def _DelRootTagName( self ):
"""
"Deletes" the root tag-name of a document in the namespace by setting it
to None."""
self._rootTagName = None
# ...
#-----------------------------------#
# Instance Properties #
#-----------------------------------#
# ...
DocumentClass = describe.makeProperty(
_GetDocumentClass, None, None,
'the document-class to be used to create root-document nodes '
'when parsing documents of the namespace',
str, unicode, None
)
# ...
RootTagName = describe.makeProperty(
_GetRootTagName, None, None,
'the tag-name of a root tag of a document in the namespace',
str, unicode, None
)
I'd also already integrated those new properties into Namespace.__init
:
#-----------------------------------#
# Instance Initializer #
#-----------------------------------#
@describe.AttachDocumentation()
# ...
def __init__( self, name, namespaceURI, contentType, systemId=None,
publicId=None, defaultRenderingModel=renderingModels.Mixed,
**tagRenderingModels ):
"""
Instance initializer"""
# ...
# Set default instance property-values with _Del... methods as needed.
# ...
self._DelDocumentClass()
# ...
self._DelRootTagName()
# ...
# Set instance property values from arguments if applicable.
# ...
# Other set-up
self.__class__.RegisterNamespace( self )
I'd also done some exploration in the current MarkupParser.NodesFromTokens
method to assure that the related namespace could be identified. That was not much
more than dropping a print
into the code and running a test-parse
with a short but still relatively complex document:
def _NodesFromTokens( self, tokens ):
"""
Generates a list of nodes from the supplied tokens"""
#...
# Create a new tag and deal with it accordingly
if currentElement:
# It'll be a child of currentElement
newNode = Tag( tagName, tagNamespace, **attrDict )
if not unary:
# Can have children, so it takes currentElement's place
currentElement = currentElement.appendChild( newNode )
else:
# No children allowed, so just append it
currentElement.appendChild( newNode )
else:
# It can't be a child of an element, for whatever reason
# TODO: Check to see if tagNamespace is identified, there's
# an appropriate/related class, and if so, create an
# instance of that class instead of Tag, maybe?
if tagNamespace:
print '### tagNamespace (from namespace URI)'
newNode = Tag( tagName, tagNamespace, **attrDict )
if not unary:
# But it can HAVE children, so it takes currentElement's
# place.
currentElement = newNode
results.append( newNode )
which demonstrated pretty succinctly that it was at least recognizing the
associated namespace:
###### tagNamespace.Name: xhtmlAll that left, then, was figuring out how to register the document-classes with their associated
Namespace
-constants...
Modifying Namespace
, and the Implications
The only other item, then, that needs to be handled in Namespace
is providing a mechanism to actually make the document-class and root-tag-name
associations. Since there are already property-setters for both, that's a very
simple method:
@describe.AttachDocumentation()
@describe.argument( 'documentClass',
'the BaseDocument-derived class to register with the Namespace instance '
'as the document-type to create when the root tag is encountered '
'during parsing',
TypeType
)
def SetDocumentClass( self, documentClass ):
"""
Registers a BaseDocument-derived class with the Namespace instance as its
document-type of record."""
self._SetDocumentClass( documentClass )
self._SetRootTagName( documentClass._rootTagName)
Actually making that association is then nothing more than calling SetDocumentClass
on each Namespace
instance after both it and its associated document
class have been defined:
HTML5Namespace = Namespace(
'html5',
'http://www.w3.org/2015/html',
# ...
)
@describe.InitClass()
class HTML5Document( BaseDocument, object ):
"""Represents a markup-document in HTML 5"""
# ...
__all__.append('HTML5Document')
HTML5Namespace.SetDocumentClass( HTML5Document )
Implementing and Organizing the Document Classes
It occurred to me while I was doing some light-weight testing of the XMLDocument
class that it might be a better idea to define the markup
namespace
as a package, in order to allow new specific document-types and -namespaces
to be added to the framework more easily. As things stand right now, with markup
being a module, the process for adding a new, defined document-type (say
something like DocBook XML) to the available
document-types involved changes to markup.py
. The inevitable progression,
as more and more new document-types get added, will result in a potentially unmanageable
growth of the code in markup.py
, raising the risk of it getting more
and more brittle as time goes on. That ignores the possible run-time speed/performance
drop that would also be introduced (though, to be fair, I don't expect that would
be all that significant... Still...).
Ideally, as time goes on, it might be better (it'd certainly be neat)
if the process of adding a new document-type to the idic
framework
involved nothing more than creating the relevant Namespace
and BaseDocument
-derived
types in a free-standing module-file, dropping that new file in the appropriate
location, and moving on. While I didn't go quite so far as to make the current document-types
auto-detected,
I did break them out into their own modules and
pull them in at the end of the core markup.py
file, after converting
that to a package-structure.
With that structure in play, implementation of the two external
document-types
got a minor facelift. In order to import the relevant classes from the sub-modules,
when those maodules and the main package-header could get called repeatedly, the
Namespace
creation and configuration had to ensure that the individual
Namespace
instances weren't being overwritten. That looked like this
(using HTML5Namespace
as an example):
try:
checkNamespace = Namespace.GetNamespaceByName( 'html5' )
except MarkupError:
Namespace(
'html5',
'http://www.w3.org/2015/html',
'text/html',
None,
None,
renderingModels.RequireEndTag,
br=renderingModels.NoChildren,
img=renderingModels.NoChildren,
link=renderingModels.NoChildren,
script=renderingModels.RequireEndTag,
)
HTML5Namespace = Namespace.GetNamespaceByName( 'html5' )
__all__.append( 'HTML5Namespace' )
The individual document-classes are (for now) very simple.
HTML5Document
@describe.InitClass()
@describe.attribute( '_documentNamespace',
'the Namespace-instance that documents of this type should be '
'associated with'
)
@describe.attribute( '_rootTagName',
'root tag for an instance of the document-type, if applicable'
)
class HTML5Document( BaseDocument, object ):
"""Represents a markup-document in HTML 5"""
#-----------------------------------#
# Class attributes (and instance- #
# attribute default values) #
#-----------------------------------#
# The Namespace-instance that documents of this type should be associated with
_documentNamespace = HTML5Namespace
# The root tag for an instance of the document-type, if applicable
_rootTagName = 'html'
#-----------------------------------#
# Instance property-getter methods #
#-----------------------------------#
#-----------------------------------#
# Instance property-setter methods #
#-----------------------------------#
#-----------------------------------#
# Instance property-deleter methods #
#-----------------------------------#
#-----------------------------------#
# Instance Properties #
#-----------------------------------#
#-----------------------------------#
# Instance Initializer #
#-----------------------------------#
@describe.AttachDocumentation()
@describe.keywordargs(
'the attribute names/values to set in the created instance'
)
def __init__( self, **attributes ):
"""
Instance initializer"""
# HTML5Document is intended to be a nominally-final class
# and is NOT intended to be extended. Alter at your own risk!
#---------------------------------------------------------------------#
# At least in theory, NO concrete document-class should ever NEED to #
# be extended. There's just no use-case that I can come up with that #
# wouldn't be better served (in my opinion) by creating a new #
# concrete document-class instead... #
#---------------------------------------------------------------------#
if self.__class__ != HTML5Document:
raise NotImplementedError('HTML5Document is '
'intended to be a nominally-final class, NOT to be extended.')
# Call BaseDocument check-helper methods:
BaseDocument._CheckNamespace( self )
BaseDocument._CheckTagName( self )
# Call parent initializers, if applicable.
BaseDocument.__init__( self, **attributes )
# Set default instance property-values with _Del... methods as needed.
# Set instance property values from arguments if applicable.
# Other set-up
#-----------------------------------#
# Instance Garbage Collection #
#-----------------------------------#
#-----------------------------------#
# Instance Methods #
#-----------------------------------#
#-----------------------------------#
# Class Methods #
#-----------------------------------#
#-----------------------------------#
# Static Class Methods #
#-----------------------------------#
#---------------------------------------#
# Append to __all__ #
#---------------------------------------#
__all__.append('HTML5Document')
HTML5Namespace.SetDocumentClass( HTML5Document )
XHTMLDocument
At this point, the only difference between the HTML5Document
and
XHTMLDocument
classes, apart from their names, are the class attributes
relating to the document-namespace:
# The Namespace-instance that documents of this type should be associated with
_documentNamespace = XHTMLNamespace
XMLDocument
The XMLDocument
document-class doesn't even have definitions
for those class-attributes, inheriting None
values from BaseDocument
:
# The Namespace-instance that documents of this type should be associated with
_documentNamespace = None
# The root tag for an instance of the document-type, if applicable
_rootTagName = None
Finally: Parsing Documents
The balance of the efforts in ths post are probably going to seem anticlimactic...
The sum total of the relevant changes needed in MarkupParser._NodesFromTokens
is actually pretty trivial (changed lines have a green left border):
def _NodesFromTokens( self, tokens ):
"""
Generates a list of nodes from the supplied tokens"""
# ...
# Create a new tag and deal with it accordingly
if currentElement:
# It'll be a child of currentElement
newNode = Tag( tagName, tagNamespace, **attrDict )
if not unary:
# Can have children, so it takes currentElement's place
currentElement = currentElement.appendChild( newNode )
else:
# No children allowed, so just append it
currentElement.appendChild( newNode )
else:
# It can't be a child of an element, for whatever reason
# - If the tag-namespace points to the tag-name being a
# document-root tag, then create a document instead
# of a tag...
if tagNamespace and tagName == tagNamespace.RootTagName:
try:
del attrDict['xmlns']
except KeyError:
pass
newNode = tagNamespace.DocumentClass( **attrDict )
else:
newNode = Tag( tagName, tagNamespace, **attrDict )
if not unary:
# But it can HAVE children, so it takes currentElement's
# place.
currentElement = newNode
results.append( newNode )
With all of these changes in place, I whipped up a quick and dirty test-script
(parse-test.py
, available for download at the end of the post) just
to make sure that it was doing what I expected/needed. It yielded the following
results with the XHTML namespace:
mparsed, <idic.markup.xhtml.XHTMLDocument object at 0x7f9ed66f0450> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> [truncated for brevity's sake] </html> Source length 1052 Parsed length 889and nearly-identical results using the HTML5 namespace:
mparsed, <idic.markup.html5.HTML5Document object at 0x7f2143f24450> <!DOCTYPE html> <html xmlns="http://www.w3.org/2015/html"> [truncated for brevity's sake] </html> Source length 1051 Parsed length 813Both runs verified that the root node of the parsed document was an instance of the appropriate
BaseDocument
-derived class: XHTMLDocument
in the first case, and HTML5Document
in the second.
This post has gone on for quite a while now, so I won't spend any time writing about the unit-testing that went into it as well. I hope, by now, that the process I go through has been covered in sufficient detail that it's not really necessary (but for those who might've missed it, see my earlier post, walking through it from the ground up). I will mention, though, that even with all of these changes, there weren't all that many new tests that had to be generated:
######################################## Unit-test results ######################################## Tests were successful ..... False Number of tests run ....... 321 + Tests ran in ........... 0.53 seconds Number of errors .......... 0 Number of failures ........ 13 Number of tests skipped ... 131 ########################################It's also (maybe) worth mentioning that the import-structure for
HTML5Document
and XHTMLDocument
did not allow either of those classes to be discovered
as items in need of tests — a fact that I will look into at a later date. For
now, that simply means that I must also explicitly generate and include unit-test
modules for the new html5.py
and xhtml.py
modules that
contain those classes.
No comments:
Post a Comment