Almost the entire process and structure for generating markup is complete
now — minus a handful of items that are waiting on the ability to parse
markup-text, and some decisions about how to handle document-creation. Those two
topics are loosely related, at least, so today I'm going to get the parsing capabilities
worked out in the hopes that it will provide a basis for deciding how to implement
concrete document-classes. So: MarkupParser
:
Parsing, from 10,000 feet
Parsing markup is in interesting problem (perhaps in the sense of the old Chinese
curse: may you live in interesting times
):
- There are rules for how tags can be expressed, but those rules may not be the same for any two tags, at least in certain markup dialects (HTML 5). Those rules help to define what constitues valid markup, but may not be enough, by themselves, to determine validity.
- Within the boundaries of those rules, markup can be very
broad
(a single tag can have any number of children), and can also be verydeep
(it's possible to have a tag within a tag within a tag... down to any depth).
namespace
) to determine whether a closing-tag
is needed (or allowed). I haven't yet thought about how I'm going to deal with determining
how to render unarytags, like
<img>
, which render differently
in HTML 5 and XHTML:
<-- HTML 5 -->
<img src="">
versus
<-- XHTML -->
<img src="" />
As things stand right now, both HTML 5 and XHTML renderings of an <img>
look like the XHTML output above. That won't cause any issues in any browsers I'm
aware of right now, but there's no guarantee that it won't in the future. I'm contemplating
setting up an IsXMLDialect
property in Namespace
, defaulting
to False
, that would be re-set to True
when an XML declaration
is created on a document-instance, but that's a discussion for later, perhaps. For
the time being, I'm content to leave it alone, rather than thrash through Tag.__str__
and Tag.__unicode
to fix what isn't currently a problem.
But I digress...
The second item is, arguably, the more significant of the two, given that I'm
not expecting to be parsing markup that isn't XML-compliant for a while. It's significant
because it makes the scope of the parsing problem theoretically infinite —
Even with some fairly realistic expectations (no more than, say, 100 children for
any given tag, and no deeper nesting than, say, 30 levels), it's still a huge
and dynamic set of possibilities to try and come up with a solution around.
And that's just in a document's
scope — in order to fully implement
Tag.innerHTML
, a parsing process has to be able to contend with mixed
node-types from the supplied test. That is,
myTag.innerHTML = ( 'This is my markup. <strong>' +
'There are many like it, but this one is ' +
'<em>mine</em>!</strong>' )
should yield markup along the lines of:
<myTag>
This is my markup. <strong>There are many like it,
but this one is <em>mine</em></strong>
</myTag>
The results of parsing that initial string aren't a single element with some number
of children (as might be expected when parsing a document or a template-file that's
used to generate a document). Instead, it's a sequence of nodes that would
then have to be appended to myTag
's child-nodes after removing any
existing child-nodes.
How I'm Going To Approach This
There are two significant processes that MarkupParser
is going
to undertake. In summary, they are:
- Breaking the provided markup-text down into
tokens,
where each token is one of:- An XML declaration
- An XML Processing-instruction;
- A DOCTYPE declaration (maybe — I'm still thinking on this);
- A document;
- A tag;
- A CDATA section;
- A comment; or
- A text-node.
list
. - Iterating over the sequence of tokens, while keeping track of both a
root
element-node (which might be a document, but would certainly be aTag
-derived instance) and acurrent
element-node (aTag
also), and setting aside (for now) any DOCTYPE handling or document-identification:- If an XML declaration is encountered, and hasn't already been defined for the root element-node, store it for later use;
- If an XML processing-instruction is encountered, store it for later use;
- If a start-tag is encountered, create a corresponding
Tag
-instance, and:- Set the current element to the newly-created
Tag
-instance - If there is no root element defined, set it to the newly-created
Tag
-instance; - Otherwise, append it to the current element
- Set the current element to the newly-created
- If an end-tag is encountered:
- Check its tag-name (and namespace, if applicable) against
the tag-name of the current element, raising a
MarkupError
if they don't match; - Otherwise, set the current element to the parent
element of the current element, effectively closing
the tag by preventing further child-appending to the
Tag
instance representing it.
- Check its tag-name (and namespace, if applicable) against
the tag-name of the current element, raising a
- If a CDATA-, comment- or text-node is encountered, create an
instance of
CDATA
,Comment
orText
, respectively, populate it with the applicable text from the token, and append it to the current node. - If the root node is being closed/completed, attach any XML declarations and processing-intructions to the (presumed) document;
Creating the Token Sequence
Creating the sequence of tokens is surprisingly easy. My previous efforts used a
regular-expression-based process (using re.findall
)
to extract token-items, and it was mostly functional, but it had a couple
of drawbacks:
- It was occasionally prone to errors, requiring revision of the regular-expression definition, which grew increasingly complex and hard to manage;
- It ended up causing an odd requirement that there could be no empty
tags in the source markup, which was at least occasionally frustrating
for the less-technical designer-types who were working on page-templates.
By way of example, using Glyphicons
that are available as part of Bootstrap),
the following markup-structure was required in order to prevent throwing
ParsingError
s:
Compare with what should have been allowed:<div> <i class="glyphicon glyphicon-user"><!-- . --></i> A user icon </div>
While that wasn't too awkward to deal with, it definitely took some getting used to, and was disruptive until the habit had been formed, so it's something to avoid in the current implementation.<div> <i class="glyphicon glyphicon-user"></i> A user icon </div>
After some experimentation, where I landed was the following function:
def Tokenize( markupText ):
tokenChunks = [
tc for tc in markupText.split( '<' ) if tc
]
result = []
for token in tokenChunks:
if '>' in token:
try:
tag, text = token.split( '>' )
result.append( '<%s>' % tag )
if not text.strip():
text = ' '
if text:
result.append( text )
except ValueError:
result.append( '<%s' % token )
else:
if token:
result.append( token )
return result
This got pretty close, maybe exactly what's needed for both an XML document-template
and a markup-fragment with a mixture of tags and text, in a mere 20 lines
of code. Those token-sequences have some potentially odd-looking items in their results,
but nothing that seems likely to prevent them from being used in the node-generation
iteration to come.
Running Tokenize
Against an XML Document Template
I started with a stripped-down version of the same XML Document Template that
I shared at the end
of last month — basically, I just stripped out any application
or
framework
items, both namespaces and tags, leaving:
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<!--
Example comment for parsing
-->
<![CDATA[
Example CDATA for parsing
]]>
<head>
<title>Page Title</title>
<script src="//some-domain.com/file.js"></script>
<link rel="stylesheet" href="//some-domain.com/file.css" />
</head>
<body>
<div id="main" class="container">
<h1>Page Title</h1>
</div>
<div id="main" class="container">
<h1>Page Title: Log-in Required:</h1>
<form action="" method="post">
<div class="form-group">
<label for="username">Name:</label>
<input type="text" id="username" name="username" class="form-control" />
</div>
<div class="form-group">
<label for="userpass">Password:</label>
<input type="password" id="userpass" name="userpass" class="form-control" />
</div>
<div>
<button type="submit" class="btn btn-default">Log In</button>
</div>
</form>
</div>
</body>
</html>
Executing the Tokenize
prototype-function against that markup (as a
string-constant in a throwaway file) returned the following sequence of tokens:
<?xml version="1.0" encoding="UTF-8"?>
- " " (a single space character)
<html xmlns="http://www.w3.org/1999/xhtml">
- " "
<!--
Example comment for parsing
-->- " "
<![CDATA[
Example CDATA for parsing
]]>- " "
<head>
- " "
<title>
- "Page Title"
</title>
- " "
<script src="//some-domain.com/file.js">
- " "
</script>
- " "
<link rel="stylesheet" href="//some-domain.com/file.css" />
- " "
</head>
- " "
<body>
- " "
<div id="main" class="container">
- " "
<h1>
- "Page Title"
</h1>
- " "
</div>
- " "
<div id="main" class="container">
- " "
<h1>
- "Page Title: Log-in Required:"
</h1>
- " "
<form action="" method="post">
- " "
<div class="form-group">
- " "
<label for="username">
- "Name:"
</label>
- " "
<input type="text" id="username" name="username" class="form-control" />
- " "
</div>
- " "
<div class="form-group">
- " "
<label for="userpass">
- "Password:"
</label>
- " "
<input type="password" id="userpass" name="userpass" class="form-control" />
- " "
</div>
- " "
<div>
- " "
<button type="submit" class="btn btn-default">
- "Log In"
</button>
- " "
</div>
- " "
</form>
- " "
</div>
- " "
</body>
- " "
</html>
- " "
The recurring
items might seem a bit odd, at first, or like they might
be problematic when it comes to the node-generation process, but they aren't all
that surprising, given how Tokenize
works. Looking at the first form-group
<div>
:
<-- just inside the form -->
<div class="form-group">
<label for="username">Name:</label>
<input type="text" id="username" name="username" class="form-control" />
</div><-- and so on... -->
The initial tokenChunks
in Tokenize
would acquire the
following segments of the original markup-text:
div class="form-group">
(plus the whitespace following until the next tag is encountered)label for="username">Name:
/label>
(plus whitespace after)input type="text" id="username" name="username" class="form-control" />
(plus whitespace after)/div>
token
iteration runs through those token-chunks:
- The
tag
andtext
values will be successfully acquired by
withtag, text = token.split( '>' )
tag
containing everything inside the original < and > from the markup-text, andtext
containing the line-break and indentation white-space that followed the tag.
From that point on, the rest is just clean-up and formatting, with the bulk oftext
being reduced down to a single space, before both are appended to the final results. - The same steps will capture
label for="username"
andName:
for thetag
andtext
values in the<label ...>
line. - Follows the same steps as #0, above, with very similar results: A tag-token
and an
empty
string-token - The same again
- The attempt to split this (assuming the comment were in place, or that
the
</div>
was the last bit of text in the source-markup) fails with aValueError
and execution drops into theexcept
block, where the token is re-assembled as the end-tag
single spaceitems that came back in it. It seemed to me that adding in all of those tokens during the iteration over them would lead to a potentially-significant amount of useless additional information in the final rendering, since each of them would eventually be converted into a
Text
item storing a single space-character
as its content. At the same time, it's perfectly valid in markup to have single spaces
between tags that also need to be preserved. I suspect that such cases would
mostly show up in content, in situations like:
...this is <code>strong</code> <strong>content</strong> ...
The space between the </code>
and <strong>
tags
is legitimate.
I cannot think of a good (and simple) way to strip out the single-space tokens
that should
be removed — Doing so would require a way to identify which
of them were of legitimate use, and which were just rendered down from (probably)
indentation in the markup-source. I set that aside for a bit, to look at the results
from a markup-fragment (below), and after looking over that token-result,
I'm pretty sure that my concerns were a non-issue:
- The whitespace-reduction that happens because of indentation or other contiguous whitespace in the markup-source would be similarly handled by the client browser, so the size and content of those long/contiguous white-space chunks turns out to be irrelevant, so long as their existance is preserved;
- So long as any trailing- or leading-white-space in a text-token isn't
stripped down to nothing, any content-resident white-space reduction
is fine. Running a quick test against
yielded the following tokens:...this is <code>strong</code> <strong>content</strong> ...
"... this is "
(space after the text is preserved)<code>
"strong"
</code>
" "
(space between the tags is preserved)<strong>
"content"
</strong>
" ..."
(space before the text is preserved)
Running Tokenize
Against a Markup Fragment
Between the fragment I checked above and this one:
myTag.innerHTML = "This is for a tag's ' +
'<code>innerHTML</code> property<br />' +
'It contains mixed and unary tags, as well as text!"
which yielded:
"This is for a tag's "
(note the inclusion of the space at the end of the text-value)<code>
"innerHTML"
</code>
" property"
(note the inclusion of the space at the beginning of the text-value)<br />
"It contains mixed and unary tags, as well as text!"
Tokenize
function will work just fine.
Converting the Token Sequence into Objects
The basic outline I provided earlier is pretty close to the final process that
I came up with. I didn't specify what was actually being returned, now that I look
at it again, but my original thought/expectation was returning a list of Tag
instances. That won't work, though: since a parsing process needs to be able to parse
markup-fragments, and those fragments may not start with a tag, it needs to return
a list of IsNode
instances, including CDATA
, Comment
,
Tag
and Text
object-types, rather than just returning a
top-level IsElement
object — It's quite possible, as shown, that
the markup being parsed will include text-nodes that aren't children of any element.
Not a major change, as far as what it returns, but that involved some pretty significant
differences in how items were handled inside the process.
It also needs the ability to split tag-names from attributes, for regular tags as well as for the various XML items that could be provided. Once the attributes-text has been isolated, it also needs to be able to identify attributes' name/value pairs. Both of these tasks are, I think, best implemented with a couple of regular expressions:
attributeFinder = re.compile(
r"""([_a-zA-Z][-_0-9A-Za-z]*) *= *["'](.*)["']"""
)
whitespaceSplitter = re.compile( '\s' )
I suspect I'll need to tweak the attributeFinder
, to assure that it'll
grab onto any valid attribute-name — the pattern specified is from memory,
and may not be accurate, but I'll review that before I finalize the markup
module, and tweak it if my recollection is incorrect.
The prototype function for the iteration (NodesFromTokens
) is fairly lengthy, so I'll break it down in a couple of chunks:
def NodesFromTokens( tokens ):
currentElement = None
rootElement = None
xmlDeclaration = None
xmlInstructions = []
currentIndex = 0
errorTokenOffset = 3
results = []
for token in tokens:
if token[ 0 ] == '<':
# Some sort of tag, or a document
Within the function, there are a fair few variables that keep track of various items
as the iteration progresses:
currentElement
None
or aIsElement
instance: The current element that new nodes will be added to withappendChild
.rootElement
None
or aIsElement
instance: The top-level element of the DOM-tree thatcurrentElement
belongs to.xmlDeclaration
- None or
BaseDocument.XMLTag
: The XML declaration of therootElement
. xmlInstructions
list
ofBaseDocument.XMLTag
: The XML processing-instructions of therootElement
, if any.currentIndex
int
: keeps track of the current position in the token-sequence of the current token, used mostly for generating reasonably-detailed error-messages if the parsing-process goes awry.errorTokenOffset
int
: The number of tokens before and after the current token to display when an error is raised.results
list
ofBaseNode
: The results to be returned by the function.
if token[ 0:4 ] == '<!--':
# Comment: <!-- content -->
newNode = Comment( token[ 4:-3 ].strip() )
if currentElement:
currentElement.appendChild( newNode )
else:
results.append( newNode )
elif token[ 0:9 ] == '<![CDATA[':
# CDATA: <![CDATA[ content ]]>
newNode = CDATA( token[ 9:-3 ].strip() )
if currentElement:
currentElement.appendChild( newNode )
else:
results.append( newNode )
The XML-items are pretty straightforward, though they are stored for addition to
a document later, since odds are good that the document's root tag hasn't been created
yet. I have yet to work out how documents' root-tags will be identified at this point,
but I'll address that later in this post.
elif token[ 0:2 ] == '<?':
# XML item: Could be an XML declaration or a
# processing-instruction
innerTag = token[ 2:-2 ]
try:
tagName, attributes = innerTag.split( ' ', 1 )
except ValueError:
tagName = innerTag
attributes = ''
# Generate attributes, if applicable
attrDict = {}
if attributes:
for keyValuePair in attributes.split( ' ' ):
try:
key, value = attributeFinder.match( keyValuePair ).groups()
attrDict[ key ] = value
except ValueError:
pass
if tagName == 'xml':
xmlDeclaration = BaseDocument.XMLTag( tagName )
else:
xmlInstructions.append( BaseDocument.XMLTag( tagName ) )
Processing for tags is more complex, though it starts simply enough:
else:
# Tag: <[xmlns:]tagName[ attribute=""]*[/]?> | </[xmlns:]tagName>
innerTag = token[ 1:-1 ].strip()
# Determine if the tag is unary/self-closing as it appears in
# the markup-source. This will determine whether or not it gets
# used as a currentElement value later on.
unary = ( token[ -2 ] == '/' )
# Split out the tag-name and its attribute-text
try:
tagName, attributes = whitespaceSplitter.split( innerTag, 1 )
except ValueError:
tagName = innerTag
attributes = ''
# Clean up attributes text
if attributes and attributes[ -1 ] == '/':
attributes = attributes[ 0:-1 ].strip()
# Clean up tag-name
if tagName[ -1] == '/':
tagName = tagName[ 0:-1 ].strip()
# Determine if there's a namespace attached to the tag, and
# handle accordingly
try:
xmlns, tagName = tagName.split( ':', 1 )
except ValueError:
xmlns = None
if xmlns:
tagNamespace = Namespace.GetNamespaceByName( xmlns )
else:
tagNamespace = None
By this point, the tag's tagName
and namespace have been identified.
The first thing to do is to see if it's an end-tag, though: If it is, the rest of
the processing can be skipped, since it won't have attributes, children, or any other
items that need to be processed. All that really needs to happen is verifying that
the end-tag matches the item stored in currentElement
— if it
doesn't, that's an error, and I'll raise a fairly detailed ParsingError
accordingly:
# If the first character of tagName is "/", then it's an end-tag.
# Check the namespace and the modified tag-name against the current
# element's equivalents: If they match the current element can be
# closed, and the loop can continue.
if tagName[ 0 ] == '/':
endName = tagName[1:]
if currentElement:
if endName == currentElement.tagName and \
(
# If there is no tagNamespace, it should be OK?
not tagNamespace or
# Otherwise it needs to match!
tagNamespace == currentElement.namespace
):
# The end-tag is legit, so we can set currentElement
# back to its parentElement. First, though:
# - Remove any whitespace-only Text at the
# beginning of the childNodes
while currentElement.childNodes and \
isinstance(
currentElement.childNodes[ 0 ], Text
) and not currentElement.childNodes[ 0 ].data.strip():
currentElement.childNodes[ 0 ].removeSelf()
# - Remove any whitespace-only Text at the end of
# the childNodes
while currentElement.childNodes and \
isinstance(
currentElement.childNodes[ -1 ], Text
) and not currentElement.childNodes[ -1 ].data.strip():
currentElement.childNodes[ -1 ].removeSelf()
currentElement = currentElement.parentElement
else:
# Generate a fairly detailed error-message,
# including some before-and-after tag-tokens to
# make debugging bad markup easier
tcStart = max( currentIndex - errorTokenOffset, 0 )
tcEnd = min( currentIndex + errorTokenOffset,
len( tokens )
)
tokenContext = tokens[ tcStart:tcEnd ]
# Include WHY the failure happened: tag-name or
# -namespace issue:
errorCauses = []
if endName != currentElement.tagName:
errorCauses.append(
'tag-name mismatch: %s != %s' % (
endName, currentElement.tagName
)
)
if xmlns != currentElement.namespace:
errorCauses.append(
'namespace mismatch: %s != %s' % (
xmlns, currentElement.namespace.Name
)
)
raise ParsingError(
'Mismatched end-tag (%s) while parsing '
'%s: %s' % (
token, tokenContext,
' and '.join( errorCauses )
)
)
# If everything worked, the rest can be skipped for this
# tag-token, since end-tags don't have attributes, etc.
continue
When a tag is closed, I want to remove any empty
Text
children
if they are at the beginning or end of the Tag
's childNodes
.
This is just some basic clean-up, but it will prevent odd occurrances like
<head>
<script src="//some-domain.com/script.js"></script>
</head>
being processed and resulting in
<head>
<script src="//some-domain.com/script.js"> </script>
</head>
(Note the space between the starting and ending <script>
tags)
If the current token doesn't represent an end-tag, then it must represent a start-tag, so any attributes need to be processed:
# Create the actual attributes
attrDict = {}
if attributes:
for keyValuePair in whitespaceSplitter.split( attributes ):
# Each name="value" item
try:
key, value = attributeFinder.match(
keyValuePair.strip() ).groups()
attrDict[ key ] = value
except AttributeError:
# Raised if the match-result has no groups(), so
# do nothing
pass
Then, finally, a new Tag
instance can be created. If currentElement
exists, then the new Tag
instance will be appended to it, otherwise
it'll be appended to the results
list. That allows the same process
to deal with both document (or at least document-like) markup-structure, where the
entire markup-body is contained in a single tag, and markup fragments,where there may be a number of nodes, including tags, all at the same level.
# If an xmlns attribute has been specified, grab that and
# attach it to the tag:
nsURI = attrDict.get( 'xmlns' )
if nsURI:
tagNamespace = Namespace.GetNamespaceByURI( nsURI )
# Create a new tag and deal with it accordingly
if currentElement:
# It'll be a child of currentElement
newNode = Tag( tagName, tagNamespace, **attrDict )
if not unary:
# Can have children, so it takes currentElement's place
currentElement = currentElement.appendChild( newNode )
else:
# No children allowed, so just append it
currentElement.appendChild( newNode )
else:
# It can't be a child of an element, for whatever reason
newNode = Tag( tagName, tagNamespace, **attrDict )
if not unary:
# But it can HAVE children, so it takes currentElement's
# place.
currentElement = newNode
results.append( newNode )
The last remaining node-type that needs to be handled is Text
instances,
which are very simple:
else:
# Text
newNode = Text( token )
if currentElement:
currentElement.appendChild( newNode )
else:
results.append( newNode )
currentIndex += 1
Since the results
are a list, and it's possible (probably even likely
in the case of documents) for there to be empty
Text
-items at
the beginning and/or end of the results, I want to remove them. I don't
want to remove any empty
Text
-items from anywhere else, though:
markup-fragments can legitimately have them, and even documents will too...
# Clear out any starting- or ending-results that are empty text-nodes
while isinstance( results[ 0 ], Text ) and not results[ 0 ].data.strip():
results = results[ 1: ]
while isinstance( results[ -1 ], Text ) and not results[ -1 ].data.strip():
results = results[ :-1 ]
return results
Executing these prototype functions against the XML source witht he following code:
tokens = Tokenize( xmlSource )
parsed = NodesFromTokens( tokens )
print parsed
print parsed[ 0 ]
yields these results (line-breaks added for clarity, empty
Text
-items
are shown as ):
[<idic.markup.Tag object at 0x7fbcb613f490>]
<html xmlns="http://www.w3.org/1999/xhtml">
<!-- Example comment for parsing -->
<![CDATA[ Example CDATA for parsing ]]>
<head><title>Page Title</title>
<script src="//some-domain.com/file.js" /></script>
<link href="//some-domain.com/file.css" rel="stylesheet" />
</head> <body><div id="main" class="container">
<h1>Page Title</h1></div>
<div id="login" class="container">
<h1>Page Title: Log-in Required:</h1>
<form method="post"><div class="form-group">
<label for="username">Name:</label>
<input id="username" type="text" class="form-control" name="username" />
</div> <div class="form-group">
<label for="userpass">Password:</label>
<input id="userpass" type="password" class="form-control" name="userpass" />
</div> <div><button type="submit">Log In</button>
</div></form></div></body></html>
Defining the MarkupParser
Class
I decided to make MarkupParser
a class for several reasons:
- Being able to create separate instances for different markup-source items allows the source, the tokens and the results to persist across any number of parsing-executions.
- Parsers living in separate instances don't run any risk of cross-contamination, which could happen if the process were a free-standing function. To be fair, avoiding that sort of cross-contamination is probably just a matter of discipline, so it's not really required. At the same time, it just feels... safer, really.
- Separate parser-instances can be attached to other objects as properties if needed — I have this gut feeling that I'll want to be able to do that down the line, probably at about the point when I'm working out page-components;
MarkupParser
involved little more than moving the constants and functions into the class definition
(and making them protected), and creating properties for the souorce, tokens and results:
@describe.InitClass()
class MarkupParser( object ):
"""
Provides an object-class that can be instantiated with a chunk of markup that
can parse that markup into an IsNode-derived-objects DOM tree."""
#-----------------------------------#
# Class attributes (and instance- #
# attribute default values) #
#-----------------------------------#
_attributeFinder = re.compile(
r"""([_a-zA-Z][-_0-9A-Za-z]*) *= *["'](.*)["']"""
)
_whitespaceSplitter = re.compile( '\s' )
#-----------------------------------#
# Instance property-getter methods #
#-----------------------------------#
@describe.AttachDocumentation()
def _GetResults( self ):
"""
Gets the final parsed markup-items generated from the original source"""
try:
return self._results
except AttributeError:
self._results = self._NodesFromTokens( self.Tokens )
return self._results
@describe.AttachDocumentation()
def _GetSource( self ):
"""
Gets the original source supplied to the instance to be parsed"""
return self._source
@describe.AttachDocumentation()
def _GetTokens( self ):
"""
Gets the sequence of tokens that the instance will use to generate final
results"""
try:
return self._tokens
except AttributeError:
self._tokens = self._Tokenize( self.Source )
return self._tokens
#-----------------------------------#
# Instance property-setter methods #
#-----------------------------------#
@describe.AttachDocumentation()
@describe.argument( 'value',
'the original source markup-text to be parsed by the instance',
str, unicode
)
@describe.raises( TypeError,
'if passed a value that is not a str or unicode'
)
def _SetSource( self, value ):
"""
Sets the original source markup-text to be parsed by the instance"""
if type( value ) not in ( str, unicode ):
raise TypeError( '%s.Source expects a string or unicode value, '
'but was passed "%s" (%s)' % (
self.__class__.__name__, value, type( value ).__name__
)
)
try:
del self._tokens
except:
pass
try:
del self._results
except:
pass
self._source = value
#-----------------------------------#
# Instance property-deleter methods #
#-----------------------------------#
@describe.AttachDocumentation()
def _DelSource( self ):
"""
"Deletes" the original source markup-text to be parsed by the instance by
setting it to None"""
self._source = None
#-----------------------------------#
# Instance Properties #
#-----------------------------------#
Results = describe.makeProperty( _GetResults, None, None,
'the results of parsing the supplied source into markup-module objects',
list
)
Source = describe.makeProperty( _GetSource, _SetSource, None,
'the original markup-text to be parsed by the instance',
str, unicode
)
Tokens = describe.makeProperty( _GetTokens, None, None,
'',
list
)
#-----------------------------------#
# Instance Initializer #
#-----------------------------------#
@describe.AttachDocumentation()
@describe.argument( 'source',
'the markup-text value to parse',
str, unicode
)
def __init__( self, source ):
"""
Instance initializer"""
# MarkupParser is intended to be a nominally-final class
# and is NOT intended to be extended. Alter at your own risk!
#---------------------------------------------------------------------#
# TODO: Explain WHY it's nominally final! #
#---------------------------------------------------------------------#
if self.__class__ != MarkupParser:
raise NotImplementedError( 'MarkupParser is '
'intended to be a nominally-final class, NOT to be extended.' )
# Call parent initializers, if applicable.
# Set default instance property-values with _Del... methods as needed.
self._DelSource()
# Set instance property values from arguments if applicable.
self.Source = source
# Other set-up
# ...
@describe.AttachDocumentation()
@describe.argument( 'tokens',
'the list of tokens to create nodes from',
list
)
@describe.raises( TypeError,
'if passed a non-list tokens value'
)
@describe.raises( ValueError,
'if the supplied tokens contain any non-str, non-unicode values'
)
def _NodesFromTokens( self, tokens ):
"""
Generates a list of nodes from the supplied tokens"""
# The function-code from the NodesFromTokens
# prototype function
# ...
@describe.AttachDocumentation()
@describe.argument( 'markupText',
'the markup-text to tokenize',
str, unicode
)
@describe.raises( TypeError,
'if passed a markupText value that is not a str or unicode'
)
@describe.returns( 'a list of strings, each containing one node-item '
'(start- or end-tag, text, comment, CDATA-section) from the supplied '
'markupText' )
def _Tokenize( self, markupText ):
"""
Returns a sequence of markup-token from the supplied markup-text"""
# ...
#---------------------------------------#
# Append to __all__ #
#---------------------------------------#
__all__.append( 'MarkupParser' )
I'll make the entire code for MarkupParser
available for download,
like I did for Tag
in a previous post
Now that MarkupParser
is implemented, the implementations of Tag.cloneNode
and Tag.innerHTML
can be completed, and their unit-tests reinstated:
#-----------------------------------#
# Instance property-getter methods #
#-----------------------------------#
# ...
@describe.AttachDocumentation()
def _GetinnerHTML( self ):
"""
Gets a text representation of the markup of all children of the instance"""
result = ''
for child in self._childNodes:
result += child.__str__()
return result
# ...
#-----------------------------------#
# Instance property-setter methods #
#-----------------------------------#
# ...
@describe.AttachDocumentation()
@describe.argument( 'value',
'the markup to set the instance\'s childNodes to after parsing them',
str, unicode
)
@describe.raises( TypeError,
'if passed a value that is not a str or unicode value'
)
def _SetinnerHTML( self, value ):
"""
Sets the children of the instance by parsing the supplied markup and
replacing the instance's children with those parsed items"""
if type( value ) not in ( str, unicode ):
raise TypeError( '%s.innerHTML expects a str or unicode '
'value, but was passed "%s" (%s)' % (
self.__class__.__name__, value,
type( value ).__name__
)
)
self._DelchildNodes()
for child in MarkupParser( value ).Results:
self.appendChild( child )
# ...
#-----------------------------------#
# Instance Methods #
#-----------------------------------#
# ...
@describe.AttachDocumentation()
@describe.argument( 'deep',
'indicates whether to make a "deep" copy (True) or '
'not (False)',
bool
)
def cloneNode( self, deep=False ):
"""
Returns a copy of the current node with all of its children cloned as well"""
if deep:
return MarkupParser( str( self ) ).Results[ 0 ]
else:
return Tag( self.tagName, self.namespace, **self.attributes )
The getter- and setter-methods for innerHTML
seem pretty simple to me:
_GetinnerHTML
really just renders all of the instance's child-nodes into a string (or unicode, if a string-representaiton raises any of several unicode-related errors) and returns the entire text._SetinnerHTML
takes the provided text, runs it through aMarkupParser
, clears all of the instance'schildNodes
out, then appends each child from the parsedResults
.
innerHTML
is complete at this point, but I'm
expecting that I'll need to add cases to it as I start using Tag
:
def testinnerHTML(self):
"""Unit-tests the innerHTML property of a Tag instance."""
testObject = Tag( 'body' )
markupSources = [
'text only',
'<!-- a comment -->',
'<![CDATA[ a CDATA section ]]>',
'<a href="#somewhere">somewhere</a>',
( 'text before a tag <code>strong</code> text: '
'<strong>strong text</strong> unary tag '
'<br />\ntext after a tag' ),
]
markupSources.append( ' '.join( markupSources ) )
for expected in markupSources:
testObject.innerHTML = expected
actual = testObject.innerHTML
self.assertEquals( actual, expected, 'A Tag\'s innerHTML, if set '
'to "%s" should return the value it was set to, but "%s" (%s) '
'was returned instead.' % ( expected, actual,
type( actual ).__name__ )
)
This test, though it's very simple (too simple, perhaps) passes.
The cloneNode
method also leverages MarkupParser
: The
basic idea is that if a deep clone is needed, rather than having to clone the node,
then recursively clone each child-node, with all of the re-parenting
that
would entail, to simply parse the string output of the node being cloned. Using a
recursive approach would probably work, but could get very deep very quickly if the
original Tag
being cloned had any significant depth to it. Taking the
parsing approach felt simpler, and easier to maintain on a long-term basis:
@describe.AttachDocumentation()
@describe.argument( 'deep',
'indicates whether to make a "deep" copy (True) or '
'not (False)',
bool
)
def cloneNode( self, deep=False ):
"""
Returns a copy of the current node with all of its children cloned as well"""
if deep:
return MarkupParser( str( self ) ).Results[ 0 ]
else:
return Tag( self.tagName, self.namespace, **self.attributes )
I generated a very basic unit-test of cloneNode
, which passes:
def testcloneNode(self):
"""Unit-tests the cloneNode method of a Tag instance."""
testObject = Tag( 'div' )
row1 = testObject.appendChild( Tag( 'div', className='row', htmlId='row1' ) )
label1 = row1.appendChild( Tag( 'label', htmlId='label1' ) )
label1.appendChild( Text( 'Label 1' ) )
row2 = testObject.appendChild( Tag( 'div', className='row', htmlId='row2' ) )
label2 = row2.appendChild( Tag( 'label', htmlId='label2' ) )
label2.appendChild( Text( 'Label 2' ) )
row3 = testObject.appendChild( Tag( 'div', className='row', htmlId='row3' ) )
label3 = row3.appendChild( Tag( 'label', htmlId='label3' ) )
label3.appendChild( Text( 'Label 3' ) )
clonedNode = testObject.cloneNode( True )
self.assertEquals( str( clonedNode ).replace( ' ', '' ),
str( testObject ).replace( ' ', '' ) )
The replacement of single-space strings in the final assertEquals
was
put in place because the parsed items may not have the same whitespace as the source
string — another variant of the whitespace-removal concerns that I initially
had and mentioned earlier. I'm not completely happy with this test, in all honesty,
but like the test-method for innerHTML
, I'll probably come back to it
and add new items as I encounter them.
That just leaves the unit-testing for MarkupParser
itself. At this
point, I'm not sure what the various trategies are that will yield useful tests,
though I have some ideas that I'm going to pursue in the future. For the time being,
though I really don't like skipping them, I'm going to — with a big,
old DEFERRED message in the results so that I won't forget to come
back to them later:
@unittest.skip( '### DEFERRED: Need to come up with a good, '
'solid testing strategy still.' )
Part of the decision to defer these hinges on the fact that I'll want to make sure
to test MarkupParser
against markup-files that are more or less representative
of the sort of page-templates that I'm eventually expecting to pass to them. At present,
I'm struggling with how to keep track of the expected results for those files —
without a reliable expected value to compare the actual value against, the testing
process would likely be meaningless...
There's been a sizable chunk of code shown here, and the post is getting pretty long, so this feels like a good point to stop for now. I have yet to work in how the parsing-process will identify documents — really, I have yet to actually define concrete document-classes, so that will be the focus for my next post. I'm pretty sure that'll wrap up markup-generation and -parsing, though (finally).
No comments:
Post a Comment