Thursday, May 18, 2017

Generating and Parsing Markup in Python [8]

Almost the entire process and structure for generating markup is complete now — minus a handful of items that are waiting on the ability to parse markup-text, and some decisions about how to handle document-creation. Those two topics are loosely related, at least, so today I'm going to get the parsing capabilities worked out in the hopes that it will provide a basis for deciding how to implement concrete document-classes. So: MarkupParser:

Parsing, from 10,000 feet

Parsing markup is in interesting problem (perhaps in the sense of the old Chinese curse: may you live in interesting times):

  • There are rules for how tags can be expressed, but those rules may not be the same for any two tags, at least in certain markup dialects (HTML 5). Those rules help to define what constitues valid markup, but may not be enough, by themselves, to determine validity.
  • Within the boundaries of those rules, markup can be very broad (a single tag can have any number of children), and can also be very deep (it's possible to have a tag within a tag within a tag... down to any depth).
The first of those two considerations is why I've chosen to use page-templates that follow XML rules — The tag-rules are simpler, and don't require keeping track of which tags do and do not require closing tags, or which cannot have closing-tags. Keeping the template-files in XML, and allowing the rendering process (using data from the document's namespace) to determine whether a closing-tag is needed (or allowed). I haven't yet thought about how I'm going to deal with determining how to render unary tags, like <img>, which render differently in HTML 5 and XHTML:
<-- HTML 5 -->
<img src="">
versus
<-- XHTML -->
<img src="" />
As things stand right now, both HTML 5 and XHTML renderings of an <img> look like the XHTML output above. That won't cause any issues in any browsers I'm aware of right now, but there's no guarantee that it won't in the future. I'm contemplating setting up an IsXMLDialect property in Namespace, defaulting to False, that would be re-set to True when an XML declaration is created on a document-instance, but that's a discussion for later, perhaps. For the time being, I'm content to leave it alone, rather than thrash through Tag.__str__ and Tag.__unicode to fix what isn't currently a problem.

But I digress...

The second item is, arguably, the more significant of the two, given that I'm not expecting to be parsing markup that isn't XML-compliant for a while. It's significant because it makes the scope of the parsing problem theoretically infinite — Even with some fairly realistic expectations (no more than, say, 100 children for any given tag, and no deeper nesting than, say, 30 levels), it's still a huge and dynamic set of possibilities to try and come up with a solution around. And that's just in a document's scope — in order to fully implement Tag.innerHTML, a parsing process has to be able to contend with mixed node-types from the supplied test. That is,

myTag.innerHTML = ( 'This is my markup. <strong>' + 
    'There are many like it, but this one is ' + 
    '<em>mine</em>!</strong>' )
should yield markup along the lines of:
<myTag>
    This is my markup. <strong>There are many like it, 
    but this one is <em>mine</em></strong>
</myTag>
The results of parsing that initial string aren't a single element with some number of children (as might be expected when parsing a document or a template-file that's used to generate a document). Instead, it's a sequence of nodes that would then have to be appended to myTag's child-nodes after removing any existing child-nodes.

How I'm Going To Approach This

There are two significant processes that MarkupParser is going to undertake. In summary, they are:

  1. Breaking the provided markup-text down into tokens, where each token is one of:
    • An XML declaration
    • An XML Processing-instruction;
    • A DOCTYPE declaration (maybe — I'm still thinking on this);
    • A document;
    • A tag;
    • A CDATA section;
    • A comment; or
    • A text-node.
    I expect the tokens to be a sequence of text-values, probably a list.
  2. Iterating over the sequence of tokens, while keeping track of both a root element-node (which might be a document, but would certainly be a Tag-derived instance) and a current element-node (a Tag also), and setting aside (for now) any DOCTYPE handling or document-identification:
    • If an XML declaration is encountered, and hasn't already been defined for the root element-node, store it for later use;
    • If an XML processing-instruction is encountered, store it for later use;
    • If a start-tag is encountered, create a corresponding Tag-instance, and:
      • Set the current element to the newly-created Tag-instance
      • If there is no root element defined, set it to the newly-created Tag-instance;
      • Otherwise, append it to the current element
    • If an end-tag is encountered:
      • Check its tag-name (and namespace, if applicable) against the tag-name of the current element, raising a MarkupError if they don't match;
      • Otherwise, set the current element to the parent element of the current element, effectively closing the tag by preventing further child-appending to the Tag instance representing it.
    • If a CDATA-, comment- or text-node is encountered, create an instance of CDATA, Comment or Text, respectively, populate it with the applicable text from the token, and append it to the current node.
    • If the root node is being closed/completed, attach any XML declarations and processing-intructions to the (presumed) document;

Creating the Token Sequence

Creating the sequence of tokens is surprisingly easy. My previous efforts used a regular-expression-based process (using re.findall) to extract token-items, and it was mostly functional, but it had a couple of drawbacks:

  • It was occasionally prone to errors, requiring revision of the regular-expression definition, which grew increasingly complex and hard to manage;
  • It ended up causing an odd requirement that there could be no empty tags in the source markup, which was at least occasionally frustrating for the less-technical designer-types who were working on page-templates. By way of example, using Glyphicons that are available as part of Bootstrap), the following markup-structure was required in order to prevent throwing ParsingErrors:
    <div>
      <i class="glyphicon glyphicon-user"><!-- . --></i>
      A user icon
    </div>
    Compare with what should have been allowed:
    <div>
      <i class="glyphicon glyphicon-user"></i>
      A user icon
    </div>
    While that wasn't too awkward to deal with, it definitely took some getting used to, and was disruptive until the habit had been formed, so it's something to avoid in the current implementation.

After some experimentation, where I landed was the following function:

def Tokenize( markupText ):
    tokenChunks = [ 
        tc for tc in markupText.split( '<' ) if tc
    ]
    result = []
    for token in tokenChunks:
        if '>' in token:
            try:
                tag, text = token.split( '>' )
                result.append( '<%s>' % tag )
                if not text.strip():
                    text = ' '
                if text:
                    result.append( text )
            except ValueError:
                result.append( '<%s' % token )
        else:
            if token:
                result.append( token )
    return result
This got pretty close, maybe exactly what's needed for both an XML document-template and a markup-fragment with a mixture of tags and text, in a mere 20 lines of code. Those token-sequences have some potentially odd-looking items in their results, but nothing that seems likely to prevent them from being used in the node-generation iteration to come.

Running Tokenize Against an XML Document Template

I started with a stripped-down version of the same XML Document Template that I shared at the end of last month — basically, I just stripped out any application or framework items, both namespaces and tags, leaving:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <!--
    Example comment for parsing
  -->
  <![CDATA[
    Example CDATA for parsing
  ]]>
  <head>
    <title>Page Title</title>
    <script src="//some-domain.com/file.js"></script>
    <link rel="stylesheet" href="//some-domain.com/file.css" />
  </head>
  <body>
    <div id="main" class="container">
      <h1>Page Title</h1>
    </div>
    <div id="main" class="container">
      <h1>Page Title: Log-in Required:</h1>
      <form action="" method="post">
        <div class="form-group">
          <label for="username">Name:</label>
          <input type="text" id="username" name="username" class="form-control" />
        </div>
        <div class="form-group">
          <label for="userpass">Password:</label>
          <input type="password" id="userpass" name="userpass" class="form-control" />
        </div>
        <div>
          <button type="submit" class="btn btn-default">Log In</button>
        </div>
      </form>
    </div>
  </body>
</html>
Executing the Tokenize prototype-function against that markup (as a string-constant in a throwaway file) returned the following sequence of tokens:
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. " " (a single space character)
  3. <html xmlns="http://www.w3.org/1999/xhtml">
  4. " "
  5. <!--
    Example comment for parsing
    -->
  6. " "
  7. <![CDATA[
    Example CDATA for parsing
    ]]>
  8. " "
  9. <head>
  10. " "
  11. <title>
  12. "Page Title"
  13. </title>
  14. " "
  15. <script src="//some-domain.com/file.js">
  16. " "
  17. </script>
  18. " "
  19. <link rel="stylesheet" href="//some-domain.com/file.css" />
  20. " "
  21. </head>
  22. " "
  23. <body>
  24. " "
  25. <div id="main" class="container">
  26. " "
  27. <h1>
  28. "Page Title"
  29. </h1>
  30. " "
  31. </div>
  32. " "
  33. <div id="main" class="container">
  34. " "
  35. <h1>
  36. "Page Title: Log-in Required:"
  37. </h1>
  38. " "
  39. <form action="" method="post">
  40. " "
  41. <div class="form-group">
  42. " "
  43. <label for="username">
  44. "Name:"
  45. </label>
  46. " "
  47. <input type="text" id="username" name="username" class="form-control" />
  48. " "
  49. </div>
  50. " "
  51. <div class="form-group">
  52. " "
  53. <label for="userpass">
  54. "Password:"
  55. </label>
  56. " "
  57. <input type="password" id="userpass" name="userpass" class="form-control" />
  58. " "
  59. </div>
  60. " "
  61. <div>
  62. " "
  63. <button type="submit" class="btn btn-default">
  64. "Log In"
  65. </button>
  66. " "
  67. </div>
  68. " "
  69. </form>
  70. " "
  71. </div>
  72. " "
  73. </body>
  74. " "
  75. </html>
  76. " "

The recurring items might seem a bit odd, at first, or like they might be problematic when it comes to the node-generation process, but they aren't all that surprising, given how Tokenize works. Looking at the first form-group <div>:

<-- just inside the form -->
<div class="form-group">
  <label for="username">Name:</label>
  <input type="text" id="username" name="username" class="form-control" />
</div><-- and so on... -->
The initial tokenChunks in Tokenize would acquire the following segments of the original markup-text:
  1. div class="form-group"> (plus the whitespace following until the next tag is encountered)
  2. label for="username">Name:
  3. /label> (plus whitespace after)
  4. input type="text" id="username" name="username" class="form-control" /> (plus whitespace after)
  5. /div>
As the token iteration runs through those token-chunks:
  1. The tag and text values will be successfully acquired by
    tag, text = token.split( '>' )
    with tag containing everything inside the original < and > from the markup-text, and text containing the line-break and indentation white-space that followed the tag.
    From that point on, the rest is just clean-up and formatting, with the bulk of text being reduced down to a single space, before both are appended to the final results.
  2. The same steps will capture label for="username" and Name: for the tag and text values in the <label ...> line.
  3. Follows the same steps as #0, above, with very similar results: A tag-token and an empty string-token
  4. The same again
  5. The attempt to split this (assuming the comment were in place, or that the </div> was the last bit of text in the source-markup) fails with a ValueError and execution drops into the except block, where the token is re-assembled as the end-tag
The first time I saw this token-output, I was somewhat concerned about the sheer number of the single space items that came back in it. It seemed to me that adding in all of those tokens during the iteration over them would lead to a potentially-significant amount of useless additional information in the final rendering, since each of them would eventually be converted into a Text item storing a single space-character as its content. At the same time, it's perfectly valid in markup to have single spaces between tags that also need to be preserved. I suspect that such cases would mostly show up in content, in situations like:
...this is <code>strong</code> <strong>content</strong> ...
The space between the </code> and <strong> tags is legitimate.

I cannot think of a good (and simple) way to strip out the single-space tokens that should be removed — Doing so would require a way to identify which of them were of legitimate use, and which were just rendered down from (probably) indentation in the markup-source. I set that aside for a bit, to look at the results from a markup-fragment (below), and after looking over that token-result, I'm pretty sure that my concerns were a non-issue:

  • The whitespace-reduction that happens because of indentation or other contiguous whitespace in the markup-source would be similarly handled by the client browser, so the size and content of those long/contiguous white-space chunks turns out to be irrelevant, so long as their existance is preserved;
  • So long as any trailing- or leading-white-space in a text-token isn't stripped down to nothing, any content-resident white-space reduction is fine. Running a quick test against
    ...this is <code>strong</code> <strong>content</strong> ...
    yielded the following tokens:
    1. "... this is " (space after the text is preserved)
    2. <code>
    3. "strong"
    4. </code>
    5. " " (space between the tags is preserved)
    6. <strong>
    7. "content"
    8. </strong>
    9. " ..." (space before the text is preserved)

Running Tokenize Against a Markup Fragment

Between the fragment I checked above and this one:

myTag.innerHTML = "This is for a tag's ' +
    '<code>innerHTML</code> property<br />' +
    'It contains mixed and unary tags, as well as text!"
which yielded:
  1. "This is for a tag's " (note the inclusion of the space at the end of the text-value)
  2. <code>
  3. "innerHTML"
  4. </code>
  5. " property" (note the inclusion of the space at the beginning of the text-value)
  6. <br />
  7. "It contains mixed and unary tags, as well as text!"
there's nothing surprising to me, so it feels like that prototype Tokenize function will work just fine.

Converting the Token Sequence into Objects

The basic outline I provided earlier is pretty close to the final process that I came up with. I didn't specify what was actually being returned, now that I look at it again, but my original thought/expectation was returning a list of Tag instances. That won't work, though: since a parsing process needs to be able to parse markup-fragments, and those fragments may not start with a tag, it needs to return a list of IsNode instances, including CDATA, Comment, Tag and Text object-types, rather than just returning a top-level IsElement object — It's quite possible, as shown, that the markup being parsed will include text-nodes that aren't children of any element. Not a major change, as far as what it returns, but that involved some pretty significant differences in how items were handled inside the process.

It also needs the ability to split tag-names from attributes, for regular tags as well as for the various XML items that could be provided. Once the attributes-text has been isolated, it also needs to be able to identify attributes' name/value pairs. Both of these tasks are, I think, best implemented with a couple of regular expressions:

attributeFinder = re.compile( 
    r"""([_a-zA-Z][-_0-9A-Za-z]*) *= *["'](.*)["']"""
)
whitespaceSplitter = re.compile( '\s' )
I suspect I'll need to tweak the attributeFinder, to assure that it'll grab onto any valid attribute-name — the pattern specified is from memory, and may not be accurate, but I'll review that before I finalize the markup module, and tweak it if my recollection is incorrect.

The prototype function for the iteration (NodesFromTokens) is fairly lengthy, so I'll break it down in a couple of chunks:

def NodesFromTokens( tokens ):
    currentElement = None
    rootElement = None
    xmlDeclaration = None
    xmlInstructions = []
    currentIndex = 0
    errorTokenOffset = 3
    results = []
    for token in tokens:
        if token[ 0 ] == '<':
            # Some sort of tag, or a document
Within the function, there are a fair few variables that keep track of various items as the iteration progresses:
currentElement
None or a IsElement instance: The current element that new nodes will be added to with appendChild.
rootElement
None or a IsElement instance: The top-level element of the DOM-tree that currentElement belongs to.
xmlDeclaration
None or BaseDocument.XMLTag: The XML declaration of the rootElement.
xmlInstructions
list of BaseDocument.XMLTag: The XML processing-instructions of the rootElement, if any.
currentIndex
int: keeps track of the current position in the token-sequence of the current token, used mostly for generating reasonably-detailed error-messages if the parsing-process goes awry.
errorTokenOffset
int: The number of tokens before and after the current token to display when an error is raised.
results
list of BaseNode: The results to be returned by the function.
The balance of the function starts by determining what type of token is being examined. The branches for comments and CDATA sections are almost identical, save for what kinds of objects get created, and what the identification-criteria are:
if token[ 0:4 ] == '<!--':
    # Comment: <!-- content -->
    newNode = Comment( token[ 4:-3 ].strip() )
    if currentElement:
        currentElement.appendChild( newNode )
    else:
        results.append( newNode )
elif token[ 0:9 ] == '<![CDATA[':
    # CDATA: <![CDATA[ content ]]>
    newNode = CDATA( token[ 9:-3 ].strip() )
    if currentElement:
        currentElement.appendChild( newNode )
    else:
        results.append( newNode )
The XML-items are pretty straightforward, though they are stored for addition to a document later, since odds are good that the document's root tag hasn't been created yet. I have yet to work out how documents' root-tags will be identified at this point, but I'll address that later in this post.
elif token[ 0:2 ] == '<?':
    # XML item: Could be an XML declaration or a 
    # processing-instruction
    innerTag = token[ 2:-2 ]
    try:
        tagName, attributes = innerTag.split( ' ', 1 )
    except ValueError:
        tagName = innerTag
        attributes = ''
    # Generate attributes, if applicable
    attrDict = {}
    if attributes:
        for keyValuePair in attributes.split( ' ' ):
            try:
                key, value = attributeFinder.match( keyValuePair ).groups()
                attrDict[ key ] = value
            except ValueError:
                pass
    if tagName == 'xml':
        xmlDeclaration = BaseDocument.XMLTag( tagName )
    else:
        xmlInstructions.append( BaseDocument.XMLTag( tagName ) )
Processing for tags is more complex, though it starts simply enough:
else:
    # Tag: <[xmlns:]tagName[ attribute=""]*[/]?> | </[xmlns:]tagName>
    innerTag = token[ 1:-1 ].strip()
    # Determine if the tag is unary/self-closing as it appears in 
    # the markup-source. This will determine whether or not it gets 
    # used as a currentElement value later on.
    unary = ( token[ -2 ] == '/' )
    # Split out the tag-name and its attribute-text
    try:
        tagName, attributes = whitespaceSplitter.split( innerTag, 1 )
    except ValueError:
        tagName = innerTag
        attributes = ''
    # Clean up attributes text
    if attributes and attributes[ -1 ] == '/':
        attributes = attributes[ 0:-1 ].strip()
    # Clean up tag-name
    if tagName[ -1] == '/':
        tagName = tagName[ 0:-1 ].strip()
    # Determine if there's a namespace attached to the tag, and 
    # handle accordingly
    try:
        xmlns, tagName = tagName.split( ':', 1 )
    except ValueError:
        xmlns = None
    if xmlns:
        tagNamespace = Namespace.GetNamespaceByName( xmlns )
    else:
        tagNamespace = None
By this point, the tag's tagName and namespace have been identified. The first thing to do is to see if it's an end-tag, though: If it is, the rest of the processing can be skipped, since it won't have attributes, children, or any other items that need to be processed. All that really needs to happen is verifying that the end-tag matches the item stored in currentElement — if it doesn't, that's an error, and I'll raise a fairly detailed ParsingError accordingly:
    # If the first character of tagName is "/", then it's an end-tag.
    # Check the namespace and the modified tag-name against the current 
    # element's equivalents: If they match the current element can be 
    # closed, and the loop can continue.
    if tagName[ 0 ] == '/':
        endName = tagName[1:]
        if currentElement:
            if endName == currentElement.tagName and \
                (
                    # If there is no tagNamespace, it should be OK?
                    not tagNamespace or 
                    # Otherwise it needs to match!
                    tagNamespace == currentElement.namespace
                ):
                # The end-tag is legit, so we can set currentElement 
                # back to its parentElement. First, though:
                # - Remove any whitespace-only Text at the 
                #   beginning of the childNodes
                while currentElement.childNodes and \
                    isinstance( 
                        currentElement.childNodes[ 0 ], Text
                    ) and not currentElement.childNodes[ 0 ].data.strip():
                    currentElement.childNodes[ 0 ].removeSelf()
                # - Remove any whitespace-only Text at the end of 
                #   the childNodes
                while currentElement.childNodes and \
                    isinstance( 
                        currentElement.childNodes[ -1 ], Text
                    ) and not currentElement.childNodes[ -1 ].data.strip():
                    currentElement.childNodes[ -1 ].removeSelf()
                currentElement = currentElement.parentElement
            else:
                # Generate a fairly detailed error-message, 
                # including some before-and-after tag-tokens to 
                # make debugging bad markup easier
                tcStart = max( currentIndex - errorTokenOffset, 0 )
                tcEnd = min( currentIndex + errorTokenOffset, 
                    len( tokens )
                )
                tokenContext = tokens[ tcStart:tcEnd ]
                # Include WHY the failure happened: tag-name or 
                # -namespace issue:
                errorCauses = []
                if endName != currentElement.tagName:
                    errorCauses.append( 
                        'tag-name mismatch: %s != %s'  % ( 
                            endName, currentElement.tagName
                        )
                    )
                if xmlns != currentElement.namespace:
                    errorCauses.append( 
                        'namespace mismatch: %s != %s'  % ( 
                            xmlns, currentElement.namespace.Name
                        )
                    )
                raise ParsingError( 
                    'Mismatched end-tag (%s) while parsing '
                    '%s: %s' % ( 
                        token, tokenContext, 
                        ' and '.join( errorCauses )
                    )
                )
        # If everything worked, the rest can be skipped for this 
        # tag-token, since end-tags don't have attributes, etc.
        continue
When a tag is closed, I want to remove any empty Text children if they are at the beginning or end of the Tag's childNodes. This is just some basic clean-up, but it will prevent odd occurrances like
<head>
    <script src="//some-domain.com/script.js"></script>
</head>
being processed and resulting in
<head>
    <script src="//some-domain.com/script.js"> </script>
</head>
(Note the space between the starting and ending <script> tags)

If the current token doesn't represent an end-tag, then it must represent a start-tag, so any attributes need to be processed:

    # Create the actual attributes
    attrDict = {}
    if attributes:
        for keyValuePair in whitespaceSplitter.split( attributes ):
            # Each name="value" item
            try:
                key, value = attributeFinder.match( 
                    keyValuePair.strip() ).groups()
                attrDict[ key ] = value
            except AttributeError:
                # Raised if the match-result has no groups(), so 
                # do nothing
                pass
Then, finally, a new Tag instance can be created. If currentElement exists, then the new Tag instance will be appended to it, otherwise it'll be appended to the results list. That allows the same process to deal with both document (or at least document-like) markup-structure, where the entire markup-body is contained in a single tag, and markup fragments, where there may be a number of nodes, including tags, all at the same level.
    # If an xmlns attribute has been specified, grab that and 
    # attach it to the tag:
    nsURI = attrDict.get( 'xmlns' )
    if nsURI:
        tagNamespace = Namespace.GetNamespaceByURI( nsURI )
    # Create a new tag and deal with it accordingly
    if currentElement:
        # It'll be a child of currentElement
        newNode = Tag( tagName, tagNamespace, **attrDict )
        if not unary:
            # Can have children, so it takes currentElement's place
            currentElement = currentElement.appendChild( newNode )
        else:
            # No children allowed, so just append it
            currentElement.appendChild( newNode )
    else:
        # It can't be a child of an element, for whatever reason
        newNode = Tag( tagName, tagNamespace, **attrDict )
        if not unary:
            # But it can HAVE children, so it takes currentElement's 
            # place.
            currentElement = newNode
        results.append( newNode )
The last remaining node-type that needs to be handled is Text instances, which are very simple:
else:
    # Text
    newNode = Text( token )
    if currentElement:
        currentElement.appendChild( newNode )
    else:
        results.append( newNode )
currentIndex += 1
Since the results are a list, and it's possible (probably even likely in the case of documents) for there to be empty Text-items at the beginning and/or end of the results, I want to remove them. I don't want to remove any empty Text-items from anywhere else, though: markup-fragments can legitimately have them, and even documents will too...
# Clear out any starting- or ending-results that are empty text-nodes
while isinstance( results[ 0 ], Text ) and not results[ 0 ].data.strip():
    results = results[ 1: ]
while isinstance( results[ -1 ], Text ) and not results[ -1 ].data.strip():
    results = results[ :-1 ]
return results

Executing these prototype functions against the XML source witht he following code:


tokens = Tokenize( xmlSource )
parsed = NodesFromTokens( tokens )
print parsed
print parsed[ 0 ]
yields these results (line-breaks added for clarity, empty Text-items are shown as  ):
[<idic.markup.Tag object at 0x7fbcb613f490>]
<html xmlns="http://www.w3.org/1999/xhtml"> 
<!-- Example comment for parsing --> 
<![CDATA[ Example CDATA for parsing ]]> 
<head><title>Page Title</title> 
<script src="//some-domain.com/file.js" /></script> 
<link href="//some-domain.com/file.css" rel="stylesheet" />
</head> <body><div id="main" class="container">
<h1>Page Title</h1></div> 
<div id="login" class="container">
<h1>Page Title: Log-in Required:</h1> 
<form method="post"><div class="form-group">
<label for="username">Name:</label> 
<input id="username" type="text" class="form-control" name="username" />
</div> <div class="form-group">
<label for="userpass">Password:</label> 
<input id="userpass" type="password" class="form-control" name="userpass" />
</div> <div><button type="submit">Log In</button>
</div></form></div></body></html>

Defining the MarkupParser Class

I decided to make MarkupParser a class for several reasons:

  • Being able to create separate instances for different markup-source items allows the source, the tokens and the results to persist across any number of parsing-executions.
  • Parsers living in separate instances don't run any risk of cross-contamination, which could happen if the process were a free-standing function. To be fair, avoiding that sort of cross-contamination is probably just a matter of discipline, so it's not really required. At the same time, it just feels... safer, really.
  • Separate parser-instances can be attached to other objects as properties if needed — I have this gut feeling that I'll want to be able to do that down the line, probably at about the point when I'm working out page-components;
Ultimately, given the work that I'd done on the prototype functions, creating MarkupParser involved little more than moving the constants and functions into the class definition (and making them protected), and creating properties for the souorce, tokens and results:
@describe.InitClass()
class MarkupParser( object ):
    """
Provides an object-class that can be instantiated with a chunk of markup that 
can parse that markup into an IsNode-derived-objects DOM tree."""
    #-----------------------------------#
    # Class attributes (and instance-   #
    # attribute default values)         #
    #-----------------------------------#

    _attributeFinder = re.compile( 
        r"""([_a-zA-Z][-_0-9A-Za-z]*) *= *["'](.*)["']"""
    )
    _whitespaceSplitter = re.compile( '\s' )

    #-----------------------------------#
    # Instance property-getter methods  #
    #-----------------------------------#

    @describe.AttachDocumentation()
    def _GetResults( self ):
        """
Gets the final parsed markup-items generated from the original source"""
        try:
            return self._results
        except AttributeError:
            self._results = self._NodesFromTokens( self.Tokens )
            return self._results

    @describe.AttachDocumentation()
    def _GetSource( self ):
        """
Gets the original source supplied to the instance to be parsed"""
        return self._source

    @describe.AttachDocumentation()
    def _GetTokens( self ):
        """
Gets the sequence of tokens that the instance will use to generate final 
results"""
        try:
            return self._tokens
        except AttributeError:
            self._tokens = self._Tokenize( self.Source )
            return self._tokens

    #-----------------------------------#
    # Instance property-setter methods  #
    #-----------------------------------#

    @describe.AttachDocumentation()
    @describe.argument( 'value', 
        'the original source markup-text to be parsed by the instance', 
        str, unicode
    )
    @describe.raises( TypeError, 
        'if passed a value that is not a str or unicode'
    )
    def _SetSource( self, value ):
        """
Sets the original source markup-text to be parsed by the instance"""
        if type( value ) not in ( str, unicode ):
            raise TypeError( '%s.Source expects a string or unicode value, '
                'but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, type( value ).__name__
                )
            )
        try:
            del self._tokens
        except:
            pass
        try:
            del self._results
        except:
            pass
        self._source = value

    #-----------------------------------#
    # Instance property-deleter methods #
    #-----------------------------------#

    @describe.AttachDocumentation()
    def _DelSource( self ):
        """
"Deletes" the original source markup-text to be parsed by the instance by 
setting it to None"""
        self._source = None

    #-----------------------------------#
    # Instance Properties               #
    #-----------------------------------#

    Results = describe.makeProperty( _GetResults, None, None, 
        'the results of parsing the supplied source into markup-module objects',
        list
    )
    Source = describe.makeProperty( _GetSource, _SetSource, None, 
        'the original markup-text to be parsed by the instance',
        str, unicode
    )
    Tokens = describe.makeProperty( _GetTokens, None, None, 
        '',
        list
    )

    #-----------------------------------#
    # Instance Initializer              #
    #-----------------------------------#
    @describe.AttachDocumentation()
    @describe.argument( 'source', 
        'the markup-text value to parse', 
        str, unicode
    )
    def __init__( self, source ):
        """
Instance initializer"""
        # MarkupParser is intended to be a nominally-final class
        # and is NOT intended to be extended. Alter at your own risk!
        #---------------------------------------------------------------------#
        # TODO: Explain WHY it's nominally final!                             #
        #---------------------------------------------------------------------#
        if self.__class__ != MarkupParser:
            raise NotImplementedError( 'MarkupParser is '
                'intended to be a nominally-final class, NOT to be extended.' )
        # Call parent initializers, if applicable.
        # Set default instance property-values with _Del... methods as needed.
        self._DelSource()
        # Set instance property values from arguments if applicable.
        self.Source = source
        # Other set-up

# ...

    @describe.AttachDocumentation()
    @describe.argument( 'tokens', 
        'the list of tokens to create nodes from', 
        list
    )
    @describe.raises( TypeError, 
        'if passed a non-list tokens value'
    )
    @describe.raises( ValueError, 
        'if the supplied tokens contain any non-str, non-unicode values'
    )
    def _NodesFromTokens( self, tokens ):
        """
Generates a list of nodes from the supplied tokens"""
        # The function-code from the NodesFromTokens
        # prototype function

# ...

    @describe.AttachDocumentation()
    @describe.argument( 'markupText', 
        'the markup-text to tokenize', 
        str, unicode
    )
    @describe.raises( TypeError, 
        'if passed a markupText value that is not a str or unicode'
    )
    @describe.returns( 'a list of strings, each containing one node-item '
        '(start- or end-tag, text, comment, CDATA-section) from the supplied '
        'markupText' )
    def _Tokenize( self, markupText ):
        """
Returns a sequence of markup-token from the supplied markup-text"""

# ...

#---------------------------------------#
# Append to __all__                     #
#---------------------------------------#
__all__.append( 'MarkupParser' )
I'll make the entire code for MarkupParser available for download, like I did for Tag in a previous post

Now that MarkupParser is implemented, the implementations of Tag.cloneNode and Tag.innerHTML can be completed, and their unit-tests reinstated:

#-----------------------------------#
# Instance property-getter methods  #
#-----------------------------------#

# ...

@describe.AttachDocumentation()
def _GetinnerHTML( self ):
    """
Gets a text representation of the markup of all children of the instance"""
    result = ''
    for child in self._childNodes:
        result += child.__str__()
    return result

# ...

#-----------------------------------#
# Instance property-setter methods  #
#-----------------------------------#

# ...

@describe.AttachDocumentation()
@describe.argument( 'value', 
    'the markup to set the instance\'s childNodes to after parsing them',
    str, unicode
)
@describe.raises( TypeError, 
    'if passed a value that is not a str or unicode value'
)
def _SetinnerHTML( self, value ):
    """
Sets the children of the instance by parsing the supplied markup and 
replacing the instance's children with those parsed items"""
    if type( value ) not in ( str, unicode ):
        raise TypeError( '%s.innerHTML expects a str or unicode '
            'value, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, value, 
                type( value ).__name__
            )
        )
    self._DelchildNodes()
    for child in MarkupParser( value ).Results:
        self.appendChild( child )

# ...

#-----------------------------------#
# Instance Methods                  #
#-----------------------------------#

# ...

@describe.AttachDocumentation()
@describe.argument( 'deep', 
    'indicates whether to make a "deep" copy (True) or '
    'not (False)',
    bool
)
def cloneNode( self, deep=False ):
    """
Returns a copy of the current node with all of its children cloned as well"""
    if deep:
        return MarkupParser( str( self ) ).Results[ 0 ]
    else:
        return Tag( self.tagName, self.namespace, **self.attributes )
The getter- and setter-methods for innerHTML seem pretty simple to me:
  • _GetinnerHTML really just renders all of the instance's child-nodes into a string (or unicode, if a string-representaiton raises any of several unicode-related errors) and returns the entire text.
  • _SetinnerHTML takes the provided text, runs it through a MarkupParser, clears all of the instance's childNodes out, then appends each child from the parsed Results.
A very basic unit-test of innerHTML is complete at this point, but I'm expecting that I'll need to add cases to it as I start using Tag:
def testinnerHTML(self):
    """Unit-tests the innerHTML property of a Tag instance."""
    testObject = Tag( 'body' )
    markupSources = [
        'text only',
        '<!-- a comment -->', 
        '<![CDATA[ a CDATA section ]]>', 
        '<a href="#somewhere">somewhere</a>',
        ( 'text before a tag <code>strong</code> text: '
            '<strong>strong text</strong> unary tag '
            '<br />\ntext after a tag' ),
        ]
    markupSources.append( ' '.join( markupSources ) )
    for expected in markupSources:
        testObject.innerHTML = expected
        actual = testObject.innerHTML
        self.assertEquals( actual, expected, 'A Tag\'s innerHTML, if set '
            'to "%s" should return the value it was set to, but "%s" (%s) '
            'was returned instead.' % ( expected, actual, 
                type( actual ).__name__ )
            )
This test, though it's very simple (too simple, perhaps) passes.

The cloneNode method also leverages MarkupParser: The basic idea is that if a deep clone is needed, rather than having to clone the node, then recursively clone each child-node, with all of the re-parenting that would entail, to simply parse the string output of the node being cloned. Using a recursive approach would probably work, but could get very deep very quickly if the original Tag being cloned had any significant depth to it. Taking the parsing approach felt simpler, and easier to maintain on a long-term basis:

@describe.AttachDocumentation()
@describe.argument( 'deep', 
    'indicates whether to make a "deep" copy (True) or '
    'not (False)',
    bool
)
def cloneNode( self, deep=False ):
    """
Returns a copy of the current node with all of its children cloned as well"""
    if deep:
        return MarkupParser( str( self ) ).Results[ 0 ]
    else:
        return Tag( self.tagName, self.namespace, **self.attributes )
I generated a very basic unit-test of cloneNode, which passes:
def testcloneNode(self):
    """Unit-tests the cloneNode method of a Tag instance."""
    testObject = Tag( 'div' )
    row1 = testObject.appendChild( Tag( 'div', className='row', htmlId='row1' ) )
    label1 = row1.appendChild( Tag( 'label', htmlId='label1' ) )
    label1.appendChild( Text( 'Label 1' ) )
    row2 = testObject.appendChild( Tag( 'div', className='row', htmlId='row2' ) )
    label2 = row2.appendChild( Tag( 'label', htmlId='label2' ) )
    label2.appendChild( Text( 'Label 2' ) )
    row3 = testObject.appendChild( Tag( 'div', className='row', htmlId='row3' ) )
    label3 = row3.appendChild( Tag( 'label', htmlId='label3' ) )
    label3.appendChild( Text( 'Label 3' ) )
    clonedNode = testObject.cloneNode( True )
    self.assertEquals( str( clonedNode ).replace( ' ', '' ), 
        str( testObject ).replace( ' ', '' ) )
The replacement of single-space strings in the final assertEquals was put in place because the parsed items may not have the same whitespace as the source string — another variant of the whitespace-removal concerns that I initially had and mentioned earlier. I'm not completely happy with this test, in all honesty, but like the test-method for innerHTML, I'll probably come back to it and add new items as I encounter them.

That just leaves the unit-testing for MarkupParser itself. At this point, I'm not sure what the various trategies are that will yield useful tests, though I have some ideas that I'm going to pursue in the future. For the time being, though I really don't like skipping them, I'm going to — with a big, old DEFERRED message in the results so that I won't forget to come back to them later:

@unittest.skip( '### DEFERRED: Need to come up with a good, '
    'solid testing strategy still.' )
Part of the decision to defer these hinges on the fact that I'll want to make sure to test MarkupParser against markup-files that are more or less representative of the sort of page-templates that I'm eventually expecting to pass to them. At present, I'm struggling with how to keep track of the expected results for those files — without a reliable expected value to compare the actual value against, the testing process would likely be meaningless...

There's been a sizable chunk of code shown here, and the post is getting pretty long, so this feels like a good point to stop for now. I have yet to work in how the parsing-process will identify documents — really, I have yet to actually define concrete document-classes, so that will be the focus for my next post. I'm pretty sure that'll wrap up markup-generation and -parsing, though (finally).

No comments:

Post a Comment