Tuesday, May 9, 2017

Generating and Parsing Markup in Python [6]

Though there are fewer methods in the Tag class than properties, there are still a good number of them that are, in my opinion, worth taking a detailed look at, so I'll just dive right in.

Method Implementations

There are three groups of methods that have a common theme between them, and that share a fair amount of logic as a result:

  • Methods relating to manipulating a Tag's childNodes (variations of adding children to and removing children from a Tag, ultimately);
  • Methods for getting various logical groupings or sets of child Tags from a Tag instance; and
  • Methods for working with a the attributes of a Tag
The majority of the other methods of Tag are either already implemented in BaseNode, are very simple implementations (in my opinion), or need to be deferred until some other class in the markup module is implemented.

Methods Relating to Manipulating Child Nodes

There are five different methods that add child nodes to a Tag instance in some fashion, three that involve removing a child, and two that relate to replacing a child (methods that I'm adding are indicated like this):

appendChild:
Adds a child to the instance's childNodes at the end of the collection
insertAfter:
Adds a child to the instance's childNodes after the position of a specified existing child
insertBefore:
Adds a child to the instance's childNodes before the position of a specified existing child
insertChildAt:
Adds a child to the instance's childNodes at a specified index/position in the colelction
prependChild:
Adds a child to the instance's childNodes at the beginning of the collection
removeChild:
Removes the specified child from the childNodes collection
removeChildAt:
Removes the child at a specified index/position from the childNodes collection
removeSelf:
Removes the instance from its parent's childNodes collection
replaceChild:
Replaces the specified child in the instance's childNodes collection with a new child
replaceChildAt:
Replaces the child at a specified index/position in the instance's childNodes collection with a new child
Experience with a previous incarnation of this module (the one I am re-creating from the ground up) led me to the conclusion that any method that alters a Tag's childNodes should return a relevant node-value. That is, for example:
# Creating a <table> to display values in a list of dicts:
listOfDicts = [
    { 'name':'row1 - name', 'value':'row1 - value' },
    { 'name':'row2 - name', 'value':'row2 - value' },
    { 'name':'row3 - name', 'value':'row3 - value' },
    { 'name':'row4 - name', 'value':'row4 - value' },
    ]

table = Tag( 'table', border='1' )
thead = table.appendChild( Tag( 'thead' ) )
tbody = table.appendChild( Tag( 'tbody' ) )
tr = thead.appendChild( Tag( 'tr' ) )
for key in sorted( listOfDicts[ 0 ] ):
    th = tr.appendChild( Tag( 'th' ) )
    th.appendChild( Text( key ) )
for row in listOfDicts:
    tr = tbody.appendChild( Tag( 'tr' ) )
    for key in sorted( row ):
        td = tr.appendChild( Tag( 'td' ) )
        td.appendChild( Text( str( row[ key ] ) ) )
generates the following <table> quickly and easily because, in part, each appendChild returns the Tag appended, which can then be used to append other nodes to the new element:
name value
row1 - name row1 - value
row2 - name row2 - value
row3 - name row3 - value
row4 - name row4 - value

It didn't occur to me to check what the JavaScript methods I'm copying did until I wrote this post, but a cursory check indicates that they do the same thing — returning an appended child — so in that respect, at least, I feel this decision is solid. Similarly, anything that removes a child should return the child being removed, which is also what the JavaScript equivalents do. The lone JavaScript replace* method (replaceChild) returns the node being replaced, so I followed that convention in all of the remove* methods of Tag in the interests of consistency.

All of these methods are also responsible for making certain that the nodes being manipulated have their parent updated as part of the process. That is:

  • Any method that adds a child to a Tag assures that those child nodes' parent doesn't already exist, and that they are set to the Tag that they are being added to; and
  • Any method that removes a child also clears that child's parent (which makes it available to be added to a different Tag if needed);
Since the replace* methods are, essentially, just a removal of an existing child and the addition of a new one, they are responsible for performing both of those tasks if they aren't handled by calling some other methods.

The carry-through of nodes added to a Tag is shown in the implementation of appendChild, as is the checking of a child's parent before executing the addition to the Tag:

@describe.AttachDocumentation()
@describe.argument( 'child', 
    'the child to append to the instance\'s childNodes', 
    BaseNode
)
@describe.raises( TypeError, 
    'if the specified child is not an instance of BaseNode'
)
@describe.raises( MarkupError, 
    'if the child to be appended is already a child of another element'
)
@describe.returns( 'the child that was appended to the instance\'s '
    'childNodes' )
def appendChild( self, child ):
    if not isinstance( child, BaseNode ):
        raise TypeError( '%s.appendChild expects an instance of BaseNode, '
            'but was passed "%s" (%s)' % ( 
                self.__class__.__name__, child, type( child ).__name__ )
            )
    # Checking for the child's current parent
    if child.parent:
        raise MarkupError( '%s.appendChild cannot append "%s" to "%s" '
            'because the node to be appended is already a child of '
            'another element (%s)' % ( 
                self.__class__.__name__, child, self, child.parent )
            )
    # Append the child
    self._childNodes.append( child )
    # Set the child's new parent
    child._SetParent( self )
    # Return the child
    return child
The return of a removed item is shown in removeChild:
@describe.AttachDocumentation()
@describe.argument( 'child', 
    'the child node to remove', 
    BaseNode
)
@describe.raises( TypeError, 
    'if the specified child is not a BaseNode instance'
)
@describe.raises( MarkupError, 
    'if the specified child is not a child of the instance'
)
@describe.returns( 'the child node removed' )
def removeChild( self, child ):
    """
Removes the specified child from the instance's childNodes"""
    if not isinstance( child, BaseNode ):
        raise TypeError( '%s.replaceChild expects a BaseNode-derived object '
            'for its child argument, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, child, 
                type( child ).__name__
            )
        )
    try:
        del self._childNodes[ self._childNodes.index( child ) ]
    except ValueError:
        raise MarkupError( '%s.removeChild could not remove %s because it '
            'is not a childNode of %s' % ( self.__class__.__name__, 
                child, self )
            )
    return child
Finally, the return of the old child is shown in replaceChildAt (which is called by replaceChild):
@describe.AttachDocumentation()
@describe.argument( 'index', 
    'the position of the child to be replaced in the instance\'s childNodes', 
    int, long
)
@describe.argument( 'newChild', 
    'the new child to replace the specified child with', 
    BaseNode
)
@describe.raises( TypeError, 
    'if the newChild specified is not a BaseNode-derived object'
)
@describe.returns( 'the child being replaced in the instance\'s '
    'childNodes' )
def replaceChildAt( self, index, newChild ):
    """
Replaces the child at the specified index/position in the instance's childNodes 
with a new child"""
    if not isinstance( newChild, BaseNode ):
        raise TypeError( '%s.replaceChild expects a BaseNode-derived object '
            'for its newChild argument, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, newChild, 
                type( newChild ).__name__
            )
        )
    oldChild = self._childNodes[ index ]
    self._childNodes[ index ] = newChild
    oldChild._DelParent()
    newChild._SetParent( self )
    return oldChild

The Collected getElement* Methods

JavaScript provides a number of methods that can be used to retrieve zero-to-many tag-elements based on various criteria, and I've added three more to the mix in Tag:

getElementById:
Returns the first Tag child with an id-attribute whose value matches the id supplied
getElementsByAttributeValue:
Returns a list of Tag children that have a specific attribute whose value exactly matches the value supplied
getElementsByClassName:
Returns a list of Tag children that have a class-attribute (using the classList property) containing the value supplied
getElementsByNamespace:
Returns a list of Tag children whose namespaces match the one provided, including children of children that inherit their parent's namespace
getElementsByPath:
Returns a list of Tag children whose DOM-paths relative to the instance match the path specified
getElementsByTagName:
Returns a list of Tag children whose tag-names match the tag-name specified
Many of these methods rely on being able to start with a list of all of an instance's children, so the implementation of getElementsByTagName (which can fulfil that need) is important enough to show in some detail before discussing the remainder.

getElementsByTagName in JavaScript allows * to be provided as a wild-card tag-name. If that wild-card is provided, then the method returns all child tags. That behavior is mirrored in Tag.getElementsByTagName, as is the return of a null (None) value if there are no matches.

@describe.AttachDocumentation()
@describe.argument( 'tagName',
    'the tag-name to search for in child elements. Using "*" will return '
    'all children',
    str, unicode
)
@describe.returns( 'list of child IsElement objects whose tag-name matches '
    'the tagName provided, or None' )
@describe.raises( TypeError, 
    'if the tagName provided is not a str or unicode value'
)
def getElementsByTagName( self, tagName ):
    """
Gets all child tags whose name matches the tag-name provided"""
    if type( tagName ) not in ( str, unicode ):
        raise TypeError( '%s.getElementsByTagName expects a string or '
            'unicode value for the tag-name it\'s to search for but was '
            'passed "%s" (%s)' % ( 
                self.__class__.__name__, tagName, type( tagName ).__name__ )
            )
    results = []
    for child in self.children:
        if child.tagName == tagName or tagName == '*':
            results.append( child )
        subResults = child.getElementsByTagName( tagName )
        if subResults:
            results += subResults
    if results:
        return results
    return None
getElementsByTagName makes use of recursion, by calling itself again for each child found in the current execution and appending the results of that recursive call to the results at the current level of execution. I suspect that there may be a better way of implementing this method, perhaps using a generator in some fashion, but as of this writing, I simply haven't dug into the idea enough to see if it would be worth pursuing.

With getElementsByTagName available, several of the remaining getElement* methods become fairly simple candidate-filtering problems in their implementation:

getElementsByAttributeValue:
Each result is a candidate that has the specified attribute with the specified value
getElementsByClassName:
Each result is a candidate that has the specified value in its classList
getElementsByNamespace:
Each result is a candidate whose namespace matches the one provided
Each, then, follows the same pattern as getElementsByAttributeValue:
@describe.AttachDocumentation()
@describe.argument( 'name',
    'the name of the attribute whose value is to be checked',
    str, unicode
)
@describe.argument( 'value',
    'the value in the specified attribute that must be matched',
    str, unicode
)
@describe.raises( TypeError, 
    'if the specified name is not a str or unicode value' )
@describe.raises( TypeError, 
    'if the specified value is not a str or unicode value' )
@describe.returns( 'a list of Tag instance matching the '
    'attribute-name/-value criteria' )
def getElementsByAttributeValue( self, name, value ):
    """
Gets the child elements that have the attribute specified containg the value 
specified"""
    if type( name ) not in ( str, unicode ):
        raise TypeError( '%s.getElementsByAttributeValue expects a str or '
            'unicode value for its name, but was passed "%s" (%s)' % 
                ( self.__class__.__name__, name, 
                    type( name ).__name__
                )
            )
    if type( value ) not in ( str, unicode ):
        raise TypeError( '%s.getElementsByAttributeValue expects a str or '
            'unicode value for its value, but was passed "%s" (%s)' % 
                ( self.__class__.__name__, value, 
                    type( value ).__name__
                )
            )
    results = [ 
            c for c in self.getElementsByTagName( '*' ) 
            if c.attributes.get( name ) == value
        ]
    if results: 
        return results
    return None
The variations of the other two are, ultimately, just in the generation of the results being returned:
# getElementsByClassName
results = [ 
            c for c in self.getElementsByTagName( '*' ) 
            if className in c.classList 
        ]
# getElementsByNamespace
    results = [ 
        c for c in self.getElementsByTagName( '*' ) 
        if c.namespace == namespace
    ]

getElementById uses getElementsByAttributeValue as a helper-method, but also provides an optional strict argument (defaulting to False) that allows it to raise a MarkupError if more than one result is found:

# ...
results = self.getElementsByAttributeValue( 'id', value )
if strict and len( results ) > 1:
    raise MarkupError( '%s.getElementById, with strict enforcement, '
        'found more than one child with the specified id' % ( 
            self.__class__.__name__, value )
        )
if results:
    return results[ 0 ]
return None

The last remaining method in this group, getElementsByPath, may take some explanation. Consider a web-page that has a fair amount of content, including a lot contained in <div> tags. The page also has two <form>s in it, and within one form are a number of rows, constructed with <div>s. As part of the application's requirements, there is a need to apply some CSS classes to every form row <div>, without altering any of the others, and it has to be done, for whatever reason, server-side in the code. The form to be altered can be identified by a specific DOM path relative to the document, as can the <div>s that need to be altered. That path, to each of those <div>s, might look something like /div/form/fieldset/div from the <body> of the page.

That is what getElementsByPath is built to do. Like getElementsByTagName it uses recursion, but in this case it uses it to drill down through the DOM tree, matching tag-names (and allowing the same wild-card capabilities) in order to find all of the children in the right position relative to the Tag that the method was called from. Its implementation is simpler than might be expected, given the complexity of what it's doing:

@describe.AttachDocumentation()
@describe.argument( 'path', 
    'the path to find matching elements for, delimited by "/", and allowing '
    'wild-cards ("*")',
    str, unicode
)
@describe.raises( TypeError, 'if the supplied path is not a str or '
    'unicode value' )
@describe.returns( 'list of elements whose dom-path from the instance '
    'matched the one specified, or None' )
def getElementsByPath( self, path ):
    """
Gets all child tags that can be identified by following matching tag-names 
down the tree"""
    if type( path ) not in ( str, unicode ):
        raise TypeError( '%s.getElementsByPath expects a string or '
            'unicode value for the path it\'s to search for but was '
            'passed "%s" (%s)' % ( 
                self.__class__.__name__, path, type( path ).__name__ )
            )
    results = []
    try:
        tagName, subPath = path.split( '/', 1 )
    except:
        tagName = path
        subPath = None
    for child in self.children:
        if tagName == '*' or child.tagName == tagName:
            if subPath != None:
                subPathResults = child.getElementsByPath( subPath )
                if subPathResults:
                    results += subPathResults
            else:
                results.append( child )
    if results:
        return results
    return None

Attribute-Related Methods

getAttribute:
Returns the value of the named attribute of the instance, or None if it doesn't exist
hasAttribute:
Checks for the existence of a specific attribute in the instance's attributes collection
hasAttributes:
Checks forthe existence of any attributes in the instance's attributes collection
removeAttribute:
Removes the specified attribute from the instance's attributes collection
setAttribute:
Sets the value of an attribute in the instance's attributes collection

Minus the type-checking of the name argument, getAttribute is really nothing more than:

return self.attributes.get( name )

Similarly, hasAttribute is:

if self.attributes.get( name ):
        return True
    return False
and hasAttributes is:
if self.attributes:
        return True
    return False

Setting attributes is a bit more detailed, but only because the name and value of the inbound attribute are both checked before the set actually occur, and the special handling for data-* attributes:

@describe.AttachDocumentation()
@describe.argument( 'name', 
    'the name of the attribute to set', 
    str, unicode
)
@describe.argument( 'value', 
    'the value of the attribute to set', 
    str, unicode, None
)
@describe.raises( TypeError, 
    'if the supplied name is not a str or unicode value'
)
@describe.raises( TypeError, 
    'if the supplied value is not a str or unicode value or None'
)
@describe.raises( ValueError, 
    'if the supplied name is not a valid attribute-name'
)
@describe.raises( ValueError, 
    'if the supplied name is not a valid attribute-value'
)
def setAttribute( self, name, value ):
    if type( name ) not in ( str, unicode ):
        raise TypeError( '%s.setAttribute expects a str or unicode value '
            'that is a valid attribute-name for the name of the attribute '
            'to be set, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, name, type( name ).__name__
            )
        )
    if not self.attributes.IsValidName( name ):
        raise ValueError( '%s.setAttribute expects a str or unicode value '
            'that is a valid attribute-name for the name of the attribute '
            'to be set, but was passed "%s" which is not valid' % ( 
                self.__class__.__name__, name
            )
        )
    if value == None:
        try:
            del self._attributes[ name ]
        except KeyError:
            pass
        return
    if type( value ) not in ( str, unicode ):
        raise TypeError( '%s.setAttribute expects a str or unicode value '
            'that is a valid attribute value for the value of the attribute '
            'to be set, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, name, type( name ).__name__
            )
        )
    if not self.attributes.IsValidValue( value ):
        raise ValueError( '%s.setAttribute expects a str or unicode value '
            'that is a valid attribute value for the value of the attribute '
            'to be set, but was passed "%s" (%s)' % ( 
                self.__class__.__name__, name, type( name ).__name__
            )
        )
    if name[0:5] == 'data_':
        name = name.replace( 'data_', 'data-' )
    self._attributes[ name ] = value
Attribute removal, though, is also very simple — barring the type-checking of the name, it's basically just a fail-safe deletion of an item from the attributes collection:
try:
    del self.attributes[ name ]
except KeyError:
    pass

The balance of Tag's methods are pretty straightforward, and I won't go into any depth on them:

cloneNode:
Returns a copy of the Tag with the option of returning copies of all of the its childNodes as well
contains:
Determines if a Tag contains another specified node
hasChildNodes:
Determines if a Tag has any childNodes members
The cloneNode method, like the innerHTML property, had to wait until I've got MarkupParser implemented. My plan for implementing it centers around either creating a new Tag instance for shallow copies, or using the MarkupParser class to generate complete copies of a markup-tree from the __str__ and/or __unicode__ methods of Tag, since nodes in general, and tags in particular, cannot have multiple parents (as noted earlier).

In light of how much code there actually is behind the implementation of Tag, and how much of it I didn't do any sort of deep dive into in this post or the last one, I figured I'd share the Tag class code, as well as its unit-test code before I signed off for the day. These are not the complete markup or test_markup modules, so they won't actually execute for lack of various dependencies, but all the code for both (as of this post) is there:

92.9kB
120.1kB

No comments:

Post a Comment