Thursday, January 19, 2017

Documentation Decorators: Meta-data Storage

The argument documentation-metadata creation process is technically usable at this point, though not as bullet-proof as I'd like yet. It is completely feasible to leave it as it is now, and it'd provide the ability to document function- and method-arguments like so:

# An example of documentation decorators on a function
@describe.argument( 'arg1', 'Description of arg1' )
@describe.argument( 'arg2', 'Description of arg2' )
@describe.argument( 'arg3', 'Description of arg3' )
def MyFunction( arg1, arg2, arg3=None ):
Description of function (original docstring)"""
    # TODO: Generate actual implementation here...
    raise NotImplementedError( 'MyFunction is not yet implemented' )

The resulting _documentation.arguments from that decoration is, presently, a standard Python dict, and would look like this:

        'description': 'Description of arg1',
        'expects': (<type 'object'>,),
        'hasDefault': False,
        'name': 'arg1'
        'description': 'Description of arg2',
        'expects': (<type 'object'>,),
        'hasDefault': False,
        'name': 'arg2'
        'defaultValue': None,
        'description': 'Description of arg3',
        'expects': (<type 'object'>,),
        'hasDefault': True,
        'name': 'arg3'

My one remaining (major?) concern is that the underlying dict is mutable. That is, it's possible to create new elements in it, or overwrite existing ones, with invalid data. It doesn't feel likely that creation or over-writing of that data would happen with both intent and invalid structure, but invalid accidental changes are always possible, and my gut feeling is that they are much more likely to occur. The concern is that if such a change (intentional or accidental) occurs, any errors that occur as a result will not surface until the invalid data is being read, by which time the cause of the error is potentially quite a long way from where the error would actually surface. This potential is, in fact, another variant of the rationales behind my previously-noted manage/control all public interface entities/members and raise errors as close to their ultimate source as possible principles.

So: How should this kind of scenario be handled? The underlying dict data-structure is, by its very nature intended to be mutable, and capable of using any immutable type-value as a key and any value-type, immutable or otherwise, as values, with no type- or value-checking performed for either. For the purposes that this argument-dictionary is going to be used, the idea of managing its public interface/members pretty much requires that:

  • Any required keys (name and description) that are not provided should raise a KeyError;
  • Any supplied key that is not recognized should also raise a KeyError;
  • Any value supplied for a recognized key that contains invalid value-types should raise a TypeError;
  • Any key that is not supplied and that has a default value (expects only, at present) should set that default value; and
  • Any attempt to replace an existing key should raise some kind of error (TBD).
I added that last item because I could see at least some potential for duplicated describe.argument calls happening, and I'd rather such an effort raise an error than potentially corrupt the documentation metadata.

Knowing what is desired, the question then becomes how to implement it. There are a couple of approaches, as I see it:

  • Implement a class that acts like a dict, that performs all of the varied type- and value-checks: This feels better from what might be considered a purist OO-design perspective, since it would use composition or aggregation instead of inheritance for the important functionality. The trade-off is that the resulting class becomes slightly harder to work with in certain scenarios (generating a JSON representation of an instance being the item that jumps first to mind, but there may well be others that I haven't thought of).
  • Implement a class that is a dict, that performs all of the varied type- and value-checks: Ultimately, this is nothing more than subclassing the built-in dict type and overriding a few methods. This approach should eliminate most (maybe all) of the harder-to-work-with concerns of the first approach, because the instances will be dict instances. The trade-off is that keeping the original interface of a dict exactly the same may be difficult, or even impossible, though adding new methods and properties is safe.
Both approaches are possible in Python. The first uses the relatively simple expedient of emulating a container type, in this case, a dict. Looking at the collection-emulation methods, there are eight that would probably need to be implemented:
  • __contains__;
  • __delitem__;
  • __getitem__;
  • __iter__;
  • __len__;
  • __missing__;
  • __reversed__; and
  • __setitem__.

The second approach, subclassing a dict, would likely only require overriding the __setitem__ method. That assumption bears some closer examination, which I'll do in a bit.

I expect that either approach would customize the __init__ method, and would have the same helper-methods to check for validity of various keys and values, so those are likely a wash.

In either case, I'd like for the documentation to be serializable to JSON. I can't put my finger on exactly why just yet, it's just a sneaking suspicion that I'll want to be able to use that JSON data-structure somewhere down the line. Since either implementation would contain values (the types in the expects members, in particular), either implementation would also require an explicit serialization method to be created.

So, which of those methods would need to be overridden or implemented for each approach (dict subclass and dict emulation)? Perhaps the simplest dict-emulation approach would be to create an object-structure that stores its data in an internal dict. In that case, most (maybe all) of the emulated methods that would be needed could simply return the results of calling the equivalent method of the internal dict. Wrapping that method, so to speak. I'll examine the override/wrapping needs from the perspective of that sort of wrapping implementation.

The __contains__ method tests membership/containment, when, for example, 'i' in ( 'a', 'e', 'i', 'o', 'u') is executed. In a dict, it checks for membership in the keys of the dict, not in the dict's values.
dict subclass: Would not need to be overridden.
dict emulation: Would wrap the method of the underlying dict.
The __delitem__ method removes a value identified by a supplied key. Since one of the goals of the object is for it to be immutable, both implementations would need to override the method, raising some sort of error to indicate that the instances do not allow member deletion. A rough equivalent would be attempting to delete a member of a tuple, which raises a TypeError (tuple object doesn't support item deletion).
The __getitem__ method is the underlying mechanism for retrieving a member (when dictInstance[ 'key' ] is called, for example).
dict subclass: Would not need to be overridden.
dict emulation: Would wrap the method of the underlying dict.
The __iter__ method returns an iterable instance of the object's data.
dict subclass: Would not need to be overridden.
dict emulation: Would wrap the method of the underlying dict.
The __len__ method returns a numeric value (the length of the members in the collection), and is the underlying process behind code like len( myDict ).
dict subclass: Would not need to be overridden.
dict emulation: Would wrap the method of the underlying dict.
The __missing__ method handles __getitem__ calls when the specified key is not present in the collection.
dict subclass: Would not need to be overridden.
dict emulation: Would wrap the method of the underlying dict.
The __reversed__ method returns an iterable copy of the instance's data, like __iter__, but in reversed order.
dict subclass: Would not need to be overridden.
dict emulation: Would wrap the method of the underlying dict.
The __setitem__ method assigns values to specified keys (e.g., myDict[ 'key' ] = 1). Since the data of an instance should be immutable (at least once a member value has been set), both implementations would require some level of override of the method, if only to check for the existence of a key. The same overriding functionality could (should) be used to type- and/or value-check member keys and values when they are being set, but much of that might be able to be handled by implementation of helper methods.
dict subclass: Full override, performing checks and/or data-structure default-value creation along the way, before calling the parent class' __setitem__ method.
dict emulation: Perform various checks and/or data-structure default-value creation before calling the __setitem__ method of the underlying dict.

There's not a whole lot of difference between these two approaches, ultimately. I'm going to go with the dict-subclass approach, though, for various reasons:

  • To keep as close to built-in types as I can manage: A subclassed-dict instance is still a dict object, so I won't need to worry as much about writing special code to determine whether instances are dict-equivalent objects. A simple isinstance( myObject, dict ) will return True. I'm not even sure it's possible for an instance of the wrapping-a-dict class to be identifiable as a dict-equivalent. I don't know that I'll ever really need to be able to make that sort of identification, but if I do, somewhere down the line, I won't have painted myself into a corner.
  • For unit-testing reasons: At some point, I'll be planning to write unit-tests for this class, in all probability. When I do, there would be fewer tests to be written, since there'd be no reason to test the methods that were not overridden. That cuts the number of tests for the class down to two (for __delitem__ and __setitem__) from eight — the implications of my previously-stated thorough testing goal would require unit-tests for all of the wrapping methods of the class. While I don't mind writing unit-tests (much), they are the most tedious part of development, I think, so anything that reduces the need for more is a Good Thing®™ as far as I'm concerned.
  • Another aspect of deriving the class directly from a dict is that other built-in and common libraries' functions will work without any significant effort. As a case in point, it is possible to use pprint.pprint to pretty-print an object derived from something that pprint already knows how to deal with.
So: On to the code!

class argdict( dict ):
Provides a dictionary that is specifically purposed for storing 
argument documentation metadata"""

    # Class attributes (and instance-   #
    # attribute default values)         #
    __allowedKeys = [ 
        'defaultValue', 'description',  'expects', 'hasDefault', 'name'
    __defaults = {
        'description':'No description provided.',
        'expects':(object, ),
    __optionalKeys = [ 'default', 'description' ]
    __requiredKeys = [ 'expects', 'hasDefault', 'name' ]

    # Instance property-getter methods  #

The various attributes provide data to support controlling what member-names are allowed, required, and optional during the process of setting metadata for an argument. The __defaults attribute provides default values for metadata members. All of these will be used as checks and/or value-population items later on.

The class has no properties, so the next significant block is the __init__ method:

    # Instance Initializer              #
    def __init__( self, *args, **kwargs ):
Instance initializer."""
        # argdict is intended to be a nominally-final class
        # and is NOT intended to be extended. Alter at your own risk!
        # Proably just as a reflexive decision, really: I cannot imagine that #
        # there'd be any real *use* for extending something this specific, so #
        # it's final just to encourage thinking about inheritance-depth.      #
        if self.__class__ != argdict:
            raise NotImplementedError( 'argdict is '
                'intended to be a nominally-final class, NOT to be extended.' )
        # Call parent initializers, if applicable.
        # Set default instance property-values with _Del... methods as needed.
        # Set instance property values from arguments if applicable.
        # Type- and (maybe) value-check inbound arguments
        initCalled = False
        if args:
            # A mapping or iterable: list or tuple of (key, value), or a dict
            if len( args ) > 1:
                if isinstance( args, ( tuple, list ) ):
                    # Check structure of the mapping/iterable: 
                    #  ( <str|unicode>,<dict> ), ...
                    badItems = tuple( [ ( key, value ) for key, value in args 
                        if type( key ) not in ( str, unicode ) 
                        or not isinstance( dict, value ) ] )
                    if badItems:
                        raise TypeError( self.__class__.__name__ + ' expects '
                            'mappings of ( <str|unicode>, <dict> ), but was '
                            'passed %d mappings that do not conform: %s' % ( 
                                len( badItems), str( badItems ) ) )
                    # If this point is reached, then the baseline structure is 
                    # valid, the instance can be created as an empty dict, 
                    # and it should be populated with iterative calls to 
                    # __setitem__ to assure that the values supplied are 
                    # valid and have the baseline defaults.
                    dict.__init__( self )
                    initCalled = True
                    for key, value in args:
                        self.__setitem__( key, value )
                    raise TypeError( self.__class__.__name__ + ' expects '
                        'mappings of ( <str|unicode>, <dict> ), but was '
                        'passed %s' % ( str( args ) ) )
            elif isinstance( args[ 0 ], dict ):
                badItems = [ 
                    key for key in args[ 0 ].keys() 
                    if type( key ) not in ( str, unicode )
                if badItems:
                    badDict = {}
                    for key in badItems:
                        badDict[ key ] = args[ 0 ][ key ]
                    raise TypeError( self.__class__.__name__ + ' expects '
                        'dictionaries structured as <str|unicode>:<dict>, but '
                        'was passed %d entries that do not conform: %s' % ( 
                            len( badItems), str( badDict ) ) )
                # If this point is reached, then the baseline structure is 
                # valid, the instance can be created as an empty dict, 
                # and it should be populated with iterative calls to 
                # __setitem__ to assure that the values supplied are 
                # valid and have the baseline defaults.
                dict.__init__( self )
                initCalled = True
                self.__assureDefaults( args[ 0 ] )
                if not self.__checkRequirements( args[ 0 ] ):
                    raise ValueError( self.__class__.__name__ + ' argument-'
                        'metadata is only allowed to have certain member-'
                        'names %s, and is required to have certain of those '
                        '%s. The supplied structure %s did not pass these '
                        'checks' % ( str( tuple( self.__allowedKeys ) ), 
                            str( tuple( self.__requiredKeys ) ), 
                            args[ 0 ] ) )
                for key in args[ 0 ]:
                    value = args[ 0 ][ key ]
                    self.__setitem__( key, value )
        if kwargs:
            # A dictionary
            badItems = [ 
                key for key in kwargs.keys() 
                if type( key ) not in ( str, unicode )
            if badItems:
                badDict = {}
                for key in badItems:
                    badDict[ key ] = kwargs[ key ]
                raise TypeError( self.__class__.__name__ + ' expects '
                    'dictionaries structured as <str|unicode>:<dict>, but was '
                    'passed %d entries that do not conform: %s' % ( 
                        len( badItems), str( badDict ) ) )
            # If this point is reached, then the baseline structure is 
            # valid, the instance can be created as an empty dict, 
            # and it should be populated with iterative calls to 
            # __setitem__ to assure that the values supplied are 
            # valid and have the baseline defaults.
            dict.__init__( self )
            initCalled = True
            self.__assureDefaults( kwargs )
            if not self.__checkRequirements( kwargs ):
                raise ValueError( self.__class__.__name__ + ' argument-'
                    'metadata is only allowed to have certain member-names '
                    '%s, and is required to have certain of those %s. The '
                    'supplied structure %s did not pass these checks' % ( 
                        str( tuple( self.__allowedKeys ) ), 
                        str( tuple( self.__requiredKeys ) ), kwargs ) )
            for key in kwargs:
                value = kwargs[ key ]
                self.__setitem__( key, value )
        # If we reach this point without any errors and initCalled is False, 
        # then it's an empty structure, so all we need to do is call the most 
        # basic dict.__init__:
        if not initCalled:
            dict.__init__( self )

The __init__ of a dict accepts both an argument-list (a mapping or an iterable) and/or keyword arguments (another dict), so argdict.__init__ needs to mirror those expectations in order to be a drop-in replacement for the original dict that was called for. Once the initialization begins, it checks for the iterable/mapping in *args, and if it exists, it handles it based on whether it's a tuple or list (expecting a mapping), or another dict. In each case, it checks for bad items, and if any are detected, raises an error. If no errors are encountered, a generic dict.__init__ is called to perform basic, empty-dict initialization, then the checked values are provided with the required default values and added to the instance's data-set with the __setitem__ method that it provides. As of this post, I haven't tested the portion of the __init__ method that handles incoming mappings, or keyword-arguments, since they aren't in use during normal argument-decoration use-cases. I will (eventually) get to the point of doing formal unit-testing on the entire thing, though, and those (probably broken) code-branches will get fixed then.

If __init__ is passed a dict in its **kwargs, it performs the same checks and process as are done when a dict pops up in the *args.

    # Instance Methods                  #
    def __assureDefaults( self, argSpec ):
Assures that the provided argSpec dictionary is populated with the default 
values from self.__defaults *if they are not already members of the 
        argSpecKeys = argSpec.keys()
        for key in self.__defaults:
            defaultValue = self.__defaults[ key ]
            if key not in argSpecKeys:
                argSpec[ key ] = defaultValue
        if not argSpec[ 'hasDefault' ]:
                del argSpec[ 'default' ]
            except KeyError:

    def __checkRequirements( self, argSpec ):
Checks metadata member-names against required, allowed names. 
Returns True if the argSpec has all required names and only allowed names, 
False otherwise."""
        argSpecKeys = set( argSpec.keys() )
        requiredKeys = set( self.__requiredKeys )
        allowedKeys = set( self.__allowedKeys )
        meetsRequirements = ( 
            requiredKeys.intersection( argSpecKeys ) == requiredKeys
        noExtraKeys = ( 
            len( argSpecKeys.difference( set( self.__allowedKeys ) ) ) == 0 )
        return meetsRequirements and noExtraKeys

The __assureDefaults method does just what it sounds like it should: it assures that the provided argSpec, a dict, has default values (from the __defaults class-attribute). __checkRequirements performs a member-name check on a provided dict, returning True if the required member-names are all present and there are no extraneous member-names, or False if either of those checks fail. The __checkRequirements method leverages Python's set data-type to perform those checks, yielding a simple boolean value for both meetsRequirements and noExtraKeys. That approach should allow for easy expansion if it were to become necessary.

    def ToJSON( self ):
Serializes the instance to a JSON string, converting the "expected" key-values 
from their native tuple-of-types to a list-of-type-names along the way."""
        result = dict( self )
        for key in result:
            if result[ key ].get( 'expects' ):
                result[ key ][ 'expects' ] = [ 
                    item.__name__ if hasattr( item, '__name__' ) else str( item ) 
                    for item in result[ key ][ 'expects' ]
        return json.dumps( result, sort_keys=True, indent=4 )

The ToJSON method makes a copy of the current state of the instance, then performs some value-changes on the members that will typically contain vlaues that cannot be converted to JSON. Right now, that's only the expects member, which will contain built-in types, None (potentially), and custom types – classes in particular. It does the conversion by looking for a __name__ attribute on the item in question, which will take care of both built-in and custom types, or, if a name isn't available (because it's a single-value type like None), it uses a string representation of the item. I haven't been able to find any other None-like types or values yet that need that sort of handling, but that doesn't mean that there aren't any (or that one or more won't surface later on).

    def __delitem__( self, argName ):
Override of dict__delitem__ to prevent the removal of members once they've 
been set."""
        raise TypeError( self.__class__.__name__ + ' does not support member '
            'deletion.' )

    def __setitem__( self, argName, argMetadata ):
Override of dict.__setitem__ that only allows specific keys and type-
checks the value supplied before adding it to the instance's members."""
        # Type- and value-check argName
        if type( argName ) not in ( str, unicode ):
            raise TypeError( self.__class__.__name__ + ' cannot accept '
                'dictionary keys that are not text types (str or unicode). '
                '%s is a %s' % ( argName, type( argName ).__name__ ) )
        # Set default values in argMetadata
        self.__assureDefaults( argMetadata )
        # Type- and value-check the keys/values of argMetadata
        # Perform the requirements-check and raise an error if it fails
        if not self.__checkRequirements( argMetadata ):
            raise ValueError( self.__class__.__name__ + ' argument-metadata '
                'items require %s members, and cannot have members other than '
                '%s. The %s value passed is invalid.' % ( 
                    self.__requiredKeys, self.__allowedKeys, argMetadata ) )
        # If everything passed, go ahead and call dict.__setitem__
        dict.__setitem__( self, argName, argMetadata )

The __delitem__ and __setitem__ methods are, I think, pretty straightforward. __delitem__ in particular, since all it does is raise an error if it's called, preventing member deletion as part of the quest for immutability of argdict instances. The __setitem__ method is slightly more complex, performing a few (pretty obvious) type- and value-checks, delegating the creation of default values and structural-requirements checking to the previously-defined helper-methods.

After altering the existing api_documentation._CreateArgumentMetadata method to return an argdict instead of a dict, it seems to work well. Given the following code:

class Ook( object ):
    @describe.argument( 'arg1', 'Ook.Fnord (method) arg1 description', int, long, float )
    @describe.argument( 'arg2', 'Ook.Fnord (method) arg2 description' )
    def Fnord( self, arg1, arg2, *args, **kwargs ):
        """Ook.Fnord (method) original doc-string"""
        return None

    @describe.argument( 'arg1', 'Ook.Bleep (classmethod) arg1 description', int, long, float )
    @describe.argument( 'arg2', 'Ook.Bleep (classmethod) arg2 description' )
    def Bleep( cls, arg1, arg2=None, *args, **kwargs ):
        """Ook.Bleep (classmethod) original doc-string"""
        return None

    @describe.argument( 'arg1', 'Ook.Flup (staticmethod) arg1 description', int, long, float )
    @describe.argument( 'arg2', 'Ook.Flup (staticmethod) arg2 description' )
    def Flup( arg1, arg2, *args, **kwargs ):
        """Ook.Flup (staticmethod) original doc-string"""
        return None

print '-'*80
print Ook.Fnord._documentation
print '-'*80
print Ook.Bleep._documentation
print '-'*80
print Ook.Flup._documentation
print '-'*80

The output is:

Fnord( self, arg1, arg2, *args, **kwargs ) [instancemethod]

Ook.Fnord (method) original doc-string


self .............. (instance, required): The object-instance that the method will bind to for execution.
arg1 .............. (int|long|float, required): Ook.Fnord (method) arg1 description
arg2 .............. (any, required): Ook.Fnord (method) arg2 description

Bleep( cls, arg1, arg2, *args, **kwargs ) [instancemethod]

Ook.Bleep (classmethod) original doc-string


cls ............... (class, required): The class that the method will bind to for executions.
arg1 .............. (int|long|float, required): Ook.Bleep (classmethod) arg1 description
arg2 .............. (any, optional, defaults to None): Ook.Bleep (classmethod) arg2 description

Flup( arg1, arg2, *args, **kwargs ) [instancemethod]

Ook.Flup (staticmethod) original doc-string


arg1 .............. (int|long|float, required): Ook.Flup (staticmethod) arg1 description
arg2 .............. (any, required): Ook.Flup (staticmethod) arg2 description


A couple of weird little tweaky things surfaced in the api_documentation class while I was testing this, but they were easily corrected and those changes are in the current downloadable version. There may be a few small discrepancies between that version and what I noted in my previous post as a result, though.

This seems like a good point to break, since the next documentation-metadata item I'm planning to tackle is the one for argument-lists, and it's going to be at least somewhat different, so there will likely be a fair chunk of discussion beforehand.

