Tuesday, March 7, 2017

Serialization: JSON (for now) [1]

OK. Now that the JavaScript interlude is done, it's time to get back to the Python framework... I've gone back and forth a few times now on what to tackle next, and though I'd really like to start working on some code to handle parsing, creating, and working with markup elements, I feel that I'd be remiss if I didn't start addressing some of the outstanding items in my coding standards. Specifically:

  1. It should be thoroughly tested
Since that involves unit-testing — and I have some specific desires around enforcing testing-standards and code-coverage — I think I'd be better off working through that with a smaller set of classes. I started planning out a set of posts about a DAL next, but that grew too unwieldy to be able to keep the unit-testing follow-up short, sweet and to the point as well. After rooting around a bit in other parts of the framework that I had outlined, I decided to implement a serialization module. It didn't have any concrete classes, but I could create an example of one for demonstration of unit-testing once I got done with the core serialization logic, and there were other... interesting aspects... to play with.

My Goals for Object Serialization

Since a lot of the projects that I'm contemplating are web-applications of some sort, and JSON is popular in that context, I want to be able to define some common functionality that will allow objects' state to be easily serialized into and unserialized from JSON. Python also has its pickle module, which I'll eventually want to incorporate, I suspect, but JSON will do for now.

One thing that I very much dislike about Python's JSON implementations that I've seen thus far is that while the json module's functions do a good job of dealing with built-in Python types, the mechanism for making an arbitrary instance of a class JSON-serializable seems cumbersome, requiring (apparently) the creation of a dedicated JSONDecoder and JSONEncoder classes. Over and above that, I can easily imagine the need to be able to JSON-encode a given object in two distinct ways: One for public consumption over the web (making sure to sanitize out any sensitive data), and one for private use (allowing, if not mandating that sensitive data be present and accounted for). That means that there would be a minimum of two custom classes that would have to be defined for each JSON-serializable class. Some thought would also have to go into determining how to switch between those variants as well. Finally, that concrete class-requirement would make it difficult to associate as an inner class, which was where my first thoughts led me.

Ugh.

Instead, I'm going to define an abstract class that requires a JSON-serializable object to have a some required members that provide the serialization and unserialization functionality. My first thought was to require a ToJSON instance-method and a FromJSON class-method. Each of those would require an instance-level GetSerializationDict and class-scoped FromDict method at another level of abstraction (an interface that the abstract class inherits from).

Then I got to thinking... Serialization to and from JSON has at least two distinct uses that I can think of off the top of my head:

  • Serializing/unserializing sanitized data for transmission over a network — something that a lot of web-services do; and
  • Serializing/unserializing for local use — Perhaps for writing data to a local file or a record in a database, for instance.
The concern that this brought to mind might be expressed as:
What if an object needs to be serialized for both these cases? A secure variant and a public variant?
After considering some of the options, the approach I'm going to take is to provide a SanitizedJSON property on all classes derived from the abstract class. That will provide a bare-bones, already-minimized and sanitized JSON output for any instance of a derived class.

That means, however, that standard json.dump and json.dumps results have to be considered in the unsanitized data category. In the case of json.dump, since it writes to a file, that's likely not going to be a concern for any of the typical application structures and environments I've been exposed to. Output from json.dumps, though, might well be used to generate JSON data for unsafe purposes, so I'll want to raise some sort of warning when that happens — nothing that prevents it (or stops execution), but that will hopefully alert a developer that there is an available option if it's needed.

All that said, here's what I'm going to define and implement for the serialization module:

  • HasSerializationDict interface: Requires implementation of a GetSerializationDict instance-method and a FromDict class-method.
  • IsJSONSerializable abstract class (extends/implements HasSerializationDict): Implements a SanitizedJSON property that uses the GetSerializationDict method required by HasSerializationDict, and requires a FromJSON class-method.

There will be a few other items in the final structure, but those are the main points I think I need to cover for now. Chief among those other points: I still want to allow the standard json-module functions to be callable on IsJSONSerializable-derived objects without raising errors. I've played around with a few ideas to make that workable, and have something that I think will work well. I'll go into that in depth later, after writing out the interface and abstract class, because demonstrating that it works will require a fair chunk of code all by itself.

The First Two Serialization Classes

In previous posts, I started with the class-diagrams and interface specifics, then worked through the implementation. In this case, I'm going to start with the code, since a lot of the exploration I did to arrive at my final solution yielded fully- or near-fully implemented code.

The UnsanitizedJSONWarning Warning

At some point, once I get the json module functions' relationship with IsJSONSerializable worked out, I'll want to raise a warning if json.dumps is being called against an instance of a derived class. Python's warnings module provides the base functionality I need for that, and there are serveral Warning-derived classes already available. None of them really look like they're quite what I'd like to see when that warning situation occurs, though, so I'm going to create a custom Warning of my own:

#-----------------------------------#
# Defined exceptions.               #
#-----------------------------------#

@describe.InitClass()
class UnsanitizedJSONWarning( Warning ):
    """
A warning to be raised when json.dumps is used to generate 
potentially-unsanitized JSON output"""
    pass

__all__.append( 'UnsanitizedJSONWarning' )

The Warning class is derived from the standard Exception class. My experience has been that although I've created custom Exceptions with some frequency, I've never had to do anything more complex than this — usually if I feel the need to make a custom Exception, it's because I've determined that I need to be able to raise one and catch it elsewhere while allowing the except that catches it to differentiate between different exception-types. In this case, it's mostly because I'd like the warning that I'm going to surface to give some indication of what the warning's actually about. Something like this:

UnsanitizedJSONWarning: [class-name] has 
   "sanitized" JSON available in obj.SanitizedJSON.
This custom Warning will allow that.

The HasSerializationDict Interface

The first item to go over is the HasSerializationDict interface. The intention around it is to provide some (minimal) requirements for all classes whose instances can be serialized in pretty much any fashion, by requiring that those instances be capable of generating a dictionary representation of their state. Both JSON serialization (now) and pickle-based serialization (some time later, maybe) can use a dict data-structure as a common mid-point for serializing and unserializing objects, so that just seemed a logical starting-point. Since it's an interface (at least nominally), there's not much to the code:

@describe.InitClass()
class HasSerializationDict( object ):
    """
Provides interface requirements and type-identity for objects that are 
required to implement serialization dictionaries: a dict representation of the 
instance used as a process-step for serializing object state"""

    #-----------------------------------#
    # Abstraction through abc.ABCMeta   #
    #-----------------------------------#
    __metaclass__ = abc.ABCMeta

    #-----------------------------------#
    # Static interface attributes (and  #
    # default values?)                  #
    #-----------------------------------#

    #-----------------------------------#
    # Abstract Properties               #
    #-----------------------------------#

    #-----------------------------------#
    # Instance Initializer              #
    #-----------------------------------#
    @describe.AttachDocumentation()
    def __init__( self ):
        """
Instance initializer"""
        # HasSerializationDict is intended to be an interface,
        # and is NOT intended to be instantiated. Alter at your own risk!
        if self.__class__ == HasSerializationDict:
            raise NotImplementedError( 'HasSerializationDict is '
                'intended to be an interface, NOT to be instantiated.' )
        # Call parent initializers, if applicable.
        # Other set-up

    #-----------------------------------#
    # Abstract Instance Methods         #
    #-----------------------------------#

    @abc.abstractmethod
    @describe.AttachDocumentation()
    @describe.argument( 'sanitize',
        'indicates whether the returned dictionary should be "sanitized," '
        'the implementation of which is up to the derived class',
        bool
    )
    def GetSerializationDict( self, sanitize=False ):
        """
Returns a dict representation of the instance."""
        raise NotImplementedError( '%s.GetSerializationDict has not been '
            'implemented as required by HasSerializationDict' % 
            ( self.__class__.__name__ )
        )

    #-----------------------------------#
    # Abstract Class Methods            #
    #-----------------------------------#

    @classmethod
    @describe.AttachDocumentation()
    @describe.argument( 'data',
        'the state-data to be used to create the new instance',
        dict
    )
    @describe.keywordargs( 
        'keyword arguments representation of the state-data to be used to '
        'create the new instance. NOTE: If provided, these override any values '
        'provided in the data argument!'
    )
    @describe.raises( NotImplementedError, 
        'if called by a derived object that has not overridden the nominally-'
        'abstract method' )
    def FromDict( cls, data={}, **properties ):
        """
[Nominally-abstract class-method] Returns an instance of the class whose state-
data has been populated with the values provided in the data and/or properties 
supplied."""
        raise NotImplementedError( '%s.FromDict has not been implemented as '
            'required by HasSerializationDict' % ( cls.__name__ ) )

    #-----------------------------------#
    # Static Class Methods              #
    #-----------------------------------#

#---------------------------------------#
# Append to __all__                     #
#---------------------------------------#
__all__.append( 'HasSerializationDict' )

It's worth noting that the FromDict class-method is not decorated as an abstractmethod. Python 2.7 doesn't support decorating a method as both a classmethod and an abstractmethod. I'm not sure if Python 3.x will or not — I haven't looked — but both methods are built to raise a NotImplementedError in any event, so even if FromDict isn't implemented in a derived class, it will raise that error as soon as it's called.

That's something that should happen during unit-testing, and I'll show how I'm planning to deal with that once I start down the unit-testing code that I mentioned before.

Integrating the json Module Functions

I'll be honest: I struggled with how to try and make instances of IsJSONSerializable tie in nicely with the json-modules' dumping- and loading-functions. I poked around a lot of websites, read through a fair number of stackoverflow articles, and took a lot of fairly long walks to try and get some right-brain creativity to kick in. For a good, long while, the problem looked insurmountable.

Then, while reviewing the doc_metadata posts prior to their publication, I started wondering if I could apply a decorator to those functions. So I tried it, and it worked!

After a fair bit of tinkering with the idea, I ended up with four functions, one each to wrap around one of the json.* functions that I was concerned with. What I'm actually doing with them doesn't feel like a typical Python decorator to me, but (amusingly enough) does feel like an application of the Decorator design pattern. I'll go over each in as much detail as seems relevant...

def wrapjsondump( origfunc ):
    """
Wraps checking for IsJSONSerializable-derived classes around the standard 
json.dump function. Note that this dumps *ALL* fields, so output is *NOT* 
sanitized for over-the-wire transit!"""
    if IsJSONSerializable._decoratedJSON.get( origfunc ):
        return IsJSONSerializable._decoratedJSON[ origfunc ]
    if origfunc != json.dump:
        raise RuntimeError( 'wrapjsondump expects json.dump as the '
            'function to decorate,but was passed %s' % ( origfunc ) )
    def _dump( obj, fp, skipkeys=False, ensure_ascii=True, 
        check_circular=True, allow_nan=True, cls=None, indent=None, 
        separators=None, encoding='utf-8', default=None, sort_keys=False, 
        **kw ):
        if isinstance( obj, IsJSONSerializable ):
            objNS = obj.PythonNamespace
            obj = obj.GetSerializationDict()
            obj[ '__namespace' ] = objNS
        return origfunc( obj, fp, skipkeys, ensure_ascii, check_circular, 
            allow_nan, cls, indent, separators, encoding, default, 
            sort_keys, **kw )
    IsJSONSerializable._decoratedJSON[ origfunc ] = _dump
    return _dump

All of these decorator-functions accept an original function (origfunc), and a replacement function (_dump in this case). The original function persists inside the replacement function because of the way Python's closures work, leaving it accessible within the scope of the replacement, but able to be overridden outside that scope. Each of the replacement functions was written to use the same signature as the functions they replace.

A step-by-step breakdown of what happens may be useful. I'll use this function as the example, but the process is very similar with the other three:

  • Somewhere in some code, json.dump = wrapjsondump( json.dump ) is called;
    • The decorator checks to see if origfunc has already been decorated by looking up the replacement function in IsJSONSerializable._decoratedJSON. If it has been, the decorator immediately returns that found function.
      I'm not sure that this is doing exactly what I want/need, but until I get a chance to test it more thoroughly than I have at this point, I'm satisfied that it seems to be working.
      The decision to store the look-up in IsJSONSerializable was made based on the realization that it would be available any place that an instance that required the use of the decorated functions would exist — they'd have to be subclasses of IsJSONSerializable, after all.
    • A check is performed to make sure that the decorator is being applied to the appropriate original function.
    • The replacement function is defined (_dump in this case);
    • The replacement function is added to IsJSONSerializable._decoratedJSON, using the origfunc itself as the key.
    • The replacement function is returned.
    That replacement function over-writes json.dump (because the initial call that started all of this was
    json.dump = wrapjsondump( json.dump )
    From that point on, any call to json.dump will instead be handed off to the replacement _dump function.
  • Later, somewhere else, a call to json.dump is made, passing an object to be serialized:
    • Since json.dump has been replaced with _dump by the decorator, _dump is called instead:
      • The supplied object (obj) is checked, to see if it's an instance of IsJSONSerializable:
        • If it is, then a dict is built out, starting with the results of obj.GetSerializationDict(), adding a '__namespace' key to it, and the original object is replaced with the dict
        • The original function (json.dump) is called, passing the (possibly-modified) object, and
        • The results are returned.

In the final analysis, the only reason this approach works is because the original function that is being replaced is still (in fact, only) accessible inside the scope of the function it's being replaced with. If that sounds weird to you, you're not alone. I couldn't come up with any simpler way to explain it, and I'm not sure that I'm qualified to explain why it works without that explanation eventually devolving into mumbling about closures in functions.

The wrapjsondumps is very similar to wrapjsondump — not surprising, I think, since they perform the same basic function, just to different outputs:

def wrapjsondumps( origfunc ):
    """
Wraps checking for IsJSONSerializable-derived classes around the standard 
json.dumps function. Note that this dumps *ALL* fields, so output is *NOT* 
sanitized for over-the-wire transit!"""
    if IsJSONSerializable._decoratedJSON.get( origfunc ):
        return IsJSONSerializable._decoratedJSON[ origfunc ]
    if origfunc != json.dumps:
        raise RuntimeError( 'wrapjsondump expects json.dump as the '
            'function to decorate,but was passed %s' % ( origfunc ) )
    def _dumps( obj, skipkeys=False, ensure_ascii=True, 
        check_circular=True, allow_nan=True, cls=None, indent=None, 
        separators=None, encoding='utf-8', default=None, sort_keys=False, 
        **kw ):
        if isinstance( obj, IsJSONSerializable ):
            # TODO: Figure out how to generate an exception-like warning
            #       instead of printing this message
            warnings.warn( '%s is an instance derived from '
                'IsJSONSerializable, and has "sanitized" JSON available '
                'in its SanitizedJSON property..' % 
                    ( obj.__class__.__name__ ), 
                    UnsanitizedJSONWarning, stacklevel=2
                )
            objNS = obj.PythonNamespace
            obj = obj.GetSerializationDict()
            obj[ '__namespace' ] = objNS
        return origfunc( obj, skipkeys, ensure_ascii, check_circular, 
            allow_nan, cls, indent, separators, encoding, default, 
            sort_keys, **kw )
    IsJSONSerializable._decoratedJSON[ origfunc ] = _dumps
    return _dumps

The significant differences are the signature of the replacement function (it has to match the signature of the original json.dump function), and the warning that gets raised if obj is an instance of IsJSONSerializable. That's this chunk of code:


if isinstance( obj, IsJSONSerializable ):
    # TODO: Figure out how to generate an exception-like warning
    #       instead of printing this message
    warnings.warn( '%s is an instance derived from '
        'IsJSONSerializable, and has "sanitized" JSON available '
        'in its SanitizedJSON property..' % 
            ( obj.__class__.__name__ ), 
            UnsanitizedJSONWarning, stacklevel=2
        )

The load-related functions, though they follow the same decorator-pattern as the dump-centric ones, has a very different internal process. In order for a load to be able to create an actual instance of the class it's serialized from, there's got to be some way to make an association. That's what the '__namespace' in the functions above is for, but that may not be enough by itself. The other piece of the puzzle is the idea of registering each JSON-loadable class, and keeping track of those classes so that they can be quickly identified, and their FromJSON methods can be called. Since I haven't detailed IsJSONSerializable yet, there's no context for how that works (it turned into a circular reference), but it does work, at least with the limited testing I've done so far.

def wrapjsonload( origfunc ):
    """
Replaces json.load with a function that hands processing off to a subclass of 
IsJSONSerializable for unserialization of the JSON data into an instance of the 
class when applicable"""
    if origfunc != json.load:
        raise RuntimeError( 'wrapjsonload expects json.load as the '
            'function to decorate, but was passed %s' % ( origfunc ) )
    if IsJSONSerializable._decoratedJSON.get( origfunc ):
        return IsJSONSerializable._decoratedJSON[ origfunc ]
    def _load( *args, **kw ):
        baseDict = origfunc( *args, **kw )
        try:
            objNS = baseDict.get( '__namespace' )
        except AttributeError:
            objNS = None
        if objNS:
            if type( baseDict ) != dict:
                raise ValueError( 'Decorated override of json.loads '
                    'expected a dict value to convert to an instance of '
                    'IsJSONSerializable, but the supplied JSON evaluated '
                    'to "%s" (%s)' % ( 
                        baseDict, type( baseDict ).__name__
                    )
                )
            objClass = IsJSONSerializable._registeredLoadables.get( objNS )
            if objClass:
                return objClass.FromDict( baseDict )
            raise RuntimeError( 'decorated override of json.load could not '
                'find a valid object-namespace (%s) to work with: %s' % ( 
                objNS, args[ 0 ] ) )
        return baseDict
    IsJSONSerializable._decoratedJSON[ origfunc ] = _load
    return _load

Here's a walkthrough of what happens when the replacement function for json.load is called:

  • A call to json.load is made, with JSON to be unserialized.
    • Since json.load has been replaced with _load by the decorator, _load is called instead:
      • The original function (preserved, again, within the scope of the replacement function) is called to get a dict.
      • That dict is checked for a '__namespace' key:
        • If the namespace exists, then the dictionary of registered loadable classes (IsJSONSerializable._registeredLoadables is checked for a match
        • If there is a match, then the found class' FromDict class-method is called, and the results returned
      • If the namespace doesn't have a registered class, then the dict that was initially retrieved is returned instead

This process has added a few things to the IsJSONSerializable interface and implementation requirements:

  • A method for registering IsJSONSerializable classes;
  • Some class-level attributes in IsJSONSerializable for keeping track of registered IsJSONSerializable classes, keyed on their Python namespace;
  • A way to find that Python namespace;

The wrapjsonloads function is, apart from the signature of the replacement function itself and what original function it's expecting, identical to wrapjsonload. There's not much about it to comment on, but I'm going to show it in the interests of being thorough:

def wrapjsonloads( origfunc ):
    """
Replaces json.loads with a function that hands processing off to a subclass of 
IsJSONSerializable for unserialization of the JSON data into an instance of the 
class when applicable"""
    if origfunc != json.loads:
        raise RuntimeError( 'wrapjsonloads expects json.loads as the '
            'function to decorate,but was passed %s' % ( origfunc ) )
    if IsJSONSerializable._decoratedJSON.get( origfunc ):
        return IsJSONSerializable._decoratedJSON[ origfunc ]
    def _loads( *args, **kw ):
        baseDict = origfunc( *args, **kw )
        try:
            objNS = baseDict.get( '__namespace' )
        except AttributeError:
            objNS = None
        if objNS:
            if type( baseDict ) != dict:
                raise ValueError( 'Decorated override of json.loads '
                    'expected a dict value to convert to an instance of '
                    'IsJSONSerializable, but the supplied JSON evaluated '
                    'to "%s" (%s)' % ( 
                        baseDict, type( baseDict ).__name__
                    )
                )
            objClass = IsJSONSerializable._registeredLoadables.get( objNS )
            if objClass:
                return objClass.FromDict( baseDict )
            raise RuntimeError( 'decorated override of json.loads could '
                'not find a valid object-namespace (%s) to work with: %s' % 
                ( objNS, args[ 0 ] )
            )
        return baseDict
    IsJSONSerializable._decoratedJSON[ origfunc ] = _loads
    return _loads

This is a bit longer than I'd like already, and this feels like a reasonable break-point, so I'll pick up again in my next post with the IsJSONSerializable abstract class, a fairly detailed look at what needs to be done to build a derived class. I'm also planning on showing an example structure that I'll be able to show in action (if only through the command-line), but I think that'll be long enough to warrant its own post.

No comments:

Post a Comment