Streams

streams

From: ag@em.net Subject: The wonderful world of streams gets better :) Date: September 15, 2000 06:13:13 CDT To: eric@learn.motion.com, Chris.Teplovs@utoronto.ca, rmclean@oise.utoronto.ca, pjohnson@kf.oise.utoronto.ca, tquinn@oise.utoronto.ca, cgodbout@oise.utoronto.ca, jburtis@oise.utoronto.ca, crawford@goingware.com

Hi Everyone,

This is a warning that streams are about to change, and that I've been spelunking through the client and server to bring them in sync with the revised streams. The changes are radical in one sense, but in the more important sense of how they're actually used the changes are trivial.

NOTE: I have *not* yet checked in my source. That will happen either on Friday, or possibly over the weekend. I do have a wedding to attend (although I'm not hosting this one, thank goodness) with associated events Friday, Saturday and Sunday, so that could (will) get in the way. If anyone is interested I can very easily email the relevant files prior to my check in.

ADDITIONAL NOTE: My check in is likely to be somewhat painful. Although there are almost no *structural* changes associated with the reworking of ZStream, there are fairly copious line by line changes (most often associated with ditching the deprecated ReadLong/WriteUShort style of invocation and providing only ReadInt32/WriteUInt16 etc.)

I'd like to give you some background, explain the basis of the new architecture and explain why I decided it was necessary to change, and finally cover some details that might be a little confusing but are not essential to fully understand initially. As a preemptive apology, I don't mean to teach my grandmother to suck eggs (a quaint old English expression) or to kick in an open door (which I just found out is the the German [I think] equivalent). I'd just like to be sure that the basic model I'm operating from is clear.


To start with, forget that you know anything about how streams are currently architected in ZooLib, and think instead about some Platonic notion of a stream.

An input stream is an entity from which you can read bytes. At its simplest you can say 'here is a block of memory, read X bytes into it'. The input stream will do so, perhaps after blocking until the requested bytes arrive from whatever source issues them. An input stream has no knowable extent (aka size) and has no notion of a current position -- you simply read bytes from it. The only error that can occur is that at some point the input stream will likely determine that it can no longer provide bytes to you; its finite limit has been reached, and a request for X bytes cannot now and never will in the future be satisfiable. The read request thus has three parameters: the address of the buffer into which the bytes should be placed, a count of how many bytes we would like to read, and the address of a count that will be modified to indicate the number of bytes that were actually read. Usually the count actually read will equal the count requested. If the stream's limit is reached during a call to read X bytes, the count read will be less than X. Every subsequent call thereafter will read nothing at all, and the count read will be zero.

The complement to an input stream is (obviously) an output stream. This is an entity that bytes can be written to. Again we pass it the address of a buffer, this time the buffer contains the bytes we want written. We also pass it a count, indicating how many bytes from the buffer should be written, and we pass the address of a count which will be modified to indicate how many bytes were succesfully written. Just as with the input stream, the output stream has no 'position' and no size, and it also might (will) be finite. This is an important point; something that is finite obviously *must* have a size (curved space notwithstanding), but in the case of input and output streams the size is knowable only *in retrospect* -- when we hit the limit we know where it was, but we have *no* way of knowing that limit ahead of time (unless that information is conveyed by some other mechanism -- but any such mechanism is not part of the standard definition of input or output streams.)

For example, a network connection is composed of an input stream and an output stream -- you can see that both streams are finite, in that the connection could get closed at any time, but any concept of a 'size' makes no sense. There is no a priori limit on how much may be written to a network connection, nor of how much may be read from it.

ZooLib standardizes the API for input and output streams by defining the abstract base classes ZStreamInput and ZStreamOutput. These classes have each have a single pure virtual method (ReadImp and WriteImp respectively), a handful of optionally overridable virtual methods (to allow for the provision of more efficient CopyFrom/CopyTo/Skip) and a suite of Read/Write methods for standard simple data types -- which handle byte swapping just as we've been used to in the past. Note that these abstract base classes do not maintain any state -- they define *no* instance variables.


The final stream variant we have is what I've been calling an "extent" stream -- this is a stream representing a vector of bytes, which bytes can be read from as if they were the feed for an input stream, and the same bytes can be written to as if they were the sink for an output stream. A read or write occurs at a current position, and that position is moved along as bytes are read from or written to. Writing beyond the end of the vector *extends* the vector. Attempting to read beyond the end of the vector behaves exactly like a pure input stream which has exhausted its source. Basically, a ZStreamExtent is a file, or something which behaves like a file -- the vector could be purely in RAM, or as just a fixed subset of bytes in a file, or be an entity that's served across a network by a protocol of some description. The two important things to remember are (1) that it *behaves* how we'd expect a file to behave (2) that we can treat it as an input stream or as an output stream (concurrently or alternately).

ZStreamExtent inherits from both ZStreamInput and from ZStreamOutput -- we are thus free to pass a ZStreamExtent to any method that expects a ZStreamInput or to any method that expects an ZStreamOutput. And because the semantics of those entities have been defined appropriately, the natural ZStreamExtent behavior when treated as an input stream or an output stream will look just like normal input or output stream behavior. This is neat, because code that needs to be fed a stream of data (say a image decompressor) can be passed a ZStreamInput that's fed by (say) a network connection, and it will work. But we can also pass it a ZStreamExtent object backed by a file on disk, and it will *still* work.

This is not really news -- it's been the fundamental basis of operation in unix forever. And we had ZStream and ZStreamNP in ZooLib already, so what's changed?

Three things are different: 1) We now have the correct division between input-only, output-only and random-access/extent streams. The old-style ZStreamNP coupled an input stream and an output stream together *for no very good reason* -- writes to one might or might not have an effect on reads from the other, but there was no fundamental reason why they should or shouldn't, and no predictability in the matter. A ZStream (which derived from ZStreamNP) when treated as a ZStreamNP had an output stream that when written to *could* affect subsequent reads from its associated input stream (if the position was reset in between). On the other hande, a network endpoint had an input stream and an output stream, but those two directions were and are independent of one another (the TCP/IP spec fully allows for half-open connections where data can only flow in one direction.) The two streams of an endpoint are connected only in that they are born together -- they are treated independently thereafter.

2) Reads and writes can now *partially* succeed, and the extent of that partial success be reported. We no longer throw an exception unconditionally when something goes wrong. This is important mainly as a building block kind of thing -- it allows us to glue together multiple input (or output) streams without requiring complex protocols or pre-flighting. It also allows one to probe for the extent of available data rather than absolutely requiring that we know ahead of time how much to expect, which is useful for enabling error-recovery.

3) ZStreamInput, ZStreamOutput and ZStreamExtent are just interface defintions. ZStream and ZStreamNP had two jobs. The first was to implement a suite of methods to read/write data in various sizes and formats. The second was to own a streamwriter, which was responsible for doing the actual work after the stream had decoded the higher level requests.

It's actually this last aspect of things that caused me to end up reworking the entire structure, and it's this part that is the radical change. To say it again differently: ZStreamWriter is no more.

The biggest problems with ZStreamWriter were twofold. Firstly, it was possible for an entity to get at and hold on to the ZStreamWriter embedded in a stream passed to it, and in doing so to mess around with it when it shouldn't, i.e., at some later date when the original owner wasn't expecting it. Secondly, ZStreamWriter required a whole bunch of mechanism that required a fair amount of extra code to be brought in -- the new streams are far more independent of the rest of ZooLib, requiring only a couple of files, and those only needed for some typedefs (the int32/int16 etc stuff) and the byte swapping API. In addition, the ZStreamWriter mechanism made it clumsy to express things like instantiating a temporary stream to filter some other stream, and the lack of a clear definition of input/output/extent semantics contributed to that. In ZStream_Misc you'll see definitions of streams that just filter other streams, and that are trivial to use and to implement. More will follow.

The new ZStreamXX interfaces define a non virtual destructor. They also define default constructor, copy constructor, destructor and assignment as protected methods. By doing so we enforces the rule that streams cannot be assigned to one another, nor can they be treated polymorphically for lifetime purposes. Thus, a stream should only be declared on the stack, or as an instance variable of another object, or they should be passed by reference to another method. This gives us rigorous control over the lifetime of a stream. The downside to this is that we can't instantiate a stream, pass it to some other entity and have *that* entity take responsbility for the stream's lifetime (without allocating the stream in the heap, which I *do not* want to see anyone doing.) This capability is important, but comprises less than 1% of the work involving streams currently (now that the necessity to do so because of the very nature of ZStream/ZStreamWriter has been removed). To facilitate this very necessary capability we have a parallel set of classes called ZStreamerInput, ZStreamerOutput, ZStreamerExtent and ZStreamerIO. This is the confusing aspect I mentioned in the introduction. Perhaps they should have been called ZStreamInputOwner, ZStreamOutputOwner etc., but that got to be way too wordy in actual usage. (I'm not wedded to this nomenclature and could easily be persuaded to do something different.) Anyway, moving right along ... a ZStreamerInput is an object that is reference counted, that is it is referred to by objects of type ZRef, and that has the pure virtual method GetStreamInput. ZStreamerOutput is the same, but it has the pure virtual method GetStreamOutput. ZStreamerExtent inherits from both ZStreamerInput and ZStreamerOutput (the *only* time I've needed to make use of a virtual base class in 14 years of C++ work) and thus any concrete ZStreamerExtent must implement GetStreamInput and GetStreamOutput (by virtue of the inheritance from ZStreamerInput and ZStreamerOutput) and of course it must implement the method GetStreamExtent. Finally, we have ZStreamerIO, which does not have a direct parallel in the ZStreamXXX hierarchy, but is nevertheless necessary simply because we do have an entity where there is a single lifetime, but two streams -- network connections.

In ZStream_Misc you'll see that every ZStreamXX class has a parallel ZStreamerXX class. The actual implementations of the ZStreamer classes is almost always the same, but it would not be correct to push that implementation into the base (although a template might be usable). The actual work is done by the ZStream classes, the option to have a ZStream that can be passed off to some other entity and have a reference to it remain with that other entity is provided by the ZStreamer classes.

When I check in you'll see ZStreamer being used in only a couple of places -- to hand around an endpoint, and to wrap a ZStreamExten_File inside a ZStreamExtent_PageBuffered (on MacOS).

Well, that about covers things. I could go on for longer, but am beginning to lose my thread of concentration. Do let me know if you have any questions, and definitely if you have any concerns or indeed suggestions.

Regards, A