In the paper Representing Cyclic Structures as Nested Datatypes, the authors dismiss quite quickly the idea of encoding cyclic lists by using a function space to represent the loop in favour of (type-level) de Bruijn levels. I wanted to see how far one can go with the more extensional approach whilst retaining good properties (canonical forms, productivity, etc.). It turns out that the answer is: "quite far, really".
The setup is pretty simple: we want to represent cyclic lists in a purely functional language. A naïve solution would be to turn to codata and use streams built using a fixpoint in Haskell. However, representing the cycle with a material artefact will allow us to analyze it (is it finite?), display it as a string by writing the fixpoint down explicitly, make serialization trivial or transform it by mapping a function, permuting the elements, or adding / removing some of them from the cycle.
Cyclic lists are represented using a mixed-variant datatype. On top of the expected CNil and Cons shared with usual lists, a CRec constructor is provided. It declares a pointer which may be used later on to represent that very node. It should be noted that CRec is Cons-like in order to guarantee that all cycles are productive.
data CList a = CNil | Cons a (CList a) | CRec a (CList a -> CList a)
This definition exploits the function space of the host language to handle the pointer declaration in a sort of Higher Order Abstract Syntax fashion. It makes giving a semantics for it in terms of streams quite easy: one just needs to feed the function stored in CRec with the list itself.
toStream :: CList a -> [a] toStream CNil = [] toStream (Cons x xs) = x : toStream xs toStream xs@(CRec x r) = x : toStream (r xs)
Now, this representation has multiple problems. They are pointed out in Ghani et al. to justify their favouring of a more intensional approach to cycle representation. Most of them relate to the fact that it does not put enough restrictions on the ways one may use the variable provided by the CRec constructor. As a consequence, the encoding lacks canonicity.
Given that the cycle is represented using a function space, it is hard to detect whether the variable has been used at all, or if it has been used whether it was in a proper way or not. So far nothing prevents a list from pretending to be cyclic by introducing a pointer it will never use:
finite = Cons 1 $ CRec 2 $ const CNil
Or introduce multiple pointers only one of which is meant to be declared whilst the other ones are CRec cells misused as Cons ones. This leads to the possibility of different representations of the same object:
fourTwos = Cons 4 $ CRec 2 id arghTwos = CRec 4 $ const $ CRec 2 id
Last but not least, these lists can also fail at being cyclic because a list transformer has been used improperly in the body of the cycle without being rejected by the typechecker:
nats = CRec 0 $ cmap (+1)
But there is more! Even if one is willing to excuse these imperfections, it is hard to define basic functions without unwinding the cycle. It is for instance impossible to map a function across a list without unfoding it. Indeed, using a CRec constructor in the last case would provide us with a CList b element but r expects a CList a one.
cmap :: (a -> b) -> CList a -> CList b cmap f CNil = CNil cmap f (Cons x xs) = Cons (f x) $ cmap f xs cmap f xs@(CRec x r) = Cons (f x) $ cmap f (r xs)
At that point, the picture we have drawn is pretty grim and we can understand why one may want to ditch the extensional approach altogether. But it turns out that there is a solution to all of our problems: the type system!
A few days before being confronted with this issue, I had read the very interesting account of the ST monad by Joseph Abrahamson explaining how the implementation used rank 2 polymorphism to avoid references escaping their environment. It was only natural to recycle that knowledge in our case to avoid all the API abuses evoked in the previous section. The new CList type constructor takes two arguments; the first one is the type of the elements it contains whilst the second one is a phantom type used to enforce that the constructor are only ever used in the way they were meant to.
data CList a b where CNil :: CList a Closed Cons :: a -> CList a b -> CList a b CRec :: a -> (forall b. CList a b -> CList a b) -> CList a Closed
CRec's argument has to be polymorphic in the phantom type which means that it may not use CRec or CNil both of which fix the phantom type to Closed. In effect, it has to be a finite number of Cons followed by the variable provided. In other words, an element of type CList a Closed describes precisely a potentially cyclic list [1]. Hence the following type alias:
data Closed = Closed type List a = CList a Closed
All the broken examples are now rightfully rejected by the type system which detects that we are trying to conflate types which are incompatible when we declare a pointer but forget to use it, when we declare multiple ones, etc. That is already a nice change but we are still facing a major challenge: writing functions such as map without unfolding the cycle. This is what Ghani et al. write on that topic:
"Although cycles are explicit in this representation, almost all functions manipulating cyclic structures will need to unwind the cycles (based on the intuition that Rec means fix). The reason is that, although, we can recognize cycles as structures of the form Rec f, the argument f is a function and we cannot analyze functions well".
This was absolutely true in the situation they were facing. But by exploiting the type system, we have managed to root out an awful lot of unwanted constructs. To the point that we now know for sure that there is at most one CRec constructor in a List a. This implies in particular that we can detect where the pointer declared by a CRec is used. Indeed, if xs is equal to CRec x f then we know for sure that f xs will contain a CRec and that it will be the head constructor of xs showing up where the pointer was used.
This means that we can actually build a cfold function which will not unwind the cycle unless we want it to. Let's start by writing down the most general type for it. The two first arguments corresponds to the usual Cons and CNil cases with the right phantom types sprinkled where necessary. The third one is the one dealing with cycles and it mimics really closely the type of CRec. The whole thing is implemented in terms of an auxiliary function.
cfold :: forall a (b :: * -> *). (forall ph. a -> b ph -> b ph) -> b Closed -> (a -> (forall ph. b ph -> b ph) -> b Closed) -> List a -> b Closed cfold c n r = goCRec
This auxiliary function unfolds the list argument applying the appropriate combinator to the constructor's arguments and the the induction hypothesis. As its name goCRec suggests, it traverses the CRec constructor before handing over the rest of the work to stopCRec.
goCRec :: forall ph. CList a ph -> b ph goCRec CNil = n goCRec (Cons x xs) = c x $ goCRec xs goCRec xs@(CRec x p) = r x $ stopCRec (p xs)
stopCRec is similarly structurally working its way through the list except that it posesses an induction hypothesis ih obtained earlier on by goCRec. When faced with a CRec, we know for sure that we have just followed the pointer back to where it was declared. As a consequence, we return the induction hypothesis rather than unfolding the cycle yet another time.
stopCRec :: CList a Closed -> forall ph. b ph -> b ph stopCRec (Cons x xs) ih = c x $ stopCRec xs ih stopCRec (CRec _ _) ih = ih
Now that we have this induction principle, it is really easy to define a cmap which outputs a list with a structure identical to the one we started with.
cmap :: forall a b. (a -> b) -> List a -> List b cmap f = cfold (Cons . f) CNil (CRec . f)
And indeed, cmap (+1) $ CRec 1 (Cons 2) evaluates down to CRec 2 (Cons 3) rather than the detestable Cons 2 (Cons 3 (Cons 2 (...))) the naïve solution used to produce.
It is also possible to instantiate this very generic scheme down to ones with simpler types. One such instance is the induction principle where the return type does not depend on the phantom type:
cfoldRec :: forall a b. (a -> b -> b) -> b -> (a -> (b -> b) -> b) -> List a -> b
An elegant use case of this cfoldRec is the definition of the Show instance for List a which allows us to display a finite string representing the cyclic structure in its entirety. When we encounter the pointer declaration, we output "rec X. " and keep showing the rest of the list but when the pointer is used, we simply output "X" and stop.
instance Show a => Show (List a) where show = cfoldRec (\ x -> (++) (show x ++ " : ")) "[]" (\ x ih -> "rec X. " ++ show x ++ " : " ++ ih "X")
And, as we expected, show $ CRec 1 (Cons 2) now reduces to the finite string "rec X. 1 : 2 : X". It is also possible to recover a fold-style function unwinding the cycle by tying the knot in the third argument: the induction hypothesis will just be the whole subcomputation itself!
cfold' :: forall a b. (a -> b -> b) -> b -> List a -> b cfold' c n = cfoldRec c n r where r :: a -> (b -> b) -> b r a ih = c a (ih $ r a ih)
This is precisely what we want to do when defining the semantics of these cyclic lists in terms of streams. Hence the really concise definition:
toStream :: List a -> [ a ] toStream = cfold' (:) []
There are still plenty of functions to implement on this representation some of which may raise interesting questions: czip will surely involve computing a least common multiple. And it will enable us to define an extensional equality on these cyclic structures. Exciting perspectives.
GADTs combined with higher rank polymorphism allowed us to bake in invariants which were sufficient for us to write more useful functions than the rather untyped approach led us to believe was possible. The type of cfold lets us decide at any time whether we want to unwind the cycle or not.
However, this approach is limited to cases in which it only makes sense to have one cycle. This is not true anymore when we consider datastructures which are not linear: a cyclic tree may have different cycle in different branches, some of which may be nested. It is then necessary to fall back to a setup managing syntax with binding.