TL;DR: A framework for reasoning about the relationship between present/future beliefs in the presence of self-modification, and about a theory of mind allowing AI to model human experience/utility. Currently just a braindump to be filled in with actual research.

To rephrase a bit from earlier:

An agent must process its sensory information/measurements (the “map”), to infer it’s actual status in the world (the “territory”). We often consider this as finding hidden variables, but sensory data is fundamentally very different from the universe it betrays. It’s not “about” anything until interpreted so, and could stand independently as a mathematical object.

For a fixed agent `i`

, consider two categories.

The first, `MAP_i`

, whose objects are sensory data, including internal states (“thoughts”), everything the agent `i`

uses to reason about.

The objects of the second, UNIV, can be interpreted as predictive models over MAP_i, but should more accurately be interpreted as “things” (logical propositions, the possible physical states of the universe, etc). The interpretation as predictive models is recovered via fixing a forgetful functor (R_i : UNIV -> MAP_i) mapping physical states to sensor readings. The morphisms of UNIV are more complicated, but should (at least) take the form of a framed bicategory with vertical arrows as “logical” maps, vertical arrows as “stochastic” maps. A starting point for a concrete representation is the category Stoch (see for ex. arXiv:1205.1488). The causal structure of UNIV lets objects “roughly factor” as a product of smaller (i.e. lower entropy. See: arXiv:1106.1791) objects.

The real challenge is inferring reality from observation. Interpreting Occam’s razor as a maximum entropy principle, we can freely generate the “best approximation to reality” from sensory data by taking the maximum entropy model generating that data. Of course the functor (L_i : MAP_i -> UNIV) is uncomputable, but since it is characterized by a universal property, it is still well-founded as a function, so can be manipulated formally, and computable approximations can be found.

The details above remain to be worked out, but the motivation for this approach comes when we find the suitable adjoint pairs R_i and L_i:

Then they yield a monad (R_i . L_i : MAP_i -> MAP_i) given by inferring the world, and then restricting to sensory data. Intuitively, this is the “self reflection monad”. If we generalize to let the indices vary freely we instead get (R_j . L_i : MAP_i -> MAP_j), the “theory of mind” indexed monad.

Now consider the internal logic of this setting, sandwiched between two applications of the monad, i.e. the logic carried out by some agent. Then there are two operations, “reflection” and “reification” that let us shift down or up a level in the monad stack. (see: Andrzej Filinski’s work, ex. “Representing Monads”). Reflection lets us believe we experience what we believe we believe we experience (with some side effects in the form of disciplined effort), and reification lets us believe we experience what we actually experience (with some side effects in the form of introspective effort). Reflection is analogous to an AI self-modifying to believe/act the way it thinks it should believe/act, and reification is analogous to an AI making its beliefs explicit.

An interesting point here is that the monad is over MAP – you can’t talk about what is “true” abstractly, but only make concrete predictions of experience, implicitly utilizing UNIV as part of the context. This may be a flaw needing correction, though it is more compelling to consider it as a solution to inconsistency. By forcing each recursive step to occur between the monad, monadic effects witness each step, and maintain coinductive productivity. However, the classic notion of ‘denotation’ is lost in the general case (we cannot simply avoid Godel’s theorem). This might not be so bad though, as many propose (ex. J-V Girard) that *interaction* is primary, while denotation is incomplete. i.e. we may still be able to reason about and put bounds on the long term behavior of self-modifying A.I., even if the process never converges.