Lambda Normalization for Souffle Datalog

0
13

Step by step onward to lambda datalog.

It may seem backwards the order in which I’ve approached the features that lambda datalog should have. I’ve talked about adding

  1. Contexts
  2. Metavariables

to Souffle Datalog (an efficient datalog implemented in C ++), but the very first thing we should expect Lambda Datalog to have is lambda terms as objects.

Once you accept allowing trees as objects into datalog, there is nothing particularly weird about having lambda terms at first glance. Lambda terms are just another kind of AST.

However, there are a couple of intrinsic operations we want to perform.

  1. We want to be able to deal with alpha equivalence and beta reduction somehow.
  2. We want to pattern match on lambda terms in a way that respects variable scope.

In the context of Souffle, it seems desirable to always canonize our terms that are going to be stored in a relation as much as possible. Otherwise, Souffle will perform redundant work and search. This was the case for contexts, where we sorted the contexts, for metavariables, where we uniquely canonized our union find info, and it’s important here.

A standard way of making an alpha canonical term is to use de Bruijn indices. There are other options. One of particular interest is co de bruijn which can be seen in the hash cons modulo alpha paper.

A standard way of implementing beta normalization is to use the locally nameless technique. In this technique, whenever you traverse under a lambda binder, you replace it’s corresponding bound de Bruijn variables with a fresh free variable name. This avoids a lot of error prone stickiness related to accidental shadowing of variables and / or de bruijn raising and lowering.

Pattern matching lambda terms properly is a less discussed topic (at least on the blog circuit). It ends up not being so bad. The key insight I think is that you should consider pattern variables to be parametrized by the bound variables they may include like x y F(x). Then you pattern match as usual, you’ll perhaps find a F(x) = x x. Throw a lambda on F and it is easy enough to go through this term and find all x and replace them with the new bound variable. F = z. z z. Now F is a closed term. If you come across a bound variable that is not a parameter to F, fail.

I must say. Something seems off about all this. Close but no cigar.

code of post here

We want to be able to normalize lambda terms. It is possible to do so purely in Souffle, but it is very painful and you produce a lot of cruft. It is easier and better to use C ++ and make functions to do it

Generic Terms

A key idea to be able to implement lambda normalization in the context of souffle is making a generic term type. We need to know the layout and structure of terms in order to be able to write C ++ external functors that perform the appropriate.

Souffle ADTs are stored in the recordTable according to the scheme described here.

.pragma "libraries" "term"
.functor reabstract1(t : term, n : unsigned):term stateful // number because substr takes number only
.functor reabstract2(t : term, n : unsigned, n2: unsigned):term stateful // number because substr takes number only
.functor beta_norm(t:term):term stateful

 // maintain in alphabectical order
.type term = 
  App f : term, x : term // 0
| BVar n : unsigned // 1
| Lam b : term //2 
| Num n : number // 3
| Sym s : symbol // 4
| Var  // 5

Vector Arena

When performing normalization we will be producing intermediate terms. We need to represent them. It may be inefficient to store them in Souffle’s recordTable since we do not want to save these intermediate pieces for later.

I asked online for what people do to represent Algebraic Data Types. It is not so obvious what is the right thing to do in C ++

What construct to use to represent the tree types

  1. Use Tagged unions.
  2. Use std :: variant.
  3. Use subclasses of a base type and visitor pattern

What to use for memory management:

  1. std :: shared_ptr. A stock reference counter implementation. I’m not confident this plays nice with tagged unions.
  2. Manual memory management calling new and delete and stuff. Yikes.
  3. Vector Arena – Use vector indices as pointers and put the structures in

In most situations, I think std :: variant + std :: shared_ptr seem like a good default option. I opted to not do so. I went for tagged unions + vector arena. The reason I did so was that it was the approach in which I felt the most confident I knew what it would do and when it would cleanup memory. It actually makes a lot of sense. I’ve seen the technique before in Egg’s RecExpr implementation https://docs.rs/egg/latest/src/egg/language.rs.html#368-370. When I internally back into souffle, I can just drop the vector. and it will clean itself up. By using Tagged union, I could reuse RamDomain values ​​instead of having to think about std :: variants index type.

I am not sure that my choice is best, but it seems to be working.

struct term;

using recexpr = std::vector<term>;

union uterm

    recexpr::size_type body;                               // Lam
    std::pair<recexpr::size_type, recexpr::size_type> app; // App
    souffle::RamDomain ramdom;                             // Everything else.
;

#define APP 0
#define BVAR 1
#define LAM 2
#define NUM 3
#define SYM 4
#define VAR 5
#define ZFVAR 6

#define NTAGS 6

#define NIL 0

struct term

    souffle::RamDomain tag;
    uterm data;
;

// Helpers to make recexpr objects.
recexpr::size_type mk_bvar(recexpr &expr, souffle::RamDomain i)

    term t = BVAR, 0;
    t.data.ramdom = i;
    expr.push_back
    return expr.size() - 1;


recexpr::size_type mk_fvar(recexpr &expr, souffle::RamDomain i)

    term t = ZFVAR, 0;
    t.data.ramdom = i;
    expr.push_back
    return expr.size() - 1;


recexpr::size_type mk_lam(recexpr &expr, recexpr::size_type i)

    term t = LAM, 0;
    t.data.body = i;
    expr.push_back
    return expr.size() - 1;


recexpr::size_type mk_app(recexpr &expr, recexpr::size_type f, recexpr::size_type x)

    term t = APP, 0;
    t.data.app = std::make_pair(f, x);
    expr.push_back
    return expr.size() - 1;



// Could do better recording of already seen stuff in Map. More memoization? Isn't clear the search is worth the effort.
recexpr::size_type from_souffle(souffle::RecordTable *recordTable, souffle::RamDomain t, recexpr &expr)

    const souffle::RamDomain *myTuple0 = recordTable->unpack(t, 2);
    souffle::RamDomain tag = myTuple0[0];
    term myterm = tag, 0;
    switch (tag)
    
    case APP:
    
        const souffle::RamDomain *f_x = recordTable->unpack(myTuple0[1], 2);
        myterm.data.app.first = from_souffle(recordTable, f_x[0], expr);
        myterm.data.app.second = from_souffle(recordTable, f_x[1], expr);
        break;
    
    case LAM:
    
        myterm.data.body = from_souffle(recordTable, myTuple0[1], expr);
        break;
    
    case BVAR:
    case NUM:
    case SYM:
    case VAR:
    
        myterm.data.ramdom = myTuple0[1];
        break;
    
    default:
        assert(false && "Missed Case in from_souffle");
    
    expr.push_back(myterm); // maybe we could use emplace_back?
    return expr.size() - 1;


souffle::RamDomain to_souffle(souffle::RecordTable *recordTable, const recexpr &expr, recexpr::size_type i)

    term t = expr[i];
    switch (t.tag)
    
    case LAM:
    
        return mk_lam(recordTable, to_souffle(recordTable, expr, t.data.body));
    
    case APP:
    
        recexpr::size_type f = to_souffle(recordTable, expr, t.data.app.first);
        recexpr::size_type x = to_souffle(recordTable, expr, t.data.app.second);
        return mk_app(recordTable, f, x);
    
    case NUM:
    case SYM:
    case VAR:
    case BVAR:
    
        const souffle::RamDomain myTuple[2] = t.tag, t.data.ramdom;
        return recordTable->pack(myTuple, 2);
    
    default:
        assert(false && "Missing case in varopen");
    

Locally Nameless

I had made a previous prototype of pieces of Lambda Datalog in Ocaml, so I translated them somewhat faithfully.

de Bruijn indices label variables by the number of binders you have to cross to get the defining binder. It is an alpha canonical representation.

Locally nameless replaces de Bruijn indices with free variables whenever you traverse inside a binder to do anything.

There are two operations, opening and closing. Opening replaces a de bruijn variable and replaces it Closing takes a free variable and turns it into a de Bruijn variable. This combo is power and sufficient to implement beta normalization correctly.

For fresh variables I used a reference to a mutable counter.

/*

Varopen takes a level and an expression id and replaces bvar(level) with that expression id. 
This is performing substitution basically.
Traversing a lambda binder increases the level

let rec varopen k e (t : ln) =
  match t with
  | FVar _ -> t
  | FVarI _ -> t
  | BVar i -> if Int.(i = k) then e else t
  | App (f, y) -> App (varopen k e f, varopen k e y)
  | Lam body -> Lam  scope = varopen (k + 1) e body.scope 

let instantiate (e : ln) (t : scope) : ln = varopen 0 e t.scope
*/

recexpr::size_type varopen(recexpr &expr, souffle::RamDomain level, recexpr::size_type e, recexpr::size_type i)

    term t = expr[i];
    switch (t.tag)
    
    case BVAR:
    
        if (t.data.ramdom == level)
        
            return e;
        
        else
        
            return i;
        
    
    case LAM:
    
        return mk_lam(expr, varopen(expr, level + 1, e, t.data.body));
    
    case APP:
    
        recexpr::size_type f = varopen(expr, level, e, t.data.app.first);
        recexpr::size_type x = varopen(expr, level, e, t.data.app.second);
        return mk_app(expr, f, x);
    
    case NUM:
    case SYM:
    case VAR:
    case ZFVAR:
        return i;
    default:
        assert(false && "Missing case in varopen");
    


/*
varclose takes a free variable identifier and replaces it with bvar(level).
Traversing a lambda binder increases the level

let rec varclose k x (t : ln) : ln =
  match t with
  | FVar x' -> if String.(x = x') then BVar k else t
  | FVarI _ -> t
  | BVar _ -> t
  | App (f, y) -> App (varclose k x f, varclose k x y)
  | Lam body -> Lam  scope = varclose (k + 1) x body.scope 

let abstract (x : string) (t : ln) : scope =  scope = varclose 0 x t 
*/

recexpr::size_type varclose(recexpr &expr, souffle::RamDomain level, souffle::RamDomain fv, recexpr::size_type i)

    term t = expr[i];
    switch (t.tag)
    
    case ZFVAR:
    
        if (t.data.ramdom == fv)
        
            return mk_bvar(expr, level);
        
        else
        
            return i;
        
    
    case LAM:
    
        return mk_lam(expr, varclose(expr, level + 1, fv, t.data.body));
    
    case APP:
    
        recexpr::size_type f = varclose(expr, level, fv, t.data.app.first);
        recexpr::size_type x = varclose(expr, level, fv, t.data.app.second);
        return mk_app(expr, f, x);
    
    case NUM:
    case SYM:
    case VAR:
    case BVAR:
        return i;
    default:
        assert(false && "Missing case in varclose");
    



/*
beta_norm traverses and looks for shapes app(f,x). It normalizes f and x, if f is of the shape lam(b) then it performs
substitution [x/bvar(0)] b via varopen.
Whenever you go into a lambda, you need to replace that bvar(0) with a fresh variable. This is smart because substitution
messes with de bruijn indices.

let rec norm (t : ln) : ln =
  match t with
  | Lam t' ->
      let v = fresh () in
      let t'' = instantiate (FVarI v) t' in
      Lam (abstract' v (norm t''))
  | App (t, u) -> (
      let t' = norm t in
      let u' = norm u in
      match t' with Lam t'' -> norm (instantiate u' t'') | _ -> App (t', u'))
  | _ -> t
*/

recexpr::size_type beta_norm(recexpr &expr, souffle::RamDomain &fresh, recexpr::size_type i)

    term t = expr[i];
    switch (t.tag)
    
    case LAM:
    
        souffle::RamDomain curfresh = fresh;
        recexpr::size_type b = varopen(expr, 0, mk_fvar(expr, curfresh), t.data.body);
        fresh++;
        return mk_lam(expr, varclose(expr, 0, curfresh, beta_norm(expr, fresh, b)));
    
    case APP:
    
        recexpr::size_type f = beta_norm(expr, fresh, t.data.app.first);
        recexpr::size_type x = beta_norm(expr, fresh, t.data.app.second);
        if (expr[f].tag == LAM)
        
            recexpr::size_type fx = varopen(expr, 0, x, expr[f].data.body);
            return beta_norm(expr, fresh, fx);
        
        else
        
            return mk_app(expr, f, x);
        
    
    case NUM:
    case SYM:
    case VAR:
    case BVAR:
    case ZFVAR:
        return i;
    default:
        assert(false && "Missing case in beta_norm");
    


extern "C"


    souffle::RamDomain beta_norm(
        souffle::SymbolTable *symbolTable, souffle::RecordTable *recordTable, souffle::RamDomain arg)
    
        recexpr expr;
        recexpr::size_type t = from_souffle(recordTable, arg, expr);
        souffle::RamDomain fresh = 0;
        recexpr::size_type t1 = beta_norm(expr, fresh, t);
        return to_souffle(recordTable, expr, t1);
    

I was cowed by Higher Order Unification. Higher Order Pattern matching is a simpler problem and turns out to be fairly simple.

Unification is taking two terms with variables and figuring out if the variables can be instantiated with something to make the terms syntactically equal. Pattern matching is taking a term with variables and one term with no variables and figuring out if the. Pattern matching is one way unification or Unification is two way pattern matching. Unification is a bit harder to understand and describe.

There is also a plethora of unification / pattern matching modulo X, where X may be equalities, assocativity, commutativity, alpha or beta equivalence and others.

Unification of terms with lambdas in them is called higher order unification. It’s a natural thing to consider if you are trying to generalize some things that worked in first order logic (resolution theorem proving, prolog) to higher order logic. It turns out that unification modulo beta reduction is undecidable and a hard problem. Not something you obviously want persay to be a small step in your deduction system.

There is a restriction from full beta reduction that is decidable. This fragment is sometimes called Miller Unification.

Datalog has pattern matching where prolog has unification. So perhaps we need to only consider Miller pattern matching.

It’s actually surprisingly fairly simple.

Miller pattern matching is kind of the minimal of what is necessary for scopes to play nice with pattern variables.

Suppose I had the problem x. F = x. x. What is a good answer for F? x you say? What the hell is x? The thing is that F is actually implicitly bound and it’s implicitly bound outside of the lambda because we’re going to take the solution to F and use it somewhere else, again outside of the lambda. Explicitly we might write exists F, x. F = x. x. F therefore cannot refer to the variable. It would then have escaped its scope.

If instead we have the problem x. F x = x. x we can solve this as F = y. y. How did we do this? Well, really F(x) should be considered as a monolithic thing. Application of pattern variables should be considered distinct from regular application. The things that F is applied to are the bound quantities to which F may refer, smuggled in via a lambda in F itself.

So here is the basic procedure: Traverse down the pattern and term making sure things match as you go along. When you hit a F(x,y,z,...) in the pattern, it will match some term t in the matchee. If the only bound variables t contained are x,y,z,...then F is x y z t. If t contains a bound variable not in F‘s application list, the pattern match fails. Essentially, we need to “reabstract” the bound pieces in the term t and the variables F is applied to in the pattern tell us what pieces we’re allowed to “reabstract”.

We could also consider allowing F to be applied to non bound things like F(42). While we can “reabstract” 42, this is not a great idea because pattern matching becomes nondeterministic again. Consider F(42) = 42 + 42 + 42. This has many solutions. F = x. 42 + 42 + 42, F = x. x + 42 + 42, F = x. 42 + x + 42, F = x. 42 + x + x, F = x. x + x + x and so on. Maaaaybe you could say the first or last are “better” in some sense, but I’m skeptical.

So there you have it. In the Miller fragment, you annotate pattern variables explicitly with the bound variables you are allowing F to reabstract. This can be seen as only applying F to bound variables, but I prefer to thing of pattern application and regular application as distinct.

In the context of souffle, what we need to do is take the ordinary souffle pattern variables and reabstract them to be properly scoped (or fail). Any de Bruijn variable that is poorly scoped needs to be re asbtracted

In usage, reabstraction looks like this:

.decl named_term(name : symbol, t : term)
named_term("t1", $Lam($Lam($App($BVar(0), $BVar(1))))).

test_term("Testing out pattern use case") :- 
   named_term("t1", t), t = $Lam($Lam($App($BVar(0), F))), @reabstract1(F,1) != $Lam($BVar(0)). 

test_term("Testing out pattern use case. Should fail. Abstracting wrong variable. ") :- 
   named_term("t1", t), t = $Lam($Lam($App($BVar(0), F))), NONNIL(@reabstract1(F,0)). 

We are binding F to a poorly scope term. We reabstract it to produce a well scoped closed term. F2 = @reabstract1(F,1) is saying that F = app(F2, bvar(1)) module normalization beta. It only accepts the bruijn indices, so we can see that it is is in the Miller pattern matching fragment. I should probably be checking that the indices given to reabstract1 are distinct just for sanities sake.

The interface is not ideal. We may need a very serious sugaring phase.

Here’s the implementation. It’s very straightforward at the end of the day. We do not even need to translate into the recexpr representation. There is no normalization happening here.

extern "C" 
{
    // Returns nil if failure to reabstract
    // Check for that on souffle side.
    // beta_zero?

    // The args are speicifed by the site at which the pattern occurs.
    souffle::RamDomain reabstract(
        souffle::RecordTable *recordTable,
        souffle::RamDomain term,
        int level,
        const std::vector<souffle::RamDomain> &args)
    
        assert(recordTable && "NULL record table");
        assert(term != 0 && "reabstract of nil");

        const souffle::RamDomain *myTuple0 = recordTable->unpack(term, 2);
        souffle::RamDomain tag = myTuple0[0];
        assert(tag <= 3 && "Unexpected Tag in count_var");
        switch (tag)
        
        case APP:
        
            const souffle::RamDomain *f_x = recordTable->unpack(myTuple0[1], 2);
            const souffle::RamDomain f = reabstract(recordTable, f_x[0], level, args);
            // TODO early fail
            if (f == NIL)
            
                return NIL;
            
            const souffle::RamDomain x = reabstract(recordTable, f_x[1], level, args);
            if (x == NIL)
            
                return NIL;
            
            return mk_app(recordTable, f, x);
        
        case BVAR: // reabstract var in in known
        
            souffle::RamDomain i = myTuple0[1];
            if (i < level)
            
                return term;
            
            auto varind = std::find(args.begin(), args.end(), i - level); // search for bvar minus the extra lambdas we've traversed
            if (varind != args.end())                                     // Found bvar
            
                return mk_bvar(recordTable, level + static_cast<souffle::RamDomain>(varind - args.begin()));
            
            else
            
                // std::cout << "did not find bvar" << i << level << i - level;
                //  Maybe I should just throw an exception to be caught. Eh.
                return NIL;
            
        
        case LAM:
        
            // shift args? No. Reabstract is written in terms of the original position.
            // raise current level.
            const souffle::RamDomain b = reabstract(recordTable, myTuple0[1], level + 1, args);
            if (b == NIL)
            
                return NIL;
            
            return mk_lam(recordTable, b);
        
        case NUM:
        case SYM:
        case VAR:
            return term;
        default:
            assert(false && "Missed Case in reabstract");
        
    

    souffle::RamDomain reabstract1(
        souffle::SymbolTable *symbolTable, souffle::RecordTable *recordTable,
        souffle::RamDomain term,
        souffle::RamDomain x0)
    
        std::vector<souffle::RamDomain> argsx0;
        souffle::RamDomain res = reabstract(recordTable, term, 0, args);
        if (res != NIL)
        
            return mk_lam(recordTable, res);
        
        else
        
            return NIL;
        
    

    souffle::RamDomain reabstract2(
        souffle::SymbolTable *symbolTable, souffle::RecordTable *recordTable,
        souffle::RamDomain term,
        souffle::RamDomain x0, souffle::RamDomain x1)
    
        std::vector<souffle::RamDomain> argsx0, x1;
        souffle::RamDomain res = reabstract(recordTable, term, 0, args);
        if (res != NIL)
        
            return mk_lam(recordTable, mk_lam(recordTable, res));
        
        else
        
            return NIL;
        
    
}

Do not write patterns that can be beta reduced. Nothing goes into the database a beta reduced.

When I throw metavariables back in, do I need higher order unification? Yikes.

I’ve gone way too far without a concrete example. But you basically need all 3 capabilities to do anything. To traverse into lambdas, you want to generate metavariables.

Something is off about traversing these terms in C ++ only to then traverse them a different way in datalog. Hmm. Am I again not respecting the ethos of bottom up enough? Maybe slapping normalization on datalog isn’t the right thing to do at all.

Is what I really want is to directly expose varopen and varclose to datalog? I’ve been wrong every step of the way. I thought I wanted fresh vars, I wanted unification vars. I thought I wanted implication, I wanted contexts.

Do beta normalization and free var normalization in same pass.

It does seem possible to be more type safe from the souffle side. Require nil checking via giving back a wrapped answer. Have two related types closed_term and term to insist that you reabstract and avoid ruining scope.

It’s tempting to have preorder traversed freevars labels. But then how to have them in a context? Just accept the context as ordered? 🙁

Source

LEAVE A REPLY

Please enter your comment!
Please enter your name here