Before we get to the spec, I'd like to outline the goals and assumptions for the algorithm(s) in program 1:
1. The conversions must be correct in the following sense: a. the semantics of the specification must not change in any way, b. no syntax errors may be introduced, c. no syntax errors may be resolved, d. no exported symbols (i.e. names) must change (although probably no names -- even local ones -- will be changed).
2. The conversion from ASCII to Unicode (A->U) must be "pretty" in the sense that it should maintain the original formatting intent -- non-syntax or semantics related -- as much as possible.
3. The A->U conversion will be rare, and so may require some manual re-beautification of the output. This mostly applies to the placement of left comments, as in Jaak Ristioja's example.
4. The U->A conversion need not be "pretty" (though it may be).
5. The algorithms should be stable, in the sense that applying conversions back and forth (without manual intervention) should eventually converge (to two character-by-character stable representations).
6. It is not the goal of the algorithms to preserve exact locations in errors reported by SANY/TLC. That will be addressed in changes to SANY made to support unicode.
7. The program should be as cheap as possible in terms of development effort, and should employ existing code as much as possible, thus possibly leading to:
8. The algorithms may be conservative, in the sense of imposing more constraints than necessary for complying with these goals for the sake of simplicity/cost.
As to the spec, I propose that it will be based on
this specification of alignment from the pretty-printer's code, with the addition that left comments may be moved to become right comments in the contracting A->U conversion (see point 3 above) in order to preserve alignment. Also, I think that the following invariant would arise naturally, that every token's beginning column may only decrease (or stay the same) under the A->U conversion, and may only increase (or stay the same) under the U->A conversion.
Ron