Using TLA+ for data modeling

In the article “How Amazon Web Services uses Formal Methods”, I reported that in addition to using TLA+ to design concurrent and distributed systems, we had also used TLA+ to help design schemas for databases — a.k.a. “data modeling"[1].    Recently, a member of this group asked me for more more details about that second use of TLA+. I’m sending my reply to the whole group in case other people are interested.

Unfortunately I can’t give any concrete examples of applying TLA+ to data modeling, as the specs I wrote are all proprietary. Instead I’ve given more details on the general problem/motivation, and the solutions that have worked for me. The request for details means that this note is fairly long — and still not particularly detailed, sorry.

N.B. My perspective is that of a software engineer who occasionally needs to design a database schema. This is quite a common practice; software engineers do whatever data modeling is necessary to make progress on a project. There do exist other people who are full time "business/data analysts/modelers", but they seem to be quite rare relative to the demand for new database designs. So software engineers fill the gap.

Caveat: I have relatively little data about the use of TLA+ in this area. I’ve only done it a couple of times myself, and I hesitated to include it in the CACM article. I finally decided to include it because at minimum it’s an interesting direction that would benefit from further exploration, and I personally found it useful in practice. However, your mileage may vary.

What problem are we trying to solve?

"Designing a schema for a database” is a broad term for several different activities. The orthodox definitions are approximately as follows (my paraphrase):

    1. Logical modeling     : identifying the ‘business entities’ of interest, the relationships between them, and the 'business constraints' on the database.
    2. Physical modeling   : designing a particular implementation of the logical model, e.g. SQL definitions of tables and indexes, intended to make applications perform well when using the database.
    3. External modeling   : designing various views of the database designed for particular applications (reports, access controls etc).

I doubt that TLA+ can help much with #2 or #3.
I’ve only used TLA+ to help with #1, logical modeling, so that’s what I’ll describe here.

Weaknesses in conventional methods for logical modeling

The conventional way to do logical modeling is to use one of several variants of Entity-Relationship (E-R) modeling, or UML class/association diagrams.
In my experience those methods are hampered by the following issues:

a. (minor) E-R and UML are inefficient to use. The methods are based on diagrams with boxes and lines. Most tools involve a lot of scrolling, clicking and dragging with the mouse.

Issue (a) can be tedious but it’s a relatively minor.
The major issues are:

b. E-R and UML scale poorly to large or complex problems. Diagrams with dozens of boxes and hundreds of lines become almost incomprehensible. People print the diagrams in tiny fonts on huge/many sheets of paper, stick them up on a wall, and worry whenever they look at them.

c. E-R and UML are insufficiently abstract. Each method is strongly biased towards a particular flavor of implementation: Entity-Relationship modeling is biased towards relational databases; UML is biased towards object-oriented databases/software. Those biases are a burden at this layer of abstraction because they distract from the process of understanding and capturing the essence of the business domain. For example, when using the Entity-Relational method the designer is constantly forced to choose between modeling a business concept as an attribute or as a relationship. Each such choice has significant consequences, e.g. attributes cannot participate in relationships, and using too many relationships makes the diagram incomprehensible. At the time when they must make these choices, the designer usually has a woefully incomplete picture of the aggregate consequences, so it is easy to choose the wrong option. Switching to the other option later requires considerable re-work, so the designer is discouraged from doing many experiments/sketches.
         E-R and UML are so prevalent that some people believe that such implementation biases are inherent in logical modeling. That’s not true; such biases are avoidable, as is demonstrated by the existence of good methods that don’t force any such bias (see later).

d. E-R and UML are insufficiently expressive for defining operations on the data.   Both E-R and UML class/association diagrams focus on capturing the structure and constraints of an instantaneous snapshot of the database. I.e. In TLA+ terms they focus heavily on defining (part of) an invariant. For some reason, conventional methods for data modeling seem to barely acknowledge the existence of operations on the database. UML has some support for defining operations (activity diagrams, state machine diagrams), but those features seem to be rarely used in data modeling, and are inexpressive compared to TLA+. In conventional methods, the operations on the database are defined by other people and in other artifacts, e.g. transactions and queries are written by programmers and embedded in application code, scripts, and report generators.
       This focus on the invariant and the downplay of operations is a problem, because the database exists to support the operations and the operations are often complex and subtle. i.e. The operations drive the most demanding requirements on the database schema, so it helps to define the operations and schema together. More on this later.

e. E-R and UML are insufficiently expressive even for defining single-state invariants. They can capture some multiplicity constraints (‘crows feet’ on the diagrams) and uniqueness constraints, but more complex constraints can only be captured as textual notes, often in ambiguous informal language.

f. E-R and UML have poor tool support for finding errors. E.g. If the constraint (invariant) is complex then we need help to find errors in that constraint: is it accidentally too strong to be satisfiable? Is it too weak, so allowing data to be stored that doesn’t make sense to the business? We also need to find errors in the operations on the database.

g. With E-R and UML, the output of the modeling process (diagrams) is often hard for less-technical business owners to understand & validate. This is a huge problem, allowing many semantic errors to go undetected.

h. E-R and UML fails to help with the problem of evolving the schema as requirements change over the lifetime of the system.

How TLA+ can help with logical modeling

I found that while TLA+ is very far from a complete solution, it can help with a subset of the above problems to varying degrees:

1. The biggest benefit of applying TLA+ to data modeling is with (d); defining the operations along with the types, relationships and invariant. This is what I meant when I wrote[1].
      In particular, I’ve found that defining the operations and the schema together gives the following benefits:

           - Accelerates the analysis and design process by resolving disputes about ambiguous business concepts.   The main problem in the analysis and design phase is for a group of people (engineers and business owners/users) to understand the business problem/domain sufficiently well that they can create and agree on the definitions that become the database design. I’ve found that when the group focuses solely on the definitions of types, relationships and constraints then the debates about definitions can last a very long time. I found that a good way to advance or settle many of those debates is to enumerate the set of operations on the database and start defining them informally, e.g. with pseudo-code, or (better) “design by contract" pre-conditions and post-conditions. If more precision is required then the informal pre-conditions and post-conditions can be refined into TLA+ (as a TLA+ action that represents an 'operation' is simply a conjunction of the precondition and the postcondition of the operation).

           - Helps avoid errors. Whenever I’ve designed a schema without simultaneously designing the operations, I’ve found deficiencies in the schema later on — often after the system/product had been launched. Defining the operations doesn’t guarantee you’ll avoid errors, but it really helps.

   ii. TLA+ helps with (a) and (b) as text is better than diagrams for problems of significant size or complexity, particularly once we consider operations on the database. (Imagine something of the complexity of the Wildfire Challenge spec written in a graphical notation.)

   iii. TLA+ helps with (d), staying abstract without bias towards any particular implementation, because TLA+ allows the user to select the level of abstraction. However, designers need help with this (see below).

   iv. TLA+ helps with (e), expressing invariants: TLA+ can easily and precisely express arbitrary invariants, removing the need to use to informal ambiguous notes on diagrams.

   v. TLA+ helps with (f), by finding errors using TLC. Another possibility is Alloy: this is what Alloy was designed to do, via constraint solving. However, some of the tradeoffs that I wrote about in [2] apply in this domain, e.g. Alloy’s primary output (auto-arranged graph) can be overwhelming for models with significant complexity.

However, some problems remain:

Problems with using TLA+ for logical modeling

1. (minor) A lot depends on how you use the language. In this context, TLA+ is ‘just’ an expressive general-purpose formalism for first-order logic. So it’s perfectly possible to write ‘data model’ specs in TLA+ that have bias towards a particular implementation; e.g. relational or object-oriented. It’s also possible to write invariants and operations that are very hard to understand. So we need some principles on how to use TLA+ to model business domains.

2. (nice to have) Using TLA+ in this way adds a new problem: the task of translating the spec into a ‘physical model’, e.g. SQL schema. We’d like automation for this.   (E-R and UML already have automation in this area.)

3. (major) Using TLA+ in this way makes one of the above problems (g) significantly worse: when using TLA+, the output is even harder for less-technical business owners to understand & validate. Given that this problem was huge even without applying TLA+, is it wise to make it worse?

I'm still working on solving the caveats above, particularly #3, while still reaping the benefits of using TLA+. This is a background task for me, so progress is slow. What progress I’ve made is described below. It is far from solved.

Current experiments to solve the full problem

The method I’m currently trying is as follows:

   - For the schema and invariant part, I’m using a method called “fact-based conceptual modeling”[3]. This is an elegant general approach that is available in several flavors. The flavor I’m using has the unfortunate name of "Object Role Modeling v2”, a.k.a “ORMv2’. (The name is unfortunate as the acronym ‘ORM' is also used for a different software technique called 'Object Relational Mapping’ which is entirely unrelated to Object Role Modeling.) See [4] for details about ORMv2.
     Each flavor of fact-based conceptual modeling is supported by one or more modeling languages and associated tools. The modeling language I’m currently investigating is called Constellation Query Language[5], which is supported by an open source tool called ActiveFacts[6].   I’m also looking at the NORMA language and tool[11].

     Fact-based conceptual modeling has some useful properties:

        - Formal semantics; maps directly to first-order logic.

        - Can be read and written via a fairly readable subset of English prose, which really helps less technical people to understand the schema. Languages other than English are also supported.
       For an example, a fact-based model of a well-known non-trivial SQL database design (the AdventureWorks demo database) looks like: [7]

        - Not biased towards any implementation, e.g.: fact-based modeling has no notion of a special category of concepts called ‘attributes’, and the ‘Object’ part of ‘Object Role Modeling’ has nothing to do with object-oriented databases/programming.

        - More expressive than UML and many forms of E-R for defining relationships. E.g. Most flavors of E-R only support binary relationships between entities, but fact-base modeling supports relationships of any arity. This is often useful in practice, but an example would take too long to explain here. Fact-based modeling also supports "facts about facts", as each instance of a relationship can itself participate in further relationships. This is also useful in practice, but again an example would require so much explanation as to be a distraction here; instead see [4] and [5].

        - More expressive than E-R or UML for defining constraints/invariants. Still not fully general, but significantly better.

        - Tools exist for querying, manipulating and translating models. E.g. Models can be machine-translated to SQL schema definitions for OLTP databases, SQL definitions for Data Warehouse (Data Vault), and OO program code. I believe that machine translation to TLA+ or Alloy should not be too difficult.
          For an example, the AdventureWorks model[7] can be machine-translated by the ActiveFacts tool[6] into SQL definitions for an OLTP database [9], and SQL definitions for a Data Warehouse [10].

        - Has a diagram notation (also with formal semantics), for people who want that.
          For example, ORMv2 diagrams for AdventureWorks[8].    Tools exist to translate directly from diagram notation to English prose, and vice-versa to some extent.[7]

   - For defining operations on the data and finding errors, use TLA+. ORMv2 does not yet support operations, and the planned support is verbose and inexpressive compared to TLA+.
     To enabling use of both ORMv2 and TLA+, I plan to write the schema in English prose in ORMv2 and automatically convert it to TLA+. For the conversion, the most promising avenue for this is a set of open source tools called ActiveFacts[4].

    - Finding a good modeling method, language & tools has been relatively easy. The harder part seems to be find a vocabulary of good abstractions of real-world concepts, with which to build the models. E.g. What are good criteria for defining categories of things? How should we model things that appear to change category through their lifecycle, or depending on context? So far the most useful guidance I’ve found comes from a branch of formal philosophy called formal ontology. The best single reference I’ve found is a thesis called "Ontological Foundations for Structural Conceptual Models” [13]. It’s very long and a bit heavy going in places (I’m on my second reading), and parts of it should be challenged. But it contains some useful ideas and most importantly, contains formal definitions. It also proposes some significant changes to UML to fix some of the deficiencies that I described above, and tools exist for this new version of UML (called “OntoUML”)[14]. Perhaps the biggest downside of this work is that it’s not remotely complete as it only discusses the physical world, where as many databases are concerned about other aspects of reality as well, e.g. social structures such as companies, contracts etc. Some of those areas are covered by later papers published by the same author and his team, but those areas are still not as well developed.

Finally, a related aside: In my opinion, the ability to write unambiguous statements in prose is very helpful, and has broader applications to other kinds of specs.   When I write any spec, I start by writing informal prose. I then refine it to make it ‘sufficiently formal’ to have high confidence that I’ve eliminated hand-waving and that I understand it. Sometimes this means going as far as PlusCal or TLA+, sometimes not.   The point is that I think in English and then have to refine and translate those thoughts into more precise statements. I speculate that I would make faster progress if I habitually wrote more formal English prose earlier in the process — ideally when I formulated the original thought. This would require changing habits of mind to avoid the more ambiguous constructs in English, and stay within a sane subset. To that end I’ve looked at a few definitions for “controlled subsets” of English that can be mapped directly to first-order logic, e.g. Common Logic Controlled English[12]. For data modeling, CQL[4] is the best I’ve looked at. This is a new area and improvements are still being made/discovered. The many papers on the development and incremental improvements of ‘verbalizations’ in fact-based modeling[4] shows how hard it can be to find a unambiguous, readable and expressive subset of English.

I hope someone finds some of this helpful.

Cheers,
Chris

[1] Extract from page 9 of http://research.microsoft.com/en-us/um/people/lamport/tla/formal-methods-amazon.pdf

"Most recently we discovered that TLA+ is an excellent tool for data
   modeling, e.g. designing the schema for a relational or ‘No SQL’
   database. We used TLA+ to design a non-trivial schema, with semantic
   invariants over the data that were much richer than standard
   multiplicity constraints and foreign key constraints. We then added
   high-level specifications of some of the main operations on the data,
   which helped us to correct and refine the schema. This suggests that a
   data model might best be viewed as just another level of abstraction
   of the entire system."

[2] "Why Amazon Chose TLA+” : http://research.microsoft.com/en-us/um/people/lamport/tla/amazon.html

[3] Fact -based conceptual modeling: http://www.factbasedmodeling.org/home.aspx
          For more depth see [4]

[4] Object Role Modeling aka ‘ORMv2’:
    a. website with links to good books, papers, and free tools: http://orm.net
    b. video of a talk by the main designer of ORMv2: http://cs.helsinki.fi/group/edoc2011/videos/edoc2011_1.html

[5] Constellation Query Language (CQL): a readable but formal language for defining ORMv2 models and querying them: https://www.infinuendo.com/introduction-to-cql/

[6] The free ActiveFacts tool for parsing, querying and transforming CQL: https://github.com/cjheath/activefacts

[7] The standard ‘AdventureWorks' database design expressed as a fact-based conceptual model in CQL (ORMv2):
         https://github.com/infinuendo/AdventureWorks/blob/master/CQL/AdventureWorks.cql

[8] ORMv2 diagrams for the AdventureWorks database design:
         https://github.com/infinuendo/AdventureWorks/tree/master/ORM2/images

[9] A machine-generated SQL OLTP schema for AdventureWorks, generated from [5] by the ActiveFacts tool:
          https://github.com/infinuendo/AdventureWorks/blob/master/OLTP/AdventureWorks.sql

[10] A machine-generated SQL Data Warehouse (DataVault) schema for AdventureWorks, generated from [5] by the ActiveFacts tool:
           https://github.com/infinuendo/AdventureWorks/blob/master/DataVault/AdventureWorks.dv.sql
:
[11] The free NORMA tool for editing ORMv2 diagrams, editing or outputting models as English prose, and generating SQL
          https://en.wikipedia.org/wiki/NORMA_%28software_modeling_tool%29
    Also, in video [4b], at 34 minutes in, there is a demo of the NORMA tool.

[12] Common Logic Controlled English
        Intro http://www.jfsowa.com/clce/specs.htm   and an update http://www.jfsowa.com/clce/clce07.htm

[13] Ontological Foundations for Structural Conceptual Models http://www.inf.ufes.br/~gguizzardi/OFSCM.pdf

[14] OntoUML https://en.wikipedia.org/wiki/OntoUML