AI and the Law

Generative AI has made a lot of news recently, both good and bad, about its application to the legal domain. Like justice, generative AI is blind, but not in a good way. By this we mean that the AI is not aware of the real world context in which it is generating its responses. So it occasionally generates what researchers call hallucinations — assertions of “facts” which are non-existent. This becomes a particular problem for the legal profession when the AI cites hallucinated case law which doesn’t exist. We find this puzzling because everything else in the brief it generates is spot on. How can it be so consistently good, yet occasionally so bad? To understand this, we need to look at how the present state of generative AI evolved from natural, human intelligence and what this means for its reliable application to the law.

Natural Intelligence: Thinking, Fast and Slow

In 2011, Nobel laureate Danial Kahneman published a best-selling book, Thinking, Fast and Slow, in which he distinguished between two types of human reasoning. One is fast, intuitive, and confident, the other, slow, logical and evaluative, and nuanced in its conclusions. Although one is fast and the other slow, these are only accidental, rather than essential aspects of the two behaviors. Even worse, for lack of a better way to name the distinction, he dubbed them ‘System 1’ and ‘System 2’. For our purposes, we will refer to them by a more essential difference: intuitive versus self-reflective. In the first method, you are not aware of your own thought processes. In the second, you are a conscious participant in your own thinking, weighing alternatives, evaluating outcomes.

Most of the time, most people employ the intuitive method to reach conclusions. We’re busy! Our brains are conditioned to a set of memes and narratives that we have absorbed, mostly through language, from our peers and the social tribes and cultures to which we belong. When presented with a new situation that requires a judgment, we unconsciously instantiate the situation to the meme or narrative in our mental store that most closely matches. It’s a judgement about similarity. What is this situation most like? Once matched, we can just as quickly enumerate conclusions by analogy from the controlling narrative. There is typically no conscious thought involved. The conclusions just occur to us. We would be hard-pressed to explain our reasoning. It’s just something that everybody knows!

The intuitive method is very efficient, and works quite well as long as you are mostly in the company of your culture or tribe. In those contexts, everybody does know. But it is less useful if you are outside of your tribal context. People with different memes and narratives won’t find your conclusions compelling. You will have to justify them, starting from evidence that all parties can agree to, and proceeding through reasoning steps that don’t depend on unshared narratives. Your initial intuitive conclusion is still your starting point, but now you must suspend that judgement and reflect on how you might have arrived at it more deliberately. You are forced to view your reasoning from someone else’s point of view. Certain professions require you to use the self-reflective method. You can’t publish a paper as a scientist by citing “I just know it to be true!”. You won’t get very far as a lawyer, standing before a judge, pleading “It’s just obvious, your honor!”.

Formal Intelligence

This distinction between two kinds of natural intelligence will help us to better understand what’s been happening in artificial intelligence lately. Setting current manifestations aside for the moment, we could say, more generally, that artificial intelligence is some form of human-made artifact that we can use to reason like intelligent humans do naturally. The artifact is what makes the intelligence artificial, distinguishing it from the naturally occurring kind. We make this same distinction for the entire "built environment": houses, trains, tools, languages, governments, books — every artifact of culture that humans have introduced that did not exist naturally before there were people on the planet.

From this more general point of view, the first artificial intelligence to arrive on the scene was formal intelligence — systems of mathematics and logic that allow humans to reason symbolically, much better than the average human can naturally. By following the rules of these artificial systems, a select set of practitioners are able to produce reasoned outcomes that exceed the native abilities of any human by orders of magnitude. And by implementing these systems as computer programs, almost any human can now achieve these unnatural results with a few keystrokes.

It is no accident that these systems of formal intelligence were modeled on humans’ slow, self-reflective style of reasoning. Indeed, the whole point of mathematics and logic is to present reasoning in an external, objective, sharable format that anyone can use to justify conclusions, regardless of one’s prior point of view.

Generate and Test

By the time the term ‘artificial intelligence’ was coined (by John McCarthy in 1955), we had already entered the computer age, so some of the first uses of this new technology were to automate the formal methods previously confined to pencil and paper (or chalk boards). Because these formal systems were precisely defined and deterministic in nature, they were a natural fit for mechanization by digital computers. But the automation of formal self-reflective reasoning underscored that the role of this type of reasoning was primarily for justification or verification. Such systems could decide whether a given conjecture is true, but couldn’t generate the candidate conjectures themselves. So we had only succeeded in automating half of human intelligence. One still had to use intuitive, similarity-based reasoning to come up with candidate solutions to problems before submitting them to the automated verifiers.

One of the first attempts to formalize the hypothesis creation side of the equation was the generate and test method. First, one formally defines the total problem space of a solution — the syntactic form of minimally possible solutions able to be processed by the verification side. This could be types of sentences, or numerical expressions, or sequences of actions — whatever the checker side of the partnership could accept as input. Then one implements a mechanical generator that systematically explores the problem space, enumerating all of the possible candidate solutions, sending each in turn to the checker until one of them is verified. This formal, problem space concept allowed the hypothesis creation process to be defined deterministically. We don’t have to account for human intuition or creativity. If the problem space is finite, eventually a solution will be found, if one exists at all.

One of the great ironies of formalizing human intelligence as a partnership between generate and test is that it effectively reverses the speed attributes of Kahneman’s System 1 and System 2. The checkers are relatively fast, but the generators can be exponentially slower because many problem spaces are huge. It becomes impractical to explore even a portion of the space in time frames that are feasible. A classic example of this infeasibility is the Monkeys and Shakespeare thought experiment. A large collection of monkeys banging away randomly at typewriters will eventually produce the collected works of Shakespeare. But eventually can be a very long time. A recent paper in mathematics now proposes that the time necessary would exceed the lifetime of the universe [1].

Much of the research in formal AI over the last 70 years or so has been focused on speeding up the automated generation process — finding ways of exploring large search spaces “intelligently” by serving up candidates with higher probabilities of verification first, and learning from failed verifications to prune portions of the remaining space.

Large Language Models

A huge paradigm shift occurred on the generator side in the 2017-2018 timeframe when neural network researchers turned their attention to generative pre-training of large language models. The models were taught the statistical distribution of all words in a very large corpus of books, papers and articles by learning to predict what word was most likely to come next given an initial sequence of words. Learning these distributions enabled the models to generate plausible sounding continuations of natural language inputs, very much like a human. Training such a model on the collected works of Shakespeare, for instance, would allow the model to finish Shakespeare’s sentences, as it were. Given an initial sequence of actual Shakespeare, the model could keep generating text in the style of Shakespeare. So much for the monkeys. The enormous search space of all English language word combinations could be explored with the most probable next word in credible sounding Shakespearean prose being generated every time.

If the model were instructed, at generation time, to always chose the next word with the highest probability of coming next, the result (after a long time and a lot of splicing) would approximate the actual collected works of Shakespeare. But these models could also be operated by injecting a degree of randomness into the selection of the next word, choosing not (necessarily) the most probable token, but another one with a sufficiently high (but not highest) probability. This emulates human creativity. The model would then produce “Shakespeare inspired” prose — novel works in the style of Shakespeare. This is, after all, how human authors learn creative writing, by reading and absorbing the cadences and nuances and plots and characters in the works of accomplished writers. They eventually produce their own “novel” prose by varying the patterns they have learned. This is the fast, intuitive kind of thinking at work. The budding author is not consciously aware of where their inspiration comes from. “It just occurs to me!”

Intuitive Polymaths

Just as artificial neural networks emulate the natural neural networks in human brains, LLMs emulate the fast, intuitive style of human thinking. Humans match novel situations to the narratives they have been “trained” on, unconsciously selecting the one that is most similar. From this learned pattern, they just as unconsciously generate their response as a continuation of the narrative. LLMs have a much more granular space of similarity based on how likely fragments of words (called tokens) tend to associate with each other in indefinitely large sequences of words from the texts they’ve been trained on. This allows any sequence of text to be represented as a numerical vector in a huge vector space of similarity. The memes and narratives the LLMs have learned through “reading” constitute locations in this vector space. At generation time, the input sequence of words (the prompt) is located as a starting point in the space — what amounts to the the most similar controlling learned narrative.

The difference between human and machine intuition, however, is that human intuition is inherently parochial, because the learning context for any one individual is a small fragment of what’s out there to be learned. Academics and professionals, such as professors, doctors and lawyers, get exposed to a much deeper set of memes and narratives, but that depth makes them specialists rather than generalists. They are erudite in certain subject matters, but ordinary in everything else. True polymaths, such as Leonardo Da Vinci, who study everything, and are thus able to connect and combine disparate narratives from different disciplines, are rare.

LLM intuition, on the other hand, is only limited by the size and variety of the training texts. Researchers discovered early on that as you increase the volume of training texts, LLM erudition increases not just linearly, but sometimes in great step functions. So the training sets kept getting larger with each new model. They are now approaching trillions of tokens (the first L in LLM). As a result, we now have artificial polymaths among us able to hold forth on everything ever learned by any human that has been committed to writing and published on the Internet.

Hallucinations

We have nothing in human history to compare this to, so we have struggled to understand this kind of intelligence. It holds forth eloquently and confidently on any subject it is asked about, but occasionally asserts things so obviously wrong that any human can detect them. This is why researchers have dubbed these hallucinations, rather than errors. They aren’t mistakes of inference, they are surprising statements of “fact” that aren’t facts. This is particularly important for legal applications. LLMs have been known to confidently cite case law that doesn’t exist in support of otherwise impeccably argued court briefs. Lawyers have been lulled into skipping the verification of these cases because the brief sounds so authoritative.

But when you consider how generative LLMs work, hallucinations are not surprising. The models do not memorize facts for later reference, they integrate the relations among the tokens that make up the facts into the overall similarity space. When asked to generate a brief, the controlling narrative is arguments plus supporting case law. When it gets to the case law citations part, it chooses the words most likely to come next, given the words it has already generated. There is no notion of true or false. The fictitious cases are not arbitrary, they are variations on actual cases, drawn from their constituent words, put together in an order that would make a compelling supporting case.

What we should marvel at is how often stringing the most probable words together does produce facts. It speaks to the overwhelming prevalence of actual facts and well-reasoned explanations in the natural language corpuses that LLMs are trained on. The fringe cases and the conspiracy theories, which are in the corpuses, are washed out by the higher frequency of facts and good arguments.

Chains of Thought

There are spot solutions to the hallucination problem, such as Retrieval Augmented Generation (RAG), where generated “facts” are first verified against Internet retrievals from vetted sources. But researchers know that there is a more general problem here. The intuitive nature of LLM generation, where it is essentially unaware of its own reasoning process, means it is running blind. This is fine for determining that ‘4’ is the most likely next token to follow ‘2 + 2 = ‘, but not so fine for determining what comes after ‘278 * 495 = ‘. Humans solve the first problem intuitively and correctly without any conscious thought. But we have to use self-aware, stepwise reasoning to solve the second kind of problem. It’s not simply a matter of which token comes next.

In trying to simulate this kind of stepwise reasoning in generative LLMs, researchers discovered that if you make the steps of the reasoning part of the original prompt — “first do this, then do this, finally give the answer” — you force the LLM to generate the intermediate steps, and then use them as the context to generate each succeeding step. This amounts to asking for the generation of a reasoning sequence that ends in a conclusion, rather than just a conclusion. The LLM is still not aware of how its generation relates to the real world, but the Chain of Thought (CoT) prompt embeds real-world problem solving constraints into the generation process. Now the words that come next conform to the progression of problem specific narratives.

This was a game changer, and allowed the LLMs to get into the deliberative reasoning business by simulating it via generation. The models became progressively more competent at solving mathematical and logical problems. It was still not generate and test, because there is still nothing internal to the model to test its final generations against. But by baking what amount to verification scenarios into the initial prompt via CoT, the reasoning itself becomes more explicit and can be examined, and improved, by the humans issuing the prompts.

Generate and (Generate a) Test

In the June of 2024, when Anthropic released Claude 3.5 Sonnet, our testing showed that it was now competent enough in propositional logic to support the automatic translation of natural language texts to propositional formulas. It rarely got translations wrong, but was still subject to occasionally getting them wrong due to the inherent non-determinism of the generation process. When these mistakes were challenged in subsequent prompts, the model recognized the errors and knew exactly what to do to fix them. So it was able to recognize its own mistakes, but only after they were generated and sent back by humans. Why couldn’t this recognition and correction be done by the LLMs themselves, we wondered?

In September of 2024, when OpenAI released the preview versions of its o1 models, we had our answer. It can be. The o1 models combine the initial, chain of thought generation with subsequent chain of thought “verification” generations by multiple instances of the model operating on the first completion — essentially generating, then generating testing. Both sides of the process are still non-deterministic, and thus subject to random errors, but like the checkers of checkers in formal logic, the multiple, independent verifications quickly drive the probability of surviving errors toward zero.

And true to Kahneman’s original distinction, this self-aware, deliberative reasoning is slower than the one-shot, intuitive generations of previous LLMs. The o1-mini model averages about 7 seconds of “thinking” to produce a translation of a rules clause, where the previous, GPT 4o models return a result almost immediately. But the o1 translations are virtually always correct. We’ll take it.

Where LogicMaps Fit In

As we’ve noted before, many kinds of legal documents, such as motions, briefs, court findings, and judicial opinions, are inherently informal. They cannot escape their informal framing in natural language. So they must be evaluated and adjudicated in natural language as well. The reasoning involved cannot be formalized, so the conclusions will always fall short of proof, and the certainty that proof provides. Here is where the recently emerging intelligence of generative AI natively fits in. The LLMs are trained on the same informal arguments and narratives that human lawyers and judges are trained on, except that the LLMs are trained on exponentially more of this than any one lawyer or judge. So their work product is pretty good. Deciding how good is a matter of reading and evaluating the documents they produce. It really doesn’t matter whether the document was produce by a machine or a human. The document itself illustrates the legal value (or lack of value).

Non-professionals are at a disadvantage here, because they can’t themselves rigorously vet the documents. When they sign the documents, or have them submitted by counsel on their behalf in court, they have to trust that their lawyers have done the vetting. So ordinary consumers would not be advised to use a chatBot as an attorney. But good legal professionals don’t sign or vouch for such documents without reading and evaluating them first. So the chatBot can be treated like a junior associate or a paralegal. They save you time, but you are ultimately responsible for approving the work product.

But the legal profession is unique in the history of human affairs for having such a large collection of natural language rule documents — laws, regulations, contracts — which are amenable to formal interpretation. They can escape their informal framing in natural language and be adjudicated with deductive certainty using formal AI. Generative AI cannot be relied on to do the formal adjudication itself because of its inherent non-determinism. You would never have certainty. And because of its inherent determinism, formal AI lacks the linguistic intuition to lift rules out of their natural context. Put the two together, and each can exercise its native strength.

The best way to see how LogicMaps fits into this picture is to imagine using just the chatBot to do what LogicMaps does. You could ask for a summary of the rules text. LLMs are very good at this. The summary is apt to make the scope of the rules a little clearer, but in summarizing, the chatBot will leave out detains you might need to make actual decisions. This is just the first step in a multi-step use case. You could ask whether your case scenario is covered by, or violates, the rules. But you must first describe your case in enough detail. What to include? What not to include? What’s relevant? When the chatBot returns an opinion, how do you verify it? Did what you included in, or left out of, the scenario affect the decision? If the answer is negative, what could you have done differently to achieve a positive decision? If you have any experience with chatBots, you will know that some additional upfront work is required to form a successful prompt that minimizes misunderstanding between you and the AI.

LogicMaps does all of this for you in one shot. You copy/paste in the rules text, and go. The maps that come back embody the answers to every question you could possibly ask about the consequences of the rules, including all of the basic facts that are relevant. In many ways, this is a niche application of formal methods that just happens to a hit a very large sweet spot in legal analysis. The maps themselves (derived from binary decision diagrams) represent an optimal decision procedure for the kind of logic that the rules encode. If the rules contain a non-trivial amount of ‘or’s, for instance, the total sets of consequences implied by the rules can be many times larger that the number of basic propositions. But the size of the diagram, which represents them all, is proportional to just the number of propositions. The many individual sets of combinations of propositions is encoded by the paths. So what you get is both compact, exhaustive and actionable, for many different use cases.

Best of all, all of the logical reasoning has been done on the formal methods side, not by the LLM, so it has the force of proof. This allows us to use the LLM in the safest possible context. There is no opportunity for hallucination because the LLM is being asked just to translate what’s already in a given natural language text into a formal language. No new facts. No derivations. Just say it again, Sam.

The certainty of the LogicMaps, of course, is qualified by the certainty of the translation. We can say, with certainty, that the LogicMaps correctly express the consequences of the translated rule. We cannot say with certainty that the translation correctly expresses the meaning of the original rule. In this regard, though, we are no worse off than the legal profession is generally. When natural language rules are so logically ambiguous that multiple interpretations are possible, it defaults to judges to choose one. At least we have a legally erudite polymath doing the choosing for us.

[1] Stephen Woodcock and Jay Falletta, A numerical evaluation of the Finite Monkeys Theorem, in Franklin Open, Volume 9, December 2024, 100171.