Less Coding, More Prompt Engineering!

Software developers have taken note of the efficacy of large language models such as ChatGPT towards automating particular development tasks. Depending on the language in use and what is to be accomplished, LLMs can provide a concise answer that circumvents the shortcomings of documentation that is either incomplete or tries to "boil the ocean."

LLMs work best at generating code when the problem is well-defined and the language is fairly static and consistent. For instance, I have had a great deal of success when asking for code to perform basic operations on Amazon DynamoDB in the Node.js SDK 3. It was a bit more difficult to have it write infrastructure as code for AWS using the CDK framework in Python 3, but it still produced decent enough code that I at least knew where to start in looking at lengthy documentation. However, writing native mobile code such as Kotlin for Android or Swift for iOS yields some rather incomprehensible and outdated results.

Besides ChatGPT, there is Google's PaLM API. Version 2.0 was introduced back in May at Google I/O 2023 in Mountain View, California. One of the compelling features supporting PaLM usage is Google's MakerSuite tool, which provides a graphical user interface for experimenting with prompt engineering and tweaking other hyperparameters such as model temperature.

An Attempt at a Timeless Challenge

Regular expressions, used for matching particular patterns within strings to aid in such processes as data extraction and validation (think about validating email, phone number, and date formats), are written in much the same way across many different programming languages. As such, they are highly adaptable to lots of situations and should already feature a significant enough body on any commercial-grade LLM so that an LLM can form an opinion on their structure.

However, there are many advanced concepts about regular expressions that may not come up in common usage, such as positive/negative lookahead/lookbehind to assert something does or does not come before or after something else without explicitly matching it, and backreferences that can actually parameterize previous matches to assert whether they should or should not come up again. In addition, there are certain aspects of data that need to be accounted for when writing a regex; for instance, IP addresses consist of four numbers between 0 to 255 only, and dates should only consist of valid months and valid days. Finally, there are certain cases where it is desirable that a regular expression match a string case insensitively, or match all instances inside the string. These types of "flags" manifest themselves differently depending on language; Perl expects them as single letters after the termination character, whereas Python expects them to be provided as arguments outside of the regular expression string.

This is where prompt engineering saves the day.

LLMs Like Plenty of Context

When interacting with a large language model, know that it has a certain amount of memory devoted to the context of the conversation. This is called its context length. The cool thing about Google MakerSuite is that it shows you exactly the amount (listed in tokens) of context length you have used up from your prompt engineering. This way, you can optimize exactly how many examples you can provide the model before injecting the user's actual prompt (and then the model's response) into the context. In some cases (particularly when generating long code), it is useful to spend about half of the context length on positive and negative examples so the model can best study and mimic the desired output format. That is, if the context length is 10,000, then you can write about 5,000 tokens (roughly equivalent to words) worth of positive and negative examples to seed the desired output.

However, regular expressions are very short (especially compared to entire program functions), and especially if you have the expectation that the model should produce a viable regular expression with minimal rounds of further discussion, then you could spend closer to 100% of your context length feeding examples to the model. This is where you can really go to town implementing your own expectation of exactly how a regular expression gets formed given your expectation of programming language and degree of care toward advanced concepts.

Here is a toy example of prompt engineering Google PaLM 2 to generate regular expressions. Not even 5% of the context length is used. Some of the test results are impressive, but others are laughable, particularly the "name" result, which assumes I'm named "Denison" or "jenison". These can change drastically depending on what model temperature is used (here it is 1 to promote creativity and fluctuation in the answers). The quality would likely improve if I were to be more specific in the original prompt (zero-shot prompt engineering) or if I were to provide follow-ups (few-shot prompt engineering).

Why should you care so much about prompt engineering? As seen above, the quality can be inferior if not enough detail is provided. Also, chances are the initial prompt of ChatGPT suggests that it be way too permissive and affirming of things that are just patently false. Here is an example, given a fairly short and simple prompt, where it fails to comprehend a pattern that exists right in front of it:

This example, from ChatGPT 3.5, fails quite spectacularly at providing the correct answer to the prompt. For what it's worth, a bit more complex of a prompt (but the same regex and number) on ChatGPT 4 yielded the correct answer.

Two Heads Are Better Than One

Pulling a page from the playbook of generative AI strategies that were popular last decade, it is worth considering an approach where one instance of an LLM generates a number of positive and negative test cases in conjunction with the user's initial prompt. Of course, this too requires some clever prompt engineering so that the LLM is clever enough to think of a variety of relevant and unique use cases that will fully exercise the regular expression. Then, once another instance of an LLM generates the regular expression by the "generator" model, it can be tested against the cases produced by the "discriminator" model. Enter LangChain, a library that allows for advanced interactions within and between LLMs.

Unfortunately, I have not seen where LangChain allows the same level of sophistication in interacting with PaLM as you can get with OpenAI's ChatGPT or other models you may have fine-tuned locally on top of LLaMa 2, or etc. However, as a coder, it is good enough for me that LangChain at least offers an abstraction of the PaLM 2 API so that I can instantiate models with different prompts and different temperatures. To this end, I can run the "discriminator" once, generate the test cases, then run the "generator" to come up with the regular expression. Depending on what happens, I can do several things. If all the test cases pass, I can output the regular expression to the user for use in their software. But if it fails, I can:

  • Increase the temperature, try again, and hope the model yields a satisfactory result. (Sometimes, with enough tries, it develops something rather sophisticated.)
  • Try to find the exact location in the regex and in the test case where the regex failed, prompt the model, and ask it to try again. If it yields the same result, then I can adjust the temperature upward.
  • If the temperature goes above 1 (where it begins to prioritize lower-likelihood tokens in its predictions as to what should be written next), we stop and assert that no regex could be developed, then show what the generator thought would work and what test cases thwarted it. It could be that the test case is illegitimate!
input = 'Find a number. The number may or may not have a sign, a decimal, and an exponent.'

prompt1 = f"""As an expert in parsing regular expressions, you are familiar with many constructs such as zero-length assertions, positive lookahead, and capturing groups. I am trying to write a regular expression to {input}. I would like you to come up with as many detailed and comprehensive test cases as you can (at least 10 would be preferred, but any more are welcome), providing test strings that follow each nuance of the described format, in order to exercise the regular expression I have written to ensure it is suitable for all values that may be written in the described format. Provide each test string on its own line."""

prompt2 = f"""input: I need to make a regular expression to capture the third word in the following string: /one-two-three-four/ Each word is separated by hyphens.
output: r".*?-.*?-(.*?)-"
input: I would like a regular expression that can detect all instances of `the` in the input string.
output: r"the"g
input: Make me a regex that can detect if the word "is" is preceded by the word "the".
output: r"is *(?=the)"
input: Design a regex that validates a date in MM/DD/YYYY format. The month and day can be one or two digits, and the year can be two or four digits.
output: r"(\d)(\d)?\/(\d)(\d)?\/(\d)(\d)(\d)?(\d)?"
input: I need a regex that validates there are exactly 12 digits in a number.
output: r"^\d{12}$"
input: I need a regular expression to validate a UUID.
output: r"^[0-9a-f]{8}-[0-9a-f]{4}-[0-5][0-9a-f]{3}-[089ab][0-9a-f]{3}-[0-9a-f]{12}$"i

input: I need a regular expression that can {input}

Here is a portion of the code inside the for loop that checks the regex against all possible test cases, and also the output from the code running in the terminal. This looks like a perfect regex to capture a number written in scientific notation.

Nevertheless, it would be interesting to expand upon this "generative adversarial network" (GAN) of sorts to see if this could be generalized to develop useful unit test cases against entire functions in a particular language that someone wishes to write someday. Some function that is substantial and involves a fair degree of bespoke business logic, and not just something that you could glean out of a college-level textbook or Stack Overflow with enough searching.

Are We Being Too Narrow-Minded?

One interesting thing to do is to take what you're doing and turn it around on its head. If we are using LLMs to crank out regexes to perform pattern matching for the sake of data extraction or validation, then couldn't we just use the LLM to perform those operations for itself? Imagine the complexity of the data we could scrape for if we were not bound by literal constructs, but could instead parse or validate the presence of a concept or abstract structure. It could be that we use LLMs to grade elementary English homework where students are asked to write a story with a beginning, a middle, and an end. (Of course, there could be instances where the LLM would be grading its own work!) Any concept with a structure that can be parsed into a finite number of basic components could be validated by an LLM in much the same way as a regex validates the presence of specific characters.

I can imagine this working well in certain scenarios, particularly where the LLM can summarize what is going on. That is a strong suit of LLMs. However, I am skeptical that they have the power to parse lengthy and complex data structures, especially in a recursive manner. That may make them harder to use (for now) for tasks such as validating long JSON documents against a schema until we can train them how to smartly purge stale or irrelevant data from their context, while retaining the important bits that help them understand what to look for.

The latter example comes to us from episodes of Mister Rogers' Neighborhood going back to 1968, conveying the same meaning as the popular kids' song with much more sophisticated language.


Popular posts from this blog

I/O 2021: New Decade, New Frontiers

Making a ROM hack of an old arcade game

Start Azure Pipeline from another pipeline with ADO CLI & PowerShell