Which AI is best for I&R?

Neil McKechnie • December 18, 2023

GPT3.5, GPT4, Google Gemini, and more

Ah the simple days of late 2022. We really only had one Large Language Model (LLM) to choose from, and that was ChatGPT. Behind the scenes it was powered by GPT3.5 and the world marveled at all the cool new things it could do.

In early 2023, OpenAi released GPT4 with substantially more accuracy and clarity. You could use it with ChatGPT but only with a paid subscription, and with some usage limitations. Microsoft, by virtue of their partnership with OpenAi, then incorporated GPT4 into Bing, offering more people a way to use GPT4 in their Edge browser and elsewhere.

More LLMs enter the fray

In the ensuing months, other companies announced and released their own competing LLMs and tools to the public: Google Bard, Anthropic Claude, Perplexity and many others. Google just announced another new suite of tools they're calling Gemini.

Which one is the right tool for the helping with I&R work?

As with many things in life, the answer is "It depends." Work on fairly large sets of I&R data requires an LLM that can process thousands of records relatively quickly, with good accuracy and reasonable cost.

Working with thousands of resource and encounter records

First, to work on large amounts of information, like your resource database, or logs of past interactions with your help seekers, an Application Programming Interface (API) is needed. That avails all the skills of the LLM to vast amounts of records. Over the past few weeks I've been doing just that with resource databases provided to me by 2-1-1s, often processing thousands of records every hour through APIs as I fine-tune my work with LLMs and I&R data.

As of today only a few of the LLMs offer access via an API, like GPT3.5 and GPT4, and recently Google Gemini Pro. But others are sure to follow in the near future.

Good, fast and cheap?

Next, the LLM has to produce useful, consistent and high-quality results. We can tolerate some occasional odd outputs but generally need it to be reliable. I am measuring an LLM's performance on a task (perhaps rewriting a Program Description to better comply with a Style Guide) by seeing an improvement in 95% of the cases, which is 19 out of every 20 records

But it's a balance between quality, speed and cost. GPT4 is better than GPT3.5. But it's also 10 times more expensive and 3 times slower, based on my testing. Like most thing, you can get any one or two of "good, fast and cheap" but rarely all three.

And of course: safe and reliable

I am also focusing on ensuring that the results are safe and reliable. Our society is making a big leap into a world where it is going to do more of our work. We have to trust AND verify.

The right tool for the right job

My approach is to test the various LLMs for each different task I'm focused on with large quantities of records and evaluate specifically what each one does well. In one example, GPT3.5 does a good job for most of the tasks but fails to properly identify some "false negatives". For those instances, I double-check its work with GPT4 and either validate that 3.5 was right, or get an improved version. In this way I can blend different LLMs to get the best combination of quality, speed and cost.

Robots judging robots judging robots

Over time, I intend to incorporate other LLMs into the mix, as their APIs become available. In addition to determining which specific tasks each one is good at, I also plan to use the LLMs to start evaluating each others' work.

Getting great results depends upon working with data and experimenting with the different LLMs to figure out the best way to tackle the problem at hand. And then stay current with the AI industry because it is moving so quickly.

Stay tuned for more results in the coming weeks and months.

< Older Post

Newer Post >