AI is the Future of Interfaces

May 13, 2024

Interface is a fascinating topic. After reading David Kaye’s substack touching on this idea, I decided to delve into what I think the future of interface using AI is.

Distilled down, interface is the translation layer between two entities to convey an idea. Imagine the idea of what you want to do in your head, then communicate it to another entity for that entity to execute.

Framed that way, interface is communication. The evolution of communication starts with one-to-one conversations over time - assuming two people speak the same language and are extremely good at describing their visions, the requestor can ask the executor for a deliverable, and get a result that is reasonably close to what they asked for. But actually - no one is really perfect at describing what they want, and no one is perfect at interpreting the answer. So the end result is always a product of iteration - show me what you have, and I will describe how it differs from the idea in my head. Eventually we can get close enough.

Unfortunately, you can’t likely scale your perfect communicators and your perfect executors. So we begin to have other means of interface, which allows the communicator to get closer to what they intend to communicate by using tools on rails to deliver the vision as close as possible. Maybe in history that’s the printing press, or the ability to mass produce drawings to describe an idea, or the use of data and charts to convince and inform about the vision.

But what if the executor is a machine? Unless you’re writing in binary, every programmer is already using an interface to communicate to these executor machines. Every modern programming language is just an attempt to translate the desires of the communicator (the programmer) to the executor (the machine) as efficiently as possible. This requires not only a knowledge of the programming language, but often a deeper understanding of math and other logical disciplines than most average people have. We learned in the DOS days that command-line interfaces do not work for most people - this was why Windows became so popular.

The establishment of relatively standard interfaces allowed communicators to communicate - on very specific rails. If you wanted to do something, you needed to know how to make the interface do it. If I want to open a photo in Photoshop and do color correction, I can do that myself, rather than communicating to a photo expert to do it for me. If the interface did not allow it, or you couldn’t figure out how to make it happen - you couldn’t do it.

The history of interfaces is one where we start with experts, then evolve to usability by novices. This is one of the most exciting innovations the AI revolution is bringing us. Now, a machine can understand us the same way that one-to-one expert could back in the day. We can just communicate naturally and get a result that is close to what we wanted.

So far this has been done in chat-bots. This is a natural way for it to start, and it’s already a huge leap forward when using large language models. But what’s next? Is this really the best way to communicate with a machine? Because an AI model is really just guessing what it is that you want. The closer what you’re asking for aligns with data its neural network has been trained on the easier it will likely be to get to the result you want. Currently AI is still a relatively poor communicator - it can produce great results, but like any machine interface it can take a learning of the tips and tricks to make it work the way you want, rather than it being as natural an interface as it could be.

The next step will be in two parts I think - the ability to stifle hallucinations (those moments when an AI model just invents a result out of nowhere. Anyone who has asked ChatGPT for specific information only for the machine to flat out lie to them will understand this) and the ability to go beyond text as input.

The first part people are working on, using RAG (retrieval-augmented generation, a technique for limiting AI results to actual sources rather than allowing it to generate random facts) and other similar experiments. I’m actually more interested in the second problem.

Today ChatGPT announced GPT-4o, or “Omni” - a huge step towards that second part. Among its abilities are the ability to read a user’s desktop, or a photo, or to read intonation in speech. This is key to solving the next step beyond chat text as a sole input.

GPT-4o can read your screen - in this example to tell you what this code is doing.

My oldest daughter is non-verbal. During years of speech therapy, one of the things the therapists often encourage is something called Total Communication. You accept communication from the child in any form they can do it. It could be from communication software on an iPad. It could be sign language. It could be certain vocalizations. You don’t limit them to just one method - because the end goal is communication. The end goal is not using an iPad, or signing, or at least in our case, verbal speech.

This is the sort of interface my daughter uses to help her achieve Total Communication results.

I think this makes sense also in how typically developing people communicate. When I’m describing a feature to my teams, I don’t just tell them what I want. I draw a picture. I write user stories. I provide research. I identify questions to be answered before we can move forward. My body language communicates. My whiteboard scribbles communicate. My expressions communicate. Anyone who has tried to have a detailed breakdown of features over slack knows that as useful as that tool is, it limits both parties to only one channel of communication when we as humans typically absorb several.

So I think that is where the next step for AI is. It must read our intentions better. We must be able to give the model inputs that are non-textual. That must be combined with other types of input - drawings. Screen selection to say - this part is what I mean, not this one (which Photoshop has already started doing). Learn my expressions - am I confused? Dismissive? Am I getting what you’re telling me? Does the tone in my voice indicate that we’re on the wrong track?

AI must become intuitive when it comes to communication with humans. What AI has the potential to accomplish is to create for non-technical stakeholders the ability to build complex systems with the communication skills they typically use for communicating with other humans.

One example I’ve been diving into lately is using AI to direct the creation of 3D scenes. For those of us who have learned to use tools like Blender, Maya, Unity or Unreal Engine, we might remember how frustrating navigating a 3D environment on a 2D surface is. Some companies solve that problem by putting the user in a 3D environment themselves, which is great, but I think that’s a higher-friction path than is necessary. If I can just say:

“Move that block above the table. Align it to the bowl in the middle, about a half meter up.”
“I want this to animate when someone clicks this button. Move it two meters left. Make it speed up and slow down on the way over.”
“Change the lighting to be warmer. Everything kind of looks blue now.”

These are conversations that, if you were a PM or Game Designer talking to a 3D artist, would be simple to have and that artist would understand you perfectly. But if you tried to do these things in 3D software, and you didn’t understand alignment on multiple planes, or how easing in animation works, or how to adjust lighting temperature - you’re going to be frustrated, and possibly be unable to complete the task.

If I could combine an AI who understands those directions with the ability to point at elements, to draw an example sketch, to show my disappointment in my face when the system gets it wrong I think we’ll get as close to speaking the same language as possible.

The less we need translation, the more invisible and frictionless machine usage will be. The ability to communicate as a human is an unmatched interface.

Chris's Substack

Discussion about this post