“One day every major city in America will have a telephone.” Alexander Graham Bell
Transformers: More Than Meets the Eye
Human beings can be forgiven for sometimes not grasping the full impact of the technologies we develop. Occasionally, we miss the forest for the trees. This explains both Alexander Graham Bell’s statement on his own invention, and perhaps also Berkshire Hathaway’s Charlie Munger recently dismissing AI in his interview with CNBC’s Becky Quick, saying that “Artificial intelligence is not going to cure cancer.” Actually, it just might, and more interestingly, it’s the underlying technology to the now everything-everywhere-all-at-once of ChatGPT that may help us do so.
To be sure, ChatGPT itself is an amazingly compelling application. The latest iteration, GPT-4, provides eye-watering performance versus humans on academic and professional exams; the statistical understanding of language input and the statistical generation of language output is demonstrably impressive.
Fig 1. GPT performance on academic and professional exams (OpenAI 2023)
In a similar vein, earlier work leveraging cognitive psychology by the Max Planck Institute for Biological Cybernetics had found that, despite other limitations, “much of GPT-3’s behavior is impressive: it solves vignette-based tasks similarly or better than human subjects, is able to make decent decisions from descriptions, outperforms humans in a multi-armed bandit task, and shows signatures of model-based reinforcement learning” (Binz and Schulz, 2023).
While GPT’s chat functionality is sure to have broad impact in consumer-facing applications – doing a great job of mimicking human language generation – what’s being lost in the current conversation is the broad impact of ChatGPT’s underlying technology. Specifically the “T” in “GPT”, and its potential to disrupt business applications across a wide range of industries. To borrow a line from the comic book, The Transformers, there’s more than meets the eye in transformer-based neural network applications than just generating consumer chat.
Attention IS All You Need
The seminal work that led to ChatGPT was principally done by researchers at Google, resulting in the paper “Attention Is All You Need” (Vaswani et al., 2017). Essentially, the authors solved a key complexity in interpreting human language, specifically that natural languages encode meaning both through words themselves and also through the positions of words within sentences. We understand specific words not only by their meaning but also by how that meaning is modified by the position of other words in the sentence. Language is a function of both word meaning (space) and word position (distance/time).
For example, let’s consider the sentences, “Time flies like an arrow. Fruit flies like a banana.” It’s clear from the contexts of each full sentence that in the first, “flies” is a verb, and “like” is a conjunction. In the second, “flies” is a noun, while “like” is a verb. The other words in each sentence signal to us how to understand “flies” and “like”. Or consider the sentence, “The chicken did not cross the road because it was too wide”. Does the word “it” refer to the chicken or the road? We humans are good at disentangling such sequences, whereas the natural language processing of computers found this challenging. Throw in syntactic differences when translating from one natural language to another – English’s “the white house” being rearranged to Spanish’s “la casa blanca” – and the problem ramifies in complexity.
Vaswani and his colleagues solved the natural language interpretation and generation challenges above through a machine learning architecture they christened the transformer. This is the “T” in GPT. The key capability of this transformer architecture was to take a sequence of words (inputs) and statistically interpret each word of the input (in parallel with the others), not only through the meaning of the word, but also through that word’s relationship to every other word in the sentence. The underlying mechanism to extract meaning – understanding the meaning of every word in context – was a statistical mechanism known as “attention.” Attention is the heart of the transformer, helping applications both understand the input sequence and also to generate the output sequence. And attention-based transformers, it turns out, are quite broadly applicable in modalities beyond language.
It’s “T” Time
The public discourse to date surrounding ChatGPT has been solely on the natural language that it so effectively generates for consumers in response to natural language prompts. But is natural language the only place where we see a sequence of data elements whose semantics are based on both meaning (space) and position (distance/time)? The answer is emphatically no. Put simply, ChatGPT has siblings in many industrial applications, and this is where disruptive AI opportunities lie for companies today. Let’s take a look at a few examples.
Biology, it turns out, is also a function of meaning and position. Proteins are the large, complex molecules that provide the building blocks of biological function, and are composed of long, linear sequences of amino acids. These amino acids are not randomly arranged molecules: positionality matters. Hence, proteins have a “language syntax” based on their amino acid sequence. Analogous to using a transformer to translate English to Spanish, can we use a transformer in the application area of de-novo drug design? I.e., is it possible to translate an input sequence of amino acids and generate novel molecules as output, with predicted ability to bind a target protein? Yes.
Transformers have been successfully used in many such applications within the drug design process (Rothchild et al. 2021, Grechishnikova 2021, Monteiro et al. 2022, Maziarka et al. 2021, Bepler & Berger 2021). The breakthrough we’ll witness in healthcare will not just be generative chat as healthcare user interface. It will be the impact of transformers on the science underlying healthcare itself.
Transformers have been used to for real-time electrocardiogram heartbeat classification (Hu et al. 2021) for wearable device applications, and for translating lung cancer gene expressions into lung cancer subtype predictions (Khan & Lee 2021). There’s also BEHRT (Li et al. 2020), and Med-BERT (Rasmy et al. 2021), both of which apply transformers to electronic health records (EHR), and are capable of simultaneously predicting the likelihood of multiple health conditions in a patient’s future visits. The future of healthcare technology? Transformers.
Where else might we see sequences of data where both meaning and position matter? Robotics. Position matters in physical tasks, whether performed by humans or robots. When baking from a recipe (add ingredients, mix, bake) or changing a flat tire (jack up the car, remove flat tire, install new tire), position matters: tasks must be correctly sequenced. How might a robot interpret and sequence tasks? Google’s PaLM-E (Driess et al. 2023) is built with the ever-absorbent transformer, as is RT-1 (Brohan et al. 2022), a “robotics transformer for real-world control at scale”.
The list of industrial applications for transformers appears endless because Big Data promises an endless supply of applications where long-sequenced data encodes positional meaning. Transformers have been used to accurately predict the failure of industrial equipment based on the fusion of sensor data (Zhang et al. 2022). Transformers have also been used to forecast electricity loads (L’Heureux et al. 2022), model physical systems (Geneva & Zabaras 2021), predict stock movement (Zhang et al. 2022), and even generate competition-level code (Li et al. 2022). In this last example, Google DeepMind’s AlphaCode succeeded in finishing among the top 54% of coding contestants versus human competition.
ChatGPT and its language brethren will doubtless find application in a range of verticalized, language-based use cases in the business world, whether in office automation, programming, the legal industry, or in healthcare. But we need also look deeper at the true innovation that the underlying transformer technology brings, enabling chat as well as a host of other business applications. Transformers give companies a whole new way of capturing the meaning in their data.
Perhaps we’ll one day look back at the transformational moment in technology that 2017’s transformer breakthrough brought us. There’s a reason why the 2021 research, “Pretrained Transformers As Universal Computation Engines” (Lu et al. 2021), chose the terminology “Universal Computation Engines.” (Technologists and non-technologists alike are strongly encouraged to read this paper, with particular attention to the “frozen” aspect described. Compellingly, the researchers found that “language-pretrained transformers can obtain strong performance on a variety of non-language tasks”.)
And Of Course, AI’s Habitual Downsides
Artificial intelligence, unfortunately, resists the simplistic Manichean classification of good or bad. It’s generally both good and bad, all at the same time. For every positive impact of AI, a negative one exists as well. We’re familiar, for example, with AI under the effects of hallucination. In a consumer application such as ChatGPT, this effect might either be amusing or disquieting but will likely have little impact. In an industrial application, the effects of hallucinating AI could be catastrophic (Nassi et al. 2020).
AI is a product of its training data, striving to deliver statistical consistency based on that training data. Consequently, if the input training data is biased, so is the output. Consider the findings in the research “Image Representations Learned With Unsupervised Pre-Training Contain Human-like Biases” (Steed & Caliskan 2021; the paper’s title says it all). Or the research “Robots Enact Malignant Stereotypes” (Hundt et al. 2022), which showed “robots acting out toxic stereotypes with respect to gender, race, and scientifically-discredited physiognomy, at scale.”
Further, AI has always been vulnerable to adversarial attack on the data itself (Chou et al. 2022, Finlayson et al. 2018). Under consumer chat, the attack vector now expands to the brand new category of malicious “prompt engineering.” We need also consider the climate impact of energy-greedy neural network technologies (Strubell et al. 2019) as they become ever more ubiquitous. Cost/benefit tradeoffs must be made with regard to carbon footprints, with the cost calculation requiring some high-fidelity means of measure.
As AI technologies become more ubiquitous – and transformers may be so protean as to ensure this universality – we create the risk of homogenization. Human populations produce the data we use to train our AI, which AI is then applied to human populations at large, helping condition our behavior (homogenizing it to the norm), which in turn produces more data that’s fed back into the system, in perpetuity. Heterogeneity and individualism get steadily smoothed out, and our behavior and beliefs converge asymptotically on a homogenized norm. (The Netflix Top 10 Effect). The more ubiquitous these data-driven technologies become, the more rapidly we converge on homogeneity.
Lastly, what happens when something like generative chat gets integrated with something like Neuralink? Perhaps we will find that to be the ultimate definition of the term “artificial intelligence.”
So, who’s going to win the day in the brand-new landscape of transformer AI? In commoditized consumer applications such as chat, it will likely be the same companies who won the last round of consumer applications: Google, Microsoft, Amazon and Facebook. These companies will win the current battle for the consumer for the same reasons they won the last one: size. Billions of users a day are already conditioned to visiting Google / Microsoft / Amazon / Facebook sites, where they’ll now find themselves further beguiled by transformer-enabled generative chat.
In addition, large language models are computationally expensive, both in training and in deployment. The large server farms of Google / Microsoft / Amazon / Facebook will be a necessity. And ultimately, generative chat is optimized by the application of multi-modal prompts. I.e., chat that’s prompted not only by the text input (“write an email to my friend inviting her on a hike”), but also by everything else that the hosting company may know about my context (what’s already on my calendar for the weekend, what park has historically had the least number of visitors during that open slot on my calendar, how’s the weather supposed to be, etc.). Only the Big Data giants possess this sort of multi-dimensional / multi-modal prompt data. Perhaps unsurprisingly and/or dismayingly, we can expect our supposed new day to be Groundhog Day.
On the corporate side, however, the contest remains wide open. We can anticipate verticalized generative chat applications to be deployed by businesses in all industries. We should also understand that, whether in drug design or robotics, transformers are now revolutionizing how we can interpret and act on large-scale industrial data. Competitive advantage will be seized by those companies who can most quickly and effectively bring these transformer-based models into production use.
Our physical world is a function of space and time (positionality!). Our experiences are defined by these two factors, and natural language – the sequenced data of human communication – encodes the reality of space and time. By solving the problem of natural language understanding and generation, transformers also generalize the means for AI to solve a host of other problems in the physical world that also depend on data’s meaning and positionality. The advent of transformers may not be a Wright Flyer moment, but we may indeed be witnessing AI’s jet engine moment. Companies in all industries had best get on board.