Homo Technologicus: Practicum AI

Anatomy of a Neural Network (Part I)

Dan Maxwell — Tue, 15 Oct 2024 18:38:35 GMT

Up to this point, I’ve mentioned neural networks multiple times in my previous posts. I thought it might be helpful if I spent some time explaining their construction and operation. This will be the first post in a series on this topic. Please let me know if anything is unclear.

Algorithm or Model

Dan Maxwell — Tue, 14 May 2024 14:55:27 GMT

Although Substack provides a transcription, here’s the original with my formatting.

Hello and welcome to another episode of Practicum AI. I’m Dan Maxwell. In this short presentation, I’m going to talk briefly about the difference between algorithms and models.

Writers are often imprecise and fuzzy when using technical vocabulary. Consider this sentence from a recent article in a peer-reviewed journal. “According to machine learning, the right algorithms can help a computer self-improve as it acquires more experience.” In this example, the author links the word “algorithm” to the phrase “machine learning” – the thought being that algorithms power machine learning. Unfortunately, this is technically unclear.

In computer science, an algorithm is different from a model. Models lie at the heart of machine learning, not algorithms.

Here’s how I might rewrite this sentence in a more precise way. “In machine learning, a good model quickly learns the patterns found in a specific dataset.” Note that I replaced “algorithms” with “model” and “self-improve” with “learn.” But why did I do that? Let’s begin with a closer look at algorithms.

Algorithms have a long and distinguished history in mathematics and computer science. An algorithm is like a recipe. It specifies the steps to achieve a task (making muffins in this case) as well as the required ingredients.

Programmers often write pseudocode to help them understand the logic of a new algorithm. Here, we see the steps for making a cup of tea. In pseudocode, a programmer concisely states what each program step will do. They then specify the how of each step by writing code in C++, Python, or some other language.

Algorithms lie at the heart of traditional programming. As noted earlier, an algorithm is like a recipe. At each step, it needs ingredients (data) and rules (underlying process logic) to arrive at an answer.

The image shown here has been the backbone of development since the beginning of computer science. But it’s limited. This approach works well when we know all the rules and can implement them in code. But what happens when cannot know all the rules? That’s when complexity quickly overwhelms us.

Let’s consider a concrete example…

Consider a case where we want to create a program to predict a person’s activity, given their speed as calculated by a fitness monitor. In this example, a programmer has already implemented an algorithm in code.

Our first program is simple. If the measured speed is less than 4 miles per hour, the program predicts ‘WALKING’ as the activity.
Our second program, however, is a bit more complex. In this case – if the person’s speed is at least 4 miles per hour or greater – the program predicts ‘RUNNING’ as the activity.
Our third program is even more complex. If a person’s speed is 12 miles per hour or greater, the program predicts ‘BIKING’. If the speed is at least 4 miles per hour but less than 12, it predicts ‘RUNNING’. And finally, if the speed is less than 4 miles per hour, it predicts ‘WALKING’.
But what if we want to extend this even further to predict ‘GOLFING’ as the activity. Now, we’re stuck. How do we model that in code? While golfing, a person might walk a bit, stop, do some activity, walk a bit more, stop, and so on. Our algorithm quickly turns into a nightmare. Clearly, our ability to detect this activity using traditional rules has hit a wall. Is there a better way? Well, yes there is. Enter machine learning…

Let’s take another look at our traditional programming diagram. Here, the rules implemented in the code act on the data to give us answers or predictions. In our activity detection example, the data is the speed at which a person is moving. Using that speed, we then designed an algorithm or set of rules to detect their activity, whether walking, biking, or running. However, we hit a wall with golfing because we couldn’t figure out the rules for that activity.

But what would happen if we flipped the axes on this diagram? Rather than figuring out the rules, what if we fed the answers and the data to a model that then discovered those rules? That, in a nutshell, is machine learning.

So, what are the implications of all this? Well – with machine learning – we don’t try to figure out the algorithm or rules. Instead, we first collect a lot of data and label it. We then let the model figure out the rules that make one piece of data match a specific label and another match a different one.

So, how does machine learning solve the complex activity detection problem? Well, the machine learning strategy goes like this. We first collect data from various sensors worn by our research participants. Those sensors might collect heart rate, location, speed, and perspiration. And if we collect this data while our participants are doing various activities, we end up with a dataset that allows the model to “see” what walking looks like, what running looks like, and so on. Now, our job has changed from designing algorithms to creating machine learning models that can match the data to the labels.

Well, what is a model? From a computational perspective, a model is simply a mathematical/statistical representation of reality. Note: a model is not reality. The adage that the “map is not the territory” also holds true for machine learning models. E.P. Box once said, “All models are wrong, but some are useful.” Etch that thought into your mind!

Whereas algorithms execute a predefined sequence of steps to complete a task, machine learning models make predictions. Note: these are two very different things. Neural networks, for example, do not take an algorithmic approach to learning. Instead, they are made of artificial neurons stacked in layers. Imagine a layered cake. During training, the model adjusts the weights between the layers – the frosting in-between – in response to the feedback it receives from its predictions. This is a messy and largely unstructured process, a far cry from the methodical step-by-step approach of the algorithm.

Now you know why I rewrote the sentence from our first slide. In future podcasts, I will talk about neural network anatomy.

Thank you for watching this presentation! If you have any questions or if anything was unclear, please leave me a comment.

To Predict a Word

Dan Maxwell — Tue, 30 Apr 2024 20:04:23 GMT

Although Substack provides a transcription, here’s the original with my formatting.

Hello and welcome to another episode of Practicum AI. I’m Dan Maxwell. In this short presentation, I’m going to talk briefly about word prediction.

Consider this famous sequence of words, spoken by the American President Abraham Lincoln at the start of his Gettysburg Address. Given this sequence, what word comes next? Or given this one, what’s the missing word? Stated simply, this is all that AI language models do. They predict the next word, given a specific context. This simple task lies at the heart of all generative AI text systems, including the most advanced large language models – ChatGPT and Llama being but two examples. When you prompt ChatGPT, it uses your sequence of words to generate a new sequence, one word at a time.

Before transformers made their debut in 2017, the most popular word prediction tools were n-Gram language models and recurrent neural networks or RNNs. But both had significant limitations. I will first talk about n-Gram language models, followed by RNNs, and conclude with a brief introduction to transformers.

N-gram language models were the first and simplest approach to word prediction. Consider the sentence: “A cat sat in the hat.” So, how might an n-gram language model generate this? In this example, our initial sequence has just two words: “A” and “cat”. Here’s how an n-gram model predicts the next word:

1.First – as this is a bi-gram model, it takes the initial two words and searches the model’s document dataset for sentences where these two appear together.

2.In this example, our search retrieved sentences where “sat” was the next word and others where “napped” came next. As we can see, the word “sat” was more prevalent, occurring 11 times in 68% of the sentences. On the other hand, the word “napped” was found in just 5 sentences – 32% of the time.

3.Because “sat” has the highest probability of appearing after these two words, the model selects it as the next word in the sequence. It then takes the next two words – “cat sat” – and repeats the same process until the entire sentence has been generated.

N-gram language models are limited. They assume that the probability of the next word in a sequence depends only on a fixed-size window of previous words (Wikipedia). The problem with this assumption is that the model might not find any matches, especially for larger n-grams of 6, 7, or more words. Consider this 5-word pentagram – “cat pawed at the moving …”. Clearly, the chances of the model finding this exact sequence of words in a large document dataset is minimal. Sophisticated versions of n-gram models use a variety of statistical techniques to predict the next word when no matching sentences are found. Even so, this remains a limiting factor.

Recurrent neural networks or RNNs made next word prediction and translation more precise. With RNNs, all the information about an input is represented in a single piece of state memory or context vector. Thus, RNNs must compress everything they need to know about a word sequence into the available space. This limits the size of the input sequence. And no matter how large we make state memory, some word sequence inevitably exceeds it, and vital information is lost.

A second problem is that RNNs must be trained and used one word at a time. This can be a slow way to work, especially with large datasets.

To gain a better understanding of the state memory limitation, let’s review the basics of RNN architecture.

RNN models contain a feedback loop that allows information to move from one step to another. As such, they’re ideal for modeling sequential data like text. As shown here, an RNN receives some input (a word or character), feeds it through the network, and outputs a vector called the hidden state. At the same time, the model feeds some information back to itself via the feedback loop, which it can then use in the next step.

To the right of the equal sign, the RNN process is unrolled. During each iteration, the RNN cell passes information about its state to the next operation in the sequence. This allows the cell to retain information from previous steps and use it for its output predictions.

The RNN architecture made early machine translation systems possible. The way RNNs translate text is usually done by linking an Encoder to a Decoder. This architecture works well when both the input and output sequences are of fixed length. Here a short English sentence of 3 words plus exclamation point is translated into German. The encoder ingests each sentence element sequentially while maintaining its state along the way. The encoder’s last hidden state – a numerical representation of the entire sentence – is then passed to the decoder. The decoder, in turn, generates the German equivalents from top to bottom.

This architecture is simple and elegant, but it has one big weakness. The encoder’s final hidden state is an information bottleneck. That is, it must represent the meaning of the entire input sequence in a compressed form. With long sequences, this creates a challenge. The information at the start of the sequence might be lost in the process of compressing everything into a single (fixed) representation.

When that happens, the decoder may not have enough information to do its job well.

Alright, let’s simulate the bottleneck problem. This animation shows how an RNN translation model works. Each word is processed separately, with a single hidden state passed between words. The encoder’s final hidden state is then handed off to the decoder which generates the German equivalent.

The transformer architecture represented a significant advance over RNNs. Here we see a transformer executing another translation task. This time from English to French. Unlike directional models which read the text input sequentially (left-to-right or right-to-left), a Transformer encoder reads the entire sequence of words at once. It is therefore considered bidirectional. Or more precisely, we say that it’s non-directional. This property allows the model to learn the context of a word in relation to all the words around it. In other words, the transformer context window is larger than that of n-Gram language models or RNNs. And this is a significant advantage. In future presentations, I’ll talk about transformers in-depth. But this is enough for now.

And before I end this short presentation, here’s a family tree of the most prominent transformer models. Keep in mind that this list is not complete as development in this space is dynamic and ongoing.