LLaMas

07 Apr 2023

This week was the first time I ran an LLM on my own machine! The first thing I asked was for the model to write an article about SwiftUI.

Here is the output:

sampling parameters: temp = 0.960000, top_k = 10000, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000


== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- If you want to submit another line, end your input in '\'.


 Below is an instruction that describes a task. Write a response that appropriately completes the request.
> Write an short introduction for an article about SwiftUI.
SwiftUI is Apple's latest programming framework for creating intuitive and interactive user interfaces for iOS, macOS, and tvOS applications. This article will take an in-depth look at SwiftUI, including its features, benefits, and usage. We'll also explore some practical tips for creating great user experiences with SwiftUI.

Pretty handy, I must admit! Although it did run at a meager pace of 2.6 tokens/min, or roughly 0.043 tokens/sec, which for some reason is about 500 times slower than what the model can run on an M1 Mac. For reference, this was running on a 2015 ThinkPad X250 with a 5th-generation Intel i5 and 8GB of RAM, on Fedora 37. Oh, and as a side note, you can see how excited I was in my misspelling of the phrase "Write an short..." :D

What is this, and how is this possible?

On March 10th, Georgi Gerganov announced llama.cpp, his port of Meta's LLaMA model in C++, which was, according to the README, "hacked in an evening." I saw it on HackerNews, but didn't dig too much since it seemed focused on the new M1&2 powered Macs and I didn't have much time to play with it. Some weeks later, I saw some new features added to it, so I decided to finally give it a try. Basically, I could run the 7B model because the implementation uses 4-bit quantization, essentially halving the memory requirements of the regular model where the parameters are one byte.

The process

First thing's first, I cloned the llama.cpp repository and then followed the steps from the README. However, I quickly ran into a problem, which was that I didn't know where to find downloadable models. So I read a bit further and saw the line:

"The LLaMA models are officially distributed by Facebook and will never be provided through this repository."

"OK, clear enough", I thought. Lets try and ask Meta for the models. There was a link to a form for requesting the models, so I followed it and submitted my request. The fact that the form asked for previous research should have been hint enough that a curious CS student like myself wasn't going to get his hands on it anytime soon. That didn't stop me from trying; however, after not getting a reply from them after a few days, I grew impatient and decided to search for the models elsewhere. It turns out you can still find the links in the repository because they were previously included in the README, but later deleted. I dug through the README.md git history to try and find the download links, but none of them ended up successfully downloading. So I googled some more and found some models on HuggingFace. I downloaded the 7B ggml alpaca model and tried to use it, but running the alpaca script spat out an error saying the model file was too old. After a bit of digging and checking out some older versions of the repository, it finally worked!

What does this mean?

While I don't think it can be usable on my current hardware, and the 7B model currently isn't that impressive once you've tried GPT-3.5 or newer OpenAI models, the fact that such a large language model can run on a 2015 laptop without a dedicated graphics card is truly astonishing. And when you consider that a 65B parameter model can run comfortably on the new M2 powered MacBooks, this is all really exciting news. It also takes some of the power away from OpenAI, Google and the like. Currently, they control their models completely. They can force updates to models, where you cannot use the old version. They alone get to decide what the limits of the models will be, what it can and cannot do (in spite of some people's best efforts at jailbreaking), as well as which directions to take to develop them further. While training such models still requires dedicated hardware not readily available to most people, this is a giant step towards democratizing LLM technology. While training the full model is prohibitively expensive for ordinary people, finetuning is much more accesible and thus many specialized models can now be built.

And that's it for this article! If you were expecting some analysis of the traits of the model, perhaps compared to other models like ChatGPT or Bard, that's unfortunate. It really is dreadfully slow to run on my hardware, so experimentation would last a (far too) long while. I nonetheless remain impressed that it's possible to run it at all!