GGML

GGML is machine learning library written in C. The “GG” refers to the initials of its author, Georgi Gerganov. Here’s its Github.

How to use GGML

> cd ggml/examples
> mkdir my-model-arch
> cat 'add_subdirectory(my-model-arch)' >> CMakeLists.txt

ggml.c walkthrough

It may be helpful to walk through the original code on the day of release: code pointer.

ggml_init – This function returns a ggml_context, which contains a pointer to the memory buffer. All tensors are allocated in this memory buffer.

  • precomputes some values to save on operations. For example, it precomputes Sigmoid Linear Unit values.
  • allocates a memory pool in which all tensors will be stored. The memory pool is accessible via ggml_context, which has a field called mem_buffer. Allocating a memory buffer at the start of the program allows easy deallocation at the end of the program.

How Georgi hacked together llama.cpp in one evening

Reference: https://github.com/ggerganov/llama.cpp/issues/33

Here is a short summary of the implementation (a.k.a. "hacking") process if anyone is interested - might be useful for porting other models:

* Started out with the GPT-J example from the ggml repo
* Used the 4-bit branch of ggml since it has initial quantization support that we want
The LLaMA model has a very similar architecture to GPT-J. It uses the same positional encoding (RoPE), similar activation function (SiLU instead of GELU). The main differences are:
    * no bias tensors
    * some new normalization layers
    * extra tensor in the feed-forward part
    * a slightly different order of the operations
    * seems context size is not fixed? (if I understand correctly the code)
* All these are trivial changes that can be applied to the GPT-J example just by looking at the original Python LLaMA code
* Modified the Python conversion script to read the .pth file of 7B model and dump it to ggml format as usual
* The tokenizer was obviously more complex and problematic, but made a quick hack to at least support it partially
* This was enough to get the LLaMA-7B running. Later, the rest of the models became supported by figuring out how to merge the original parts of the model thanks to some references from community

Here is the LLaMA WIP branch in the ggml repo that I then migrated to become llama.cpp:
https://github.com/ggerganov/ggml/tree/llama

Through this process, there was no need to even run the original Python code. The downside is that I haven't had the chance to compare the outputs at different stages of the inference, so I have doubts about the correctness of this implementation. However, looking at the generated outputs, I guess it has to be correct.

Resources

  • https://rentry.org/local_LLM_guide