The FAMLX repository is a LLM playground built on MLX, a great framework for tinkering with small LLMs. The philosophy of the project is simple: keeping things as straightforward as possible, trying to minimize abstractions and dependencies, while staying true to the awesome projects that introduced or inspired the methods we showcase.
FAMLX focuses purely on local inference - no training, adapters are disabled and quantization is untested. Because abstraction is limited, it is reasonably easy to follow all the steps of the auto-regressive process that turns a prompt into words. The whole point is to make hacking easy for all types of experiments. It's all about learning by doing, breaking things, and (sometimes) making them work again.
FAMLX allows to run models of any size with the Llama, Qwen2 or Gemma2 architectures, and using PretrainedTokenizers (see Transformers below) of the llama, qwen2, gemma2 or gpt2 types. This should cover the vast majority of the current state-of-the-art in open-source, including all the Llama3.1, Llama 3.2 and Qwen2.5 models (of which Qwen2.5-Coder). Note, however, that large models may not run efficiently (or at all) on typical MacBooks. VRAM will be your constraint : as a rule of thumb, 2GB of VRAM for each 1B parameters in bf/fp16, so choose your model accordingly. You can use quantized models to reduce VRAM requirements.
Sources and Inspirations
This project leverages the work of amazing open-source projects. This explains the different coding styles across the files in the repository. To the extent possible, we tried not to delete the content of the original files when removing unnecessary features or simplifying functions or objects. Instead, the unnecessary parts were commented out with '###' so these comments can be distinguished from those in the original projects. The approach should allow to re-implement the removed features more easily at a later stage if necessary.
- The MLX community is very active and quickly adapt new models to the mlx-lm library. Most of the files at the root of the FAMLX repository and the files in the models directory are based on the mlx examples.
- Transformers is a library developed by HuggingFace which provides tools to easily download, run and train models. The AI ecosystem heavily relies on Transformers, which is open-source, and mlx-lm had a dependency to transformers in order to use pretrained tokenizers from HuggingFace. However, we did not want to download the entire transformers library as we needed a limited number of features. We lifted the necessary parts (now saved in the tokenisation directory), which constitutes an excellent exercise to take a deep dive into tokenizers - an essential yet often overlooked aspect of AI. FAMLX keeps a dependency to huggingface-hub to fetch models, config files and tokenizers stored on HuggingFace.
- Entropix is a project that focuses on adaptive sampling to improve output quality during inference. We believe that it is an interesting area of research so FAMLX offers the possibility to use different samplers, including the Entropix sampler. Note that we only implemented the original Entropix project and our implementation does not reflect the latest updates on the research branch. Our implementation of Entropix is heavily inspired by the entropix-local and the entropix-smollm repositories.
Adaptive samplers rely on attention metrics so adding this feature required substantial changes to the mlx model examples to extract the metrics. Again, an excellent exercise to better understand transformers and attention.
Using the Entropix sampler is optional and the improvement over the default sampler is not always obvious. At the very least, it gives interesting insights into the distinctive attention signatures of different model families.
Feel free to contribute on Github.
License : see information in the GitHub repository