Fast Llama 2 on CPUs with Sparse Fine-Tuning and DeepSparse

Fast Llama 2 on CPUs With Sparse Fine-Tuning and DeepSparse - Neural Magic.
Key Takeaways We expanded our Sparse Fine-Tuning research results to include Llama 2. The results include 60% sparsity with INT8 quantization and no drop in accuracy. DeepSparse now supports accelerated inference of sparse-quantized Llama 2 models, with inference speeds 6-8x faster over the baseline at 60-80% sparsity. We used some interesting algorithmic techniques in order

Read in full here:

This thread was posted by one of our members via one of our news source trackers.