Optimization of triangular and banded matrix operations using 2d-packed layouts