But what about other instruction sets such as AVX2 that don't have compress-store? Previous work has shown how to emulate this instruction using permute instructions. This strategy has been used in an AVX-512-specific Quicksort. We can then logically negate the yes/no values and apply the instruction again to write the elements to the other partition. Given a separate input of yes/no values (whether an element is less than the pivot), this "compress-store" instruction stores to consecutive memory only the elements whose corresponding input is "yes". Happily, modern instruction sets (Arm SVE, RISC-V V, x86 AVX-512) include a special instruction suitable for partitioning. Partitioning accounts for most of the CPU time, so if we can speed it up using SIMD, we have a fast sort. Then, the Quicksort algorithm for sorting a larger array consists of partitioning it into two sub-arrays: those less than a "pivot" value (ideally the median), and all others then recursing until a sub-array is at most 256 elements large, and using our special method for sorting those. Imagine we have some special way to sort, for instance 256 element arrays. But if SIMD operations only involve independent elements, how can we sort them, which involves re-arranging adjacent array elements? If you are already familiar with SIMD, you may have heard of it being used in supercomputers, linear algebra for machine learning applications, video processing, or image codecs such as JPEG XL.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |