Ramiro Polla
|
87052c0933
|
swscale/x86: add sse4 and avx2 {lum,chr}ConvertRange16
chrRangeFromJpeg16_1920_c: 3153.9
chrRangeFromJpeg16_1920_sse4: 1770.0 (1.78x)
chrRangeFromJpeg16_1920_avx2: 891.5 (3.54x)
chrRangeToJpeg16_1920_c: 3165.0
chrRangeToJpeg16_1920_sse4: 1953.2 (1.62x)
chrRangeToJpeg16_1920_avx2: 973.0 (3.25x)
lumRangeFromJpeg16_1920_c: 1298.5
lumRangeFromJpeg16_1920_sse4: 886.5 (1.46x)
lumRangeFromJpeg16_1920_avx2: 447.7 (2.90x)
lumRangeToJpeg16_1920_c: 1905.0
lumRangeToJpeg16_1920_sse4: 993.0 (1.92x)
lumRangeToJpeg16_1920_avx2: 498.9 (3.82x)
|
2024-12-05 21:10:29 +01:00 |
|
Ramiro Polla
|
be108ebcf4
|
swscale/x86/range_convert: update sse2 and avx2 range_convert functions to new API
chrRangeFromJpeg8_1920_c: 2127.4 (1.00x)
chrRangeFromJpeg8_1920_sse2: 816.0 (2.61x) 813.5 (2.62x)
chrRangeFromJpeg8_1920_avx2: 408.9 (5.20x) 405.4 (5.25x)
chrRangeToJpeg8_1920_c: 3166.9 (1.00x)
chrRangeToJpeg8_1920_sse2: 815.0 (3.89x) 815.0 (3.89x)
chrRangeToJpeg8_1920_avx2: 404.5 (7.83x) 405.5 (7.81x)
lumRangeFromJpeg8_1920_c: 1263.0 (1.00x)
lumRangeFromJpeg8_1920_sse2: 411.0 (3.07x) 413.2 (3.06x)
lumRangeFromJpeg8_1920_avx2: 200.5 (6.30x) 201.9 (6.26x)
lumRangeToJpeg8_1920_c: 1886.8 (1.00x)
lumRangeToJpeg8_1920_sse2: 412.0 (4.58x) 408.9 (4.61x)
lumRangeToJpeg8_1920_avx2: 208.5 (9.05x) 205.7 (9.17x)
|
2024-12-05 21:10:29 +01:00 |
|
Ramiro Polla
|
2d1358a84d
|
swscale/range_convert: saturate output instead of limiting input
For bit depths <= 14, the result is saturated to 15 bits.
For bit depths > 14, the result is saturated to 19 bits.
x86_64:
chrRangeFromJpeg8_1920_c: 2126.5 2127.4 (1.00x)
chrRangeFromJpeg16_1920_c: 2331.4 2325.2 (1.00x)
chrRangeToJpeg8_1920_c: 3163.0 3166.9 (1.00x)
chrRangeToJpeg16_1920_c: 3163.7 2152.4 (1.47x)
lumRangeFromJpeg8_1920_c: 1262.2 1263.0 (1.00x)
lumRangeFromJpeg16_1920_c: 1079.5 1080.5 (1.00x)
lumRangeToJpeg8_1920_c: 1860.5 1886.8 (0.99x)
lumRangeToJpeg16_1920_c: 1910.2 1077.0 (1.77x)
aarch64 A55:
chrRangeFromJpeg8_1920_c: 28836.2 28835.2 (1.00x)
chrRangeFromJpeg16_1920_c: 28840.1 28839.8 (1.00x)
chrRangeToJpeg8_1920_c: 44196.2 23074.7 (1.92x)
chrRangeToJpeg16_1920_c: 36527.3 17318.9 (2.11x)
lumRangeFromJpeg8_1920_c: 15388.5 15389.7 (1.00x)
lumRangeFromJpeg16_1920_c: 15389.3 15388.2 (1.00x)
lumRangeToJpeg8_1920_c: 23069.7 19227.8 (1.20x)
lumRangeToJpeg16_1920_c: 19227.8 15387.0 (1.25x)
aarch64 A76:
chrRangeFromJpeg8_1920_c: 6334.7 6324.4 (1.00x)
chrRangeFromJpeg16_1920_c: 6336.0 6339.9 (1.00x)
chrRangeToJpeg8_1920_c: 11474.5 9656.0 (1.19x)
chrRangeToJpeg16_1920_c: 9640.5 6340.4 (1.52x)
lumRangeFromJpeg8_1920_c: 4453.2 4422.0 (1.01x)
lumRangeFromJpeg16_1920_c: 4414.2 4420.9 (1.00x)
lumRangeToJpeg8_1920_c: 6645.0 5949.1 (1.12x)
lumRangeToJpeg16_1920_c: 6005.2 4446.8 (1.35x)
NOTE: all simd optimizations for range_convert have been disabled
except for x86, which already had the same behaviour.
they will be re-enabled when they are fixed for each architecture.
|
2024-12-05 21:10:29 +01:00 |
|
James Almer
|
fcf72966a5
|
swscale/x86/range_convert: add missing AVX2 preprocessor wrapper
Fixes compilation with old yasm.
Signed-off-by: James Almer <jamrial@gmail.com>
|
2024-06-16 10:09:38 -03:00 |
|
James Almer
|
8a4c9d6bd3
|
swscale/x86/range_convert: reduce amount of xmm regs clobbered in luma functions
Signed-off-by: James Almer <jamrial@gmail.com>
|
2024-06-15 21:02:06 -03:00 |
|
Ramiro Polla
|
f6859cade3
|
swscale/x86: add sse2 and avx2 {lum,chr}ConvertRange
chrRangeFromJpeg_8_c: 22.3
chrRangeFromJpeg_8_sse2: 13.3
chrRangeFromJpeg_8_avx2: 13.3
chrRangeFromJpeg_24_c: 72.8
chrRangeFromJpeg_24_sse2: 22.3
chrRangeFromJpeg_24_avx2: 17.5
chrRangeFromJpeg_128_c: 345.5
chrRangeFromJpeg_128_sse2: 106.0
chrRangeFromJpeg_128_avx2: 57.8
chrRangeFromJpeg_144_c: 380.5
chrRangeFromJpeg_144_sse2: 118.5
chrRangeFromJpeg_144_avx2: 62.3
chrRangeFromJpeg_256_c: 646.3
chrRangeFromJpeg_256_sse2: 218.8
chrRangeFromJpeg_256_avx2: 109.0
chrRangeFromJpeg_512_c: 1461.5
chrRangeFromJpeg_512_sse2: 426.5
chrRangeFromJpeg_512_avx2: 211.5
chrRangeToJpeg_8_c: 37.8
chrRangeToJpeg_8_sse2: 10.5
chrRangeToJpeg_8_avx2: 14.0
chrRangeToJpeg_24_c: 114.3
chrRangeToJpeg_24_sse2: 23.5
chrRangeToJpeg_24_avx2: 16.3
chrRangeToJpeg_128_c: 633.5
chrRangeToJpeg_128_sse2: 107.5
chrRangeToJpeg_128_avx2: 55.0
chrRangeToJpeg_144_c: 758.3
chrRangeToJpeg_144_sse2: 132.0
chrRangeToJpeg_144_avx2: 64.5
chrRangeToJpeg_256_c: 1345.0
chrRangeToJpeg_256_sse2: 218.0
chrRangeToJpeg_256_avx2: 105.3
chrRangeToJpeg_512_c: 2524.0
chrRangeToJpeg_512_sse2: 417.0
chrRangeToJpeg_512_avx2: 218.8
lumRangeFromJpeg_8_c: 11.8
lumRangeFromJpeg_8_sse2: 11.0
lumRangeFromJpeg_8_avx2: 10.3
lumRangeFromJpeg_24_c: 38.5
lumRangeFromJpeg_24_sse2: 15.5
lumRangeFromJpeg_24_avx2: 12.5
lumRangeFromJpeg_128_c: 232.3
lumRangeFromJpeg_128_sse2: 60.0
lumRangeFromJpeg_128_avx2: 26.8
lumRangeFromJpeg_144_c: 259.5
lumRangeFromJpeg_144_sse2: 65.3
lumRangeFromJpeg_144_avx2: 29.0
lumRangeFromJpeg_256_c: 464.5
lumRangeFromJpeg_256_sse2: 107.5
lumRangeFromJpeg_256_avx2: 54.0
lumRangeFromJpeg_512_c: 897.5
lumRangeFromJpeg_512_sse2: 224.5
lumRangeFromJpeg_512_avx2: 109.8
lumRangeToJpeg_8_c: 17.8
lumRangeToJpeg_8_sse2: 11.0
lumRangeToJpeg_8_avx2: 11.8
lumRangeToJpeg_24_c: 56.3
lumRangeToJpeg_24_sse2: 11.0
lumRangeToJpeg_24_avx2: 12.5
lumRangeToJpeg_128_c: 333.8
lumRangeToJpeg_128_sse2: 53.3
lumRangeToJpeg_128_avx2: 26.5
lumRangeToJpeg_144_c: 375.5
lumRangeToJpeg_144_sse2: 60.8
lumRangeToJpeg_144_avx2: 29.0
lumRangeToJpeg_256_c: 652.0
lumRangeToJpeg_256_sse2: 109.5
lumRangeToJpeg_256_avx2: 53.5
lumRangeToJpeg_512_c: 1284.3
lumRangeToJpeg_512_sse2: 218.0
lumRangeToJpeg_512_avx2: 108.3
|
2024-06-16 00:35:51 +02:00 |
|