Lmst

Vectorization doesn't always speed things up because of SIMD math.

Sometimes it speeds things up because it forces you to overlap independent dependency chains.

New post:
https://johnnysswlab.com/exposing-more-parallelism-is-the-hidden-reason-why-some-vectorized-loops-are-faster-not-vectorization-per-se/

🚀 New on Johnny’s Software Lab: Handling floating-point errors in C++ without killing performance!
NaNs, infinities, sticky bits, and traps — what works and what’s a trap 🪤.

#Cpp #Performance #FloatingPoint #NaN #Infinity

📬 Mailing list is live!
New articles + workshop dates (AVX, NEON) straight to your inbox.

👉 Go to johnnysswlab.com
➡️ Enter your email in the box on the right

Still thinking #Java is slow? A deep dive into Java vs C++ performance will show what are its strengths and what are its weaknesses.

https://johnnysswlab.com/deep-dive-in-java-vs-c-performance/

#java #javaperformance #garbagecollector #jvm

9 Things Every Fresh Graduate Should Know About Performance

https://johnnysswlab.com/9-things-every-fresh-graduate-should-know-about-software-performance/

We investigate vector functions, more specifically, how to make your vector function available to the compiler's autovectorizer!

#vectorfunctions #simd #openmp #omp

https://johnnysswlab.com/the-messy-reality-of-simd-vector-functions/

Does it matter if we are compiling with optimizations off (O0) or optimizations on (O3) if the problem is memory bound? Let’s find out…

#optimizations #performance #instructionlevelparallelism #ilp #compiler #gcc #memorybound

https://johnnysswlab.com/an-optimizing-compiler-doesnt-help-much-with-long-instruction-dependencies/

Last chance to register for AVX vectorization workshop!

More info: https://johnnysswlab.com/avx-neon-vectorization-workshop/
Register: info@johnnysswlab.com

A post on how to grow data buffers without memcpy:

https://johnnysswlab.com/growing-buffers-to-avoid-copying-data/

Your program doesn’t run fast enough? You need someone to talk to about your software’s performance? You or your team want to learn to write faster software? Whatever it is, we can help you. Check out the consulting page for more info.

https://johnnysswlab.com/consulting/

Another AVX workshop, this time 4 half-days.

Registration or inquiry: info@johnnysswlab.com

More info: https://johnnysswlab.com/avx-neon-vectorization-workshop/

We debug a performance problem by simulating it on the CPU with #llvm-mca, part of @llvmorg .

Have a sneak-peak at what CPU does wile it is executing your program!

https://johnnysswlab.com/performance-debugging-with-llvm-mca-simulating-the-cpu/

Register now or express interest: info@johnnysswlab.com
More info: https://johnnysswlab.com/avx-neon-vectorization-workshop/

You can find the full interactive example here (notice, when the example opens, you will need to switch the active thread from `perf` to `ffmpeg` from the top center part of the screen and also pick `Left Heavy` in the top left part).

https://www.speedscope.app/#profileURL=https%3A%2F%2Fraw.githubusercontent.com%2Fibogosavljevic%2Fjohnysswlab%2Fmaster%2F2021-03-speedscope%2Fspeedscope-ffmpeg.txt

In the above picture, the horizontal bars are function names and the wider the bar - the more time is spent there.

Stacks represent the call hierarchy: functions that call other functions stack on top of each other.

It's very easy to see what functions spend the most time.

Are you familiar with flamegraphs? Flamegraphs are great way to visualize where your program is spending time or other resources.

Here is a screenshot of flamegraph taken from ffmpeg converting a movie from one type to another.

https://johnnysswlab.com/avx-neon-vectorization-workshop/

Everything you need to speed up your memory intensive program:

https://johnnysswlab.com/memory-subsystem-optimizations/

After more than 1 year of development, we are proud to announce:

𝑶𝒖𝒓 𝑭𝒊𝒓𝒔𝒕 𝑽𝒆𝒄𝒕𝒐𝒓𝒊𝒛𝒂𝒕𝒊𝒐𝒏 𝑾𝒐𝒓𝒌𝒉𝒐𝒑

The vectorization workshop is a 2 day event that teaches you how to make your software faster by using NEON vectorization extensions for ARM CPUs and AVX vectorization extensions for Intel and AMD CPUs.

We start lightly and move on to explain more difficult concepts needed for efficient vectorization.

More info and registration here:
https://johnnysswlab.com/avx-neon-vectorization-workshop/