It’s been a very busy few weeks working on 2 big new things for TimelineFX – a compute shader and a new motion randomness alorithm that uses Simplex Noise.
Particle Compute Shader
I really wanted to get this out of my system as I always wanted to see if I could make TimelineFX effects run in a compute shader for maximum speed and free up the CPU. This would be a big plus for anyone wanting to incorperate the TimelineFX library into their games and other projects.
It took a lot of head scratching as to how I’d make it work but in the end I think I’ve come up with a decent solution. Here are some of the problems and solutions that I ran into:
How to manange sub effects
One of the features of timelinefx is the ability to have sub effects where particles can emit other particles. I also wondered about whether emitters should run on the GPU as well and how that might work. Of course it just isn’t feasible to have the GPU manage everything, one thing to realise about shaders is that they’re massively parallel which takes some wrapping your head around to realise that it’s not like the more sequential style of the CPU. So the logical solution here was to simply make a rule, only bottom level particles that have no sub effects attached to them would be uploaded to the GPU to run on there. The vast majority of the particles that need to be processed will be bottom of the heirarchy so the rule works well.
All emitters would stay on the cpu and spawn particles there. Then each frame all new particles are uploaded to the GPU particle buffer to be added to the compute shader update routine.
How should particles expiring be handled?
On the CPU side of things I use double buffering. This means that each frame all particles are updated in one buffer and if they haven’t expired yet then they are moved to the next buffer and swapped between buffers each frame until they expire. This works well especially as the ordering of particles is important for alpha blended particles. But what about on the GPU? This method doesn’t work well at all. As mentioned the GPU operates on things in pararell so you can’t just “push_back” to the next buffer because they’ll all be trying to do that at the same time. You could use atomics to increment an index but then you lose the advantage of the parallelism.
The solution? Ring buffer. It’s funny because in early versions of the TimelineFX rewrite I initially used ring buffers but I ran into a problem when single particles were used (particles that don’t expire). They would clog up the buffer and get stuck at the front of the queue which is why I opted for the double buffering approach (having said this though I’d like to revisit ring buffers again and maybe handle single particles in a separate buffer). So obviously that problem would still exist with the compute shader as well but it’s much simpler to handle – simply don’t send single particles to the GPU, just let the cpu update them instead. Again the vast majority of particles are not single particles so it’s not an issue to let the CPU handle them instead.
The way the ring buffer works is quite simple, you have a start index, the current length of the section of the buffer being used, and the max index. As you add new particles if you hit the end of the buffer it loops back to index 0. As particles expire the start index is bumped up. There was another problem to deal with though – how does the compute shader know how to bump the start index? Well it doesn’t without adding a bunch of conditions into the shader and if statements and branching in general is a bad thing performance wise so it made sense to let the CPU do it instead.
So how does that work considering the particle buffer is on the GPU? Well we just simply sample the front of the ring buffer and then have the CPU loop through the sampled particles until it finds a non expired particle and stops there bumping up the start index to that point. The amount that you have to sample is proportionate to the amount of new particles you are adding, otherwise if you don’t sample enough the ring buffer will end up growing until it runs out of space. On a simple test I did with a million particles updating each frame it needed to sample about 5k particles (that’s with 5k new particles each frame) which it managed without issue. I’ve managed to get the particle struct down to 64 bytes so that would be 312kb that it needs to copy from the GPU each frame. Having written that though why don’t I just store the age/max_age of the particle in a separate buffer then I’d only have to download 40kb each frame for 5k particles – will have to look into that!
If you’re wondering about particles that expire somewhere in the middle of the buffer, they are still processed, but the sprite is set to 0 size so that they’re not rendered. This is the best compromise in my view.
GPUs don’t like braching (if statements)
If you look at the ControlParticle function in the TimelineFX library, this is basically the function that needed to be translated to the compute shader and it contains a few if statements. Eliminating these in the compute shader has been pretty straightforeward by basically using negators. For example if a particle should remain relative to the position of the emitter then it should be transformed into position each frame. I put an if/else on the CPU for this but on the compute shader I do the transform regardless but negate the result if the relative flag is set.
The other main condition was with the motion randomness. This is what made me look into an alternative solution and ended up implementing simplex noise (should have done a lot sooner!). Simplex noise produces way better results for random movement by making it look a lot more natural and organic. There is a cost of speed though but I think it’s more then worth it. The old motion randomness would count a few frames and if the counter is hit then it would change course but I didn’t want this sometimes-do-sometimes-don’t situation with the compute shader due to the branching. Having said that though I do still have the “if noise” condition in there as it is an expensive function and probably faster having it there then not in this case for particles that don’t use noise.
Results?
I think I’ll be optimising a lot going forward but for starters I’m very pleased that particles on the GPU look identical to CPU updated particles, that’s obviously an important thing. Secondly they do run extrememly fast. I’m on a mediocre card (1660gtx) and it updated 1 million particles at about 150fps without simplex noise, or about 250k at 150fps with simplex noise particles. This is with smallish particles so fill rate wouldn’t be so much of an issue. What I really need to do though is start profiling in NVidia insights to see where the bottlenecks are as I’m sure I can improve performance more.
Editor Updates
As mentioned the big addition in the editor is the replacing of Motion Randomness with a simplex noise algorithm. With this there are 5 new Attributes to control noise:
Base Noise Offset
The noise algorithm is passed the X and Y coordinates of the particle and it passes back a noise value between -1 and 1. If you think of the x and y coords plotting an area on a graph then the offset simply offsets where on the graph you plot. A good idea here is to put the emitter on a loop and slowly change the offset over time. See the Motion Randomness effects in the LibraryExamples that come with the alpha version to see what I mean. Having said this, it’s very likely that I’ll add a property that will do the same thing in the future as this creates some greate results.
Variation Noise Offset
You can also vary the offset using this attribute. Higher values here will make the movement much less uniform and more random.
Variation Noise Resolution
I was at odds as to where to put this but for now decided on variation, I think I’ll probably move it to base though now that I think about it. This basically changes the resolution of the x,y lookup so higher numbers will mean wider arcs in the movement of each particles.
Overtime Velocity Turbulance
This will use the noise value to decide the speed of the particle
Overtime Direction Turbulance
This will use the noise value to decide the direction of particle
Overime Noise Resolution
This will scale the value that you put in Noise Resolution so that you can change the amount of resolution over the lifetime of the particle.
I’ve been playing around a lot with these new attributes and been impressed by the different effects that can be created, I think it will open up a lot of new possibilities in the future. I’m already thinking about the option to use noise to decide where a particle will spawn in an area.
As usual here is the full list of latest changes:
* Removed the effect list in a child window for now.
* Fixed a vulkan validation issue relating to the animation tab
* Fixed issue with effect libraries that contained imported effects.
* Improved how velocity makes particles align when they have Align selected in properties.
* Added new Emission angle type on the properties tab. This will make the particle align on emission only and not stay aligned as the particle changes direction thereafter.
* Put emitter size onto the emitter tab for easier access. You can still change the emitter size overtime using the graphs under attributes.
* Moved option to only play the selected emitter onto the Settings menu.
* Setting that changes the updates per second now works correctly.
* When loop length has a value on the properties tab it will now make the sine/square wave generator set it’s width to that length.
* Global zoom attribute now works a lot better and also renamed it to Overal Scale.
* Replaced the current motion randomness algorithm with Simplex Noise algorithm. This means there are additional attributes available – Noise Offset, Noise Resolution, Velocity Turbulance and Direction Turbulance.
* Option to sync refresh rate now properly applies on start up according to whatever setting was saved last.