MPI programming is rarely done the way I think it should be done. The tools provided by the compiler manufacturers, and all the instructions I've seen on the Internet favor the simple divide-and-conquer approach where an approximately equal amount of work is apportioned to each core or node. I consider this a failure of our educational system, and so I've given a couple of lectures over the years on how to do this a little smarter. This method might not work for all problems, but then maybe the problems could be re-thought to get it into a form that could make use of this standard software-engineering approach. For seismic data processing, the applications we wrote that use my code here are 3.5 to 5 times faster than the competition (according to Total), and runs on tens of thousands of cores processing around 1 PB of data (the initial requirement I was given for one application was that it take a tiny dataset of 5 TB and process it on a 100 core machine in 1 week - the execution was completed in less than 3 days).
Matrix Inversion
This method for inverting a matrix will work for any matrix, regardless of the size or shape. The drawback is the amount of memory that is required, but there are ways to modify this to account for large sparse matrices that may be suitable. An early version of this in C++ I wrote in 1992 used a template where you'd have to supply the definition of 0, 1, subtraction, and division for the datatype, and it would even work with complex datatypes (like differential equations).
Limited Size Binary Tree
I came upon the need for a binary tree for an embedded device in the mid '90s that would throw away the oldest data and keep the newest, should a problem arise with the transmission of the data to the collection facility. This particular device collected electricity usage for smart household metering systems (communicating over the power lines), and when the project was abandoned, and someone had a question about it, the facility was opened after 7 years and found to be still running, still collecting the data and waiting for the transmission to work (but the antenna for transmitting the data back to the facility had been stolen). Here's a simple binary tree: BTree.h BTree.c, and here's a python version of the fixed-size one: - the original was in C++.

I've made many more little algorithm implementations over my career, but they would just clutter these pages, if I got around to re-implementing them. Back in '85, for instance, I beat the Cray-supplied quicksort for selecting the best input data for an Optimal Interpolation routine for a weather analysis program (by a factor of about 4x) while working at the Air Force Global Weather Central. I removed the Householder matrix inversion routine from a 3D refraction statics program and coded in an OpenMP modified Jacobian Relaxation routine that gave a speedup of about 200x for a test dataset. And once I wrote a program in C that when compiled and run would output its own source code. Compile the program and run it and pipe the output to a file which could be compiled and run and the output would match the input - without referencing any external files (including its source code). I was challenged by a student on that one, and my solution kept track of which generation it was, it would randomly generate variable and function names each run, and would toggle between logic states as well. That one took all day to create, and although the original source code is long gone, here's a recent version that keeps track of its iteration number. Compile the code with gcc repeat.c && ./a.out | tee repeat.c, and then "rinse and repeat".

The games can also be thought of as algorithm implementations. Of course, I never think of them that way when I'm playing them.