Not so "power" shell

Not so "power" shell
Photo by Marc Rentschler / Unsplash

So it's back again. Like a phoenix that rises from the ashes (Logical volume idiocy) this blog can never die. Like the hydra with it's nine heads (The nine times I completely hosed this server) it shall never fail.

For those not keeping up at home, I managed to nuke this blog and most of my Proxmox single node server with it. Thankfully I have backups but alas the archive for this very site was corrupted... Testing backups is too hard for me apparently.
So for everyone that managed to catch a glimpse of post 1 and 2, I am eternally sorry. For everyone else, consider yourselves lucky.

Considering I made a really dumb series of decisions that led to me obliterating some 15 virtual machines and their associated logical volumes, I figured I'd make post 3 about another time I fucked up. Since I spend so much time ranting about other people it's only fair.

Do you need some stamps?

Oh look, another post about email, yay...
Look you'd be forgiven for thinking all I do is manage email all day (all I do is manage email all day) but that's not true.
I also sometimes send emails.

In all seriousness, my place of work has a relatively complex mail setup. Not stupidly complex but also not the average "email in -> Exchange Online -> email out" setup either. Due to the nature of a large organisation with highly varying requirements per department, a one size fits all approach rarely works and because of the evolution of mail systems and technology we are inevitably left with workarounds and "fixes".

Below is a "flow chat" that gives a very high level overview of the internal mail flow at work. I'm probably not using the right symbols but the bottle of merlot that is helping me write doesn't know better so go easy on it.

There is so much more to it but I don't want you to have an aneurysm.

As you might have guessed from the flow chart, a component of the internal email system is called "Postoffice". This is a classic holdover from ye olden days of email when large institutions would have some sort of complex and arcane mail system made of 30000 moving parts often collectively referred to as Postoffice. At my org we still refer to the remnants of this system as Postoffice but it's actual functionality is a far cry (Thankfully) from the older systems sharing the same name. Postoffice is simply a collection of on-premises email relays, 5 in fact, 3 of which receive mail and make sure the sender is authorised and 2 provide some mangling capabilities (That's the official postmaster's term, it's true just go ask your mail man).
Since most email communication now occurs through services like Microsoft 365 Exchange Online and Google Workspace you'd be right to question why on-premises mail relays still exist.

The fact of the matter is, devices and software are dumb.

So many "modern" pieces of software or "up-to-date" devices (Printers!?!?!?!") cannot do what the tech industry calls "modern auth", they simply want to perform standard SMTP authentication, maybe encrypted if you're lucky, and call it a day. Some particularly vexing devices can't even do authentication properly and simply want to barf unencrypted and unauthenticated SMTP traffic on port 25 at whatever will listen.
This all amounts to these horrifically designed pieces of software and hardware needing some assistance in reaching the bin mailboxes of end users.

Getting my fix

To allow Postoffice to route mail to the correct place, it needs to know about all the addresses it should accept mail from and deliver to.
As per the flow chart above, if mail hits Postoffice with an external destination address, it simply forwards it onto the Mimecast mail gateway.
If mail hits Postoffice with an org address one of two things could happen:

  1. If the destination address is known (It's in the address map table) then Postoffice forwards the mail to the mapped address.
  2. If the destination address is not known but is an org address, Postoffice rejects the mail and discards it.

The recipient maps table looks something like this:

some-dl@org.tld      some-dl@org.onmicrosoft.com

As you can see, any mail addressed to some-dl@org.tld that lands on Postoffice would be sent on to some-dl@org.onmicrosoft.com. ".onmicrosoft.com" domains are added automatically when objects are created in Exchange Online and provide a convenient way to make sure mail lands in the Exchange Online tenant. The reason the org.tld can't be used is because Postoffice knows that as it's domain name and destination so will try to process the mail. We need something to forward it to.

That's all super great but screw adding all known addresses to the database by hand. What other methods could we use?

Back when the relays ran Sendmail. The best way at the time to handle the address mappings was a flat file delivered by the integration team who generated it from an MSSQL database table. That table was populated using a web UI by staff who needed Postoffice to deliver to a specific address.
Sure that works, but it's still got an element of human interaction and honestly delivering flat files to hosts over SCP and then mangling them and reloading services is very 2005.
Can we do better?

Of course we can, enter Postfix (That website reeks of Open Source software dev design), a powerful Mail Transfer Agent that features lots of extensibility and functionality. Postfix can do anything, I heard it even makes a really good cup of tea!
So the plan to eliminate all the human interaction and ferrying of files around was to have Postfix read the database mappings directly. Actually Postfix can't talk to MSSQL but it can talk to PostgreSQL and PostgreSQL has the FOREIGN table type. Essentially we can map MSSQL table columns (The ones we care about) onto PostgreSQL columns and allow Postfix to function normally.

How do we populate the database? Well here's were the shit hit the fan fun begins!

Postman Pat pathetic

Originally I had thought to use a Python script to read the Microsoft Graph API and get a list of mailboxes and their ".onmicrosoft.com" aliases. Unfortunately the Graph API isn't complete and can't return enough info for this use case, maybe in the future.

The only tool that could query Exchange Online and provide all the information I needed was Powershell. That's fine I'll use that.
Being a Linux admin and nerd, I decided to use a Linux box with cron to schedule it to run at 1AM daily. Thank you Microsoft for porting Powershell to Linux, I can run this little sync script on a VM with 1 core and 1GB of RAM. The Windows equivalent would have been around 2 cores and 4GB of RAM to run a single script on a schedule... yeahhhhh no.

So the script ran as it should for over a year with no issues. I even checked up on it from time to time just to make sure it didn't do anything weird. The process looked like this:

  1. The script connects to Exchange Online.
  2. It gets all known distribution group and shared mailbox addresses and their aliases.
  3. It stores the details in a hash.
  4. It disconnects from Exchange Online.
  5. It connects to MSSQL (Using some horrifically insecure OpenSSL settings, thanks Windows Server 2008).
  6. It reads all the data from the table and builds a hash.
  7. It compares the hashes.
  8. If an address is missing from the MSSQL table that appears in Exchange Online, it adds it to the table.
  9. If an address that is in the MSSQL table doesn't appear in Exchange Online, it removes it from the table.
  10. It disconnects from MSSQL

All was well.


One day a ticket comes in with priority: urgent and impact: high.

Our systems send reports overnight including failures and issues to the following distribution group: dl-something@org.tld. It is critical we receive these reports as they allow us to fix issues with software integrations quickly. Last night no reports came through, is there a problem with the distribution group?

At first that looked like an Exchange Online issue. After all, distribution groups are an Exchange thing. So a bunch of people started checking settings and mailboxes but could find no issue. It's only when one person realised the system that sent the reports was on-prem that the penny dropped... This was my script failing.
Not just failing with "Oh we haven't got anything to do", failing with "I'm going to destroy this database if it's the last thing I do HE-MAN!!"

Quickly jumping onto the server to check the logs and confirm what went wrong I was met with thousands of lines of "Yeah I can't connect to Exchange bro" or words to that effect. If I had written the script properly, it would have errored out and gracefully exited. However, I did not write it properly and assumed that a failure to connect to Exchange would cause the script to bomb out. See I had only ever connected to Exchange Online interactively using a Powershell prompt which of course blows up on a failure condition.

What this script did was what it was designed to do, it couldn't connect to Exchange Online so it had an empty hash table to compare the database to.

Oh look, no addresses in Exchange, better remove EVERYTHING.

Mail started to arrive in the middle of the night and Postfix said "I don't know where you are trying to send that, in the bin" to literally hundreds of emails from on-prem software and devices.

Running the script manually put all the addresses back in which fixed the issue fairly quickly. I then removed the cron job and got the Powershell master at our org to add some more stringent error handling, tested it manually a few times forcing it to fail and added the cron job back in.

It has run flawlessly ever since no thanks to me.


P.S. I had to give an incident report to the change advisory board regarding this outage. I gave a bit of background on why these relays still exist and how they interact with various systems. I then said "So the root cause was my crappy scripting." I at least got some laughs.