ZizzyDizzyMC
Administrator
Site Developer
Hello,
This one is a pretty difficult one to write, and unlike me goofing off and using chatGPT to create a temporary placeholder page for the service outage, this isn’t AI slop.
Obviously by now, most of you realized something went very wrong the morning hours of Thursday the 7th, where the site images stopped loading and several functions stopped working before eventually not working at all.
As part of policy I am providing a detailed post mortem to what happened, what solutions we’ve used to address it, and the process changes I am taking to correct and help prevent this in the future.
- Early 2021 I migrated ponybooru to a new host system.
- System was secured shortly after setup by disabling password authentication, rate limiting ssh requests, and disabling default accounts un-needed.
- Tuesday the 4th around approximately 6PM eastern a mishap occurred at the DC and turned off all systems due to power loss.
- Approximately 9PM I realize the system was up without an IP address.
- Around 9:10PM eastern I log in and set an IP in netplan, system is back up.
- Frustrated with the fact the system’s IP address isn’t being assigned by cloud-init as expected I spend 6 hours till approximately 3AM wednesday working on cloud-init.
- Cloud init appeared to be non functional - IP addresses were not being assigned on reboot.
- I put away the project for another time, and went to bed.
- Unknown to me at the time, cloud-init WAS working, and cloud-init had erased the hardened sshd_config.
- This renabled password authentication
- Cloud init also re-enabled the default user / password for the system, unknown to me.
- I have traveled Whinny City Pony Con on thursday morning, intending to help with AV setup and pre-reg which I successfully did.
- Due to issues and me forgetting essential supplies I get depressed and decide to sleep around 9pm and start off friday as a new day.
- Friday 2:31AM EST I receive an alert from 1 of 2 IDS watch groups I subscribe to as part of Z+ incident remediation, to help keep Z+ services clean and safe for the greater internet.
- The alert specifies that Ponybooru’s host was being used for SSH Bruteforce attacks.
- Per Z+ policies I login immediately at 2:40AM EST to find that Ponybooru has been compromised.
- Realizing the issue, I issue an immediate stop to the VM.
- Per Z+ incident response The Pony Archive (TPA) was also halted and air-gapped to prevent any potential lateral movement.
- Per Z+ incident response all Pony Client VPS services were also temporarily halted and air-gapped.
- At 3:40 AM I contact a remediation advisor for next steps.
- Am advised to restrict all VMs network halting all Z+ services on the host node where Ponybooru resides.
- I take a 2h nap, and at approximately 5AM friday morning I wake up again and begin remediation and additional hardening of critical Z+ infrastructure.
- Completed by 7AM I proceed to remediate, verify and online Pony Client VPS services customers.
- At approximately 9:30AM I online most Clients, and online non-critical Z+ infrastructure.
- I leave WCPC and arrive home at 3:30pm EST, and begin investigative efforts to determine the cause of the Ponybooru server being breached.
- At approximately 7PM EST I determine the root cause to be cloud-init, I did not see any evidence of ransomware, data exfiltration, or encryption.
- The root system of the infected machine was found to be mostly removed, the external disks were found to be unmounted, Data from the external disks was unmodified.
- At 9PM I began the process of remediation via weeks old backups, and a fresh backup that was cut off via power outage.
- Luckily the cut-off backup contained recent items, pairing with the weeks old backup of the main VM.
- Remediation was carried out by erasing the infected VM, replacing the VM with a known-good VM without security holes from cloud-init.
- Data was combined to form a near-realtime remediation where only the last few minutes before the site failure was lost.
- Data was re-indexed to combine the datasets with the weeks old VM backup.
- Ponybooru was onlined in 3 trial periods between 1AM and 12 noon Saturday the 8th.
- In this time a short series of unit tests and scans were performed externally and internally to ensure the integrity of both the data, and the security of the system.
- As part of the remediation at approximately 9PM all keys on the host system were rotated. This requires all users of Ponybooru to perform a password reset.
As part of the incident response, I am required to inform that while there does not appear to have been any attempt at Data Exfiltration during the event, that users Usernames, Email Addresses, Last known IP, and Browser Fingerprint may be at risk.
Browser fingerprints are not reversible by an adversary, and tells nothing about a users actual browser.
I would also like to emphasize that the system was compromised in whole due to the absurdly weak SSH configuration left behind by the cloud-init mistake, and is not an error in Philomena.
Z+ is taking the following steps to help prevent this in the future:
- SSH access is disabled by default at the Host Level firewall until needed.
- Non-standard SSH ports are now mandatory on Z+ Pony Hosting services. All services using standard SSH ports are to be updated by July 1st.
- SSH Access is to be granted on a need-access basis, and will be restricted via firewall to an IP or an IP range no greater than /19.
- TOTP access is being looked into as 2FA for critical systems.
I would like to state that per security practices, Z+ services already had a standard policy of disabling root login, and password authentication by default.
I’m sorry.
Thank you for your time,
ZizzyDizzyMC
ZizzyDizzyMC