Viewing last 25 versions of post by ZizzyDizzyMC in topic We're working on it.

ZizzyDizzyMC
Rampant Alicorn - The majestic steed of a blessed crusade (Masqueraded Esteem)
Fried Chicken - Attended an april fools event
Autist - How long until you notice this?  I noticed it after 4 days , but really I didn't even look till now.
Lil Shid Culture - If you see someone with a lil shid badge, you should refer to them by "Listen here you lil shid."
Liberty Belle - Sings the song of the unchained (The Bellstrike)

Administrator
Site Developer
TL:DR We're in a new home for a while until we get much better hardware.

So what happened?

I was planning to migrate most of our infrastructure to a nice rebuilt server that'll host disks and keep everything going with minimal issue.

What I got was 3(5 if you count the ones that died twice) systems and a migraine.


As most of you will recall, we started having some issues with random 503 pages, internal errors etc. I migrated the site slowly over to one of the nodes, we'll call it Node-1.

Node-1 showed issues about 2 weeks after completing this.

Node-0 was what ponybooru was on, and it was slowly enroute via mail after having spent 3 weeks migrating stuff from it, including clients and other pony sites.

Node-2 was a built system I built for personal backup, and had an issue in the past, that was resolved.

Node-3 is a system I built for hosting HDD's purpose built.

Issues in order:

Ponybooru having issues
Build Node-3
Migrate Node-0 to Node-3 except for ponybooru which needed to be migrated to node-1 due to the sheer size of PB.
Node-1 starts having issues, shutting off, network adapters disconnecting, 503 errors.
I begin the migration of some small datasets to Node-1 from Node-2, so I can use Node-2's hdd's in a future project (Node-4)
Node-0 makes it to my DC, I begin assessing the damage and what needs to be done
Node-1 tips over and crashes, array is gone overnight and I have to rebuild. This is when PB became unusable due to not having enough disk speed to load static web pages.
Node-1's boot disk fails (1 of 2, raid 1)
Node-2 isn't fully migrated yet
Node-6 crashes, I divert time and effort fixing this as this hosts clients, clieantsd that pay sour sthat thius service mpay rgemain free.
Node-0 begins getting rebuilt. 40gbe networking, 2 new boot / vm storage drives, external sas controller added.
Node-1 completely goes out, PB has to be shut down along with TPA and PV to avoid data loss. If these steps were not taken you wouldn't be reading this.
Node-0 is installed, prelim tested and put into service.
Node-3 contents begin moving over to Node-0
Node-1 is brought back online, a scrub of data is initiated to ensure data accuracy and no data loss. All other services on Node-1 are disabled.
Node-3 contents are all entirely moved over to Node-0.
Node-3 is unplugged from the disk shelf unit, the disk shelf is then added to Node-0
Node-1 PB VM is backed up to Node-0
Node-1 data processing continues
Node-0 disk shelf work begins, creating a new dedicated array for PB
Node-0 begins restoration of PB VM to new array hosted off of disk shelf, separate from the backup array built into Node-0.
Node-1 data processing continues, and finishes successfully.
Node-1 disks are removed and added to disk shelf on Node-0
Node-0 loads disks and begins import of array, this completes quickly.
Node-0 restores VM's from Node-1 that were backed up to Node-0
Node-0 finishes restore of PB VM.
Networking configuration is done for the newly rebuilt Node-0
Node-6 obliterates itself, ruining my morning.
Node-6 finishes repairing itself, some damage done but it'll survive, backups are made. A VM hosted for a client got compromised sucking all of the system resources. (This doesn't affect us, the VM was nuked from orbit, it was already discontinued from use anyway)
Node-0 finishing touches, array verification, status checks, networking config, hardening config done.
Online one thing at a time and you're here!

Holy shit I love backups, but only having 1 copy of a backup at certain points through this process was not great.

As an end result we have some storage and room to grow, everything I run also gets ever so slightly cheaper to run every month because I'm powering off Node-3, 2 and 1 in favor of running Node-0 which has been rebuilt. Node 1 will be sold, node 2 will be made into an emergency backup system, and node 3 will also likely be sold.

Life's been really shit lately, hopefully this sparks some better waters ahead.
No reason given
Edited by ZizzyDizzyMC