Category Archives: SysAdmin

Job Titles Galore

Hello-My-Job-Title-Is

I have been out of school for 8 years now and still trip myself up when people ask me what i do. It usually is some variation of “I run computers for an investment firm.” Which is true, but suffers from a massive lie of omission.

Computers is an understatement. I imagine most people think of computers as desktops. I used to do desktop management, transitioned to servers management, then spent a couple years as the sole HPC administrator before transitioning to my new role as CFEngine ninja.

Investment firm is also an understatement. I work for Two Sigma Investments and we do a lot more than manage investments.

This post is not about either of those understatements though. It was meant to be a short meandering through the various job titles I have held. So join me for a brief nostalgic sojourn.

Junior Systems Administrator
mostly perl scripting and hardware builds

Jack of all Trades Systems Administrator
desktop/phone/server support, window admin, linux admin, networking, data center management, pretty much EVERYTHING

Senior Systems Administrator
This role focused more on the linux admin and a lot of debugging

Global Systems Infrastructure Manager
I managed systems infrastructures in 4+ countries.

High Performance Computing Systems Administrator
I grew a 100 server compute farm into a 700+ server farm, managing multiple generations of hardware. I even brought in our first GPU servers. I managed a multimillion dollar budget.

Senior Linux Architect/Engineer
I worked with internal customers delivering a solution to meet their needs. I did a lot of hardware evaluation.

?
These days I am not sure what to call myself. All the above still applies but I spend the bulk of my time knee deep in CFEngine and managing the internal Sysadmin infrastructure (repository servers, CFEngine hubs, etc)

Here are some options for my current title:

Senior CFEngine Ninja
Grand CFEngine Poobah

How to hire a Systems Administrator

The job market is bullish for System Administrators and their ilk. I see new job postings daily and recruiters have been contacting me through LinkedIn and my blog for awhile. Unfortunately most of the job postings are horrible. Just flat out bad.

How to write a BAD sysadmin job posting:

1. Include every acronym known to IT

I have seen posting that include every acronym possible: DNS, DHCP, FTP, EGREP, OSPF, LMNOP. This is not helpful. If the position requires a more capable sysadmin, just say you are looking for a jack-of-all-trades. We are out there. I started out doing everything. Some people prefer variety from their work demands.

This is also exacerbated by sysadmin resumes. In an effort to land every possible interview resumes include whatever can catch HR or a recruiter’s eye.

2. Unrealistic expectations

You are not going to find someone who is an expert web designer and apache tomcat ace. Sysadmins are great dabblers, we love to try out new technologies. So there is a good chance that if an applicant has some experience it is cursory, and if they don’t they can probably teach themselves in short order.

Don’t expect a 10 year veteran to do helpdesk duty. Completely unrealistic.

3. No salary range or ridiculous salary ranges

Unless you are Google, Facebook, or a NYC Investment firm, I need to know you are going to pay me commensurate with my abilities. The aforementioned get a pass on street cred. I am not going to apply for your job unless I know the pay will match or exceed my current position.

You get what you pay for. If you think you are going to hire the next Ninja Sysadmin paying hourly, you need to reset your expectations. This is an issue that arises because “everyone’s nephew knows computers.” The devaluation of our profession due to amateurs is real. No one goes to amateur doctors, don’t let a hack touch your production network.

There are more, but that is all I have time for today boys and girls. I will try to dig up some examples from craigslist or listservs.

ITIL Foundation Training

Last week I took 2 days to take ITIL Foundation training through Simplilearn. I am going to review the training company and the subject matter separately.

Simplilearn

There a lot of companies willing to provide for any certificate acronym you can think up. I could find no useful reviews of whose training was better. I knew I wanted to take a course in person. I have done a number of MOOCs over the past 2 years and am burnt out on self-directed learning. I tried to study for the LPIC certification by myself using an online training course but it was horrible and I never felt confident enough to take the certification exam. I signed up for training in September but received a call from an Indian gentleman telling me he had to cancel the in-person training in Pittsburgh for September due to lack of students. He offered to refund me a portion of my fee to travel to Philadelphia to receive the training. That wasn’t going to work for numerous reasons. He offered to refund the difference if I could take the online training, again no. He assured me that the training would happen in October.

The course would be 8am-5pm Thursday and Friday. I was going to take the exam at the end.  I received confirmation that the training was happening the preceding Monday. The class was myself and one other student. The trainer had traveled in from Chicago. The course was supposed to take 18 hours over two days. I think we really spent 10 hours working on material. We finished early both days, and had frequent breaks and started late both days.

The training is very singularly focused on preparing you to pass the exam. It is aimed at certification, NOT education. I am not a ITIL Foundation expert by any means. I know the process and vocabulary, but we didn’t do anything that goes on in most classrooms. We did no case studies, had no discussion, and didn’t apply anything we were learning.

Subject Matter / Course Content

ITIL is a lot of things. It is a framework, it is a list of best practices for IT service management. It is also confusing. The pot of gold at the end of the rainbow is that it is useful. Its flexibility and therefore almost universal applicability to our profession means it can help out just about everyone and anyone. System Administration is a relatively new profession but its not so new that the wheel hasn’t been invented and re-invented, and re-re-invented. Some really smart and focused people have thought hard and long about how to deliver value through IT. I am smart enough to know when to defer to the experts. I would recommend at least the Foundation training for every sysadmin regardless of experience. The key though, like all learning, is to have an open mind.

Other Thoughts

As soon as I receive my exam results I am going to pursue the next steps in ITIL certification. Thankfully there is no shortage of content or training. I am thinking of focusing on the Operations aspect but almost every stage appeals somewhat to me and how it could better my work and help my company.

Downtime & DNS Registrars

I wanted to take a second to explain why martingehrke.com has been down for about a week. I was forced to change my IP address on the hosting server. I also host my own DNS and backup DNS. I have been a customer of 1&1 Internet for about a decade now. They recently changed their UI and it was not letting me change the name of my nameservers. Every registrar requires you to create NS records if you want to host your own DNS, instead of using theirs. Well 1&1 makes you create subdomains for each nameserver instead of just A records. Their horrible UI was not letting me accomplish what I wanted.

So…

Goodbye 1&1, hello Namecheap. I initiated the DNS transfer last week after just a day of fighting 1&1’s UI. Most DNS registrars will transfer your domain quickly. Not 1&1. They decided to make me wait the full 5 day period.

TL;DR I stopped using 1&1 for DNS registration because they replaced something that worked with broken-ness.

Total Cost of Ownership in the System Administration World

In the System Administrations world the total cost of ownership (TCO) is the price you pay for equipment from before its purchase to after its demise. There are always hidden costs that you can not anticipate, but through diligent evaluation and selection you can limit these unforeseen costs and minimize your TCO.

My main focus for this post is going to be vendor selection, specifically vendor plurality.

Unlike Google of Facebook, most Systems Administrators buy their server equipment from the following vendors: HP, IBM, Dell, or through a re-seller of Supermicro. Each vendor offers different product lines and while one product line may be more reliable than another, as a whole let’s assume they are all as reliable within some standard deviation. Why then would you ever buy equipment from more than one vendor? Price. IT and System Administration are sunk costs, they are the cost of doing business (ignoring IaaS,PaaS, & SaaS). A cost conscious company will want to get the best price on hardware.

Why You Should Support More Than One Vendor

Depending on your computing needs you could be spending millions on hardware, one companies margin could be tens of thousands of dollars. It is a good practice to always get competing quotes from different vendors (not just different re-sellers). That way you keep the vendors honest (or at least more honest).

You can save a significant amount of money this way, which you should obviously make sure your bosses know.

But what hidden costs arise?

Why You Should Support the Least Amount of Vendors as Possible

For every vendor you support it requires your time as a system administrator. To name a few considerations:

  • different RAID controllers with different syntaxes
  • different OOBM (imm,idrac,ilo) with various nuances
  • support from vendors can differ widely in process and competence
  • automated hardware monitoring is different. While most vendors support IPMI, there are always little differences
  • all new vendor equipment needs evaluation and a familiarization

What am I to do?

There is no optimal solution, only what works for you. I recommend limiting the number of supported vendors to the absolute minimum while still keeping a 2nd around to make sure the other vendor’s pricing is honest and competitive.

Each vendor adds to a server’s TCO. Therefore limiting supported vendors, minimizes every servers’ TCO.

What I learned from Hurricane Sandy as a System Administrator

I just experienced my first real Business Continuity Plan (BCP) event. In the past my company has simulated BCP events to test our response and capabilities. The simulated events were always minor, like a fiber cut. These are a couple observations and lessons I learned being involved in support as a Systems Administrator.

1) Your one stupid decision will be brought to light. Two major data center providers had major issues because of one poor decision. They both had located their generator fuel pumps in the basement. Generators in NYC/Manhattan are typically located on the roof due to the cost of real estate. Pumps are used to transport the fuel from the street level up 15-50+ stories. Basements are the first places to flood, thereby making your fuel pumps useless, eventually leading to your generators running out of fuel.
Articles to read:
Flooded NY data centers survive Sandy on generator power, fuel deliveries | Ars Technica

New York Data Centers Battle Back from Storm Damage » Data Center Knowledge

Hurricane Sandy Topples New York Data Center, Gawker, Gizmodo | Wired.com

2. Man power is important but tough to guarantee. You need the right people in the right places to keep things running or to fix things that break. But those same people have families and responsibilities. Key people may be busy dealing with more important matters. Business is important but life is paramount.

3. Make sure everyone has remote access and uses it periodically. After Sandy, employees’ needed to work from home because our offices still did not have power. That morning 10% of the company opened tickets requesting help with remote access. This could have been avoided if we had encouraged employees to work from home periodically, thereby ensuring remote access.

4. Have more than two of important infrastructure or have more than one backup. A BCP event causes havok and in that chaos it will turn your redundant services into single points of failure (SPOF). You might have had two VPN servers before you lost power to an office, but afterwards you have a significant SPOF.

5. Be diligent in your BCP testing and preperation. Test those generators, test them again, and test them on real load a third time.  You might have designed a system to have redundancy, but you need to be sure they built what you designed. I know many fiber paths that were designed to be redundant, but collapse together at some points (usually the last X feet). Be strict and follow up.

I am on the Board of Directors for the League of Professional System Administrators

This past week I started my two year tenure on the Board of Directors for the League of Professional System Administrators (LOPSA). My reasons behind running for the board can be found in an earlier post.

I volunteered to become the new Local Chapter Committee Chairman. In this role I will have the opportunity to shape and direct LOPSA’s continuing effort to expand and support local chapters throughout the country. We currently have eight local chapters with a couple more in the up-and-coming phase. I have assembled a great committee with a number of volunteers who are gracious enough to donate time and effort.  This will take up most of my time.

I have also been put in charge of planning and coordinating LOPSA-Live, a periodic IRC members forum. During a LOPSA-Live the Board makes announcements and updates the community on LOPSA’s goals and initiatives. This is usually followed with a Q&A. I am working toward having these every other month. This is one of the main ways the Board communicates with LOPSA’s membership.

I believe we have a great set of people on the Board who will work to strengthen LOPSA and further its goals.

The Dreaded Work Slump

I’ve been an athlete all my life and to an athlete the idea of a slump is accepted. No one told me that it happens in the non-athletic world as well.  In an athletic slump a basketball player can’t hit free throws, a baseball player’s batting average drops, you are performing below your average. In the business world this translates to malaise and general dissatisfaction with your daily work. Personally this means regular tasks drag on and seem more difficult, project stall, and my attention wanders.

Lifehacker.com just did a short piece on getting out of a slump @ How Can I Overcome a Work Slump?

I wanted to add a couple things that have helped me get out of slumps.

Change your scenery. If you usually work out of the office and have the opportunity to work from home, take it. Work out of your home office for a week. Changing your surroundings can help you refocus and keep your attention from straying.

Do something physical. A lot of us sit at our computers all day without much in terms of breaks or physical activity. Use this chance to get back into the habit of working out. Take a 20 minute break from work to walk around or hit the gym. Cardiovascular workouts like running, biking, or erging give your brain and body a chance to focus on something physically demanding.

As a systems administrator my job does entail a decent amount of monotony. I have to sometimes do the same thing multiple times to different servers to fix similar issues. This can get boring, but it can’t be ignored. Thus it follows that professional system administrators are uniquely susceptible to slumps. Realizing this and identifying that you are in fact in a slump are the first steps to getting out of one.

Ignoring the fact that this does happen won’t keep it from happening. Willful ignorance isn’t a solution.

Ubuntu 12.04 LTS, LVM, KVM, and Fun

Occasionally my job affords me the opportunity to do some really cool stuff. For a myriad of reasons I am building myself a Linux KVM host using Ubuntu 12.04 LTS. I hurried through the install prompts and didn’t pay enough attention to the partitioning. Therefore I ended up with 96GB of swap space and a 40GB root partition.

My goal was to have no swap, a small root, and a large partition for the KVM guests. I disabled swap and destroyed the swap LVM partition.  I then rebooted into the always handy Rescue Is Possible (RIP) using PXE and resized the root partition and root logical volume, crossed my fingers and rebooted. Voila! It all came back up perfectly configured for my uses.

Now onto creating KVM guests…