Researching & Learning About Zookeeper: A Guide

I’ve started working with Zookeeper. Since I’ve started doing that I’ve put together this blog post. It’s aim is to provide a structured approach to learning Zookeeper and researching the elements that make its features tick. Along the way I have a few call outs to people that have also provided excellent talks, material, or contributions to learning about Zookeeper along the way. With that, let’s get started.

Zookeeper is a consensus system written based on ideas presented via consensus algorithms. The idea is key value stores that keep all of their data in memory for read heavy workloads. The qualities in this context present a system that is highly consistent, with intent for access from distributed systems to data that won’t be lost.

Start Learning

The starting point should be a complete read of the Apache Zookeeper Project Home Page.

At this point I took an administrators’ angle on determining what I needed to know and do next. I knew that my situation would meet the basic assumptions of reliability around Zookeeper; First is that only a minority of servers in a deployment would fail at a particular time or become inaccessible from a crash, partition, or related issue, and second is that deployed machines would have correctly operating clocks, storage, and network components that perform consistently.

I had also made an assumption that I would need 2 x F + 1 machines in order to maintain data consistency. The F here is the number of failed or inaccessible machines. This meant that if I wanted to have 2 failures, I’d need at least 5 machines. For a failure of up to 3 machines, that would be 7 machines. Pretty easy, just a little simple math.

The other thing I was curious about, especially on a single machine, would be Zookeeper’s overall overhead. Would it come into contention with the services that are already running? Would it be ok to put Zookeeper on the machines that run the micro-services that Zookeeper is providing information to? Well, Zookeeper does indeed content with other application for CPU, network, memory, and storage. For this reason I have to balance the deployment of Zookeeper in relation to the other applications, as my server loads may not be super high, and thus I’d be able to have Zookeeper on some of the servers that have actual other services deployed. But YMMV depending on your services you’ve got deployed.

While I was thinking through how I’d build out the architecture for my implementation of Zookeeper I came upon a very important note in the documentation,

“ZooKeeper’s transaction log must be on a dedicated device. (A dedicated partition is not enough.) ZooKeeper writes the log sequentially, without seeking Sharing your log device with other processes can cause seeks and contention, which in turn can cause multi-second delays.

Do not put ZooKeeper in a situation that can cause a swap. In order for ZooKeeper to function with any sort of timeliness, it simply cannot be allowed to swap. Therefore, make certain that the maximum heap size given to ZooKeeper is not bigger than the amount of real memory available to ZooKeeper. For more on this, see Things to Avoid below.”

After reading up on the following documentation it seemed like a good time to do a test deployment:

BEGIN BUG DESCRIPTION: 1ST DOCKER ATTEMPT

NOTE: If you just want to get to the Zookeeper installation & setup and skip this issue, GOTO here.

My first go was to pull up a clean Ubuntu docker image and prep it as a container. Then start installing the necessary parts of Zookeeper. These steps consisted of the following. I made a video for it (see toward the bottom of this entry), so you can actually see the flow and I also wrote the commands I’m tying in bash below. Then you can pick your preferred use.

docker-machine start fusion-fire

Docker machine starts my virtual machine on OS-X that runs the Docker daemon, which I’ve named fusion-fire, thus the command above. Then after that I pulled down an Ubuntu image, started a container from the image, connecting to the container and all set for installation.

docker pull ubuntu
docker run -it ubuntu

To install the Zookeeper server and begin execution I then issued the following.

sudo apt-get update
sudo apt-get -Y install default-jdk

While this was executing I also ran into a situation where the Java Development Kit was hanging on getting the certificates put into place.

I began looking into this problem, and found currently on Ubuntu 14.04, running sudo-apt-get update and then running the install will trigger the bug. Two other points of reference are herehere, and there are other postings and issues related to the issue, just google. So what I did at that point to fix this issue was the following.

First I forcefully killed the docker container by just restarting the whole docker VM.

docker-machine stop fusion-fire
docker-machine start fusion-fire

Once that stopped I then started the virtual machine again.

sudo apt-get -y install default-jre

When it started I ran sudo apt-get install again. At that point apt-get attempted to recover but the install kept getting stuck on registering the certificates. So I gave up on this avenue for now. Hopefully a future Docker & Linux Kernal fixes the problem. So instead I went out and just spooled up some AWS instances for now, I’ll update this blog entry with a “Part II: Docker is Zookeeper Fixed” when the Java + Linux Kernal + Docker issue is remedied, until then, here’s the installation process on the AWS instances.

END BUG DESCRIPTION: AWS Instance Zookeeper Installation

Once this was setup I started 5 nano instances for Zookeeper (nano, since it’s just a test example for learning) and then logged in using broadcast with iTerm 2. From there each instance had the following commands executed.

sudo apt-get update
sudo apt-get install -y default-jdk
cd /opt/
sudo mkdir zookeeper
cd zookeeper/
sudo wget http://mirror.tcpdiag.net/apache/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz
sudo tar -zxvf zookeeper-3.4.6.tar.gz
cd conf/
sudo nano zoo.cfg

NOTE: Nano is the text editor I used above for “sudo nano zoo.cfg”. If you don’t have it available just install it with “sudo apt-get install nano”.

In that zoo.cfg I entered the following. For the IPs I actually used the AWS private IPs for the config file example below.

tickTime=2000
dataDir=/var/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=172.31.19.66:2888:3888
server.2=172.31.19.67:2888:3888
server.3=172.31.19.68:2888:3888
server.4=172.31.19.69:2888:3888
server.5=172.31.19.70:2888:3888

Now I started the service using the zkServer.sh script file.

sudo /opt/zookeeper/zookeeper-3.4.6/bin/zkServer.sh start-foreground

When I booted up I ran into an error about the myid file, so I added the file with a sequential number for the byid in the /var/zookeeper directory.

sudo nano /var/zookeeper/myid

In each of the files I added a number, respectively 1 through 5 for the id of each and saved those files. Upon attempting to start the zookeeper service with the following command I finally got to see the various nodes in the ensemble gain access to each other and start working. Which, I gotta admit, was a pretty damn cool feeling.

After all that fussing it seemed good to note, especially since they’re hard to find in the documentation (which is kind of a bit hard to use), here are some of the switches for zkServer.sh.

start
start-foreground (super useful for debugging)
stop
restart
status
upgrade
print-cmd

Once this is done, restart the service but this time instead of using the start-foreground command, just use the start command and that will start the service and return the bash back to you to issues commands or whatnot. An easy way to test out Zookeeper now that it is running is to use the Zookeeper CLI. This is the zkCli.sh shell script (or zkcli.bat file if you’re running it on windows – which I’d strongly suggest NOT to do).

Ok, that’s it for this entry. More to come in the near future. Cheers!

Excellent Additional References

All That Tech… SITREP: Elastic Meetup & Quote Center Updates

QC_377x285I started working with the Quote Center (QC) back in November, and wrote about it in “After 816 Days I’m Taking a Job!” Now that I’m a few months into the effort, it’s sitrep time. Sitrep, btw is military speak for Situational Report.

The three core priorities I have at Quote Center in my role are: Community Contributions, Site Reliability, and Talent Recon.

Community Contributions (& Organizing)

Some of the progress I’ve made, is direct and immediate involvement with some really interesting groups here in Portland. The first seemed a prime option, and that’s the Elastic User Group.

Myself and some of the QC Team traveled late last year to check out the Elasticon Tour stop in Seattle. It was an educational experience where I got some of my first introductions to Elasticsearch and also a new product Elastic had just released recently called Beats. I was fairly impressed by what I saw and several other things aligned perfectly for follow up community involvement after that.

I’ve since kept in touch with the Elastic Team and started coordinating the Elastic User Group in Portland (Join the group on Meetup for future meetings & content). In March the group will be hosting a great meetup from Ward & Jason…

Kafka, Java, Ruby, React, and Elasticsearch w/ Ward Cunningham and Jason Clark

Monday, Mar 28, 2016, 6:30 PM

Lucky Lab
915 SE Hawthorne Blvd. Portland, OR

13 Elastic Portlandians Attending

New Relic receives tons of metrics. Large customers can report thousands of uniquely named metrics per minute, but they want to search and chart them in nearly realtime.We’ve turned to Elasticsearch on this problem, tuning it for this write-heavy workload. With small, frequently duplicated documents, it’s been an interesting challenge to optimize …

Check out this Meetup →

So be sure to RSVP for that meetup as it’s looking to be a really interesting presentation.

The second group I’ve stepped up to help out with is the Docker Meetup here in Portland. The first meetup we have planned at this time is from Casey West.

How Platforms Work

Wednesday, Mar 16, 2016, 6:30 PM

New Relic
111 Southwest 5th Avenue #2800 Portland, OR

48 Dockers Attending

Platforms: either you have one or you’re building one. Over the years I’ve observed six high-level characteristics common to production environments which are operationally mature. This talk will explain in detail the six capabilities in an operationally mature production environment. I will also demo these capabilities live.Working in Internet i…

Check out this Meetup →

Site Reliability

One of the other priorities I’ve been focusing on is standard site reliability. Everything from automation to continuous integration and deployment. I’ve been making progress, albeit at this stage going from zero to something, in the space of a site reliability practice takes time. I’ve achieved a few good milestones however, which will help build upon the next steps of the progress.

We’ve started to slowly streamline and change our practice around Rackspace and AWS Usage. This is a very good thing as we move toward a faster paced continuous integration process around our various projects. At this time it’s a wide mixture of .NET Solutions that we’re moving toward .NET Core. At the same time there are some Node.js and other project stacks that we’re adding to our build server process.

Team City

Our build server at this time is shaping up to be Team City. We have some build processes that are running in Jenkins, but those are being moved off and onto a TeamCity Server for a number of reasons. I’m going to outline these reasons and I’m happy to hear any reasons there may be other better options. So feel free to throw a tweet at me or leave a comment or three.

  1. Jetbrains has a pretty solid and reliable product in Team City. It tends to be cohesive in building the core types of applications that we would prospectively have: Java, .NET, Node.js, C/C++ and a few others. That makes it easy to get all projects onto one build server type.
  2. TeamCity has intelligence about what is and isn’t available for Java & .NET, enabling various package management and other capabilities without extensive scripting or extra coding needed. There are numerous plugins to help with these capabilities also.
  3. TeamCity has fairly solid, quick, and informative support.

Those are my top reasons at this point. Another reason, which isn’t really something I felt should be enumerated, because it’s a feeling versus something I’ve confirmed. That is, the Jenkins Community honestly feels a bit haphazard and disconnected. Maybe I’m just not asking or haven’t seen the right forums to read or something, but I’ve found it a frustrating experience to deal with the Jenkins Server and find information and help regarding getting a disparate and wide ranging set of tech stacks building on it. TeamCity has always just been easy, and getting some continuous integration going the easiest way possible is very appealing.

Monitoring

We use a number of resources for monitoring of our systems. New Relic is one of them, and they’re great, however it’s a bit tough when things are locked down inside of a closed (physically closed) network. How does one monitor those systems and the respective network? Well, you get Nagios or something of the sort installed and running.

I installed it, but Nagios left me with another one of those dirty feelings like I just spilled a bunch of sour milk everywhere. I went about cleaning up the Nagios mess I’d made and, upon attending the aforementioned Elasticon Tour Stop in Seattle, decided to give Beats a try. After a solid couple weeks of testing and confirming the various things work well and would work well for our specific needs, I went about deploying Beats among our systems.

So far, albeit only being a few weeks into using Beats (and still learning how to actually make reports in Kibana) Beats appears to have been a good decision. Dramatically more cohesive and not spastically splintered all over the place like Nagios. I’m already looking into adding additional Beats beyond the known three; Topbeats, Packetbeats, and Filebeats. There are a number of other beats that we could add specific to our needs, that would be good open source projects. Stay tuned for those, I’ll talk about them in this space and get a release out to all as soon as we lay a single line of code for those.

Talent Recon

Currently, nothing to report, but more to come in the space of talent recon.

Docker Tips n’ Tricks for Devs – #0004 Using VMware Fusion w/ Docker

In this article I’m going to cover a few steps in getting started with VMware and Docker instead of the default VirtualBox and Docker. The basic prerequisites for this are:

  • VMware Fusion >= v8.xDocker 01
  • Docker Toolbox w/ Docker
    $ docker --version
    Docker version 1.9.0, build 76d6bc9

To start with, one of the things I didn’t find super intuitive was finding out where boot2docker’s URL is. I then attempted to create the virtual machine several times with what I thought it would be and then realized, to my dismay, that it defaulted to what it generally would need to be. A serious case of RTFM.

Once the prereqs are met, just run the following command.

docker-machine create nameOfTheVirtualMachine --driver vmwarefusion

You’ll see the creation results display to the terminal then. They should look something like this.

Running pre-create checks...
Creating machine...
(fusion-fire) Creating SSH key...
(fusion-fire) Creating VM...
(fusion-fire) Starting fusion-fire...
(fusion-fire) Waiting for VM to come online...
Waiting for machine to be running, this may take a few minutes...
Machine is running, waiting for SSH to be available...
Detecting operating system of created instance...
Detecting the provisioner...
Provisioning with boot2docker...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
Checking connection to Docker...
Docker is up and running!
To see how to connect Docker to this machine, run: docker-machine env fusion-fire

After that, just run that last command from the creation results.

docker-machine env fusion-fire

Then again you’ll get another results as shown below with another command.

export DOCKER_TLS_VERIFY="1"
export DOCKER_HOST="tcp://192.168.244.159:2376"
export DOCKER_CERT_PATH="/Users/adron/.docker/machine/machines/fusion-fire"
export DOCKER_MACHINE_NAME="fusion-fire"
# Run this command to configure your shell:
# eval "$(docker-machine env fusion-fire)"

Execute the command.

eval "$(docker-machine env fusion-fire)"

If you take a look at the Virtual Machine Library on VMware now you should see your machine, and pop the actual VM open and you should see that standard boot2docker screen with wide open root access to that virtual machine.

docker-vmware
At this point I wanted to take the virtual machine for a little spin. I issued the following commands to pull an elasticsearch container, run it, and then get the bash prompt of that running container.

docker pull elasticsearch
docker run -it elasticsearch

At this point I saw the log displaying so I killed the run with Ctrl+C, got a list of the just exited container so I could get the Container ID to restart it.

docker ps -a

Then with everything in place I started it and logged into the container instance.

docker exec -it 257e98847bbb bash

Which then shows…

root@257e98847bbb:/#

I’m in. At this point I could work with the container however I’d need to.

Aside

Elastic Search, Beats, & Learning in Portland

Recently I attended the Elasticon Tour Stop in Seattle with @thebigscaryguy & the Elastic Team we have at the Home Depot Quote Center. I have always dug Elastic Search a bit and after the tour stop & some research I was very interested in Beats too. After that we decided we’d help to kick start the Portland Elastic Meetup again. So I reached out, as usual, via twitter first for a pulse on interest.

I got some immediate interest in a few tweets from @WardCunningham @tweetcaco and others that made me think, “yup, definitely something worth pursuing!

Next Steps…

Next I’m aiming to get a few speakers lined up. This is where I’d love your help. What would you like to hear about? What would you like to learn? Want to learn more about Elastic, Beats, or some other Lucene related, distributed style system, or something inside or even outside of Elastic’s product offerings? Let me know about what you’d like to hear about and the team will make it happen one way or another!

 

 

Not Totally Done With 2015, But…

I was sitting hacking away on my ride into work (I ride the bus/train so I can get a solid hour of work in before I ever get to the office). I was dorking around with a bash script I’d recently written that was taking the place of a file watcher.

At the same time I was toying around with keywords and such on a watcher that watches twitter, and stumbled onto @poornima’s tweet.

At first I just read the article, but then I thought it would be a good idea to actually write up what I’d accomplished for this year too. Especially being that 2015 has been an exceptional and very different year for me.

First however, a little context of why 2015 was an exceptional year. 2013 ended in a ridiculous way. I’d just finished 2+ years working as a tech evangelist and product manager & coder for AppFog, Tier 3, and then Basho while writing with the Orchestrate.io Team. I’d however gotten ridiculously burned out by the end of 2013, so I literally took time to sit on my porch, have a beer, and watch the trees blow in the breeze.

To recover from my burn out I kicked off 2014 by co-founding a start with a friend, Aaron. We proved out a lot of things, for one that 40 hour work weeks are utter bullshit. The whole startup work yourself to the bone is also bullshit. We accomplished a ton of things and built a system with barely any capital whatsoever. I was impressed, and when I showed others they were impressed.

Then at the end of that, I realized I hadn’t resolved my burn out, it had just abated while I lived the startup life. So I started 2015 off by not doing anything. That’s right, I didn’t do a damn thing. However many in the industry thought I was still working, I still dropped into a meetup here or there, but otherwise I was pretty sick of the whole tech industry. At least, I felt I was sick of it.

By February I had landed a gig helping out Strongloop put together its curriculum and training material. I even went and delivered some of the training. It was good, but that wasn’t really the exact thing I wanted to do either (it didn’t help that it was all in the suburbs, and that’s another whole point of burn out I have – I’m done with suburbia).

But as I wrapped that up I put together some more training material and worked on a few side jobs. I realized something at this time, I’d removed so many of my expenses that I was doing fine even without working much at all. I learned that I didn’t need to have the “American Life” with a noose around my neck of debt and other nonsense. I was, in essence, debt free, mortgage free, loan free, and I didn’t owe anybody a thing.

So April rolled around and .NET Fringe happened. That was interesting because I started getting interested in technology again in a huge way. Oddly, not particularly .NET, but just in languages and systems in general. I started digging in again to things that made me curious that I wanted to implement.

Again, not working but just learning. Then I started working on a contract at CDK Global, helped some interns put together a hackathon team, and generally just enjoyed being in the field and getting things done. Without any tie downs or nooses anywhere to be seen.

Then I realized, “Holy shit Adron, you’d gotten burned out because of things – often just daily nonsense – that was tying you down and when you’re free and don’t have to live at the whim of managers, loans, money and others you’re happy…”

At that moment I realized the real accomplishment in 2015. I learned what I need to stay happy in the industry. That’s what I’m doing today, at a strategic level, is staying happy with my work. On a tactical level I’ve been slowly working toward that and the fruits of that work will be self-evident in 2016.

I hope 2015 kick ass for you, enjoy Star Wars, and see you all in 2016!

Elasticon Tour 2015 in Seattle

Today is the tour stop of the Elasticon Tour that swung into Seattle. Myself and some of the Home Depot Quote Center team headed up via the Geek Train for the event. We arrived the night before so we could get up and actually be awake and ready for the event.

Just to note, a good clean place to stay, that isn’t overpriced like most of Seattle is the Pioneer Square Hotel – usually about $110-120 bucks. If you’re in town for a conference, sometimes it’s even worth skipping the “preferred hotels” and staying here. But I digress…

When the team and I walked in we waited a little bit for registration to get started. We stood around and chatted with some of our other cohort. Once the registration did open, we strolled into the main public space and started checking out some demos.

StreamSets

The first thing I noticed of the demos is something that’s catching a lot of attention. It’s a partner of Elastic’s called StreamSets.

From what I could figure out from just watching the demo is that StreamSets is a ingest engine. That’s simple enough to determine just taking a look at their site. But being able to watch the demo also enlightened me to the way the interface IDE (the thing in the dark pictures above) worked.

The IDE provided ways to connect to ingestion data with minimal schema and actually start to flow the ingestion of this data through the engine. One of the key things that caught my attention at this point was the tie in with Kafka and Hadoop with the respective ingest and egress of data to sources ranging from AWS S3 to things like Elastic’s engine or other various sources that I’ll be working with in the coming months.

For more information about StreamSets here are a few other solid articles:

…and connect to keep up with what StreamSets is doing:

…and install instructions:

…and most importantly, the code:

Beats (Not the Dumb Lousy Headphones)

packetbeat-fish-and-clusterRecently I installed Nagios as I will be doing a lot of systems monitoring, management, and general devops style work in the coming weeks to build out solid site reliability. Nagios will theoretically do a lot of the things I need it to do, but then I stumbled into the recently released Beats by Elastic Search (not by Dre, see above links in the title).

I won’t even try to explain Beats, because it is super straight forward. I do suggest checking out the site if you’re even slightly interested, but if you just want the quick lowdown, here’s a quote that basically summarizes the tool.

“Beats is the platform for building lightweight, open source data shippers for many types of operational data you want to enrich with Logstash, search and analyze in Elasticsearch, and visualize in Kibana. Whether you’re interested in log files, infrastructure metrics, network packets, or any other type of data, Beats serves as the foundation for keeping a beat on your data.”

So there ya go, something that collects a ton – if not almost all of – the data that I need to manage and monitor the infrastructure, platforms, network, and more that I’m responsible for. I’m currently diving in, but here’s a few key good bits about Beats that I’m excited to check out.

packetbeat-fish-nodes-bkgd.png#1 – PacketBeat

This is the realtime network packet analyzer that integrates with Elasticsearch and provides the respective analytics you’d expect. It gives a level of visibility with Beats between all the network servers and such that will prospectively give me insight to were our series of tubes or getting clogged up. I’m looking forward to seeing our requests mapped up with our responses!  ;)

#2 – FileBeat

This is a log data shipper based on the Logstash-Forwarder. At least it was at one point, it appears to look like it is less and less based on it. This beat monitors log directories for log files, tails the fails, and forwards them to Logstash. This completes another important part of what I need to systemically monitor within our systems.

Random fascinating observations:

  • Did I mention Beats is written in Go? Furtherering Derek’s tweet from 2012!  ;)

  • Beats has a cool logo, and the design of the tooling is actually solid, as if someone cared about how one would interact with the tools. I’ll see how this holds up as I implement a sample implementation of things with Beats & the various data collectors.

More References & Reading Material for Beats:

That’s it for the highlights so far. If anything else catches my eye this evening at the Elasticon Tour, I’ll get started rambling about it too!

Nagios and Ubuntu 64-bit 14.04 LTS Setup & Configuration

1st – The Virtual Machine

First I created a virtual machine for use with VMware Fusion on OS-X. Once I got a nice clean Ubuntu 14.04 image setup I installed SSH on it so I could manage it as if it were a headless (i.e. no monitor attached) machine (instructions).

In addition to installing openssh, these steps also include build-essential, make, and gcc along with instructions for, but don’t worry about installing VMware Tools. The instructions are cumbersome and in parts just wrong, so skip that. The virtual machine is up and running with ssh and a good C compiler at this point, so we’re all set.

2nd – The LAMP Stack

sudo apt-get install apache2

Once installed the default page will be available on the server, so navigate over to 192.168.x.x and view the page to insure it is up and running.

0BF1B44B-F9D4-44B3-967E-BF9F98AAB2BC

Next install mysql and php5 mysql.

sudo apt-get install mysql-server php5-mysql

During this installation you will be prompted for the mysql root account password. It is advisable to set one.

Then you will be asked to enter the password (the one you just set about 2 seconds ago) for the MySQL root account. Next, it will ask you if you want to change that password. Select ‘n’ so as not to create another password for the root acount since you’ve already created the password just a few seconds before.

For the rest of the questions, you should simply hit the enter key for each prompt. This will accept the default values. This will remove some sample users and databases, disable remote root logins, and load these new rules so that MySQL immediately respects the changes we have made.

Next up is to install PHP. No grumbling, just install PHP.

sudo apt-get install php5 libapache2-mod-php5 php5-mcrypt

Next let’s open up dir.conf and change a small section to change what files apache will provide upon request. Here’s what the edit should look like.

Open up the file to edit. (in vi, to insert or edit, hit the ‘i’ button. To save hit escape and ‘:w’ and to exit vi after saving it escape and then ‘:q’. To force exit without saving hit escape and ‘:q!’)

sudo vi /etc/apache2/mods-enabled/dir.conf

This is what the file will likely look like once opened.

<IfModule mod_dir.c>
DirectoryIndex index.html index.cgi index.pl index.php index.xhtml index.htm
</IfModule>

Move the index.php file to the beginning of the DirectoryIndex list.

<IfModule mod_dir.c>
DirectoryIndex index.php index.html index.cgi index.pl index.xhtml index.htm
</IfModule>

Now restart apache so the changes will take effect.

sudo service apache2 restart

Next let’s setup some public key for authentication. On your local box complete the following.

ssh-keygen

If you don’t enter a passphrase, you will be able to use the private key for auth without entering a passphrase. If you’ve entered one, you’ll need it and the private key to log in. Securing your keys with passphrases is more secure, but either way the system is more secure this way then with basic password authentication. For this particular situation, I’m skipping the passphrase.

What is generated is id_rsa, the private key and the id_rsa.pub the public key. They’re put in a directory called .ssh of the local user.

At this point copy the public key to the remote server. On OS-X grab the easy to use ssh-copy-id script with this command.

brew install ssh-copy-id

or

curl -L https://raw.githubusercontent.com/beautifulcode/ssh-copy-id-for-OSX/master/install.sh | sh

Then use the script to copy the ssh key to the server.

ssh-copy-id adron@192.168.x.x

Next let’s setup some public key for authentication. On your local box complete the following.

ssh-keygen

That should give you the ability to log into the machine without a password everytime. Give it a try.

Ok, so now on to the meat of this entry, Nagios itself.

Nagios Installation

Create a user and group that will be used to run the Nagios process.

sudo useradd nagios
sudo groupadd nagcmd
sudo usermod -a -G nagcmd nagios

Install these other essentials.

sudo apt-get install libgd2-xpm-dev openssl libssl-dev xinetd apache2-utils unzip

Download the source and extract it, then change into the directory.

curl -L -O https://assets.nagios.com/downloads/nagioscore/releases/nagios-4.1.1.tar.gz
tar xvf nagios-*.tar.gz
cd nagios-*

Next run the command to configure Nagios with the appropriate user and group.

./configure --with-nagios-group=nagios --with-command-group=nagcmd

When the configuration is done you’ll see a display like this.

Creating sample config files in sample-config/ ...

*** Configuration summary for nagios 4.1.1 08-19-2015 ***:

General Options:
-------------------------
Nagios executable: nagios
Nagios user/group: nagios,nagios
Command user/group: nagios,nagcmd
Event Broker: yes
Install ${prefix}: /usr/local/nagios
Install ${includedir}: /usr/local/nagios/include/nagios
Lock file: ${prefix}/var/nagios.lock
Check result directory: ${prefix}/var/spool/checkresults
Init directory: /etc/init.d
Apache conf.d directory: /etc/httpd/conf.d
Mail program: /bin/mail
Host OS: linux-gnu
IOBroker Method: epoll

Web Interface Options:
------------------------
HTML URL: http://localhost/nagios/
CGI URL: http://localhost/nagios/cgi-bin/
Traceroute (used by WAP):

Review the options above for accuracy. If they look okay,
type 'make all' to compile the main program and CGIs.

Now run the following make commands. First run make all as shown.

make all

Once that runs the following will be displayed upon success. I’ve included it here as there are a few useful commands in it.

*** Compile finished ***

If the main program and CGIs compiled without any errors, you
can continue with installing Nagios as follows (type 'make'
without any arguments for a list of all possible options):

make install
- This installs the main program, CGIs, and HTML files

make install-init
- This installs the init script in /etc/init.d

make install-commandmode
- This installs and configures permissions on the
directory for holding the external command file

make install-config
- This installs *SAMPLE* config files in /usr/local/nagios/etc
You'll have to modify these sample files before you can
use Nagios. Read the HTML documentation for more info
on doing this. Pay particular attention to the docs on
object configuration files, as they determine what/how
things get monitored!

make install-webconf
- This installs the Apache config file for the Nagios
web interface

make install-exfoliation
- This installs the Exfoliation theme for the Nagios
web interface

make install-classicui
- This installs the classic theme for the Nagios
web interface

*** Support Notes *******************************************

If you have questions about configuring or running Nagios,
please make sure that you:

- Look at the sample config files
- Read the documentation on the Nagios Library at:
https://library.nagios.com

before you post a question to one of the mailing lists.
Also make sure to include pertinent information that could
help others help you. This might include:

- What version of Nagios you are using
- What version of the plugins you are using
- Relevant snippets from your config files
- Relevant error messages from the Nagios log file

For more information on obtaining support for Nagios, visit:

https://support.nagios.com

*************************************************************

Enjoy.

After that successfully finishes, then execute the following.

sudo make install
sudo make install-commandmode
sudo make install-init
sudo make install-config
sudo /usr/bin/install -c -m 644 sample-config/httpd.conf /etc/apache2/sites-available/nagios.conf

Now some tinkering to setup the web server user in www-data and nagcmd group.

sudo usermod -G nagcmd www-data

Now some Nagios plugins. You can find the plugins listed for download here: http://nagios-plugins.org/download/ The following are based on the 2.1.1 release of plugins.

Change back out to the user directory on the server and download, tar, and change into the newly unzipped files.

cd ~
curl -L -O http://nagios-plugins.org/download/nagios-plugins-2.1.1.tar.gz
tar xvf nagios-plugins-*.tar.gz
cd nagios-plugins-*
./configure --with-nagios-user=nagios --with-nagios-group=nagios --with-openssl

Now for some ole compilation magic.

make
sudo make install

Now pretty much the same things for NRPE. Look here to insure that 2.15 is the latest version.

cd ~
curl -L -O http://downloads.sourceforge.net/project/nagios/nrpe-2.x/nrpe-2.15/nrpe-2.15.tar.gz
tar xvf nrpe-*.tar.gz
cd nrpe-*

Then configure the NRPE bits.

./configure --enable-command-args --with-nagios-user=nagios --with-nagios-group=nagios --with-ssl=/usr/bin/openssl --with-ssl-lib=/usr/lib/x86_64-linux-gnu

Then get to making it all.

make all
sudo make install
sudo make install-xinetd
sudo make install-daemon-config

Then a little file editing.

sudo vi /etc/xinetd.d/nrpe

Edit the file for the line only_from to include the following where 192.x.x.x is the IP of the Nagios Server.

only_from = 127.0.0.1 192.x.x.x

Save the file, and restart the Nagios server service.

sudo service xinetd restart

Now begins the Nagios Server configuration. Edit the Nagios configuration file.

sudo vi /usr/local/nagios/etc/nagios.cfg

Find this line and uncomment the line.

#cfg_dir=/usr/local/nagios/etc/servers

Save it and exit.

Next creat the configuration file for the servers to monitor.

sudo mkdir /usr/local/nagios/etc/servers

Next configure the contacts config file.

sudo vi /usr/local/nagios/etc/objects/contacts.cfg

Fine this line and set the email address to one you’ll be using.

email adronsemail@compositecode.com

Now add a Nagios service definition for the check_nrpe command.

sudo vi /usr/local/nagios/etc/objects/commands.cfg

Add this to the end of the file.

define command{
command_name check_nrpe
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}

Save and exit the file.

Now a few last touches for configuration in Apache. We’ll want the Apache rewrite and cgi modules enabled.

sudo a2enmod rewrite
sudo a2enmod cgi

Now create an admin user, we’ll call them ‘nagiosadmin’.

sudo htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin

Create a symbolic link of nagios.conf to the sites-enabled directory and then start the Nagios server and restart apache2.

sudo ln -s /etc/apache2/sites-available/nagios.conf /etc/apache2/sites-enabled/
sudo service nagios start
sudo service apache2 restart

Enable Nagios to start on server boot (because, ya know, that’s what this server is going to be used for).

sudo ln -s /etc/init.d/nagios /etc/rcS.d/S99nagios

Now navigate to the server and you’ll be prompted to login to the web user interface.

nagioslogin

Now begins the process of setting up servers you want to monitor… stay tuned, more to come.