Wintersmith Creating Documentation

I set out a few days ago to put together a documentation site. I had a few criteria for this site:

  1. A static site that I could push to Github to use with their github pages feature.
  2. The static site is generated from markdown.
  3. It just works. It’s easy to get it into a workflow without breaking the tool or breaking a solid workflow.

That was it, what I’d consider some pretty straight forward criteria. However it wasn’t that easy, until it was. Here’s a few of the issues I ran through on the way to getting a solid tool with a solid workflow working together. Beware however if you have fickle reading eyes, the following is a rant about what does and does not work.

[rant on]

Middleman Broken Ruby and Broken Gems

I have a Mac Book Pro Retina 15″. The machine runs OS-X Mavericks. I’ve had zero issue with this OS. It comes with Ruby 2 and some version of gems. My first attempt was to take a stab with middleman, the same static site builder used by many companies including Basho. Even though I ran into problems which I detailed in “Basho – First Week Coding & Research Adventures…” and “Un-breaking OS-X Mountain Lion” eventually middleman mostly worked.

Well, I didn’t get to a working app very fast. Immediately Ruby 2 had issues and gemsets puked middleman everywhere. I then ran into some confusing permissions errors. About 15 minutes into this process of troubleshooting middleman I had flashbacks of the first few days at Basho and thought, “this is bullshit, something has to work better than this catastrofuck of software version conflicts“. So I dropped middleman dead.

Assemble, Assemble, Assemble…    ??!?#@$%! WTF!

I attempted assemble next for the node.js stack. It looked to have a lot of promise. It uses grunt.js and a bunch of other tools to manage a static site generating, bootstrap using stack. The more I looked at it however it seemed busy. Busy as in “I’m going to do more than three things so I’ll maybe do none of them right“.

Reading about assemble I turned to another hacker slinging some code at the bar I sat at. She looked at the project and asked, “what’s it supposed to do exactly? I get that it’s a framework of tools but it doesn’t’ exactly lay out what it is supposed to be doing besides arbitrarily managing some parts of the stack.” That seemed reasonable to me.

Before I just tossed assemble.io to the trash heap of options I wanted to ask at least one more person. So the next day I asked my good friend and super genius Troy Howard. It was a short verdict, “drop that shit”.

That was enough for me, assemble was officially dead for this project.

Slate, This Seems Slick But…

I then took a stab at Slate. Orchestrate.io just created some excellent documentation using the Slate solution. So I dove into this, getting a test site up and running rapidly. It seemed like a mostly viable solution until I started running into issues with how and where I wanted things displayed for the code samples and other material. It appeared, if I were going to use Slate, I’d be using it almost exactly as is. I might borrow pieces of it in the future, even the layout to some degree, but for now I wanted something else that I could incorporate my themes as needed. Alas, I was super happy with Slate, it just wasn’t a great fit for now.

Where The Hell Are My Options, Jekyll?

At this point I was getting a little frustrated. I then went to a tried and true solution in jekyll. Jekyll is a pretty solid solution, with some bugs and oddball issues but nothing major. I started working with it and even transitioning a jekyll project into my theme. Hacking a jekyll blog into a reasonable documentation solution this seemed like the way to go.

But then I got a wild urge to see if there was anything else in Node.js land that I was missing. I really didn’t want to sling a Ruby project if I didn’t have to. I’d rather keep all the stacks around JavaScript for this particular set of projects. No reason to diverge when I’m just dealing with such simple straight forward web projects. I’ll diverge when something truly validate diverging, like doing some real math with a real functional language or something. Trading Node.js for one single project to go with a pseudo Ruby project for static site generation just didn’t seem appealing. So I started looking around one more time.

Made in -34°C

Yup, -34 Celsius. That's about as cold as it gets. Click for the full size chart!

Yup, -34 Celsius. That’s about as cold as it gets. Click for the full size chart!

The next solution I tried was Wintersmith. This solution appeared to have everything that I’d been looking for feature wise. It was a node.js project, it generated static content, could generate blogs but other things too, was simple, had plugins, was straight forward and more. I was a little paranoid after the solutions I’d fought my way through earlier so I went to the only place that would insure that I’d have a solution I could be confident in. I went straight to the source!

I’ll admit I took a peak at the package.json file before going head long into the source. A quick perusal of the dependencies list looked ok.

  dependencies: {
    marked: ~0.3.0,
    coffee-script: ~1.6.3,
    async: ~0.2.9,
    highlight.js: ~8.0.0,
    jade: ~1.1.5,
    ncp: ~0.5.0,
    rimraf: ~2.2.6,
    winston: ~0.7.2,
    colors: ~0.6.2,
    optimist: ~0.6.0,
    minimatch: ~0.2.14,
    mime: ~1.2.11,
    js-yaml: ~3.0.1,
    mkdirp: ~0.3.5,
    chokidar: ~0.8.1,
    server-destroy: ~1.0.0,
    npm: ~1.3.24,
    slugg: ~0.1.2
  },
  devDependencies: {
    shelljs: 0.1.x
  }

I immediately took note of a few things. The first was that there was actually a breakout of dev dependencies versus actual project dependencies. That’s a good first sign. The second thing I just went through the list and checked the various library dependencies, there were a few that I’ve played around with before that I trusted; highlight.js, coffee-script, async, js-yaml and npm were all cool by me. It didn’t seem to crazy out of whack. With that I went forth into the code with zero expectations…

The first files I dug into were the config.coffee file, which pointed out a few things I’d want to possibly tweak a little later such as the port number and other things the wintersmith server would use when running the preview server.

class Config
  ### The configuration object ###

  @defaults =
    # path to the directory containing content's to be scanned
    contents: './contents'
    # list of glob patterns to ignore
    ignore: []
    # context variables, passed to views/templates
    locals: {}
    # list of modules/files to load as plugins
    plugins: []
    # modules/files loaded and added to locals, name: module
    require: {}
    # path to the directory containing the templates
    templates: './templates'
    # directory to load custom views from
    views: null
    # built product goes here
    output: './build'
    # base url that site lives on, e.g. '/blog/'
    baseUrl: '/'
    # preview server settings
    hostname: null # INADDR_ANY
    port: 8080
    # options prefixed with _ are undocumented and should generally not be modified
    _fileLimit: 40 # max files to keep open at once
    _restartOnConfChange: true # restart preview server on config change

Second code file that looked interesting, the renderer.coffee code file.

fs = require 'fs'
util = require 'util'
async = require 'async'
path = require 'path'
mkdirp = require 'mkdirp'
{Stream} = require 'stream'

{ContentTree} = require './content'
{pump, extend} = require './utils'

if not setImmediate?
  setImmediate = process.nextTick

renderView = (env, content, locals, contents, templates, callback) ->
  setImmediate ->
    # add env and contents to view locals
    _locals = {env, contents}
    extend _locals, locals

    # lookup view function if needed
    view = content.view
    if typeof view is 'string'
      name = view
      view = env.views[view]
      if not view?
        callback new Error "content '#{ content.filename }' specifies unknown view '#{ name }'"
        return

    # run view
    view.call content, env, _locals, contents, templates, (error, result) ->
      error.message = "#{ content.filename }: #{ error.message }" if error?
      callback error, result

render = (env, outputDir, contents, templates, locals, callback) ->
  ### Render *contents* and *templates* using environment *env* to *outputDir*.
      The output directory will be created if it does not exist. ###

  env.logger.info "rendering tree:\n#{ ContentTree.inspect(contents, 1) }\n"
  env.logger.verbose "render output directory: #{ outputDir }"

  renderPlugin = (content, callback) ->
    ### render *content* plugin, calls *callback* with true if a file is written; otherwise false. ###
    renderView env, content, locals, contents, templates, (error, result) ->
      if error
        callback error
      else if result instanceof Stream or result instanceof Buffer
        destination = path.join outputDir, content.filename
        env.logger.verbose "writing content #{ content.url } to #{ destination }"
        mkdirp.sync path.dirname destination
        writeStream = fs.createWriteStream destination
        if result instanceof Stream
          pump result, writeStream, callback
        else
          writeStream.end result, callback
      else
        env.logger.verbose "skipping #{ content.url }"
        callback()

  items = ContentTree.flatten contents
  async.forEachLimit items, env.config._fileLimit, renderPlugin, callback

module.exports = {render, renderView}

Fairly straight forward code. Puts together the rendered content and I noted a few key things. There was a solid process order that was repeated; env, content, locals, contents, templates, callback. Because of this it looked like local variables were set to statically set certain things based on configuration instead of dynamic location. This could bite me, but with this quick glance, at least I knew where and what was happening with the order of generation.

I then did a scan of the templates.coffee and a few other code files. Having gotten a fair idea of where and what was being done, I went looking for a quick start. Things looked pretty good, so I crossed my fingers and my rant ends here…

[/rant off]

So now that the rant mode was over, here’s what I did to make wintersmith my documentation solution. Most of this is in a state of flux as I automate and put more into the project to simplify the workflow.

Here’s how I got started super fast.

Step #1 Get Wintersmith running.

npm install wintersmith -g

Note that you’ll need to install it globally (thus the -g) and may need to install Wintersmith with sudo prepended to that command.

The next thing that I did was create a directory that I’d use to build the static generated contents. This material I’d put into a git repository on github (namely the deconstructed gh-pages repo). I’ll call this generically the root directory.

mkdir rootDirectory

After that I navigated into the rootDirectory and created a new Wintersmith Application.

wintersmith new myAppName

That now gives me a directory structure like this

  • rootDirectory
    • myAppName

Now that I have this, the app content, markdown, views and related templates are in myAppName. To view the app, I changed directories into myAppName and ran wintersmith preview like this

wintersmith preview

Opening up a browser I can navigate to http://localhost:8080 and see the fully rendered site. To publish the site however one needs to run wintersmith build, however there’s one problem. I want the site to publish to the rootDirectory where the application content currently sites. To do this I have to edit the config.json file. Just above the locals code settings shown below…

{
  locals: {
    url: http://localhost:8080,
    name: The Wintersmith's blog,
    owner: Someone,
    description: Ramblings of an immor(t)al demigod
}

I added an output key value property to the file as shown. It merely takes the results and shifts them back a directory so they end up in the rootDirectory.

{
  output:../,
  locals: {
    url: http://docs.deconstructed.io,
    name: Deconstructed Docs,
    owner: Adron Hall,
    description: This site provides the documentation around the Deconstructed API Services.
  },
  plugins: [
    ./plugins/paginator.coffee
  ],
  require: {
    moment: moment,
    _: underscore,
    typogr: typogr
  },
  jade: {
    pretty: true
  },
  markdown: {
    smartLists: true,
    smartypants: true
  },
  paginator: {
    perPage: 6
  }
}

I also changed the perPage setting to 6, just so I could get a little more content on the main page eventually. There is also the change for the domain name and a few other parameters that I’ll catch up on with the next blog entry.

Summary

In my next blog entry I’ll cover a quick how-to on how to setup the CNAME in github pages to get the static wintersmith site up at a subdomain/domain name. I’ll also dive into setup with AWS Route 53, which generically applies to setting a gh-pages site up with any DNS provider. So subscribe and I’ll have that post in the next 1-2 days.

Mapping Domain Names with name.com, Elastic Beanstalk, Elastic Load Balancer and AWS Route 53

I finally wrapped up my name server and DNS mapping needs with Name.com, Route 53 and Elastic Beanstalk. Since this was a little confusing I thought a short write up was in order. Thanks to Evan @evandbrown for helping out!

The first thing needed is a delegation set of name servers for your DNS and name server provider. These can be found by creating a hosted zone. The way to do this is open up the AWS Management Console and navigate into the Route 53 management area. The Route 53 icon is under the Compute & Networking section on the management console.

Beanstalk, Route 53 - Click for full size image

Beanstalk, Route 53 – Click for full size image

Upon navigating to the Route 53 console area click on the Create Hosted Zones button.

Create Hosted Zone

Create Hosted Zone – Click for full size image

When the zone is created then the delegation set can be found under the Hosted Zone Details. This delegation set now needs setup as the name servers for whoever, in this case name.com, is the domain provider.

Delegation Set - Click for full size image.

Delegation Set – Click for full size image.

Open up the management console for the name server administration.

Upon adding them the list should look something like this.

Name servers list built from the delegation set of the hosted zone. Click for full size image.

Name servers list built from the delegation set of the hosted zone. Click for full size image.

Once the name servers are setup, those will need time to propagate. Likely this could take a good solid chunk of time, somewhere in the hours range likely, and don’t be surprised if it takes a little bit more than a day.

While the propagation starts navigate back to the AWS Management Console and open up the EC2 section of the console. On the right hand side of the Resources list there is a Load Balancers section. Click it.

Load Balancers - Click for full size image.

Load Balancers – Click for full size image.

In this section there is a listing of all load balancers that have been created manually or by Elastic Beanstalk.

Load Balancers - Click for full size image.

Load Balancers – Click for full size image.

Make note of the Load Balancer Name for selection in Route 53. This is what Route 53 needs in order to point an alias at for incoming traffic to that particular Elastic Beanstalk application. In this particular image above there are 4 load balancers listed, the easiest way to prevent confusion is to take note of the load balancer name at the time of creation, but this is the easiest way to find them otherwise.

Record Set - Click for full size image

Record Set – Click for full size image

Now when going back to the hosted zone to set it up with the appropriate information, create a new record with the appropriate name, in this case I was setting up the admin.deconstructed.io (no it isn’t live yet, I just set it up to test it out) to point to an alias target. Just leave the Type set to A – IPv4 address and click the radio control so that Alias is set to Yes. In the alias target select the appropriate load balancer for the Elastic Beanstalk (or whatever it points to) application.

That’s it, give it a few hours (or a day) and eventually the domain or subdomain will be pointed appropriately at the Elastic Beanstalk load balanced application.

Learning About Docker

Over the next dozen or so few days I’ll be ramping up on Docker, where my gaps are and where the project itself is going. I’ve been using it on and off and will have more technical content, but today I wanted to write a short piece about what, where, who and how Docker came to be.

As an open source engine Docker automates deployment of lightweight, portable, resilient and self-sufficient containers that run primarily on Linux. Docker containers are used to contain a payload, encapsulate that and consistently run it on a server.

This server can be virtual, on AWS or OpenStack, in clusters, public instances or private, bare-metal servers or wherever one can get an operating system to run. I’d bet it would show up on an Arduino cluster one of these days.  ;)

User cases for Docker include taking packaging and deployment of applications and automating it into a simple container bundle. Another is to build PaaS style environments, lightweight that scale up and down extremely fast. Automate testing and continuous integration and deployment, because we all want that. Another big use case is simply building resilient, scalable applications that then can be deployed to Docker containers and scaled up and down rapidly.

A Little History

The creators of Docker formed a company called dotCloud that provided PaaS Services. On October 29th, 2013 however they changed the name from dotCloud to Docker Inc to emphasize the focus change from the dotCloud PaaS Technology to the core of dotCloud, Docker itself. As Docker became the core of a vibrant ecosystem the founders of dotCloud chose to focus on this exciting new technology to help guide and deliver on an ever more robust core.

Docker Ecosystem from the Docker Blog. Hope they don't mind I linked it, it shows the solid lifecycle of the ecosystem. (Click to go view the blog entry that was posted with the image)

Docker Ecosystem from the Docker Blog. Hope they don’t mind I linked it, it shows the solid lifecycle of the ecosystem. (Click to go view the blog entry that was posted with the image)

The community of docker has been super active with a dramatic number of contributors, well over 220 now, most who don’t work for Docker and they’ve made a significant percentage of the commits to the code base. As far as the repo goes, it has been downloaded over a 100,000 times, yup, over a hundred. thousand. times!!! It’s container tech, I’m still impressed just by this fact! On Github the repo has thousands of starred observers and over 15,000 people are using Docker. One other interesting fact is the slice of languages, with a very prominent usage of Go.

Docker Language Breakout on Github

Docker Language Breakout on Github

Overall the Docker project has exploded in popularity, which I haven’t seen since Node.js set the coder world on fire! It’s continuing to gain steam in how and in which ways people deploy and manage their applications – arguably more effectively in many ways.

Portland Docker Meetup. Click image for link to the meetup page.

Portland Docker Meetup. Click image for link to the meetup page.

The community is growing accordingly too, not just a simple push by Docker/dotCloud itself, but actively by grass roots efforts. One is even sprung up in Portland in the Portland Docker Meetup.

So Docker, Getting Operational

The Loading Bay

The Loading Bay

One of the best ways to describe docker (which the Docker team often uses, hat tip to the analogy!) and containers in general is to use a physical parallel. One of the best stories that is a great example is that of the shipping and freight industry. Before containers ships, trains,

Manually Guiding Freight, To Hand Unload Later.

Manually Guiding Freight, To Hand Unload Later.

trucks and buggies (ya know, that horses pulled) all were loaded by hand. There wasn’t any standardization around movement of goods except for a few, often frustrating tools like wooden barrels for liquids, bags for grains and other assorted things. They didn’t mix well and often were stored in a way that caused regular damage to good. This era is a good parallel to hosting applications on full hypervisor virtual machines or physical machines with one operating system. The operating system kind of being the holding bay or ship, with all the freight crammed inside haphazardly.

Shipping Yards, All of a Sudden Organized!

Shipping Yards, All of a Sudden Organized!

When containers were introduced like the shiny blue one shown here, everything began a revolutionary change. The manpower dramatically

A Flawlessly Rendered Container

A Flawlessly Rendered Container

dropped, injuries dropped, shipping became more modular and easy to fit the containers together. To put it simply, shipping was revolutionized through this invention. In the meantime we’ve all benefitted in some way from this change. This can be paralleled to the change in container technology shifting the way we deploy and host applications.

Next post, coming up in just a few hours “Docker, Containers Simplified!”

Getting Distributed – BOOM! The Top 3 Course Selections

A few months ago I posted a poll to ask what courses I should put together next. I just wrapped up and am putting the final edits and finishing touches on a Pluralsight Course on distributed databases, focusing on Riak. On the poll the top three courses, by a decent percentage of votes included the following:

  1. Node.js Distributed Systems – Bringing the Node.js Nodes together for Distributed Noes of Availability and Compute @ 12.14% of the vote.
    1. A Quick Intro to Node.js
    2. Introduction to Relevant Distributed Patterns
    3. How Does Node.js Fit Into the Distribution
    4. Working With Distributed Systems (AKA Avoiding a Big Ball of Mud)
    5. Build a Demo
  2. Distributed Systems Programming with Javascript @ 10.4% of the vote.
    1. Patterns for Distributed Programming
    2. …and I’m figuring the other sections out still for this one…  got ideas? It needs to encompass the client side as well as the non-client code side of things. So it’s sort of like the above course, but I’m focusing more on the periphery of what one deals with when dealing with developing on and around distributed systems as well as distributed systems themselves.
  3. Vagrant OS-X, Windows and Linux – how to build, manage and ship machines to use for development and recreation of production environments.
    1. Vagrant, What is it?
    2. OS-X, Linux and Windows
    3. Using Vagrant Machines
    4. Building Vagrant Dev Machines
    5. Vagrant the Universe!

Now I might flip this list, but either way they’re all going to be super cool. So stay tuned and I’ll be working up these into courses. So far here’s the sub-bullets above are the basics of the curriculum I intend to put forward. Am I missing anything? Would you like to see anything specifically? Leave a comment and I’ll be sure to get everything as packed in there as possible!!

Riak Developer Guidance

The “Client Round Robin Anti-Pattern”

One of the features that is often available in Riak Client software (including the CorrguatedIron .NET Client, the riak-js client and others) is the ability to send requests to the Riak Cluster through a round robin style approach. What this means is each IP, of each node within the Riak Cluster is entered into a config file for the client. The client then goes through that list to send off requests to read, write or delete data in the database.

The client being responsible and knowledgeable about the data tier of the application in an architecture is an immediate red flag! The concept around SoC (Separation of Concerns) dictates that

“SoC is a principle for separating a computer program into distinct sections, such that each section addresses a separate concern.

Having the client provide a network tier layer to round robin communication with the database leaves us in a scenario that should be separated into individual concerns. Below is some basic guidance on eliminating this SoC issue.

  • Client ONLY sends and receives communication: The client, especially in the situation with a distributed system like Riak should only be dealing with sending and receiving information from the cluster or a facade that provides an interface for that cluster.
  • Another layer should deal with the network communication and division of nodes and node communication. Ideally, in the case or Riak, and most distributed systems this should be dealt with at the network device layer (router).
  • The network device (router) layer would ideally be able to have (through software likely) a way to automate the failure, inclusion or exclusion of nodes with the cluster system. If a node goes down, the network device should handle the immediate cessation of communication with that node from all clients, routing the communication accordingly to an active node.
  • The node itself needs to maintain a continual information state available to the network. Ideally the network state would identify any addition or removal of a node and if possible the immediate failure of a node. Of course it isn’t always possible to be informed of a failure, but the first line of defense should start within the cluster itself among the nodes.

The Anti-Pattern

Having the client handle all of these parts of the functional architecture leads to a number of problems, not merely that the guidance of the SoC concept is broken. With the client attempting to track and be aware of the individual nodes in the cluster, it sets the client with a huge responsibility.

Take for instance the riak-js client. If a node goes down the client will need to be aware of which node has gone down. For a few seconds (yes, you have to wait entire seconds at this level) the node will be gone and the client won’t know it is down. The client would just have to reasonably wait. When the communication times out, the client would then have to have the responsibility of marking that particular node as down. At this point the client must track which node it is in some type of data repository local to the client. The client must also set a time or some way to identify when the node comes back up. Several questions start to come up such as;

  • Does the client do an arbitrary test to determine when the node comes back up?
  • When the node comes back up is it considered alive or damaged?
  • How would the client manage the IP (or identifier) of the node that has gone down?
  • How long would the client store that the node is down?

The list of questions can get long pretty quick, thus the bad karma of not following a good practice around separating your concerns appropriately! One has to be careful, a god class might be right around the corner otherwise! That’s it for this quick journey into some distributed database usage guidelines. Until next, happy data sciencing.  ;)

OSCON : Conversations, Deployments, Architecture, Docker and the Future?

I wrote about my first day of OSCON “OSCON : Day 1, Windows Just Doesn’t Do Cloud Foundry… but, there’s a fix for that…“. The rest of the week was most excellent. I caught up with friends and past coworkers. I heard about people working on some amazing new projects. Some things I will try to write up in the coming days, as I’m sure some of it will be making the tech news (if not the regular people news too).

Conversations

Had some great conversations about the direction of enterprise and paas uptake. It’s great to hear that there is some movement in that space finally. As one would expect however, there is still a lot of distance for the enterprise to catch up on, but they’ll get there – or fall apart in the meantime.

There were also tons of conversation about the Indiegogo Ubuntu Edge mobile device. This device is a great looking and sounds like a solid idea. The questions arise in the fact that they’re working to make this a purely crowd funded project. This wouldn’t be a concern if they were trying to just get a few million in capital, but they’re aiming for $32 million! Overall though, with 128 GB, Dual LTE Antennas for Europe and the US, a top tier screen in quality and design, a metal body and also multiple other features that put this phone ahead of anything out there. I hope it’s successful, but I must admit my own hesitance. What’s your take on the device?

Deployments

Over the course of the conference I talked to and worked with a number of other individuals playing around with Cloud Foundry and also OpenShift. The primary aspect that we worked on was strategies around deployment of these PaaS Technologies.

We also worked with Iron Foundry to extend Cloud Foundry to support .NET. If you love .NET or hate .NET, wherever in that spectrum, it has an absolutely huge user base still. Primarily because .NET spent the last decade and a few years going head to head against Java in the Enterprise, and we all know the enterprise is slow to shift anything. So for now and the foreseeable future .NET is an extremely large part of the development world. Having it work in your PaaS is fundamental to gaining significant enterprise share. Cloud Foundry is the only open source, internally usable PaaS on the market today. There are closed source options available, but that obviously doens’t come up at OSCON.

While at OSCON, I also got to discuss architecture and deployment of Riak with a number of people. The usage of Riak continues to grow and the environments, use cases and tooling that people are using Riak with and for is always an interesting space for me. I also got to discuss deployment of Cassandra and even some Neo4j, Redis and Riak side by side deployments. People have used an interesting mix of NoSQL solutions out there to pull their respective data together for their needs.

Among all these deployments, conversations regularly returned to a known topic of mine. Cloud computing and who is capable of what, where and when. AWS is still an easy leader in cloud computing, not just in customers but in technology. This also brought up the concerns and apathy that some have around OpenStack (hat tip to Ben Kepes for the write up) working more homogeneously with AWS. Whatever the case might be, the path for OpenStack needs to be clarified regularly. I imagine the next movement is going to be away from being too concerned with infrastructure and increased concern with portability of applications and development of applications.

Another growing topic of discussion was around building applications for, on and with Windows Azure. Microsoft has actually become dramatically more involved in open source in an honest and more integrity based way. I’m honestly amazed at how far they’ve come from the declaration years ago that “open source is a cancer” and the all too famous, “linux is communism“. Whatever that was supposed to mean, they didn’t seem to get it back then. Now however, they regularly contribute to open source projects on codeplex but also github and other places. Microsoft has even contributed to the Linux kernal a few months ago.

That leads me to the next topic that came up a number of times…

Architecture

There’s been a lot of discussion about architecture around PaaS, containers (more on that in a moment), distributed systems in general and distributed databases. As I wrote about recently, “Architectural PaaS Cracks or Crack PaaS” the world of distributed systems and distributed databases has more than a few issues when working together in a PaaS environment. This brought up the discussion about what solutions exist today, solutions I look forward to writing and building in the coming months.

The most immediate solution to scalable data sources is still to run your operational data sources such as Neo4j, Redis, Riak or other database autonomously but residing close to your PaaS System. The current public PaaS Providers do exactly this and in some cases extend that to offer the databases and data sources as services through add-ons. These are currently great solutions, but require time, effort and custom development work when setting up internally.

This leads me to the last topic…

The Story of a Container – Docker

Well, not just Docker, but containers in general and Docker specifically. First some context about what a container is.

Container – In this particular context I’m writing about a container, or more specifically a runtime-container, that isolates resources for applications or services. Containers are common in PaaS technologies to help isolate the specific services or applications when they’re on a single physical machine or instance. For each of the respective PaaS systems that came up at OSCON we have dotCloud from the same team that created Docker, Cloud Foundry has Warden and OpenShift has gears and Red Hat Enterprise Linux OS specific containers.

I’ve studied Warden a little in the past while I was working with AppFog and Tier 3 around Cloud Foundry. Warden is a great piece of technology. However the star at OSCON was clearly Docker. I jumped into a number of conversations around Docker. This conversation would then take the direction to containers becoming the key to PaaS tooling and systems growth and increasing capabilities. That leads me back to my previous blog entry “Architectural PaaS Cracks or Crack PaaS” and one of the key solutions to the data tier issue.

Containers, A Solution for Scaling the Data Tier

One of the issues that comes up when trying to scale any distributed database in a PaaS Environment is how to provide multi-tenancy without spooling up new instances for each and every single installation of a node within that distributed database. Here’s an example diagram of the requirements behind a scalable distributed database.

Masterless, Distributed Cluster of Nodes

Masterless, Distributed Cluster of Nodes

In a default configuration you’d want each node to be running on a physical machine or dedicated virtual instance. This is for performance reasons as well as reasons for load balancing, security, data integrity and a host of others. This is the natural beginning state of a highly available distributed database or distributed system.

Trying to deploy something like this into a PaaS environment is tricky. Take into account that there is no such thing in application or service speak as an instance, and especially not anything such as a physical server. The real division between process and resources are containers. These containers are what actually needs to run the distributed system node. This becomes possible, if a distributed system node can be deployed to and executed from within a container.

Enter Docker

After reviewing Docker, the capabilities around it and the requirements of a distributed database, it looks like an ideal marriage of the two technologies. Already Docker has Redis and other database technologies running on it. The Container technology around Docker looks like an ideal fit to extend distributed systems to run autonomously of a single physical machine or single instance per node. This would enable nodes to be deployed as resources are available to provide a more seamless and PaaS style deployment for systems like Cassandra, Riak and related distributed systems. Could this be the next evolution of affordable distributed systems, containers to the rescue?

I’ll be reporting back on my progress, this could be cool!

Stay tuned for a write up on Docker in the near future. For more information now check out http://www.docker.io.

Consistent Hashing – Learning About Distributed Databases :: Issue 002

One of the core tools in the belt of the distributed database is consistent hashing. In Riak this is especially true, as it stands at the core of a Riak Cluster. Hashing, using a hash function, is an algorithm that maps data to variable length to data that’s fixed. In other words, odd things like the name of things mapped to integers. Consistent hashing is a special kind of hashing that provides the pattern for mapping keys and all related functionality around a cluster ring in Riak.

Consistent hashing was originally devised by David Karger, a professor of computer science at MIT (Massachusetts Institute of Technology). He’s also known for Karger’s Algorithm, a Monte Carlo method that computes the minimum cut in a connected graph (graph theory related stuff). Along with these developments he’s been part of many other efforts and contributed to computer science in many ways.

Remapping, Mapping and Keeping Distributed (& Available)

One key property of a consistent hash is that it minimizes the number of keys that must be remapped. With a regular hash changes, the entire key hash must be remapped.

Consistent hashing is based around mapping each object to a point of a circle. The system maps each storage bucket to pseudo-randomly distributed points on the edge of this circle.

The system finds where to place the object based on the key on the edge of the circle. It then walks the circle falling into the first bucket it finds. This results in the buckets containing the resources between its point and the next bucket point.

When a bucket disappears for any reason, the pseudo randomly mapped objects will now get re-mapped to different buckets. When a bucket appears, such as becoming available again or being added, a similar process occurs.

The Basho Docs describe in brief that,

Consistent hashing is a technique used to limit the reshuffling of keys when a hash-table data structure is rebalanced (when slots are added or removed). Riak uses consistent hashing to organize its data storage and replication. Specifically, the vnodes in the Riak Ring responsible for storing each object are determined using the consistent hashing technique.

NOTES: This is not a single blog entry topic by any means. This is merely a cursory look at consistent hashing. This entry I aimed to provide a basic description and coverage of the actions around consistent hashing. For more information and to dive even deeper into consistent hashing I’ve included a few links that have extensive information on the topic: