Configuring my computer's environment for different tools without risk of breaking my work computer sounds like a dream. Bonus points to use those configurations to practice firing up a cluster of virtual machines. In comes Vagrant.

Vagrant wraps around other virtual machines (VirtualBox in vanilla Vagrant, but can manage other providers as well) enabling the entire setup of a virtual machine to be done through a script rather than by hand. This enables version controlling of machine configuration and precisely repeatable environments for yourself and anyone else you share your box with.

Let's start from scratch and build up to a virtual machine running Spark that can be created using just a couple of keystrokes. I'm using Ubunutu 14.04, though everything here should just work on Mac as well. Every code block starting in a $ denotes a command to be ran in your terminal.

Initial Setup

Vagrant is a script-flavored wrapper for virtual machines. The out-of-the-box virtual machine is VirtualBox, so it needs to be installed before Vagrant.

$ sudo apt-get install virtualbox
$ sudo apt-get install vagrant

Let's download a pre-made box to use as our starting point. There are plenty available online (e.g., here or here). We'll be using the vanilla Ubuntu Server 12.04 LTS. Add the box to Vagrant through the vagrant box add <name> <url_to_download_from> command.

$ vagrant box add precise32

Use vagrant box list to print the list of available boxes. Verify you now have at least precise32 (virtualbox) in that list.

It's now time to initialize a Vagrant project. Start a new directory to play in, cd into that directory, and run $ vagrant init <boxname>.

$ vagrant init precise32

Your project directory is now initialized with the file Vagrantfile which tells Vagrant what to consider the root directory for your project and is the version-controllable script that you will use to share your to-be box. We'll come back to this file later to customize our box but, for now, tell Vagrant to fire up the vm.

$ vagrant up

Congratz! You've just fired up your first vagrant-based virtual machine. Now what?

Manually Install Spark

My main motivation for working with Vagrant is to make a playground for setting up and running different big data tools without breaking my primary environment (many boxes were hurt in the writing of this post). The next tool I want to test drive is Spark, so Spark is what I'll be installing into this vm.

Let's log into your new vm. Still in your project's root directory, run

$ vagrant ssh

You'll now be logged in as user vagrant on your precise32 box.

We'll be using a pre-built version of spark, so getting it running is fairly straight-forward. First we'll need to install a JDK.

$ sudo apt-get -y update
$ sudo apt-get -y install openjdk-7-jdk

Next, we need to download the pre-bulit spark and extract the archive.

$ wget
$ tar -zxf spark-1.1.0-bin-hadoop1.tgz

Test the Installation

Let's move into the newly unzipped spark directory.

$ cd spark-1.1.0-bin-hadoop1

Spark is pre-built and should just run. Let's run the trivial example and make sure it's working. Fire up the Spark-based Scala shell.

$ ./bin/spark-shell

The following two lines entered into the newly-launched spark-shell will open spark's file and count its number of lines.

spark> val textFile = sc.textFile("")
spark> textFile.count()

You would have seen lots of stuff print out while opening the shell and after running each command. Make sure the last line you see printed is es0: Long = 126. If so, it worked! Nice job. I'll dive more into Spark in a future post, but for now let's get back to Vagrant. Exit the spark-shell, exit out of your vm, and make sure you're back in the root directory of your vagrant project.

Script-izing Our Installation

We're using Vagrant to let us script-ify and version control our virtual environment creation, so let's take everything we did manually up top and put it into Vagrantfile. Open your project's Vagrantfile in your favorite text editor and we'll automate everything we did by hand earlier.

Your Vagrantfile will already have the line = "precise32" telling Vagrant to use the precise32 box. Before we had to make sure we had a copy of that box locally. Alternatively, we can tell Vagrant a url to find the box and download if it's not available locally. We do that by adding the following line below the = precise32 line.

config.vm.box_url = ""

Next, we need to tell Vagrant to install the JDK and download and unpack Spark. We did this through four commands above. The way to tell vagrant to run a shell command takes the form config.vm.provision :shell, inline: "<command>". So, we'll want to add the following four lines next:

config.vm.provision :shell, inline: "sudo apt-get -y update"
config.vm.provision :shell, inline: "sudo apt-get -y install openjdk-7-jdk"
config.vm.provision :shell, inline: "wget"
config.vm.provision :shell, inline: "tar -zxf spark-1.1.0-bin-hadoop1.tgz"

You'll now have a Vagrantfile that looks like this. Let's test it.

Test Drive

We're again back in your Vagrant project's root directory. Clean out the old vm.

$ vagrant destroy

Let's also get rid of the box we manually downloaded to make sure Vagrant can download it automatically.

$ vagrant box remove precise32

Now fire up the vm from our new and shiny Vagrantfile. This is going to take some time to re-download the box and Spark.

$ vagrant up

Now ssh into vagrant like before and make sure spark still works. Congratz! You have spark running in a virtual environment that you can do your worst to with no real repercussions.

Next steps

There is plenty of great documentation as well as very useful comments that came included in the original Vagrantfile. I'll be using this to practice configuring systems. What will you use it for?

Allen Grimm

Read more posts about this author.