Thursday, July 23, 2015

Optimizing the Condor Workflow

The way I have my program set up...
  1. I call a script that writes out a CSV for each bird containing all the predictor variables and the response
  2. The next script reads that CSV to fit the model.
This is to save space in the R memory (and aid in making all this run parallel). Today, I'm trying to make Condor work a little better for us, focusing on Step 1. Since the structure of automating multiple jobs for Condor is folders, I made a (in this case empty) folder for each species numeric code. I used the --parg=unique successfully to pass the folder name as an argument. For this step, the required memory specs were about the same as yesterday.

As far as I can tell, by the way the DAG command is setup, it's pretty good at finding things recursively in folders. The "shared" folder is set apart for use by the program. So, it seems to go through the expected folder structure and figure out where everything you reference is, probably somehow taking note of the paths. Thus, unless something is passed as an argument in your script, it should be able to find it easily. In my case, I was able to copy all my folders with the data that would be incorporated, etc. into the shared folder and the paths to the right places were found. In other words, practically speaking, you can take out "setwd" in your scripts, and any references to "hard" file locations. It knows where your starting location is when it runs over Condor.

As you can imagine, we are beginners, and there is a lot these applications can do. Thankfully, we have a good and communicative person to help us get started. For example, I learned how to string together dependent workflows. In the example I've been talking about, this means I can automate step 1 leading into step 2 by making my own DAG. Follow the basic instructions for making a DAG for step 1, and then in its output folder, put your shared folder for step 2. Then, make a DAG also for that like you normally would (this time, the input folder is the output of the last). Then, you can splice them (see section of that link). Here was the example given to me...

SPLICE job1 mydag.dag DIR step1_out
SPLICE job2 mydag.dag DIR step2_out
PARENT job1 CHILD job2

To move everything later...
for subdir in perspp/*; do cp --parents $subdir/bird.csv input; done

Wednesday, July 22, 2015

Butterflies - last week July

This week, I'm moving and getting ready for a conference, so I'll miss the last "wave." But, hopefully I'll see a few skippers before I leave.
  • Dorcas copper
  • marine blue - very rare!
  • Reakirt's blue - rare
  • Atlantis fritillary
  • green comma
  • Peck's skipper
  • tawny-edged skipper
  • crossline skipper

Tuesday, July 21, 2015

Bee Walk

A good friend imparted some of her bee knowledge on our walk to lunch! We passed a rec field in front of the dorm where we were eating, and she pointed a few out.
brown-belted bumblebee
After lunch, we checked out the little nearby garden of native flowers, and found a few more species.
common eastern bumblebee
As the name might imply, we saw several of these, but that was the best shot I got. As we were leaving, she spotted one that is more up my alley...
Peck's skipper
Then, we made our way to the Allen centennial garden. We wound around to the rock garden where I found this gem...
tule bluet
Then, we went to the English garden. This bee was so huge it's hard to believe I didn't snap a better pic...
black-and-gold bumblebee
All in all, a great day learning some new insects with friends! Can't wait to do it again sometime soon.

Monday, July 20, 2015


The other rabbit hole! :) We're learning how to use HTC for our research. The basics: HTCondor networks computers across campus (and beyond) that you can use in their downtime. We had a nice meeting with a rep who helped us get set up, but there's still a lot to figure out. First, you'll want to run a few test cases, so that's what this post will be about.

Step 1. Login to the node you were assigned. My desktop at the office is Windows, so I use WinSCP for a lot of Linux server interactions. UNIX/Linux machines often use the utility scp to move files between machines, so this is a nice GUI implementation that can be installed on Windows. If Windows computer --> Linux server is also your setup, you'll also need to install PuTTY and point WinSCP to where it's installed. I logged into the submit node I was assigned (as the website notes, if you applied in 2015, you were probably assigned and was able to easily move my submit file to the node.

Step 2. (For R) We're a little spoiled because CHTC has helped us out with all this, and has built some software for common (but not ubiquitous) software. For example, R (and whatever packages you may need) may not be on the computers you want to use. Thus, you have to "package them up" and send them along with your task. I'm using R-INLA, which is technically its own program. I downloaded the latest tar.gz named "INLA," and tried to pass this with --rlibs like any other package. The encouraging part was that it told me it was missing a required package, which to me meant it at least "got that far." The package it wanted was sp, so I downloaded the package source there. As you'll note from the webpage that explains how to compile R for ChtcRun, order matters, so I bundled the R tar with sp first in the list of --rlibs.

Step 3. Make a workflow for your R analyses. Again, CHTC has you covered. As I've mentioned before, I have an input file for each bird for my analyses (maybe I'll streamline that part sometime later, but for now I'll just give it input files). You need a file structure representing each of your different cases. So, I made a folder called "BBS" in ChtcRun, and moved all my input files there. Then, I made a folder for each.
find . -name "*.csv" -exec sh -c 'NEWDIR=`basename "$1" .csv` ; mkdir "$NEWDIR" ; mv "$1" "$NEWDIR" ' _ {} \;
It looks like I can then maybe specify these files using --parg=unique or something, but for this first run, I think it's easier to just rename the files now that they're in unique folders.
for subdir in *; do mv $subdir/$subdir.csv $subdir/bird.csv; done;
In your shared folder, but your R script and the R tar you made in Step 2. I imagine our call to mkdag will look something like this...
 ./mkdag --cmdtorun=compute-index.R --data=BBS --outputdir=model_output --pattern=model.Rdata --parg=bird.csv --type=R --version=sl5-R-3.1.0  
Step 4. Submit DAG to Condor. The instructions are at the bottom of the last link, but also when the DAG file completes, it will give you instructions there too. You can check the status of your jobs with...
 condor_q $USER  
So, I ran a few test cases, and I found for the null models (maybe, some of the smallest) I needed...
  • 500 MB disk
  • 10 MB memory

Sunday, July 19, 2015


Now comes the potentially dangerous rabbit hole ;)
female bluet?
I finally got a photo of the little damsel fly (I think it's the same species, at least) that's been flitting around behind my apartment. I've also seen others that may be juveniles nearby. This is the first one I'm trying to learn (beside the jewel wings I grew up around).

Saturday, July 18, 2015


Outside my apartment, there were the typical gnats and green bottle flies when I left (neither of which I have any knowledge to ID to species!). I walked lakeshore path and the arboretum today. I had a hard time getting a photo a harvester today, but for good reason! As I was "stalking it" to get a photo, it beat me to the punch and landed on my hand :) I even walked around with it for a little bit before it spooked...only to land on my dress!

Tuesday, July 14, 2015

Wisconsin Butterflies - 3rd Week July

All species except Karner blue (perhaps obviously!) found in Dane Co. Good luck!
  • coral hairstreak - hosted by cherry and plum, nectars on butterfly weed
  • striped hairstreak
  • Karner blue
  • Aphrodite fritillary
  • regal fritillary
  • Appalachian brown (I think I may have caught a glimpse of this spp. this week already)
  • northern broken-dash
  • broad-winged skipper
  • black dash
  • dun skipper