← Berlin Triathlon 2015 Using BibLaTeX in Texmaker →

PhantomJS in a Bash Loop

February 7, 2016

The problem I recently tackled was quite simple: I have a list of URLs (+ some additional information) in a tab separated file (.tsv) and want to run a PhantomJS script for each line with some parameters, which are provided in the file mentioned.

So initially it seems quite obvious to just open the file and use while read -r -a line to do that. That reads the next line in the file, splits it on tabs and I can use it as an array in the body of the loop – works like a charm! However, as soon as I added my PhantomJS script, the loop stops after the first iteration. After hours of testing different ways to read the file and figure out what happens, it seemed clear, that there was just no line to read after Phantom was done.

Just to show how messy it ended up:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


# header of tsv file
#  0     1       5 
# url  id ... title   ...
COUNTER=0
{
  #skip header
  read
  # set separator to TAB
  IFS=$'\t'
  # read file line by line
  while read -r -a row
  do
    let COUNTER=COUNTER+1
    hash=`echo -n "${row[0]}" | md5sum | awk '{ print $1 }'`
    phantomjs phantomscript.js "${row[0]}" "${row[1]}" "${hash}" "${row[5]}"
    echo "(${COUNTER} => ${hash}) ${row[0]}"
    # after that, nothing else happens, script stops
    # (obviously there's one more attempt to read, but the result is empty)
  done
} < ${FILE}

I’m not too subtle when it comes to this kind of stuff, so frustration kicked in. Just after I started offloading the file parsing and looping into my phantom script, which would have implied lots of extra work, I decided to give it one more shot with a different approach, this time with no concern for elegance nor good practice.

The idea was: instead of reading a file line by line, take a for loop and use sed to read one specific line at a time. The following example worked as expected:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


NUM_LINES=`cat ${FILE} | wc -l`
for (( COUNTER=2; COUNTER<${NUM_LINES}; COUNTER++ ))
do
    LINE=`sed "${COUNTER}q;d" ${FILE}`
    IFS=$'\t' read -r -a row <<< "${LINE}"
    hash=`echo -n "${row[0]}" | md5sum | awk '{ print $1 }'`
    
    phantomjs phantomscript.js "${row[0]}" "${row[1]}" "${hash}" "${row[5]}"
    echo "(${COUNTER} => ${hash})"
done

I guess the key components are quite clear: First you need the range of the for loop (starting at 2, to skip the header). Secondly get the current line with sed "NUMq;d" FILE, whereas NUM is the line you want to fetch (start counting at one, not zero) (details). To get the line parsed into an array, you can still use read -r -a (details) as before in a slightly different way.

One tip along the way: In case you ever want to force-stop the script (CTRL+C), you will notice, it will continue anyway (because the signal is trapped by PhantomJS). A simple fix is to set up the trap in your shell script with trap "exit" INT (details).

Maybe one more: What if there was an error processing the page? You probably want to keep track of those cases to take a look later on. Simply use phantom.exit(1) (or any other non-zero exit code) in your phantom script and, within your loop, right after the phantomjs call, add the magic line: if [[ $? != 0 ]] ; then echo ${LINE} >> ${FAILED_FILE} ; fi.

I hope that helps someone somewhere out there.

Thanks for reading!