PACER receipt scraping

Some preliminary results from my attempt to build a wrapper API around PACER:

Just before you view a document in PACER, you’re greeted by a receipt page which cheerfully shows you just how much you’re being overcharged for the document in question. The table has the following format (once intermediate cluttering nonsense is excised):

<table>
  <tr>Pacer Service Center</tr>
  <tr>Transaction Receipt>
  <tr></tr><tr></tr>†
  <tr>[current time*]</tr>
  <tr>[Pacer login and client code]</tr>
  <tr>[Description and case number**]</tr>
  <tr>
    <th>Billable Pages:</th>
      <td>[# of billable pages]</td>
    <th>Cost:</th>
      <td>[cost, in $]
  </tr>
  <tr></tr><tr></tr>
</table>
</code>

Yes they really do use two empty rows as spacers.
* This is the current time, (fortunately) in the standard "%a %b %d %H:%M:%S %Y" format. It seems to be in the time zone of the court.
** This row is entirely useless (for our purposes, anyway), as presumably if you’ve reached the receipt page you know what case you’re in and what document you just selected.

There’s really only one useful bit of information in here, and that’s the number of pages (from which the cost can be derived). Note though that it really is billable pages; the rare documents that are free (that is, orders that are so marked) will have 0 here, and documents longer than the 30 page cap will have 30 here. (NB: in these cases PACER adds extra rows to this table. I haven’t yet figured this part out, and so the code below probably doesn’t work in such cases.) Update: After some investigation, in the case of free opinions, the billable pages field still lists the total number of pages. The added row is at bottom and so should not affect any of the code below.

I use BeautifulSoup (Python; MIT license) for HTML scraping. This is the code that extracts out how many (billable) pages there are:

int(soup.table.find_all('tr')[7].find_all('td')[0].text)

And here’s the cost (which is currently just $0.10 per page, but hey, this could change in the future):

Decimal(soup.table.find_all('tr')[7].find_all('td')[1].text)

Update for documents with attachments

Some documents (a lot, actually) come with attachments. In this case putting in the document # into qryDocument.pl (like we do) doesn’t give a receipt page, it gives a “Document Selection Menu”.

PACER Document Selection Menu

The link to the actual main document is given by (BeautifulSoup code):

soup.table.find_all('tr')[0].find_all('td')[0].a['href']

Then you get to a receipt page which can be processed as described above.

(The PACER website has a JavaScript onClick handler on this link. I have no idea what it does, and copy-pasting just that link into the address bar seems to work, so…)

The number of pages (in the main document) is given by

int(soup.table.find_all('tr')[0].find_all('td')[1].text.split(" ")[0])

For some reason this page also tells you the size of the PDF; something I don’t think the usual receipt page has. Not sure what you’d need it for, but this is how you get it:

soup.table.find_all('tr')[0].find_all('td')[2].text

The attachments themselves should start at soup.table.find_all('tr')[3]. So if an attachment is in tr number X, its info should be as follows:

Link:

soup.table.find_all('tr')[X].find_all('td')[0].a['href']

Number, although you should know this already:

soup.table.find_all('tr')[X].find_all('td')[0].a.text

Title (which in my experience is usually just the very-unhelpful “Exhibit “):

soup.table.find_all('tr')[X].find_all('td')[1].text

Page count (not sure if billable or total. probably total):

int(soup.table.find_all('tr')[X].find_all('td')[2].text.split(" ")[0])

Size:

soup.table.find_all('tr')[X].find_all('td')[3].text

You know when you’ve reached the end of the attachments when you see a tr with an hr in it. (Optionally, checking if the .text in BeautifulSoup is empty seems to work too.)