November 4, 2017

Scraping BNcollege.com

My friend at school runs our college’s textbook library — they accept bookstore donations and rent them out for free to students who can’t afford textbooks. With a mission like that, you can imagine how irritating it was for her to discover that the school’s bookstore “doesn’t have a spreadsheet of next term’s textbooks”, despite a textbook search feature listed prominantly on the bookstore website. Sounds like the bookstore just doesn’t like competition, and would like to make life as difficult as possible for my friend.

She asked me to write a scraper for the textbook search engine — if they’re not willing to hand over a list, maybe we can just take the list from them? It turns out many existing textbook scraping tools stopped working last year, since Barnes & Noble implemented a bunch of JavaScript to deter scraping with simple cURL requests. They also have some IP blocking techniques going on. They really don’t want people accessing textbook lists. (And remember, they get these lists from professors for free!)

I’m frankly pretty angry at Barnes & Noble. My friends are trying to give textbooks to people who can’t afford them, and the bookstore is intentionally getting in the way. So I’m releasing my working code as open source, and offering free installation help, support, and updates to anybody who needs it — just shoot me an email.

Circumventing the blocking techniques was a three step process. First, we use chromedriver to run a real version of Chrome that’s capable of running the Javascript that BNcollege uses to make sure you’re not scripting. Secondly, every couple of requests, we restart Chrome with a new false user agent. This both clears the cookies and circumvents the user-agent-based blocking system they have in place. Lastly, and unfortunately this is not automated, but every ~100 textbook requests or so will require getting a new IP address to circumvent the IP blocks. If you’re on a school network, you can probably get a new one by simply changing your MAC address. On macOS, it looks like:

ifconfig en0
# look for the content after 'ether' from the above output,
# change a single number, and replace the example mac address
# in the following command:
sudo ifconfig en0 ether 11:22:33:44:55:66

The best part about getting a new IP address from the school is that they can’t block your scraping IPs without blocking the entire school’s IP range. Not really possible, assuming they want students to use their site.