July 13, 2015

Scraping a Website with Elixir

When I’m learning a new language, like I’m doing now with Elixir, I like to practice by doing everyday scripting tasks using the new language. In this case, there was a video tutorial site that obnoxiously didn’t have a speed up function on the player. I usually like playing those kinds of videos at 2x speed, so I built a scraper in Elixir to download all the .mp4 files. With the source files, playing them faster in VLC is trivial.

Note that I’m relatively new to Elixir, so this probably isn’t the cleanest code, but it’s at the very least functional!1 It’s quite a simple project, but I when I was writing it, I couldn’t find too many resources on this topic. Hopefully it can save a few minutes for somebody else.

Prerequisites

To start out, you’ll want to make sure you have Elixir and Mix installed. On OS X, this is a simple brew update; brew install elixir, but you can find instructions for every platform on the Elixir website. You’ll also need selenium-server installed, which on OS X is again brew install selenium-server-standalone, but you’ll have to look up instructions for your specific platform if you’re running something else.

Starting the Project

The standard way to create a Elixir project is a simple command in mix, Elixir’s build tool. Just run

mix new scraper
cd scraper

and it’ll spit out a new project named “Scraper” for us to add our code.

We’re going to need a few libraries to create this project. Mix can automatically download and install them for us! Just change the deps definition in the mix.exs file to read:

defp deps do
  [
    {:httpoison, "~> 0.5"},
    {:floki, "~> 0.3"},
    { :hound, "~> 0.6.0" }
  ]
end

Then, in the root directory of your project, run mix deps.get to fetch and install them2.

There’s one more step, which is to set up Mix to automatically start run some startup scripts for the libraries we’ve installed. You can do this by changing the application definition in our mix config to read:

def application do
  [applications: [:logger, :httpoison, :hound]]
end

httpoison is the HTTP requests library that we’ll use to download files. hound is the library we’re using to connect to a Selenium server to load web pages with Javascript running in a real live browser. I opted for this rather than just scraping the web pages with hound because the video site in question loaded the videos with Javascript, making it harder to scrape just the HTML. You’ll find that running an actual automated browser like Selenium will often result in much better scraping results.

Logging in to the Site

Time to start writing our scraper’s code! You’ll notice that mix created a file for you called lib/scraper.ex. This is where we’ll be writing our code. Open it, and let’s start writing.

The video site that I was scraping was login-only, so my first order of business was to write a function that started a new Selenium session, and navigated to the site’s login page, and then logged me in with my credentials. Depending on how your site is laid out, and whether it requires logins or not, you’ll probably need to put different things here. I highly suggest taking a look at the Hound documentation!

Anyways, this site had a login field that had a #user_login ID, and a password field with a #user_pass ID, so my code looked like this:

defmodule Scraper do
  use Hound.Helpers

  def start do
    IO.puts "starting"
    Hound.start_session
    navigate_to "https://example.com/login"
    find_element(:id, "user_login") |> fill_field('rlord')
    el = find_element(:id, "user_pass")
    fill_field(el, 'passwordblahblahblah')
    submit_element(el)
  end
end

Note how fill_field and submit_element both operate on the same element. This is because once the password passwordblahblah was filled in, I wanted to just send the submit command right to that field, automatically submitting the associated form. Much easier than finding the form object itself to submit it. Also note the use Hound.Helpers. This runs a macro in Hound that loads a bunch of the helper functions you see in start, like navigate_to and find_element, so that we don’t need to write the fully qualified function name with Hound.Helpers.Blah... to call it.

Anyways, now that we’ve written a login script, it’s time to test it out! First, in a different terminal window, run selenium-server so that our application has a Selenium procress to connect to. Then, just run iex -S mix, and Elixir will compile and run our application and its dependences, and plop us into a interactive REPL. Once it’s loaded, all we need to do is run our function with Scraper.start:

iex(1)> Scraper.start
starting
nil
iex(2)>

You should see a Selenium test browser fire up. If you get complains about an element not being found, it’s probably just because the example of example.com/login doesn’t actually exist. Try to find another website’s login page, and replace the IDs of the username and password fields with ones from your page to make the script work!

Keep the iex running, and let’s jump back to our app’s code.

Downloading Files

The next step is actually downloading a video from a video URL. We’ll create a new function to do this:

def download(src, output_filename) do
  IO.puts "Downloading #{src} -> #{output_filename}"
  body = HTTPoison.get!(src).body
  File.write!(output_filename, body)
  IO.puts "Done Downloading #{src} -> #{output_filename}"
end

This function simply downloads a file at a specified source URL to a destination filename. We can test it out pretty easily. Rather than starting up a whole new instance of Elixir, just go back to your still-running iex, and type

iex(2)> r Scraper
lib/scraper.ex:1: warning: redefining module Scraper
{:reloaded, Scraper, [Scraper]}
iex(3)>

The r Scraper command reloads the Scraper module from its source file automatically! This is one of my favorite features of Elixir. We don’t need to start a new iex or even specify a source filename for Scraper. It just works. Elixir even maintains its connection to Selenium, so unless we accidentally quit our terminal or the Selenium window, there’s no need to re-run Scraper.start. Anyways, let’s test that new function on a simple HTML file:

iex(3)> Scraper.download("http://example.com", "test.html")
Downloading http://example.com -> test.html
Done Downloading http://example.com -> test.html
:ok
iex(4)>

Just like that, Elixir downloads and writes an HTML file to disk. We can also easily pass it a video file’s URL, and it’d download that instead. Let’s do that.

Scraping a Site for a Video URL

Just a script that downloads files from URLs is somewhat useless. We want to be able to give it the URL of a page containing the embedded video, and have it automatically find the video source URL and download it. In the case of my video site, after inspecting the source, I discovered that the site kept the source in the src attribute of a <source> tag, and that there weren’t any other <source> tags on the page. Although Hound has some HTML parsing and searching features, I had trouble getting them to work, so I just used another HTML parsing library, Floki, and passed the DOM of the page to it.

def get_video(url, output_filename) do
  navigate_to url
  page_source() |> Floki.find("source") |> Floki.attribute("src") |> download(output_filename)
end

Note that Hound‘s page_source() function is different from just grabbing the HTML with HTTPoison. It is the logged-in source, since we first ran the start function to log us in. Also, it is the current DOM, after any Javascript manipulations have touched the page, which is crucial since for the site I was scraping, the <source> tag was added with Javascript.

You’ve probably already seen the |> tag since every Elixir programmer loves it and uses it for everything, but if you haven’t, it’s just a cool way to pipe the output of one function into the first argument of another. It’s equivelent to saying

  download(Floki.attribute(Floki.find(page_source(), "source"), "src"), output_filename)

but much, much easier to look at and understand. Let’s test out our new function by once again reloading the module, and then running the script. Don’t forget to rerun Scraper.start if you accidentally closed the Selenium browser or restarted iex.

iex(4)> r Scraper
lib/scraper.ex:1: warning: redefining module Scraper
{:reloaded, Scraper, [Scraper]}
iex(5)> Scraper.get_video("http://example.com/video", "test.mp4")
Downloading https://embed-ssl.wistia.com/deliveries/f015928adefadef/file.mp4 -> test.mp4
Done Downloading https://embed-ssl.wistia.com/deliveries/f015928adefadef/file.mp4 -> test.mp4
:ok
iex(6)>

Adding More Processes

Depending on the size of the video, this command could take a while to run. This is kinda obnoxious — it’d be nice to run the download function in the background, so that we can run other Elixir commands, or perhaps start other downloads. To do this, we’ll use the spawn function. Change your get_video function to read:

def get_video(url, output_filename) do
  navigate_to url
  video_src = page_source() |> Floki.find("source") |> Floki.attribute("src")
  spawn(Scraper, :download, [video_src, output_filename])
end

This spawn function says “call the download function on Scraper, and pass it the arguments video_src and output_filename.” If we recompile and run get_video again, you’ll note that the function returns almost immediately, before actually downloading the video:

iex(6)> r Scraper
lib/scraper.ex:1: warning: redefining module Scraper
{:reloaded, Scraper, [Scraper]}
iex(7)> Scraper.get_video("http://example.com/video", "test.mp4")
Downloading https://embed-ssl.wistia.com/deliveries/f015928adefadef/file.mp4 -> test.mp4
#PID<0.196.0>
iex(8)>

We can run other commands at this point, and eventually, we’ll see this message3 appear, letting us know our download is done:

Done Downloading https://embed-ssl.wistia.com/deliveries/f015928adefadef/file.mp4 -> test.mp4

From here, you should be set to write your own simple scraping applications in Elixir. Be sure to check out the Hound documentation for more Selenium commands that you might need. If you’re looking for inspiration, perhaps a function that looks through a directory of embedded video URLs and downloads any videos it finds with the above script might be fun. Or, you could add error handling to get_video if it doesn’t find a <source>. Or write a scraper for a completely different kind of site. The possibilities are endless!


  1. In the “this code runs without crashing” sense. Hound seems not-at-all functional in the “functional programming” sense, although I suppose that’s a result of Selenium being very stateful? 

  2. Some users are reporting “Could not compile dep hackney” bugs when installing dependencies. I haven’t confirmed it, but apparently these instructions can help fix it. 

  3. In my limited exploration so far, my second favorite feature of Elixir is how when a different thread prints something, iex prints the new content above the current prompt, rather than printing it after the iex(8)> and messing up whatever you typed. So snazzy!