Scraping SEC Rulemaking with Scrapy

Intro

In the course of studying rulemaking, I’ve done a lot of complex scraping. Regulatory agencies put out incredible amounts of data, but it is hard to integrate into a useful form. Mostly, I’ve done ad-hoc scraping before using python tools like Requests and BeautifulSoup, which largely gets the job done. However, there are better tools out there apparently, including Scrapy, which I’m starting to get a handle of. In this set of blogs, I’ll show how I use this tool to scrape the SEC’s rulemaking data.

The first page we want to scrape is 

https://www.sec.gov/rules/rulemaking-index.shtml

Within the page there is a table that is a good example of untidy data (and so the “fun” begins, search for Hadley’s writings on tidy data if you don’t know what I’m talking about).  Tons of great information here though, and putting this document together ourselves would take tons and tons of time.

It also contains links to various other documents where we can get things that are even more interesting for a political scientist, for example comments or meeting logs. Eventually we’ll want to get those, however let’s start small and get the directory of rules published by the SEC since 2008.

#Nitty Gritty

I’m gonna skip ahead of all the stuff installing Scapy. That’s well-covered in the tutorial. As a first cut, here’s what our spyder looks like

<br />import scrapy
class SecRulemakingSpider(scrapy.Spider):
    name = 'sec_rulemaking'
    allowed_domains = ['www.sec.gov/rules/rulemaking-index.shtml']
    start_urls = ['https://www.sec.gov/rules/rulemaking-index.shtml']


    def parse(self, response):
        for docket in response.xpath("//*[@id='rulemaking-index']/li")[2:]:
            yield {
            'last-action' : docket.xpath(
                    "div[contains(@class,'last-action')]/text()"
                                       ).extract_first(),
            'file-number': docket.xpath(
                    "div[contains(@class,'file-number')]/text()"
                                       ).extract_first(),
            'rule-title' : 
                docket.xpath(
                "div[contains(@class,'rulemaking')]/span[@class='name']/text()"
                ).extract_first(),
            'divisions' : docket.xpath(
                "div[contains(@class,'rulemaking')]/span[@class='division']/text()"
                ).extract_first()
            }


Pretty straightforward looking and produces some cool results (here’s a taste).

{“divisions”: “Corporation Finance”, “last-action”: “3/31/17”, “rule-title”: “Titles I and III of the JOBS Act”, “file-number”: “S7-09-16”}
{“divisions”: “Corporation Finance”, “last-action”: “08/23/16”, “rule-title”: “Modernization of Property Disclosures for Mining Registrants”, “file-number”: “S7-10-16”}

The problem is the final column is the messy one. It contains multiple regulatory actions per docket, and also links to other pages that we might eventually want to do something with (see Part 2). For now, the simplest thing to do is to define another function that will analyze each action. In the interactive terminal I do something like the following

#pick a juicy one
docket =  response.xpath("//*[@id='rulemaking-index']/li")[5]
actions = docket.xpath("""div[contains(@class,'actions')]/a""")
action=actions[0]
print action.extract()

Yields something like

<a class="final-rule" href="/rules/final/finalarchive/finalarchive2008.shtml#34-57711">
            <span class="type">Final Rule</span>
            <span class="title">Disclosure of Divestment by Registered Investment Companies in Accordance With Sudan Accountability and Divestment Act of 2007</span>
            <span class="date">4/24/08</span>
            <span class="release">34-57711</span>
        </a>

That href is gonna be interesting eventually but for now we can just try to parse the content

    def parse_action(action):
        return {'type' : action.xpath("span[@class='type']/text()").extract_first(),
        'title' : action.xpath("span[@class='title']/text()").extract_first(),
        'date' : action.xpath("span[@class='date']/text()").extract_first(),
        'release' : action.xpath("span[@class='release']/text()").extract_first()}

And passing parse_action(action) yields

{'date': u'4/24/08',
 'release': u'34-57711',
 'title': u'Disclosure of Divestment by Registered Investment Companies in Accordance With Sudan Accountability and Divestment Act of 2007',
 'type': u'Final Rule'}

Pretty Sweet. Let’s try running our updated script.

# -*- coding: utf-8 -*-
import scrapy


class SecRulemakingSpider(scrapy.Spider):
    name = 'sec_rulemaking'
    allowed_domains = ['www.sec.gov/rules/rulemaking-index.shtml']
    start_urls = ['https://www.sec.gov/rules/rulemaking-index.shtml']

    def parse_action(self,action):
        return {
        'type' : action.xpath("span[@class='type']/text()").extract_first(),
        'title' : action.xpath("span[@class='title']/text()").extract_first(),
        'date' : action.xpath("span[@class='date']/text()").extract_first(),
        'release' : action.xpath("span[@class='release']/text()").extract_first()}


    def parse(self, response):
        for docket in response.xpath("//*[@id='rulemaking-index']/li")[2:]:
            yield {
            'last-action' : docket.xpath(
                    "div[contains(@class,'last-action')]/text()"
                                       ).extract_first(),
            'file-number': docket.xpath(
                    "div[contains(@class,'file-number')]/text()"
                                       ).extract_first(),
            'rule-title' : 
                docket.xpath(
                "div[contains(@class,'rulemaking')]/span[@class='name']/text()"
                ).extract_first(),
            'divisions' : docket.xpath(
                "div[contains(@class,'rulemaking')]/span[@class='division']/text()"
                ).extract_first(),
            'actions' :  map(self.parse_action,docket.xpath("""
                    div[contains(@class,'actions')]/a
                    """))

            }

And the results would be something like this:

{"divisions": "Corporation Finance", "last-action": "01/13/16", "rule-title": "Amendments to Forms S-1 and F-1", "actions": [{"date": "01/13/16", "release": "33-10003", "type": "Interim Final Temporary Rule", "title": "Simplification of Disclosure Requirements for Emerging Growth Companies and Forward Incorporation by Reference on Form S-1 for Smaller Reporting Companies"}], "file-number": "S7-01-16"}
{"divisions": "Corporation Finance", "last-action": "3/31/17", "rule-title": "Titles I and III of the JOBS Act", "actions": [{"date": "03/31/17", "release": "33-10332", "type": "Final Rule", "title": "Inflation Adjustments and Other Technical Amendments under Titles I and III of the JOBS Act"}, {"date": "06/01/16", "release": "34-77969", "type": "Interim Final Temporary Rule", "title": "Form 10-K Summary"}], "file-number": "S7-09-16"}

While I think there’s a bit of an adjustment period to using scapy relative to the old way I was doing scraping, this is pretty nice actually. You spend more time thinking about the layout of the page and less time setting up the infrastructure for requesting and parsing content. I’m guessing it only gets better, find out in my next installment.

Advertisements


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s