Scraping SEC Rulemaking (Pt. 2)

In a prior post I showed how to scrape the rulemaking index from the SEC. Now we want to show how to get the comments and related material as well, again using Scrapy.

Having done a little more research on how to develop in this environment, it seems that you don’t do development in scrapy using the interactive programming style you use for R or iPython. It’s more like writing in C where you need to write the whole script and then check if it works… although there is an interactive environment that one can use (check the documentation for shell/”inspect_response”), it might be worth exploring.

Another point is that instead of constructing one “master parser” that combs through the site once, it seems that scrapy works better with multiple spiders searching for data that is uniform. Simple rule of thumb: each table of output that we want should get its own spider. While this will result in some redundant requests, the greater simplicity in development probably makes up for it in efficiency.

So what’s our goal now? One thought was to follow the links from the index provided before. The problem with that is that the index doesn’t consistently link to the sites where the dockets are. (D’oh!). While this somewhat defeats the point of an index, the rulemaking index is still useful in so far as it organizes rules into related matters. So instead we’ll find all the dockets we can and match the two datasets up later using the release number.

So to get started, let’s see how we can use the CrawlSpider class to find the set of pages we want to scrape:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

#%%
class SecRulemakingSpider(CrawlSpider):
    name = 'rule_archives'
    allowed_domains = ['www.sec.gov']
    start_urls = [
            'https://www.sec.gov/rules/proposed.shtml',
            'https://www.sec.gov/rules/final.shtml',
            'https://www.sec.gov/rules/interim-final-temp.shtml',
            'https://www.sec.gov/rules/concept.shtml'
            ]

    rules = [Rule(LinkExtractor(restrict_xpaths="//p[@id='archive-links']"),
        callback="parse_item",
        follow=True
        )]

    def parse_item(self,response):
        yield {'url' : response.url}

What’s different about the crawl spider? Well note that this spider understands when it is given some “start_urls” and takes rules that define which pages should be parsed. The callback to parse_item applies it to all the links found by rules. And it gives us results like

{"url": "https://www.sec.gov/rules/final/finalarchive/finalarchive2013.shtml"}
{"url": "https://www.sec.gov/rules/final.shtml"}

Which is great, clearly this is a clever parser that in just a few lines will load the pages that we want. Now we want to fill in the parse_item to give us a better idea of what’s on the page. So let’s follow an example. If one checks out https://www.sec.gov/rules/final/finalarchive/finalarchive2014.shtml, one will see that there are lots of table rows like so

<tr><td valign="top">
<a name="ia-3984"></a><a href="/rules/final/2014/ia-3984.pdf">IA-3984</a></td>
<td valign="top" nowrap="">Dec. 17, 2014</td>
<td valign="top"><b class="blue">Temporary Rule Regarding Principal Trades With Certain Advisory Clients</b><br>
<b><i>File No.:</i></b> S7-23-07<br>
<b><i>Effective Date:</i></b> The amendments in this document are effective December 30, 2014 and the expiration date for 17 CFR 275.206(3)-3T is extended to December 31, 2016.<br>
<strong><em>See also:</em></strong> Interim Final Temporary Rule Release No. 
<a href="/rules/final/2007/ia-2653.pdf">IA-2653</a>; Proposed Rule Release Nos. <a href="/rules/proposed/2014/ia-3893.pdf">IA-3893</a>, <a href="/rules/proposed/2012/ia-3483.pdf">IA-3483</a>, and <a href="/rules/proposed/2010/ia-3118.pdf">IA-3118</a>; Final Rule Release Nos.  <a href="/rules/final/2012/ia-3522.pdf">IA-3522</a>, 
<a href="/rules/final/2010/ia-3128.pdf">IA-3128</a>, 
<a href="/rules/final/2009/ia-2965.pdf">IA-2965</a>,
 <a href="/rules/final/2009/ia-2965a.pdf">IA-2965a</a>; 
<a href="/info/smallbus/secg/206-3-3-t-secg.htm">Small Entity Compliance Guide</a><br>
<i>Federal Register</i> (79 FR 76880): <a href="https://www.federalregister.gov/articles/2014/12/23/2014-29975/temporary-rule-regarding-principal-trades-with-certain-advisory-clients">HTML</a> | <a href="http://www.gpo.gov/fdsys/pkg/FR-2014-12-23/pdf/2014-29975.pdf">PDF</a> | <a href="http://www.gpo.gov/fdsys/pkg/FR-2014-12-23/html/2014-29975.htm">text</a> | <a href="https://www.federalregister.gov/articles/xml/201/429/975.xml">XML</a>
</td></tr>

So each row in this table has lots of links and also meta data we can collect. Because different years of the archive use different templates, we have to be careful about how we identify the right data on the page. The following xpath seems to work well

//tr[count(td)=3 and not(descendant::table)]

For now let’s not deal with parsing the data, but rather collecting it.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

#%%

class SecRulemakingSpider(CrawlSpider):
    name = 'rule_archives'
    allowed_domains = ['www.sec.gov']
    start_urls = [
            'https://www.sec.gov/rules/proposed.shtml',
            'https://www.sec.gov/rules/final.shtml',
            'https://www.sec.gov/rules/interim-final-temp.shtml',
            'https://www.sec.gov/rules/concept.shtml'
            ]

    rules = [Rule(LinkExtractor(restrict_xpaths="//p[@id='archive-links']"),
        callback="parse_item",
        follow=True
        )]

    def parse_item(self,response):
        listings = response.xpath(
            "//tr[count(td)=3 and not(descendant::table)]")
        yield {'url' : response.url,
            'listings' : [l.extract() for l in listings]}

And this seems to work like a charm!

[{"url": "https://www.sec.gov/rules/concept/conceptarchive/conceptarch2004.shtml", "listings": ["<tr valign=\"top\">\r\n<td nowrap><b><i>Release No.</i></b></td><td valign=\"top\"><b><i>Date</i></b></td><td valign=\"top\"><b><i>Details</i></b></td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/34-50700.htm\">34-50700</a></td>\r\n<td nowrap>  Nov. 18, 2004</td>\r\n<td><b class=\"blue\">Concept Release Concerning Self-Regulation</b>\r\n<br><b><i>Comments due:</i></b> Mar. 8, 2005\r\n<br><b><i>File No.: </i></b>\u00a0 S7-40-04\r\n<br><i>Comments received  \r\n<a href=\"/rules/concept/s74004.shtml\">are available</a>\r\nfor this notice</i><br> \r\n\r\n<a href=\"/rules/concept/34-50700.pdf\">Federal Register PDF</a>\r\n</td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/33-8497.htm\">33-8497</a></td>\r\n<td nowrap>  Sep. 27, 2004</td>\r\n<td><b class=\"blue\">Enhancing Commission Filings Through the Use of Tagged Data</b>\r\n<br><b><i>Comments due:</i></b> Comments should be received on or before November 15, 2004.\r\n<br><b><i>File No.: </i></b>\u00a0 S7-36-04\r\n<br><b><i>Other Release Nos.: </i></b>\u00a0 34-50454; 35-27895; 39-2429; IC-26623\r\n<br><i>Comments received  \r\n<a href=\"/rules/concept/s73604.shtml\">are available</a>\r\nfor this notice</i><br>\r\n\r\n<a href=\"/rules/concept/33-8497.pdf\">Federal Register PDF</a>\r\n</td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/33-8398.htm\">33-8398</a></td>\r\n<td nowrap>  Mar. 11, 2004</td>\r\n<td><b class=\"blue\">Securities Transactions Settlement</b>\r\n<br><b><i>Comments due:</i></b> June 16, 2004\r\n<br><b><i>File No.: </i></b>\u00a0 S7-13-04<br>\r\n<i>Comments received  \r\n<a href=\"/rules/concept/s71304.shtml\">are available</a>\r\nfor this notice</i><br>\r\n\r\n<a href=\"/rules/concept/33-8393.pdf\">Federal Register PDF</a>\r\n</td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/34-49175.htm\">34-49175</a></td>\r\n<td nowrap>Feb. 3, 2004</td>\r\n<td><b class=\"blue\">Competitive Developments in the Options Markets</b>\r\n<br><b><i>Comments due:</i></b>\u00a0 April 9, 2004\r\n<br><b><i>File No.:</i></b>\u00a0 S7-07-04\r\n<br><i>Comments received  \r\n<a href=\"/rules/concept/s70704.shtml\">are available</a>\r\nfor this notice</i> <br>\r\n\r\n<a href=\"/rules/concept/34-49175.pdf\">Federal Register PDF</a>\r\n\r\n\r\n</td></tr>"]}
{"url": "https://www.sec.gov/rules/concept/conceptarchive/conceptarch2007.shtml", "listings": ["<tr valign=\"top\">\r\n<td nowrap><b><i>Release No.</i></b></td><td valign=\"top\"><b><i>Date</i></b></td><td valign=\"top\"><b><i>Details</i></b></td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/2007/33-8870.pdf\">33-8870</a></td>\r\n<td nowrap>Dec. 12, 2007</td>\r\n<td><b class=\"blue\">Concept Release on Possible Revisions to the Disclosure Requirements Relating to Oil and Gas Reserves</b>\r\n<br><b><i>Comments due:</i></b>\u00a0 February 19, 2008\r\n<br><b><i>File No.:</i></b>\u00a0 S7-29-07\r\n<br><b><i>Other Release No.:</i></b>\u00a0 34-56945\r\n<br><i>Comments received \r\n<a href=\"/comments/s7-29-07/s72907.shtml\">are available</a>\r\nfor this notice</i>\r\n<br>\r\n<a href=\"/rules/concept/2007/33-8870fr.pdf\"><i>Federal Register</i> version</a>\r\n\r\n<br>\r\n\r\n\r\n</td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/2007/33-8860.pdf\">33-8860</a></td>\r\n<td nowrap>Nov. 16, 2007</td>\r\n<td><b class=\"blue\">Mechanisms to Access Disclosures Relating to Business Activities in or With Countries Designated as State Sponsors of Terrorism</b>\r\n<br><b><i>Comments due:</i></b>\u00a0 January 22, 2008\r\n<br><b><i>File No.:</i></b>\u00a0 S7-27-07\r\n<br><i>Comments received \r\n<a href=\"/comments/s7-27-07/s72707.shtml\">are available</a>\r\nfor this notice</i>\r\n<br>\r\n<a href=\"/rules/concept/2007/33-8860fr.pdf\"><i>Federal Register</i> version</a>\r\n\r\n<br>\r\n\r\n\r\n</td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/2007/33-8831a.pdf\">33-8831A</a></td>\r\n<td nowrap>Sep. 13, 2007</td>\r\n<td><b class=\"blue\">Concept Release On Allowing U.S. Issuers To Prepare Financial Statements In Accordance With International Financial Reporting Standards</b> (Correcting Amendment)\r\n<br><b><i>File No.:</i></b>\u00a0 S7-20-07\r\n<br><b><i>Other Release Nos.:</i></b>\u00a0 34-56217A; IC-27924A\r\n<br>\r\n<a href=\"/rules/concept/2007/33-8831afr.pdf\"><i>Federal Register</i> version</a>\r\n\r\n<br><b><i>See also:</i></b>\u00a0 \r\n<a href=\"/rules/concept/2007/33-8831.pdf\">Release No. 33-8831</a>\r\n\r\n</td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/2007/33-8831.pdf\">33-8831</a></td>\r\n<td nowrap>Aug. 7, 2007</td>\r\n<td><b class=\"blue\">Concept Release On Allowing U.S. Issuers To Prepare Financial Statements In Accordance With International Financial Reporting Standards</b> (Corrected)\r\n<br><b><i>Comments due:</i></b>\u00a0 November 13, 2007\r\n<br><b><i>File No.:</i></b>\u00a0 S7-20-07\r\n<br><b><i>Other Release Nos.:</i></b>\u00a0 34-56217; IC-27924\r\n<br>\r\n<a href=\"/rules/concept/2007/33-8831fr.pdf\"><i>Federal Register</i> version</a>\r\n\r\n<br><i>Comments received \r\n<a href=\"/comments/s7-20-07/s72007.shtml\">are available</a>\r\nfor this notice</i>\r\n<br><b><i>See also:</i></b>\u00a0 \r\n<a href=\"/rules/concept/2007/33-8831a.pdf\">Release No. 33-8831A</a></td></tr>"]

Now that we basically have the data we want, however the form is still not that great. One possibility is to save the output like this and handoff the unparsed data to another program, and on some forums people suggest you might do that. According to the Srapy documentation, however, what you should do is to define subclasses of Items and ItemLoaders. What’s the difference? “Items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container,” well put by the documentation. So first let’s think about the data we want to end up with. Rendering the HTML above we expect entries like

\r\n

“, “

\r\n

\r\n

\r\n

\r\n

\r\n

\r\n

Release No. Date Details
\r\n33-8870 Dec. 12, 2007 Concept Release on Possible Revisions to the Disclosure Requirements Relating to Oil and Gas Reserves\r\n
Comments due:\u00a0 February 19, 2008\r\n
File No.:\u00a0 S7-29-07\r\n
Other Release No.:\u00a0 34-56945\r\n
Comments received \r\nare available\r\nfor this notice\r\n
\r\nFederal Register version\r\n\r\n
\r\n\r\n\r\n
\r\n33-8831 Aug. 7, 2007 Concept Release On Allowing U.S. Issuers To Prepare Financial Statements In Accordance With International Financial Reporting Standards (Corrected)\r\n
Comments due:\u00a0 November 13, 2007\r\n
File No.:\u00a0 S7-20-07\r\n
Other Release Nos.:\u00a0 34-56217; IC-27924\r\n
\r\nFederal Register version\r\n\r\n
Comments received \r\nare available\r\nfor this notice\r\n
See also:\u00a0 \r\nRelease No. 33-8831A

So what fields do we want to get out of this? Here’s a list you can see from these examples or by poking around the other pages.

  • Release Number (good for matching to the data in part 1)
  • The address the release number references
  • ReleaseDate
  • Title
  • Comment Due Date
  • File No. (formally this is the docket identifier)
  • Related releases
  • Comment Location
  • Federal Register Notice
  • Effective Date

The code for making a container does not look much different than this list, see the class definition for “Entry” below. The harder part is to define the ItemLoader. Here’s some stater code

<br />def check_listing(listing):
    toplevel_text = map(unicode.lower,
        listing.xpath("descendant::*//text()").extract())
    return 'release no' not in toplevel_text[0] or \
        'date' not in toplevel_text[1] or \
        'details' not in toplevel_text[2]

class Entry(scrapy.Item):
     release_no = scrapy.Field()
     release_link = scrapy.Field()
     release_date = scrapy.Field()
     release_title = scrapy.Field()
     comment_deadline = scrapy.Field()
     file_no = scrapy.Field()
     related_releases = scrapy.Field()
     docket_link = scrapy.Field()
     federal_register_notice = scrapy.Field()
     effective_date = scrapy.Field()
     listed_on = scrapy.Field()

class SecRulemakingSpider(CrawlSpider):
    name = 'rule_archives'
    allowed_domains = ['www.sec.gov']
    start_urls = [
            'https://www.sec.gov/rules/proposed.shtml',
            'https://www.sec.gov/rules/final.shtml',
            'https://www.sec.gov/rules/interim-final-temp.shtml',
            'https://www.sec.gov/rules/concept.shtml'
            ]

    rules = [Rule(LinkExtractor(restrict_xpaths="//p[@id='archive-links']"),
        callback="parse_item",
        follow=True
        )]


    def parse_item(self,response):
        from scrapy.loader import ItemLoader
        #filter out the header rows
        listings = filter(check_listing,response.xpath(
            "//tr[count(td)=3 and not(descendant::table)]")
            )
        for listing in listings:
            loader = ItemLoader(item=Entry(),selector=listing)
            loader.add_xpath('release_no','td[1]/a/@name')
            loader.add_xpath('release_link','td[1]/a/@href')
            loader.add_xpath('release_date',"td[2]/text()")
            #details loader
            dl = loader.nested_xpath('/td[3]')
            dl.add_xpath('release_title','b[1]/text()')
            yield loader.load_item()        

Note that what we’ve changed in the parse_item is the code that parses what I’m calling “listings” or table rows. This produces some nice results!

{"listed_on": ["https://www.sec.gov/rules/proposed/proposedarchive/proposed2011.shtml"], "release_date": ["Jan. 25, 2011"], "release_no": ["33-9177"], "release_link": ["/rules/proposed/2011/33-9177.pdf"]}
{"listed_on": ["https://www.sec.gov/rules/proposed/proposedarchive/proposed2011.shtml"], "release_date": ["Jan. 14, 2011"], "release_no": ["34-63727"], "release_link": ["/rules/proposed/2011/34-63727.pdf"]}
{"listed_on": ["https://www.sec.gov/rules/proposed/proposedarchive/proposed2011.shtml"], "release_date": ["Jan. 6, 2011"], "release_no": ["34-63652"], "release_link": ["/rules/proposed/2011/34-63652.pdf"]}

Now at the end of the article I’ll post the finished code, however for the moment let’s assume that we’re able to produce output like.

{"release_link": ["/rules/final/33-8176.htm"], "see_also": [" ", "<a href=\"/rules/final/33-8216.htm\">Final Rule Rel. No. 33-8216</a>", ";\r\n", "<br>", "<a href=\"/rules/proposed/33-8145.htm\">Proposed Rule Rel. No. 33-8145</a>", " and ", "<a href=\"/rules/proposed/s74302.shtml\">comments</a>", "\r\n"], "release_date": ["Jan. 22, 2003"], "release_title": ["Conditions for Use of Non-GAAP Financial Measures "], "listed_on": ["https://www.sec.gov/rules/final/finalarchive/finalarchive2003.shtml"], "docket_link": ["/rules/proposed/s74302.shtml"], "file_no": [" S7-43-02"], "effective_date": [" March 28, 2003"], "release_no": ["33-8176"]},
{"release_link": ["/rules/final/ic-25888.htm"], "see_also": [" ", "<a href=\"/rules/proposed/ic-25557.htm\">Proposed Rule Rel. No. IC-25557</a>", " and ", "<a href=\"/rules/proposed/s71302.shtml\">comments</a>", "\r\n"], "release_date": ["Jan. 14, 2003"], "release_title": ["Transactions of Investment Companies with Portfolio and Subadviser Affiliates"], "listed_on": ["https://www.sec.gov/rules/final/finalarchive/finalarchive2003.shtml"], "docket_link": ["/rules/proposed/s71302.shtml"], "file_no": [" S7-13-02"], "effective_date": [" February 24, 2003"], "release_no": ["IC-25888"]}

Then we need some way of validating that we are not missing thing. To do that, we write some simple code.

import pandas as pd
import json
import numpy as np
#%%

with open("Scrapy/SEC/first_run.json") as fp:
    j = json.load(fp)

absent_dockets = [r for r in j if not r.has_key('docket_link')]
np.random.seed(1)
checks = np.random.choice(absent_dockets,10)

You can check the “checks” by hand. The most important thing we want is the association between releases and dockets, since that’s what’s going to link in to the richer data we scraped on the index, and this is basically fine. So this is the validation I’ve done for now. As we do more work with the data, we will note when things are missing and try to do better. Missing data is inevitable on sites that are not consistently maintained, such as the SECs.

While it makes sense to clean-up the input a bit in the scrapy app, for such a weirdly organized set of data it’s always gonna be hard to get everything pivoted into a nice tabular form. I wrote a separate script for doing that which I won’t show, that also identifies edge cases that would benefit from an undergraduate RA’s eye to make sure nothing weird is going on with the scraper.

Here’s the final parser.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.shell import inspect_response
from natty import DateParser
import logging
#%%
#%%helpers

def check_listing(listing):
    toplevel_text = map(unicode.lower,
        listing.xpath("descendant::*//text()").extract())
    return 'release no' not in toplevel_text[0] or \
        'date' not in toplevel_text[1] or \
        'details' not in toplevel_text[2]

def parse_date(x):
    try:
        r=DateParser(x)
        return map(lambda x: x.date().isoformat(),r.result())
    except:
        logging.warning("FAILED ON:" + x)
        return x

class Entry(scrapy.Item):
    from scrapy.loader.processors import TakeFirst, MapCompose, Join,Compose
    release_no = scrapy.Field(
        input_processor=MapCompose(unicode.strip),
        output_processor=TakeFirst()
    ) #
    release_link = scrapy.Field(
        input_processor=MapCompose(unicode.strip),
        output_processor=Join()
    ) #
    release_date = scrapy.Field(
        input_processor=MapCompose(parse_date),
        output_processor=Join()) 
    release_title = scrapy.Field(output_processor=Join()) 
    comment_deadline = scrapy.Field(
        input_processor=MapCompose(unicode.strip),
        output_processor=Join()
    ) #
    file_no = scrapy.Field(
                input_processor=MapCompose(unicode.strip),
                        output_processor=Join()
                         ) #
    see_also = scrapy.Field() #this one is hard to make sense of...
    related_releases = scrapy.Field() #also hard 
    docket_link = scrapy.Field() #
    federal_register_notice = scrapy.Field() 
    effective_date = scrapy.Field(input_processor=MapCompose(parse_date)) #
    listed_on = scrapy.Field() #



class SecRulemakingSpider(CrawlSpider):
    name = 'rule_archives'
    allowed_domains = ['www.sec.gov']
    start_urls = [
            'https://www.sec.gov/rules/proposed.shtml',
            'https://www.sec.gov/rules/final.shtml',
            'https://www.sec.gov/rules/interim-final-temp.shtml',
            'https://www.sec.gov/rules/concept.shtml'
            ]

    rules = [Rule(LinkExtractor(restrict_xpaths="//p[@id='archive-links']"),
        callback="parse_item",
        follow=True
        )]



    def parse_item(self,response):
        from scrapy.loader import ItemLoader
        #filter out the header rows
        listings = filter(check_listing,response.xpath(
            "//tr[count(td)=3 and not(descendant::table)]")
            )
        for listing in listings:
            #inspect_response(response,self)
            loader = ItemLoader(item=Entry(),selector=listing)
            loader.add_value('listed_on',response.url)
            loader.add_xpath('release_no','td[1]/*/text()')
            loader.add_xpath('release_link','td[1]/a/@href')
            loader.add_xpath('release_date',"td[2]/text()")
            #details = listing.xpath('td[3]')[0]
            #print details.extract()
            dl = loader.nested_xpath('td[3]')
            dl.add_xpath('release_title','b[1]/text()')
            dl.add_xpath('effective_date',
                """b[contains(i/text(),'Effective Date')]/
                        following-sibling::text()[1]""")
            dl.add_xpath('file_no',
            """b[contains(i/text(),'File No')]/
                    following-sibling::text()[1]""")
            dl.add_xpath("see_also",
                "b[contains(i/text(),'See also')]/following-sibling::node()")      
            dl.add_xpath('docket_link',
                """a[contains(translate(text(),
                    'ABCDEFGHIJKABCDEFGHIJKLMNOPQRSTUVWXYZ',
                    'abcdefghijklmnopqrstuvwxyz'),
                    'comments')]/@href
                """)
            dl.add_xpath('docket_link',
                "descendant::*[contains(text(),'are available')]/@href")
            out = loader.load_item()
            print out
            yield out

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s