Scraping SEC Rulemaking (Pt. 4)

In previous posts I’ve shown how to scrape a rulemaking index from the SEC, the archive of all rulemaking dockets, and the dockets associated with each rulemaking. The final piece in this series will be downloading the rule texts themselves and associating them with their docket. As we’ll see, this part is one of the easier ones as this is actually a scrape of the Federal Register, not the SEC, and the Federal Register page has a really solid API.

Now what we want to do in this instance is to leverage the unique identifiers we have from the rule archives. The two principal unique identifiers are the ‘file numbers’ and the release numbers. They look like so

release_no file_no release_title
0 BHCA-3 NaN Proposed Revisions to Prohibitions and Restric…
1 34-71194 S7-15-11 Removal of Certain References to Credit Rating…
2 33-9692 NaN Adoption of Updated EDGAR Filer Manual
3 34-73639A S7-01-13 Regulation Systems Compliance and Integrity; C…
4 33-10075A NaN Technical Correction: Changes to Exchange Act …

The file number won’t uniquely identify any particular regulatory action, because the proposed and final regulation will have the same such file number. So we’ll focus on scraping by release number, though it’ll be a good check to make sure they line up.

Let’s try searching for a release using the federal register API. The first one does not hit even using the web interface, so we’ll focus on the second one. If one goes here one can use the interface to get the desired API call for release no 34-71194.

http://www.federalregister.gov/api/v1/documents.json?per_page=1000&order=relevance&conditions%5Bterm%5D=34-71194&conditions%5Bagencies%5D%5B%5D=securities-and-exchange-commission

And the response would be like so:

{
"count": 1,
"description": "Documents matching '34-71194' and from Securities and Exchange Commission",
"total_pages": 1,
"results": [
{
"title": "Removal of Certain References to Credit Ratings Under the Securities Exchange Act of 1934",
"type": "Rule",
"abstract": "The Securities and Exchange Commission (the “Commission'') is adopting amendments that remove references to credit ratings in certain rules and one form under the Securities Exchange Act of 1934 (the “Exchange Act'') relating to broker-dealer financial responsibility and confirmations of securities transactions. This action implements a provision of the Dodd-Frank Wall Street Reform and Consumer Protection Act (the “Dodd-Frank Act'').",
"document_number": "2013-31426",
"html_url": "https://www.federalregister.gov/documents/2014/01/08/2013-31426/removal-of-certain-references-to-credit-ratings-under-the-securities-exchange-act-of-1934",
"pdf_url": "https://www.gpo.gov/fdsys/pkg/FR-2014-01-08/pdf/2013-31426.pdf",
"public_inspection_pdf_url": "https://s3.amazonaws.com/public-inspection.federalregister.gov/2013-31426.pdf?1389102867",
"publication_date": "2014-01-08",
"agencies": [
{
"raw_name": "SECURITIES AND EXCHANGE COMMISSION",
"name": "Securities and Exchange Commission",
"id": 466,
"url": "https://www.federalregister.gov/agencies/securities-and-exchange-commission",
"json_url": "https://www.federalregister.gov/api/v1/agencies/466.json",
"parent_id": null,
"slug": "securities-and-exchange-commission"
}
],
"excerpts": " … 1934, Exchange Act Release No. <span class="match">34</span>-<span class="match">71194</span> (Dec. 27, 2013), at http … "
}
“`

So let's start out with the following parser

“`python
import scrapy

class FederalRegsiterSpider(scrapy.Spider):
name = 'fedregister'

def start_requests(self):
import pandas as pd
self.src= pd.read_csv('sec_rulemaking_directory_scraped.csv')
queue = self.src.release_no.dropna()
self.api_call='https://www.federalregister.gov/api/v1/documents.json?per_page=1000&#039; + \
'&conditions[term]={term}' + \
'&conditions[agencies][]=securities-and-exchange-commission'
for term in queue:
call = self.api_call.format(term=term)
yield scrapy.Request(url=call,callback=self.parse)

def parse(self,response):
import json
from urlparse import parse_qs
reply = json.loads(response.text)
yield {'term' : parse_qs(response.url)['conditions[term]'],
'replies' : reply['count']}

Sometimes this yields too many replies, for example because we’re doing full-text search and a release might be cited in another document. For example, searching ’34-42266′ returns 3 documents, but only one of them is what we want. To narrow our search, let’s not do a full-text search but instead only look in the docket numbers. Doing this, we get only one result.

{
    "count": 1,
    "description": "Documents from Securities and Exchange Commission and filed under agency docket 34-42266",
    "total_pages": 1,
    "results": [
        {
            "abstract": "The Securities and Exchange Commission is adopting new rules and amendments to its current rules to require that companies' independent auditors review the companies' financial information prior to the companies filing their Quarterly Reports on Form 10-Q or Form 10-QSB with the Commission, and to require that companies include in their proxy statements certain disclosures about their audit committees and reports from their audit committees containing certain disclosures. The rules are designed to improve disclosure related to the functioning of corporate audit committees and to enhance the reliability and credibility of financial statements of public companies.",
            "action": "Final rule.",
            "agencies": [
                {
                    "raw_name": "SECURITIES AND EXCHANGE COMMISSION",
                    "name": "Securities and Exchange Commission",
                    "id": 466,
                    "url": "https://www.federalregister.gov/agencies/securities-and-exchange-commission",
                    "json_url": "https://www.federalregister.gov/api/v1/agencies/466.json",
                    "parent_id": null,
                    "slug": "securities-and-exchange-commission"
                }
            ],
            "agency_names": [
                "Securities and Exchange Commission"
            ],
            "body_html_url": "https://www.federalregister.gov/documents/full_text/html/1999/12/30/99-33849.html",
            "cfr_references": [
                {
                    "title": 17,
                    "part": 210,
                    "chapter": null,
                    "citation_url": null
                },
                {
                    "title": 17,
                    "part": 228,
                    "chapter": null,
                    "citation_url": null
                },
                {
                    "title": 17,
                    "part": 229,
                    "chapter": null,
                    "citation_url": null
                },
                {
                    "title": 17,
                    "part": 240,
                    "chapter": null,
                    "citation_url": null
                }
            ],
            "citation": "64 FR 73389",
            "comment_url": null,
            "comments_close_on": null,
            "correction_of": null,
            "corrections": [],
            "dates": null,
            "docket_id": "Release No. 34-42266",
            "docket_ids": [
                "Release No. 34-42266",
                "File No. S7-22-99"
            ],
            "document_number": "99-33849",
            "effective_on": null,
            "end_page": 73403,
            "excerpts": "The Securities and Exchange Commission is adopting new rules and amendments to its current rules to require that companies' independent auditors review the companies' financial information prior to the companies filing their Quarterly Reports on Form...",
            "executive_order_notes": null,
            "executive_order_number": null,
            "full_text_xml_url": null,
            "html_url": "https://www.federalregister.gov/documents/1999/12/30/99-33849/audit-committee-disclosure",
            "images": {},
            "json_url": "https://www.federalregister.gov/api/v1/documents/99-33849.json",
            "mods_url": "https://www.gpo.gov/fdsys/granule/FR-1999-12-30/99-33849/mods.xml",
            "page_length": 15,
            "pdf_url": "https://www.gpo.gov/fdsys/pkg/FR-1999-12-30/pdf/99-33849.pdf",
            "president": {
                "name": "William J. Clinton",
                "identifier": "william-j-clinton"
            },
            "public_inspection_pdf_url": null,
            "publication_date": "1999-12-30",
            "raw_text_url": "https://www.federalregister.gov/documents/full_text/text/1999/12/30/99-33849.txt",
            "regulation_id_number_info": {
                "3235-AH83": {
                    "xml_url": "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=199910&RIN=3235-AH83&operation=OPERATION_EXPORT_XML",
                    "issue": "199910",
                    "title": "Audit Committee Disclosure",
                    "priority_category": "Substantive, Nonsignificant",
                    "html_url": "https://www.federalregister.gov/regulations/3235-AH83/audit-committee-disclosure"
                }
            },
            "regulation_id_numbers": [
                "3235-AH83"
            ],
            "regulations_dot_gov_info": {},
            "regulations_dot_gov_url": null,
            "significant": false,
            "signing_date": null,
            "start_page": 73389,
            "subtype": null,
            "title": "Audit Committee Disclosure",
            "toc_doc": null,
            "toc_subject": null,
            "topics": [],
            "type": "Rule",
            "volume": 64
        }
    ]
}

This gives us a great deal of interesting meta data on the rule, as well as full text. The key entry here is the document_number, which is a per document identification assigned by the Federal Register. Once you have that, it’s trivial to get back all this information, for example from OIRA, about the publication date, and the regulation’s text. Given that storage ain’t free, we won’t do much more than collect that item, the publication date, and a few select pieces of meta data.

Before we get ahead of ourselves, let’s see what we might be missing. One thing that we didn’t anticipate is ‘technical corrections’. For example, in searching 34-50870 you get three results. Two of them, C4-27934 and C4-28655, are short bits of text explaining a typo. For this reason, we will amend the API call to return a bit more information that will help us figure out which rules were corrected at later date. Ultimately, the resulting data is already good enough to begin pivoting into a useful form.

import scrapy
from urllib import urlencode
import pandas as pd
import json
from urlparse import parse_qs
from scrapy.shell import inspect_response




class FederalRegsiterSpider(scrapy.Spider):
    name = 'fedregister'

    def start_requests(self):
        self.src= pd.read_csv('sec_rulemaking_directory_scraped.csv')
        queue = self.src.release_no.dropna()
        self.api_call='https://www.federalregister.gov/api/v1/documents.json?'
        self.api_params = [
            #conditions
            ('per_page' , 1000),
            ('conditions[agencies][]', 'securities-and-exchange-commission')] + \
            [('fields[]',i) for i in 
            ['abstract','agencies','citation','comment_url','correction_of','corrections',
            'docket_ids','document_number','end_page','full_text_xml_url','html_url','publication_date',
            'regulation_id_number_info','regulation_id_numbers',
            'regulations_dot_gov_info',
            'regulations_dot_gov_url',
            'significant',
            'start_page',
            'title',
            'type']]
        for term in queue:
            params = urlencode(self.api_params + [('conditions[docket_id]',term)])
            url = self.api_call + params
            yield scrapy.Request(url=url,callback=self.parse)

    def parse(self,response):
        reply = json.loads(response.text)
        additions = {'term' : parse_qs(response.url)['conditions[docket_id]'][0],
                    'request' : response.url}
        #test if the results column is not empty
        if reply['count'] != 0:
            #parse the results
            for result in reply['results']:
                out = dict(result)
                out.update(additions)
                out.update({'results' : reply['count'],
                            'flagged' : False})
                yield out
        else: 
            additions.update({'results' : reply['count'],'flagged' : False})
            yield additions

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s