Category Archives: Omnibus

Scraping SEC Rulemaking (Pt. 5, Press Releases)

One of the big difficulties in conducting research on rulemaking is the substantial difficulty of obtaining information about outcomes. Various collaborators of mine have introduced ideas about how to do that to advance the scholarly research agenda. One of the ideas that I used extensively in my dissertation was to use equity markets to obtain concise single-dimensional summaries of regulatory impacts. In order to do any kind of event study, you need to know when a regulation was announced to a high-degree of certainty. Typically the way this has been done is to use Bloomberg (good) or publication date in the Federal Register (really bad! In one case I saw this was 2 months off of the real publication date!). For my papers, I’ve used press releases that I date via recourse to RSS feeds. The main hitch with this method is that RSS feeds are ephemeral, so you need to hope that a service like Feedly or Google Reader was tracking it. If that doesn’t work, your last best hope is that the Internet Archive was tracking it, however generally you’ll fail to get a comprehensive record if that’s your only recourse.

So to start getting the Press Release Times we go to the RSS

Feedly Link

Feedly has 945 results going from February 11, 2015 to today associated with that location. Wayback machine does not have any archives before December 23, 2014, so they likely had a different RSS feed before that, which hopefully was still functional for some time afterward AND stored on feedly.

The link here:

http://www.sec.gov/rss/news/press.xml

When parsed from Feedly yields 1000 results, indicating that our results are being clipped. They also go from today to 2014-12-04. Using the continuation key to get the next ones,

http://feedly.com/v3/streams/contents?streamId=feed/http://www.sec.gov/rss/news/press.xml&count=1000&continuation=14a166d4eed:128a3c7:b669a9fb

We get 602 more results, indicating that’s all there is. These results go back to 2012-11-08. That’s pretty good! Google Reader went offline July 1, 2013, so in theory we might get more there wading through the archives. But before we do that, let’s make sure that there’s not another RSS feed that was being used until before that time. Indeed, https://web.archive.org/web/20120626021816/http://www.sec.gov:80/rss/news/press.xml goes back to 2006. So probably our best hope at this point would be the google reader archive. To get that, you download the index first, which is a huge 12 GB compressed file, and then you need to grep it for the feed you want.

It turns out that we’re in luck and there is some archived RSS data from Google Reader. Searching through is file is pretty easy, you just use grep -e ‘www.sec.gov’ BIGFILENAME.cdx > search_results.fw, the result will be a space delimited file.

And once that big search is done, you can view the results like so

d = pd.read_csv('search_results.fw',delimiter=' ',header=None)
#N b a m s k r M S V g
d.columns = ['massaged_url',
    'date',
    'original_url',
    'mime_original',
    'response_code',
    'new_style_checksum',
    'redirect',
    'meta_tags',
    'S',
    'compressed_arc_offset',
    'file_name']

relevant_feeds = d[d.massaged_url.str.contains('www.sec.gov/rss/news/press.xml')]

It turns out that the backup is in a file called archiveteam_greader_20130619095946/greader_20130619095946.megawarc.warc.gz, which is 25 GB and also easily downloaded.

Now these warc files are confusingly documented and I don’t really understand how to use the tools internet archive provides to access them. But it turns out that just reading the file byte by byte from information given in the index works well enough. Below find some code that extracts all the SEC press releases

import pandas as pd
import StringIO
import gzip
from jsonlines import Writer
import codecs
d = pd.read_csv('GoogleReader/archiveteam-googlereader201306-indexes.cdx/search_results.fw',delimiter=' ',header=None)
#N b a m s k r M S V g
d.columns = ['massaged_url',
    'date',
    'original_url',
    'mime_original',
    'response_code',
    'new_style_checksum',
    'redirect',
    'meta_tags',
    'S',
    'compressed_arc_offset',
    'file_name']

relevant_feeds = d[d.massaged_url.str.contains('www.sec.gov/rss/news/press.xml')]

file_name = 'GoogleReader/archiveteam_greader_20130619095946/greader_20130619095946.megawarc.warc.gz'

all_items = []
for idx,row in relevant_feeds.iterrows():
    with open(file_name) as fp:
        #move ahead in the file by the amount specified in index
        fp.seek(row['compressed_arc_offset'])
        #read in some 
        raw = fp.read(row['S'])
    text = gzip.GzipFile(fileobj=StringIO.StringIO(raw)).read()
    w = warc.WARCFile(fileobj=StringIO.StringIO(text))
    record = w.read_record()
    payload = record.payload.read()
    j=json.loads(payload.split('\r\n\r\n')[1])
    all_items = all_items + j['items']

with codecs.open('google_reader_archived.jsonl',"w+",encoding='utf-8') as fp:
    writer = Writer(fp)
    writer.write_all(all_items)    

And the files that result should look quite similar to what we obtain from Feedly.

Now that we can assign very precise dates and times to the appearance of a rule, we only need to acquire the remaining press releases. The following code does that job.

import scrapy
from urllib import urlencode
from scrapy.shell import inspect_response
import json

class PressReleaseSpider(scrapy.Spider):
    name = 'press_release_spider'

    def make_feedly_url(self):
        return self.feedly + self.rss + urlencode(self.params)

    def start_requests(self):
        self.feedly = 'http://feedly.com/v3/streams/contents?streamId=feed/' 
        self.rss = 'http://www.sec.gov/rss/news/press.xml&'
        self.params = {'count' : 1000}
        url = self.make_feedly_url()
        self.log(url)
        yield scrapy.Request(url=url,callback=self.parse)


    def parse(self,response):
        #inspect_response(self,response)
        if 'feedly.com' in response.url:
            j = json.loads(response.text)
            if len(j['items']) == 1000:
                self.params['continuation'] = j['continuation']
                yield scrapy.Request(url=self.make_feedly_url(),callback=self.parse)
            for item in j['items']:
                item['kind'] = 'rss'
                yield item

Scraping SEC Rulemaking (Pt. 4)

In previous posts I’ve shown how to scrape a rulemaking index from the SEC, the archive of all rulemaking dockets, and the dockets associated with each rulemaking. The final piece in this series will be downloading the rule texts themselves and associating them with their docket. As we’ll see, this part is one of the easier ones as this is actually a scrape of the Federal Register, not the SEC, and the Federal Register page has a really solid API.

Now what we want to do in this instance is to leverage the unique identifiers we have from the rule archives. The two principal unique identifiers are the ‘file numbers’ and the release numbers. They look like so

release_no file_no release_title
0 BHCA-3 NaN Proposed Revisions to Prohibitions and Restric…
1 34-71194 S7-15-11 Removal of Certain References to Credit Rating…
2 33-9692 NaN Adoption of Updated EDGAR Filer Manual
3 34-73639A S7-01-13 Regulation Systems Compliance and Integrity; C…
4 33-10075A NaN Technical Correction: Changes to Exchange Act …

The file number won’t uniquely identify any particular regulatory action, because the proposed and final regulation will have the same such file number. So we’ll focus on scraping by release number, though it’ll be a good check to make sure they line up.

Let’s try searching for a release using the federal register API. The first one does not hit even using the web interface, so we’ll focus on the second one. If one goes here one can use the interface to get the desired API call for release no 34-71194.

http://www.federalregister.gov/api/v1/documents.json?per_page=1000&order=relevance&conditions%5Bterm%5D=34-71194&conditions%5Bagencies%5D%5B%5D=securities-and-exchange-commission

And the response would be like so:

{
"count": 1,
"description": "Documents matching '34-71194' and from Securities and Exchange Commission",
"total_pages": 1,
"results": [
{
"title": "Removal of Certain References to Credit Ratings Under the Securities Exchange Act of 1934",
"type": "Rule",
"abstract": "The Securities and Exchange Commission (the “Commission'') is adopting amendments that remove references to credit ratings in certain rules and one form under the Securities Exchange Act of 1934 (the “Exchange Act'') relating to broker-dealer financial responsibility and confirmations of securities transactions. This action implements a provision of the Dodd-Frank Wall Street Reform and Consumer Protection Act (the “Dodd-Frank Act'').",
"document_number": "2013-31426",
"html_url": "https://www.federalregister.gov/documents/2014/01/08/2013-31426/removal-of-certain-references-to-credit-ratings-under-the-securities-exchange-act-of-1934",
"pdf_url": "https://www.gpo.gov/fdsys/pkg/FR-2014-01-08/pdf/2013-31426.pdf",
"public_inspection_pdf_url": "https://s3.amazonaws.com/public-inspection.federalregister.gov/2013-31426.pdf?1389102867",
"publication_date": "2014-01-08",
"agencies": [
{
"raw_name": "SECURITIES AND EXCHANGE COMMISSION",
"name": "Securities and Exchange Commission",
"id": 466,
"url": "https://www.federalregister.gov/agencies/securities-and-exchange-commission",
"json_url": "https://www.federalregister.gov/api/v1/agencies/466.json",
"parent_id": null,
"slug": "securities-and-exchange-commission"
}
],
"excerpts": " … 1934, Exchange Act Release No. <span class="match">34</span>-<span class="match">71194</span> (Dec. 27, 2013), at http … "
}
“`

So let's start out with the following parser

“`python
import scrapy

class FederalRegsiterSpider(scrapy.Spider):
name = 'fedregister'

def start_requests(self):
import pandas as pd
self.src= pd.read_csv('sec_rulemaking_directory_scraped.csv')
queue = self.src.release_no.dropna()
self.api_call='https://www.federalregister.gov/api/v1/documents.json?per_page=1000&#039; + \
'&conditions[term]={term}' + \
'&conditions[agencies][]=securities-and-exchange-commission'
for term in queue:
call = self.api_call.format(term=term)
yield scrapy.Request(url=call,callback=self.parse)

def parse(self,response):
import json
from urlparse import parse_qs
reply = json.loads(response.text)
yield {'term' : parse_qs(response.url)['conditions[term]'],
'replies' : reply['count']}

Sometimes this yields too many replies, for example because we’re doing full-text search and a release might be cited in another document. For example, searching ’34-42266′ returns 3 documents, but only one of them is what we want. To narrow our search, let’s not do a full-text search but instead only look in the docket numbers. Doing this, we get only one result.

{
    "count": 1,
    "description": "Documents from Securities and Exchange Commission and filed under agency docket 34-42266",
    "total_pages": 1,
    "results": [
        {
            "abstract": "The Securities and Exchange Commission is adopting new rules and amendments to its current rules to require that companies' independent auditors review the companies' financial information prior to the companies filing their Quarterly Reports on Form 10-Q or Form 10-QSB with the Commission, and to require that companies include in their proxy statements certain disclosures about their audit committees and reports from their audit committees containing certain disclosures. The rules are designed to improve disclosure related to the functioning of corporate audit committees and to enhance the reliability and credibility of financial statements of public companies.",
            "action": "Final rule.",
            "agencies": [
                {
                    "raw_name": "SECURITIES AND EXCHANGE COMMISSION",
                    "name": "Securities and Exchange Commission",
                    "id": 466,
                    "url": "https://www.federalregister.gov/agencies/securities-and-exchange-commission",
                    "json_url": "https://www.federalregister.gov/api/v1/agencies/466.json",
                    "parent_id": null,
                    "slug": "securities-and-exchange-commission"
                }
            ],
            "agency_names": [
                "Securities and Exchange Commission"
            ],
            "body_html_url": "https://www.federalregister.gov/documents/full_text/html/1999/12/30/99-33849.html",
            "cfr_references": [
                {
                    "title": 17,
                    "part": 210,
                    "chapter": null,
                    "citation_url": null
                },
                {
                    "title": 17,
                    "part": 228,
                    "chapter": null,
                    "citation_url": null
                },
                {
                    "title": 17,
                    "part": 229,
                    "chapter": null,
                    "citation_url": null
                },
                {
                    "title": 17,
                    "part": 240,
                    "chapter": null,
                    "citation_url": null
                }
            ],
            "citation": "64 FR 73389",
            "comment_url": null,
            "comments_close_on": null,
            "correction_of": null,
            "corrections": [],
            "dates": null,
            "docket_id": "Release No. 34-42266",
            "docket_ids": [
                "Release No. 34-42266",
                "File No. S7-22-99"
            ],
            "document_number": "99-33849",
            "effective_on": null,
            "end_page": 73403,
            "excerpts": "The Securities and Exchange Commission is adopting new rules and amendments to its current rules to require that companies' independent auditors review the companies' financial information prior to the companies filing their Quarterly Reports on Form...",
            "executive_order_notes": null,
            "executive_order_number": null,
            "full_text_xml_url": null,
            "html_url": "https://www.federalregister.gov/documents/1999/12/30/99-33849/audit-committee-disclosure",
            "images": {},
            "json_url": "https://www.federalregister.gov/api/v1/documents/99-33849.json",
            "mods_url": "https://www.gpo.gov/fdsys/granule/FR-1999-12-30/99-33849/mods.xml",
            "page_length": 15,
            "pdf_url": "https://www.gpo.gov/fdsys/pkg/FR-1999-12-30/pdf/99-33849.pdf",
            "president": {
                "name": "William J. Clinton",
                "identifier": "william-j-clinton"
            },
            "public_inspection_pdf_url": null,
            "publication_date": "1999-12-30",
            "raw_text_url": "https://www.federalregister.gov/documents/full_text/text/1999/12/30/99-33849.txt",
            "regulation_id_number_info": {
                "3235-AH83": {
                    "xml_url": "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=199910&RIN=3235-AH83&operation=OPERATION_EXPORT_XML",
                    "issue": "199910",
                    "title": "Audit Committee Disclosure",
                    "priority_category": "Substantive, Nonsignificant",
                    "html_url": "https://www.federalregister.gov/regulations/3235-AH83/audit-committee-disclosure"
                }
            },
            "regulation_id_numbers": [
                "3235-AH83"
            ],
            "regulations_dot_gov_info": {},
            "regulations_dot_gov_url": null,
            "significant": false,
            "signing_date": null,
            "start_page": 73389,
            "subtype": null,
            "title": "Audit Committee Disclosure",
            "toc_doc": null,
            "toc_subject": null,
            "topics": [],
            "type": "Rule",
            "volume": 64
        }
    ]
}

This gives us a great deal of interesting meta data on the rule, as well as full text. The key entry here is the document_number, which is a per document identification assigned by the Federal Register. Once you have that, it’s trivial to get back all this information, for example from OIRA, about the publication date, and the regulation’s text. Given that storage ain’t free, we won’t do much more than collect that item, the publication date, and a few select pieces of meta data.

Before we get ahead of ourselves, let’s see what we might be missing. One thing that we didn’t anticipate is ‘technical corrections’. For example, in searching 34-50870 you get three results. Two of them, C4-27934 and C4-28655, are short bits of text explaining a typo. For this reason, we will amend the API call to return a bit more information that will help us figure out which rules were corrected at later date. Ultimately, the resulting data is already good enough to begin pivoting into a useful form.

import scrapy
from urllib import urlencode
import pandas as pd
import json
from urlparse import parse_qs
from scrapy.shell import inspect_response




class FederalRegsiterSpider(scrapy.Spider):
    name = 'fedregister'

    def start_requests(self):
        self.src= pd.read_csv('sec_rulemaking_directory_scraped.csv')
        queue = self.src.release_no.dropna()
        self.api_call='https://www.federalregister.gov/api/v1/documents.json?'
        self.api_params = [
            #conditions
            ('per_page' , 1000),
            ('conditions[agencies][]', 'securities-and-exchange-commission')] + \
            [('fields[]',i) for i in 
            ['abstract','agencies','citation','comment_url','correction_of','corrections',
            'docket_ids','document_number','end_page','full_text_xml_url','html_url','publication_date',
            'regulation_id_number_info','regulation_id_numbers',
            'regulations_dot_gov_info',
            'regulations_dot_gov_url',
            'significant',
            'start_page',
            'title',
            'type']]
        for term in queue:
            params = urlencode(self.api_params + [('conditions[docket_id]',term)])
            url = self.api_call + params
            yield scrapy.Request(url=url,callback=self.parse)

    def parse(self,response):
        reply = json.loads(response.text)
        additions = {'term' : parse_qs(response.url)['conditions[docket_id]'][0],
                    'request' : response.url}
        #test if the results column is not empty
        if reply['count'] != 0:
            #parse the results
            for result in reply['results']:
                out = dict(result)
                out.update(additions)
                out.update({'results' : reply['count'],
                            'flagged' : False})
                yield out
        else: 
            additions.update({'results' : reply['count'],'flagged' : False})
            yield additions

Scraping SEC Rulemaking (Pt. 3)

In the previous post I showed how I scraped the rulemaking archives of the SEC to find, among other things, all the rulemaking dockets available. The next thing we want to do is to acquire the comments and whatever else might be of interest in the rulemaking dockets. Here’s the code I start with:

import scrapy
import pandas as pd
from urlparse import urljoin

class CommentSpider(scrapy.Spider):
    name = 'comment_spider'
    allowed_domains = ['www.sec.gov']

    #make start_urls by first loading in our previously scraped data
    src = pd.read_csv("sec_rulemaking_directory_scraped.csv")
    #dockets not always available, so drop
    #also might be more than one link per entry, so split-melt-drop
    links = src.docket_link_cleaned.dropna().\
        str.split("|",expand=True).\
        melt(value_name='links').dropna().links

    start_urls = map(lambda x: urljoin('https://www.sec.gov/',x),links)

    def parse(self,response):
        yield {'url' : response.url}

I don’t use the CrawlSpider class because I have the set of starting urls and don’t need all that structure about rules. So an example of what a docket page looks like is here:

https://www.sec.gov/rules/proposed/s71205.shtml

This is a good first cut. They do get more complicated with multiple types of data, as we’ll see, however for now this will do. The name of the page is at the top and some other identifying information. For now we can ignore this, as we already have this kind of information associated with the URL (see last post). As a general rule, it makes sense to scrape as little data as you actually want to clean and organize.

Thus let’s think about what we want the items to look like. Basically the rows seem to have three pieces of information.

  • Date
  • Description
  • PDF attachment.

The description suggests richer data, for example name of the commenter, title, organization, location. There are also “memorandums” of meetings mixed in with these documents. This seems to me to be one of those cases where parsing the meta data is probably best done outside of the scrapy context in another script. So we’ll make a class just to get these items. Similarly, we are likely going to want to extract the OCRed text from the PDFs, but this seems like a task best left for another cleanup script.

class DocketItem(scrapy.Item):
    date = scrapy.Field()
    description = scrapy.Field()
    attachment_web_location = scrapy.Field()
    attachment_file_location = scrapy.Field()

With a clear idea of what we want, we can now focus on how the page is organized. Google Developer extension is good for this, since you can expect elements and also copy out the xpath.

So the nearest analogue of our items is the “table row” and this would normally be indicated by

tags. The problem is that in the 1990s and early 2000s developers often liked to control the widths of pages by making the whole page a big table, so selecting by

yields a mess. In the rulemaking scraper we were able to get around this by noting that a table row needed to have 3 entries for us to be interested in it. Here no such luck, because we want two column rows and the whole page is setup as a two column table!

So the thing to note is that the table we want is a sibling that follows the headers. A quick xpath command in console $x(“//h1”) shows that there is a unique header on this site. So we can try $x(“//h1/following-sibling::*”) to see what siblings there are. Turns out there are two tables that results, one is the footer (“contact | employment | links … “) and the other is the table we want. And it’s also a two column table (d’oh!). Fortunately one of these footer elements has a class (“footer”) so we can simply kick that out.

$x(“//h1/following-sibling::table[not(.//td[contains(@class,’footer’)])]”)

Returns 1 table, which is the one we want. This says, “give me all the tables that are siblings below h1, except for those that have a descendant data item classed as a footer. We want all the table rows below this table that have two rows (append ‘//tr[count(td)>2]’). Doing the obvious stuff after that we obtain the following script:

import scrapy
import pandas as pd
from urlparse import urljoin
from scrapy.loader import ItemLoader

class DocketItem(scrapy.Item):
    date = scrapy.Field()
    description = scrapy.Field()
    attachment_web_location = scrapy.Field()
    attachment_file_location = scrapy.Field()
    listed_on = scrapy.Field()

class CommentSpider(scrapy.Spider):
    name = 'comment_spider'
    allowed_domains = ['www.sec.gov']

    #make start_urls by first loading in our previously scraped data
    src = pd.read_csv("sec_rulemaking_directory_scraped.csv")
    #dockets not always available, so drop
    #also might be more than one link per entry, so split-melt-drop
    links = src.docket_link_cleaned.dropna().\
        str.split("|",expand=True).\
        melt(value_name='links').dropna().links

    start_urls = map(lambda x: urljoin('https://www.sec.gov/',x),links)

    def parse(self,response):
        rows  = response.xpath(
        "//h1/following-sibling::table[not(.//td[contains(@class,'footer')])]//tr[count(td)=2]"
            )
        for row in rows:
            l = ItemLoader(item=DocketItem(),selector=row)
            l.add_xpath('date','td[1]/text()')
            l.add_xpath('description','td/a/text()')
            l.add_xpath('attachment_web_location','td/a/@href')
            l.add_value('listed_on',response.url)
            out = l.load_item()
            yield out

And it produces output like

[ {'attachment_web_location': [u'/rules/proposed/s71205/eurocomp031805.pdf'],
'date': [u'Mar. 18, 2005'],
'description': [u'Alexandre Tessier, Directeur G\xe9n\xe9ral, AFEP, Association Franc\xe7aise des Entreprises Priv\xe9es; Dr. Hellmut Longin, Pr\xe4sident, Aktienforum; \r\nProf. R\xfcdiger von Rosen, Managing Director, Deutsches Aktieninstitut; John Pierce, Chief Executive, The Quoted Companies Alliance; Mrs. Angeliki Petroulaki, General Manager, The Seiset Association; Panayotis G. Dracos, President and CEO, UCL/ASE; Evelyne Deloirie, Secr\xe9taire G\xe9n\xe9ral, MiddleNext; Digby Jones, Director General, CBI; Dr. Ludolf von Wartenbreg, Director General, BDI; Ebba Linds\xf6, Director General, Confederation of Swedish Enterprise; Pieris Theodorou, Chairman, SYDEK'],
'listed_on': ['https://www.sec.gov/rules/proposed/s71205.shtml'%5D},
{'attachment_web_location': [u'/rules/proposed/s71205/cleary020904.pdf'],
'date': [u'Feb. 9, 2004'],
'description': [u'Edward F. Greene, Cleary, Gottlieb, Steen & Hamilton'],
'listed_on': ['https://www.sec.gov/rules/proposed/s71205.shtml'%5D},
{'attachment_web_location': [u'/rules/proposed/s71205/eurocompanies020904.pdf'],
'date': [u'Feb. 9, 2004'],
'description': [u'\r\nAlain Joly, Pr\xe9sident, EALIC; Alexandre Tessier, Directeur G\xe9n\xe9ral, AFEP (Association Fran\xe7aise des Enterprises Priv\xe9es; Prof. R\xfcdiger von Rosen, Managing Director, Deutsches Aktieninstitut; Baron Vandamme, Administrateur, ABSC; Dr. Hellmut Longin, Pr\xe4sident, Aktienforum; Dr. Ludolf von Wartenberg, Director General, BDI; Edouard de Royere, Pr\xe9sident, ANSA; Mrs. Angeliki Petroulaki, General Manager of The Seiset Association; Ebba Lindso, Director General, Confederation of Swedish Enterprise; Tarmo Korpela, Deputy Director General, Confederation of Finnish Industry and Empoyers; Jones Digby, Director General, CBI'],
'listed_on': ['https://www.sec.gov/rules/proposed/s71205.shtml'%5D}]

What about downloading the files? Scrapy will have an easy time doing this, apparently, using some prewritten item pipelines. An item pipeline gets applied to an item after it has been scraped. This is apparently the place where you would put validation, HTML cleaning, duplicate checking, and database storage (oh so that’s where it goes!). And it so happens that which pipelines get applied is simply a setting that you amend.

class CommentSpider(scrapy.Spider):
    name = 'comment_spider'
    allowed_domains = ['www.sec.gov']

    #make start_urls by first loading in our previously scraped data
    src = pd.read_csv("sec_rulemaking_directory_scraped.csv")
    #dockets not always available, so drop
    #also might be more than one link per entry, so split-melt-drop
    links = src.docket_link_cleaned.dropna().\
        str.split("|",expand=True).\
        melt(value_name='links').dropna().links

    start_urls = map(lambda x: urljoin('https://www.sec.gov/',x),links)

    #to download files
    custom_settings = {
        'ITEM_PIPELINES' : {'scrapy.pipelines.files.FilesPipeline': 1},
        'FILES_STORE' : 'Downloads',
        'FILES_RESULT_FIELD' : 'attachment_file_location',
        'FILES_URLS_FIELD' : 'attachment_web_location'

    }

    def parse(self,response):
        rows  = response.xpath(
        "//h1/following-sibling::table[not(.//td[contains(@class,'footer')])]//tr[count(td)=2]"
            )
        for row in rows:
            l = ItemLoader(item=DocketItem(),selector=row)
            l.add_xpath('date','td[1]/text()')
            l.add_xpath('description','td/a/text()')
            web_loc = row.xpath('td/a/@href').extract()
            l.add_value('attachment_web_location',map(response.urljoin,web_loc))
            l.add_value('listed_on',response.url)
            out = l.load_item()
            yield out

This code looks as though it’s running well and doesn’t have a great deal of edge cases. Nevertheless, let’s start think about producing some improved output. Here’s what might get fed into the pipeline

{u'attachment_file_location': [{u'checksum': u'a462a3fcd7c45775b6fcf545f5ee626f',
   u'path': u'full/01c89e34bf2978c41eeabd9fb7c83c2edf86da66.pdf',
   u'url': u'https://www.sec.gov/comments/s7-15-11/s71511-12.pdf'}],
 u'attachment_web_location': [u'https://www.sec.gov/comments/s7-15-11/s71511-12.pdf'],
 u'date': [u'Jan. 24, 2014'],
 u'description': [u'Sean C. Davy, Managing Director, Corporate Credit Markets Division, Securities Industry and Financial Markets Association'],
 u'listed_on': [u'https://www.sec.gov/comments/s7-15-11/s71511.shtml']}

Let’s make a pipeline that would produce a clean(ish) tabular csv file. I will have this one write to file separately, it won’t change the items at all. In order to write this stuff I load the json produced from a first run into a pandas dataframe. Then I write the process_item function to make changes that mirror what I am doing to the dataframe in the first instance. here’s what I do

  • Drop items where the description is missing.
  • Focus on the ones where we have more than one entry per line
    — Drop from these, entries that are just whitespace
    — Drop Those that begin ‘Letter Type’ or “Attachment’
    — Join the text
  • Based on the text decide whether it’s a meeting, phone call, or what
  • Flatten the file locations by eliminating some unnecessary data.

Now this actually took a fair amount of time to optimize this script, and a lot of examining rule-based parsing. The end result seems to produce excellent results, however, so it’s well worth the changes. Here’s the final output.

<br />import scrapy
import pandas as pd
from urlparse import urljoin
from scrapy.loader import ItemLoader
import regex
import natty

class DocketItem(scrapy.Item):
    date = scrapy.Field()
    description = scrapy.Field()
    attachment_web_location = scrapy.Field()
    attachment_file_location = scrapy.Field()
    listed_on = scrapy.Field()
    kind = scrapy.Field()
    flagged = scrapy.Field()

class TabularOutputPipeline(object):
    """
    Produce a csv file that is a flat table from scraped data.
    Column headers include

    attachment_file_location,attachment_web_location,' + \
                'date,description,listed_on,kind,flagged
    """
    def open_spider(self,spider):
        #clobber whatever file is there and write a header
        self.file = open('lobbying_sec_rulemaking.csv','w')
        #the line we wish to write
        self.header= 'attachment_file_location,attachment_web_location,' + \
            'date,description,listed_on,kind,flagged\n'
        #we'll need to add the columnns to the items
        #to ensure that the csv aligns when missing
        self.columns = self.header.strip().split(',')
        self.file.write(self.header)

    def close_spider(self,spider):
        self.file.close()

    def process_item(self,item,spider): 
        #---- Description Section
        if 'description' not in item.keys():
            item['flagged'] = True
            return item #do nothing with bad ones, just note that we will drop them.
        x = item['description']    
        if x != x:
            item['flagged'] = True
            return item
        x = filter(lambda x: len(x)!=0,map(unicode.strip,x))
        x = filter(lambda x: not regex.match('^Letter Type|^Attachment',x,regex.IGNORECASE),x)
        item['description'] = " ".join(x)
        # ----  Parse the description to determine which are meetings.
        if regex.search('Memorandum',item['description'],regex.IGNORECASE,item['description']):
             if regex.search('call|phone|telephon|teleconference',
                 item['description'],regex.IGNORECASE):
                 item['kind'] = 'Phone call'
             elif regex.search('meeting|discussion|consultation|presentation|meeating',
                 item['description'],
                 regex.IGNORECASE):
                 item['kind'] = 'Meeting'
             elif regex.search('e-mail|email',
                 item['description'],
                 regex.IGNORECASE):
                 item['kind'] = 'Email'
             else:
                 item['kind'] = 'Memo'
        else:
            item['kind'] = 'Comment'
        #----- parse dates
        if 'date' in item.keys():
             raw = item['date'][0]
             if raw=='undated' or len(raw.strip())==0:
                 item['date'] =np.nan
                 item['flagged'] = True
             else:
                 parsed = natty.DateParser(raw).result()[0]
                 item['date'] = parsed.date().isoformat()
        #--- clean information to help track down
        item["listed_on"] = item["listed_on"][0]
        item["attachment_web_location"] = item["attachment_web_location"][0]
        item["attachment_file_location"] = item["attachment_file_location"][0]['path']
        if 'flagged' not in item.keys():
            item['flagged'] = False
        out = pd.DataFrame([item],columns=self.columns)
        out.to_csv(self.file,mode='a',
            header=False,index=False,encoding='utf8')
        return item

class CommentSpider(scrapy.Spider):
    name = 'comment_spider'
    allowed_domains = ['www.sec.gov']

    #make start_urls by first loading in our previously scraped data
    src = pd.read_csv("sec_rulemaking_directory_scraped.csv")
    #dockets not always available, so drop
    #also might be more than one link per entry, so split-melt-drop
    links = src.docket_link_cleaned.dropna().\
        str.split("|",expand=True).\
        melt(value_name='links').dropna().links

    start_urls = map(lambda x: urljoin('https://www.sec.gov/',x),links)

    #to download files
    custom_settings = {
        'ITEM_PIPELINES' : {'scrapy.pipelines.files.FilesPipeline': 1,
                            'SEC.spiders.sec_comments.TabularOutputPipeline': 800},
        'FILES_STORE' : 'Downloads',
        'FILES_RESULT_FIELD' : 'attachment_file_location',
        'FILES_URLS_FIELD' : 'attachment_web_location'

    }

    def parse(self,response):
        rows  = response.xpath(
        "//h1/following-sibling::table[not(.//td[contains(@class,'footer')])]//tr[count(td)=2]"
            )
        for row in rows:
            l = ItemLoader(item=DocketItem(),selector=row)
            l.add_xpath('date','td[1]/text()')
            l.add_xpath('description','td/a/text()')
            web_loc = row.xpath('td/a/@href').extract()
            l.add_value('attachment_web_location',map(response.urljoin,web_loc))
            l.add_value('listed_on',response.url)
            out = l.load_item()
            yield out

One brief note before proceeding. I’ve put everything in a single giant script, but this is not the way projects are supposed to be organized in scrapy. When you create a project in scrapy, it makes scripts called ‘settings.py’, ‘pipelines.py’, ‘middlewares.py’, and ‘items.py’, as well as a folder called spiders. So the three classes my code defines should really be in three separate places, which probably makes code more maintainable and easy to reuse. It also suggests that much of what I’ve put in the process_item function in the TabularOutputPipepline really should go in a custom item_loader.

Scraping SEC Rulemaking (Pt. 2)

In a prior post I showed how to scrape the rulemaking index from the SEC. Now we want to show how to get the comments and related material as well, again using Scrapy.

Having done a little more research on how to develop in this environment, it seems that you don’t do development in scrapy using the interactive programming style you use for R or iPython. It’s more like writing in C where you need to write the whole script and then check if it works… although there is an interactive environment that one can use (check the documentation for shell/”inspect_response”), it might be worth exploring.

Another point is that instead of constructing one “master parser” that combs through the site once, it seems that scrapy works better with multiple spiders searching for data that is uniform. Simple rule of thumb: each table of output that we want should get its own spider. While this will result in some redundant requests, the greater simplicity in development probably makes up for it in efficiency.

So what’s our goal now? One thought was to follow the links from the index provided before. The problem with that is that the index doesn’t consistently link to the sites where the dockets are. (D’oh!). While this somewhat defeats the point of an index, the rulemaking index is still useful in so far as it organizes rules into related matters. So instead we’ll find all the dockets we can and match the two datasets up later using the release number.

So to get started, let’s see how we can use the CrawlSpider class to find the set of pages we want to scrape:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

#%%
class SecRulemakingSpider(CrawlSpider):
    name = 'rule_archives'
    allowed_domains = ['www.sec.gov']
    start_urls = [
            'https://www.sec.gov/rules/proposed.shtml',
            'https://www.sec.gov/rules/final.shtml',
            'https://www.sec.gov/rules/interim-final-temp.shtml',
            'https://www.sec.gov/rules/concept.shtml'
            ]

    rules = [Rule(LinkExtractor(restrict_xpaths="//p[@id='archive-links']"),
        callback="parse_item",
        follow=True
        )]

    def parse_item(self,response):
        yield {'url' : response.url}

What’s different about the crawl spider? Well note that this spider understands when it is given some “start_urls” and takes rules that define which pages should be parsed. The callback to parse_item applies it to all the links found by rules. And it gives us results like

{"url": "https://www.sec.gov/rules/final/finalarchive/finalarchive2013.shtml"}
{"url": "https://www.sec.gov/rules/final.shtml"}

Which is great, clearly this is a clever parser that in just a few lines will load the pages that we want. Now we want to fill in the parse_item to give us a better idea of what’s on the page. So let’s follow an example. If one checks out https://www.sec.gov/rules/final/finalarchive/finalarchive2014.shtml, one will see that there are lots of table rows like so

<tr><td valign="top">
<a name="ia-3984"></a><a href="/rules/final/2014/ia-3984.pdf">IA-3984</a></td>
<td valign="top" nowrap="">Dec. 17, 2014</td>
<td valign="top"><b class="blue">Temporary Rule Regarding Principal Trades With Certain Advisory Clients</b><br>
<b><i>File No.:</i></b> S7-23-07<br>
<b><i>Effective Date:</i></b> The amendments in this document are effective December 30, 2014 and the expiration date for 17 CFR 275.206(3)-3T is extended to December 31, 2016.<br>
<strong><em>See also:</em></strong> Interim Final Temporary Rule Release No. 
<a href="/rules/final/2007/ia-2653.pdf">IA-2653</a>; Proposed Rule Release Nos. <a href="/rules/proposed/2014/ia-3893.pdf">IA-3893</a>, <a href="/rules/proposed/2012/ia-3483.pdf">IA-3483</a>, and <a href="/rules/proposed/2010/ia-3118.pdf">IA-3118</a>; Final Rule Release Nos.  <a href="/rules/final/2012/ia-3522.pdf">IA-3522</a>, 
<a href="/rules/final/2010/ia-3128.pdf">IA-3128</a>, 
<a href="/rules/final/2009/ia-2965.pdf">IA-2965</a>,
 <a href="/rules/final/2009/ia-2965a.pdf">IA-2965a</a>; 
<a href="/info/smallbus/secg/206-3-3-t-secg.htm">Small Entity Compliance Guide</a><br>
<i>Federal Register</i> (79 FR 76880): <a href="https://www.federalregister.gov/articles/2014/12/23/2014-29975/temporary-rule-regarding-principal-trades-with-certain-advisory-clients">HTML</a> | <a href="http://www.gpo.gov/fdsys/pkg/FR-2014-12-23/pdf/2014-29975.pdf">PDF</a> | <a href="http://www.gpo.gov/fdsys/pkg/FR-2014-12-23/html/2014-29975.htm">text</a> | <a href="https://www.federalregister.gov/articles/xml/201/429/975.xml">XML</a>
</td></tr>

So each row in this table has lots of links and also meta data we can collect. Because different years of the archive use different templates, we have to be careful about how we identify the right data on the page. The following xpath seems to work well

//tr[count(td)=3 and not(descendant::table)]

For now let’s not deal with parsing the data, but rather collecting it.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

#%%

class SecRulemakingSpider(CrawlSpider):
    name = 'rule_archives'
    allowed_domains = ['www.sec.gov']
    start_urls = [
            'https://www.sec.gov/rules/proposed.shtml',
            'https://www.sec.gov/rules/final.shtml',
            'https://www.sec.gov/rules/interim-final-temp.shtml',
            'https://www.sec.gov/rules/concept.shtml'
            ]

    rules = [Rule(LinkExtractor(restrict_xpaths="//p[@id='archive-links']"),
        callback="parse_item",
        follow=True
        )]

    def parse_item(self,response):
        listings = response.xpath(
            "//tr[count(td)=3 and not(descendant::table)]")
        yield {'url' : response.url,
            'listings' : [l.extract() for l in listings]}

And this seems to work like a charm!

[{"url": "https://www.sec.gov/rules/concept/conceptarchive/conceptarch2004.shtml", "listings": ["<tr valign=\"top\">\r\n<td nowrap><b><i>Release No.</i></b></td><td valign=\"top\"><b><i>Date</i></b></td><td valign=\"top\"><b><i>Details</i></b></td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/34-50700.htm\">34-50700</a></td>\r\n<td nowrap>  Nov. 18, 2004</td>\r\n<td><b class=\"blue\">Concept Release Concerning Self-Regulation</b>\r\n<br><b><i>Comments due:</i></b> Mar. 8, 2005\r\n<br><b><i>File No.: </i></b>\u00a0 S7-40-04\r\n<br><i>Comments received  \r\n<a href=\"/rules/concept/s74004.shtml\">are available</a>\r\nfor this notice</i><br> \r\n\r\n<a href=\"/rules/concept/34-50700.pdf\">Federal Register PDF</a>\r\n</td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/33-8497.htm\">33-8497</a></td>\r\n<td nowrap>  Sep. 27, 2004</td>\r\n<td><b class=\"blue\">Enhancing Commission Filings Through the Use of Tagged Data</b>\r\n<br><b><i>Comments due:</i></b> Comments should be received on or before November 15, 2004.\r\n<br><b><i>File No.: </i></b>\u00a0 S7-36-04\r\n<br><b><i>Other Release Nos.: </i></b>\u00a0 34-50454; 35-27895; 39-2429; IC-26623\r\n<br><i>Comments received  \r\n<a href=\"/rules/concept/s73604.shtml\">are available</a>\r\nfor this notice</i><br>\r\n\r\n<a href=\"/rules/concept/33-8497.pdf\">Federal Register PDF</a>\r\n</td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/33-8398.htm\">33-8398</a></td>\r\n<td nowrap>  Mar. 11, 2004</td>\r\n<td><b class=\"blue\">Securities Transactions Settlement</b>\r\n<br><b><i>Comments due:</i></b> June 16, 2004\r\n<br><b><i>File No.: </i></b>\u00a0 S7-13-04<br>\r\n<i>Comments received  \r\n<a href=\"/rules/concept/s71304.shtml\">are available</a>\r\nfor this notice</i><br>\r\n\r\n<a href=\"/rules/concept/33-8393.pdf\">Federal Register PDF</a>\r\n</td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/34-49175.htm\">34-49175</a></td>\r\n<td nowrap>Feb. 3, 2004</td>\r\n<td><b class=\"blue\">Competitive Developments in the Options Markets</b>\r\n<br><b><i>Comments due:</i></b>\u00a0 April 9, 2004\r\n<br><b><i>File No.:</i></b>\u00a0 S7-07-04\r\n<br><i>Comments received  \r\n<a href=\"/rules/concept/s70704.shtml\">are available</a>\r\nfor this notice</i> <br>\r\n\r\n<a href=\"/rules/concept/34-49175.pdf\">Federal Register PDF</a>\r\n\r\n\r\n</td></tr>"]}
{"url": "https://www.sec.gov/rules/concept/conceptarchive/conceptarch2007.shtml", "listings": ["<tr valign=\"top\">\r\n<td nowrap><b><i>Release No.</i></b></td><td valign=\"top\"><b><i>Date</i></b></td><td valign=\"top\"><b><i>Details</i></b></td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/2007/33-8870.pdf\">33-8870</a></td>\r\n<td nowrap>Dec. 12, 2007</td>\r\n<td><b class=\"blue\">Concept Release on Possible Revisions to the Disclosure Requirements Relating to Oil and Gas Reserves</b>\r\n<br><b><i>Comments due:</i></b>\u00a0 February 19, 2008\r\n<br><b><i>File No.:</i></b>\u00a0 S7-29-07\r\n<br><b><i>Other Release No.:</i></b>\u00a0 34-56945\r\n<br><i>Comments received \r\n<a href=\"/comments/s7-29-07/s72907.shtml\">are available</a>\r\nfor this notice</i>\r\n<br>\r\n<a href=\"/rules/concept/2007/33-8870fr.pdf\"><i>Federal Register</i> version</a>\r\n\r\n<br>\r\n\r\n\r\n</td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/2007/33-8860.pdf\">33-8860</a></td>\r\n<td nowrap>Nov. 16, 2007</td>\r\n<td><b class=\"blue\">Mechanisms to Access Disclosures Relating to Business Activities in or With Countries Designated as State Sponsors of Terrorism</b>\r\n<br><b><i>Comments due:</i></b>\u00a0 January 22, 2008\r\n<br><b><i>File No.:</i></b>\u00a0 S7-27-07\r\n<br><i>Comments received \r\n<a href=\"/comments/s7-27-07/s72707.shtml\">are available</a>\r\nfor this notice</i>\r\n<br>\r\n<a href=\"/rules/concept/2007/33-8860fr.pdf\"><i>Federal Register</i> version</a>\r\n\r\n<br>\r\n\r\n\r\n</td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/2007/33-8831a.pdf\">33-8831A</a></td>\r\n<td nowrap>Sep. 13, 2007</td>\r\n<td><b class=\"blue\">Concept Release On Allowing U.S. Issuers To Prepare Financial Statements In Accordance With International Financial Reporting Standards</b> (Correcting Amendment)\r\n<br><b><i>File No.:</i></b>\u00a0 S7-20-07\r\n<br><b><i>Other Release Nos.:</i></b>\u00a0 34-56217A; IC-27924A\r\n<br>\r\n<a href=\"/rules/concept/2007/33-8831afr.pdf\"><i>Federal Register</i> version</a>\r\n\r\n<br><b><i>See also:</i></b>\u00a0 \r\n<a href=\"/rules/concept/2007/33-8831.pdf\">Release No. 33-8831</a>\r\n\r\n</td></tr>", "<tr valign=\"top\">\r\n<td>\r\n<a href=\"/rules/concept/2007/33-8831.pdf\">33-8831</a></td>\r\n<td nowrap>Aug. 7, 2007</td>\r\n<td><b class=\"blue\">Concept Release On Allowing U.S. Issuers To Prepare Financial Statements In Accordance With International Financial Reporting Standards</b> (Corrected)\r\n<br><b><i>Comments due:</i></b>\u00a0 November 13, 2007\r\n<br><b><i>File No.:</i></b>\u00a0 S7-20-07\r\n<br><b><i>Other Release Nos.:</i></b>\u00a0 34-56217; IC-27924\r\n<br>\r\n<a href=\"/rules/concept/2007/33-8831fr.pdf\"><i>Federal Register</i> version</a>\r\n\r\n<br><i>Comments received \r\n<a href=\"/comments/s7-20-07/s72007.shtml\">are available</a>\r\nfor this notice</i>\r\n<br><b><i>See also:</i></b>\u00a0 \r\n<a href=\"/rules/concept/2007/33-8831a.pdf\">Release No. 33-8831A</a></td></tr>"]

Now that we basically have the data we want, however the form is still not that great. One possibility is to save the output like this and handoff the unparsed data to another program, and on some forums people suggest you might do that. According to the Srapy documentation, however, what you should do is to define subclasses of Items and ItemLoaders. What’s the difference? “Items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container,” well put by the documentation. So first let’s think about the data we want to end up with. Rendering the HTML above we expect entries like

\r\n

“, “

\r\n

\r\n

\r\n

\r\n

\r\n

\r\n

Release No. Date Details
\r\n33-8870 Dec. 12, 2007 Concept Release on Possible Revisions to the Disclosure Requirements Relating to Oil and Gas Reserves\r\n
Comments due:\u00a0 February 19, 2008\r\n
File No.:\u00a0 S7-29-07\r\n
Other Release No.:\u00a0 34-56945\r\n
Comments received \r\nare available\r\nfor this notice\r\n
\r\nFederal Register version\r\n\r\n
\r\n\r\n\r\n
\r\n33-8831 Aug. 7, 2007 Concept Release On Allowing U.S. Issuers To Prepare Financial Statements In Accordance With International Financial Reporting Standards (Corrected)\r\n
Comments due:\u00a0 November 13, 2007\r\n
File No.:\u00a0 S7-20-07\r\n
Other Release Nos.:\u00a0 34-56217; IC-27924\r\n
\r\nFederal Register version\r\n\r\n
Comments received \r\nare available\r\nfor this notice\r\n
See also:\u00a0 \r\nRelease No. 33-8831A

So what fields do we want to get out of this? Here’s a list you can see from these examples or by poking around the other pages.

  • Release Number (good for matching to the data in part 1)
  • The address the release number references
  • ReleaseDate
  • Title
  • Comment Due Date
  • File No. (formally this is the docket identifier)
  • Related releases
  • Comment Location
  • Federal Register Notice
  • Effective Date

The code for making a container does not look much different than this list, see the class definition for “Entry” below. The harder part is to define the ItemLoader. Here’s some stater code

<br />def check_listing(listing):
    toplevel_text = map(unicode.lower,
        listing.xpath("descendant::*//text()").extract())
    return 'release no' not in toplevel_text[0] or \
        'date' not in toplevel_text[1] or \
        'details' not in toplevel_text[2]

class Entry(scrapy.Item):
     release_no = scrapy.Field()
     release_link = scrapy.Field()
     release_date = scrapy.Field()
     release_title = scrapy.Field()
     comment_deadline = scrapy.Field()
     file_no = scrapy.Field()
     related_releases = scrapy.Field()
     docket_link = scrapy.Field()
     federal_register_notice = scrapy.Field()
     effective_date = scrapy.Field()
     listed_on = scrapy.Field()

class SecRulemakingSpider(CrawlSpider):
    name = 'rule_archives'
    allowed_domains = ['www.sec.gov']
    start_urls = [
            'https://www.sec.gov/rules/proposed.shtml',
            'https://www.sec.gov/rules/final.shtml',
            'https://www.sec.gov/rules/interim-final-temp.shtml',
            'https://www.sec.gov/rules/concept.shtml'
            ]

    rules = [Rule(LinkExtractor(restrict_xpaths="//p[@id='archive-links']"),
        callback="parse_item",
        follow=True
        )]


    def parse_item(self,response):
        from scrapy.loader import ItemLoader
        #filter out the header rows
        listings = filter(check_listing,response.xpath(
            "//tr[count(td)=3 and not(descendant::table)]")
            )
        for listing in listings:
            loader = ItemLoader(item=Entry(),selector=listing)
            loader.add_xpath('release_no','td[1]/a/@name')
            loader.add_xpath('release_link','td[1]/a/@href')
            loader.add_xpath('release_date',"td[2]/text()")
            #details loader
            dl = loader.nested_xpath('/td[3]')
            dl.add_xpath('release_title','b[1]/text()')
            yield loader.load_item()        

Note that what we’ve changed in the parse_item is the code that parses what I’m calling “listings” or table rows. This produces some nice results!

{"listed_on": ["https://www.sec.gov/rules/proposed/proposedarchive/proposed2011.shtml"], "release_date": ["Jan. 25, 2011"], "release_no": ["33-9177"], "release_link": ["/rules/proposed/2011/33-9177.pdf"]}
{"listed_on": ["https://www.sec.gov/rules/proposed/proposedarchive/proposed2011.shtml"], "release_date": ["Jan. 14, 2011"], "release_no": ["34-63727"], "release_link": ["/rules/proposed/2011/34-63727.pdf"]}
{"listed_on": ["https://www.sec.gov/rules/proposed/proposedarchive/proposed2011.shtml"], "release_date": ["Jan. 6, 2011"], "release_no": ["34-63652"], "release_link": ["/rules/proposed/2011/34-63652.pdf"]}

Now at the end of the article I’ll post the finished code, however for the moment let’s assume that we’re able to produce output like.

{"release_link": ["/rules/final/33-8176.htm"], "see_also": [" ", "<a href=\"/rules/final/33-8216.htm\">Final Rule Rel. No. 33-8216</a>", ";\r\n", "<br>", "<a href=\"/rules/proposed/33-8145.htm\">Proposed Rule Rel. No. 33-8145</a>", " and ", "<a href=\"/rules/proposed/s74302.shtml\">comments</a>", "\r\n"], "release_date": ["Jan. 22, 2003"], "release_title": ["Conditions for Use of Non-GAAP Financial Measures "], "listed_on": ["https://www.sec.gov/rules/final/finalarchive/finalarchive2003.shtml"], "docket_link": ["/rules/proposed/s74302.shtml"], "file_no": [" S7-43-02"], "effective_date": [" March 28, 2003"], "release_no": ["33-8176"]},
{"release_link": ["/rules/final/ic-25888.htm"], "see_also": [" ", "<a href=\"/rules/proposed/ic-25557.htm\">Proposed Rule Rel. No. IC-25557</a>", " and ", "<a href=\"/rules/proposed/s71302.shtml\">comments</a>", "\r\n"], "release_date": ["Jan. 14, 2003"], "release_title": ["Transactions of Investment Companies with Portfolio and Subadviser Affiliates"], "listed_on": ["https://www.sec.gov/rules/final/finalarchive/finalarchive2003.shtml"], "docket_link": ["/rules/proposed/s71302.shtml"], "file_no": [" S7-13-02"], "effective_date": [" February 24, 2003"], "release_no": ["IC-25888"]}

Then we need some way of validating that we are not missing thing. To do that, we write some simple code.

import pandas as pd
import json
import numpy as np
#%%

with open("Scrapy/SEC/first_run.json") as fp:
    j = json.load(fp)

absent_dockets = [r for r in j if not r.has_key('docket_link')]
np.random.seed(1)
checks = np.random.choice(absent_dockets,10)

You can check the “checks” by hand. The most important thing we want is the association between releases and dockets, since that’s what’s going to link in to the richer data we scraped on the index, and this is basically fine. So this is the validation I’ve done for now. As we do more work with the data, we will note when things are missing and try to do better. Missing data is inevitable on sites that are not consistently maintained, such as the SECs.

While it makes sense to clean-up the input a bit in the scrapy app, for such a weirdly organized set of data it’s always gonna be hard to get everything pivoted into a nice tabular form. I wrote a separate script for doing that which I won’t show, that also identifies edge cases that would benefit from an undergraduate RA’s eye to make sure nothing weird is going on with the scraper.

Here’s the final parser.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.shell import inspect_response
from natty import DateParser
import logging
#%%
#%%helpers

def check_listing(listing):
    toplevel_text = map(unicode.lower,
        listing.xpath("descendant::*//text()").extract())
    return 'release no' not in toplevel_text[0] or \
        'date' not in toplevel_text[1] or \
        'details' not in toplevel_text[2]

def parse_date(x):
    try:
        r=DateParser(x)
        return map(lambda x: x.date().isoformat(),r.result())
    except:
        logging.warning("FAILED ON:" + x)
        return x

class Entry(scrapy.Item):
    from scrapy.loader.processors import TakeFirst, MapCompose, Join,Compose
    release_no = scrapy.Field(
        input_processor=MapCompose(unicode.strip),
        output_processor=TakeFirst()
    ) #
    release_link = scrapy.Field(
        input_processor=MapCompose(unicode.strip),
        output_processor=Join()
    ) #
    release_date = scrapy.Field(
        input_processor=MapCompose(parse_date),
        output_processor=Join()) 
    release_title = scrapy.Field(output_processor=Join()) 
    comment_deadline = scrapy.Field(
        input_processor=MapCompose(unicode.strip),
        output_processor=Join()
    ) #
    file_no = scrapy.Field(
                input_processor=MapCompose(unicode.strip),
                        output_processor=Join()
                         ) #
    see_also = scrapy.Field() #this one is hard to make sense of...
    related_releases = scrapy.Field() #also hard 
    docket_link = scrapy.Field() #
    federal_register_notice = scrapy.Field() 
    effective_date = scrapy.Field(input_processor=MapCompose(parse_date)) #
    listed_on = scrapy.Field() #



class SecRulemakingSpider(CrawlSpider):
    name = 'rule_archives'
    allowed_domains = ['www.sec.gov']
    start_urls = [
            'https://www.sec.gov/rules/proposed.shtml',
            'https://www.sec.gov/rules/final.shtml',
            'https://www.sec.gov/rules/interim-final-temp.shtml',
            'https://www.sec.gov/rules/concept.shtml'
            ]

    rules = [Rule(LinkExtractor(restrict_xpaths="//p[@id='archive-links']"),
        callback="parse_item",
        follow=True
        )]



    def parse_item(self,response):
        from scrapy.loader import ItemLoader
        #filter out the header rows
        listings = filter(check_listing,response.xpath(
            "//tr[count(td)=3 and not(descendant::table)]")
            )
        for listing in listings:
            #inspect_response(response,self)
            loader = ItemLoader(item=Entry(),selector=listing)
            loader.add_value('listed_on',response.url)
            loader.add_xpath('release_no','td[1]/*/text()')
            loader.add_xpath('release_link','td[1]/a/@href')
            loader.add_xpath('release_date',"td[2]/text()")
            #details = listing.xpath('td[3]')[0]
            #print details.extract()
            dl = loader.nested_xpath('td[3]')
            dl.add_xpath('release_title','b[1]/text()')
            dl.add_xpath('effective_date',
                """b[contains(i/text(),'Effective Date')]/
                        following-sibling::text()[1]""")
            dl.add_xpath('file_no',
            """b[contains(i/text(),'File No')]/
                    following-sibling::text()[1]""")
            dl.add_xpath("see_also",
                "b[contains(i/text(),'See also')]/following-sibling::node()")      
            dl.add_xpath('docket_link',
                """a[contains(translate(text(),
                    'ABCDEFGHIJKABCDEFGHIJKLMNOPQRSTUVWXYZ',
                    'abcdefghijklmnopqrstuvwxyz'),
                    'comments')]/@href
                """)
            dl.add_xpath('docket_link',
                "descendant::*[contains(text(),'are available')]/@href")
            out = loader.load_item()
            print out
            yield out

Scraping SEC Rulemaking with Scrapy

Intro

In the course of studying rulemaking, I’ve done a lot of complex scraping. Regulatory agencies put out incredible amounts of data, but it is hard to integrate into a useful form. Mostly, I’ve done ad-hoc scraping before using python tools like Requests and BeautifulSoup, which largely gets the job done. However, there are better tools out there apparently, including Scrapy, which I’m starting to get a handle of. In this set of blogs, I’ll show how I use this tool to scrape the SEC’s rulemaking data.

The first page we want to scrape is 

https://www.sec.gov/rules/rulemaking-index.shtml

Within the page there is a table that is a good example of untidy data (and so the “fun” begins, search for Hadley’s writings on tidy data if you don’t know what I’m talking about).  Tons of great information here though, and putting this document together ourselves would take tons and tons of time.

It also contains links to various other documents where we can get things that are even more interesting for a political scientist, for example comments or meeting logs. Eventually we’ll want to get those, however let’s start small and get the directory of rules published by the SEC since 2008.

#Nitty Gritty

I’m gonna skip ahead of all the stuff installing Scapy. That’s well-covered in the tutorial. As a first cut, here’s what our spyder looks like

<br />import scrapy
class SecRulemakingSpider(scrapy.Spider):
    name = 'sec_rulemaking'
    allowed_domains = ['www.sec.gov/rules/rulemaking-index.shtml']
    start_urls = ['https://www.sec.gov/rules/rulemaking-index.shtml']


    def parse(self, response):
        for docket in response.xpath("//*[@id='rulemaking-index']/li")[2:]:
            yield {
            'last-action' : docket.xpath(
                    "div[contains(@class,'last-action')]/text()"
                                       ).extract_first(),
            'file-number': docket.xpath(
                    "div[contains(@class,'file-number')]/text()"
                                       ).extract_first(),
            'rule-title' : 
                docket.xpath(
                "div[contains(@class,'rulemaking')]/span[@class='name']/text()"
                ).extract_first(),
            'divisions' : docket.xpath(
                "div[contains(@class,'rulemaking')]/span[@class='division']/text()"
                ).extract_first()
            }


Pretty straightforward looking and produces some cool results (here’s a taste).

{“divisions”: “Corporation Finance”, “last-action”: “3/31/17”, “rule-title”: “Titles I and III of the JOBS Act”, “file-number”: “S7-09-16”}
{“divisions”: “Corporation Finance”, “last-action”: “08/23/16”, “rule-title”: “Modernization of Property Disclosures for Mining Registrants”, “file-number”: “S7-10-16”}

The problem is the final column is the messy one. It contains multiple regulatory actions per docket, and also links to other pages that we might eventually want to do something with (see Part 2). For now, the simplest thing to do is to define another function that will analyze each action. In the interactive terminal I do something like the following

#pick a juicy one
docket =  response.xpath("//*[@id='rulemaking-index']/li")[5]
actions = docket.xpath("""div[contains(@class,'actions')]/a""")
action=actions[0]
print action.extract()

Yields something like

<a class="final-rule" href="/rules/final/finalarchive/finalarchive2008.shtml#34-57711">
            <span class="type">Final Rule</span>
            <span class="title">Disclosure of Divestment by Registered Investment Companies in Accordance With Sudan Accountability and Divestment Act of 2007</span>
            <span class="date">4/24/08</span>
            <span class="release">34-57711</span>
        </a>

That href is gonna be interesting eventually but for now we can just try to parse the content

    def parse_action(action):
        return {'type' : action.xpath("span[@class='type']/text()").extract_first(),
        'title' : action.xpath("span[@class='title']/text()").extract_first(),
        'date' : action.xpath("span[@class='date']/text()").extract_first(),
        'release' : action.xpath("span[@class='release']/text()").extract_first()}

And passing parse_action(action) yields

{'date': u'4/24/08',
 'release': u'34-57711',
 'title': u'Disclosure of Divestment by Registered Investment Companies in Accordance With Sudan Accountability and Divestment Act of 2007',
 'type': u'Final Rule'}

Pretty Sweet. Let’s try running our updated script.

# -*- coding: utf-8 -*-
import scrapy


class SecRulemakingSpider(scrapy.Spider):
    name = 'sec_rulemaking'
    allowed_domains = ['www.sec.gov/rules/rulemaking-index.shtml']
    start_urls = ['https://www.sec.gov/rules/rulemaking-index.shtml']

    def parse_action(self,action):
        return {
        'type' : action.xpath("span[@class='type']/text()").extract_first(),
        'title' : action.xpath("span[@class='title']/text()").extract_first(),
        'date' : action.xpath("span[@class='date']/text()").extract_first(),
        'release' : action.xpath("span[@class='release']/text()").extract_first()}


    def parse(self, response):
        for docket in response.xpath("//*[@id='rulemaking-index']/li")[2:]:
            yield {
            'last-action' : docket.xpath(
                    "div[contains(@class,'last-action')]/text()"
                                       ).extract_first(),
            'file-number': docket.xpath(
                    "div[contains(@class,'file-number')]/text()"
                                       ).extract_first(),
            'rule-title' : 
                docket.xpath(
                "div[contains(@class,'rulemaking')]/span[@class='name']/text()"
                ).extract_first(),
            'divisions' : docket.xpath(
                "div[contains(@class,'rulemaking')]/span[@class='division']/text()"
                ).extract_first(),
            'actions' :  map(self.parse_action,docket.xpath("""
                    div[contains(@class,'actions')]/a
                    """))

            }

And the results would be something like this:

{"divisions": "Corporation Finance", "last-action": "01/13/16", "rule-title": "Amendments to Forms S-1 and F-1", "actions": [{"date": "01/13/16", "release": "33-10003", "type": "Interim Final Temporary Rule", "title": "Simplification of Disclosure Requirements for Emerging Growth Companies and Forward Incorporation by Reference on Form S-1 for Smaller Reporting Companies"}], "file-number": "S7-01-16"}
{"divisions": "Corporation Finance", "last-action": "3/31/17", "rule-title": "Titles I and III of the JOBS Act", "actions": [{"date": "03/31/17", "release": "33-10332", "type": "Final Rule", "title": "Inflation Adjustments and Other Technical Amendments under Titles I and III of the JOBS Act"}, {"date": "06/01/16", "release": "34-77969", "type": "Interim Final Temporary Rule", "title": "Form 10-K Summary"}], "file-number": "S7-09-16"}

While I think there’s a bit of an adjustment period to using scapy relative to the old way I was doing scraping, this is pretty nice actually. You spend more time thinking about the layout of the page and less time setting up the infrastructure for requesting and parsing content. I’m guessing it only gets better, find out in my next installment.