Scraping SEC Rulemaking (Pt. 5, Press Releases)

One of the big difficulties in conducting research on rulemaking is the substantial difficulty of obtaining information about outcomes. Various collaborators of mine have introduced ideas about how to do that to advance the scholarly research agenda. One of the ideas that I used extensively in my dissertation was to use equity markets to obtain concise single-dimensional summaries of regulatory impacts. In order to do any kind of event study, you need to know when a regulation was announced to a high-degree of certainty. Typically the way this has been done is to use Bloomberg (good) or publication date in the Federal Register (really bad! In one case I saw this was 2 months off of the real publication date!). For my papers, I’ve used press releases that I date via recourse to RSS feeds. The main hitch with this method is that RSS feeds are ephemeral, so you need to hope that a service like Feedly or Google Reader was tracking it. If that doesn’t work, your last best hope is that the Internet Archive was tracking it, however generally you’ll fail to get a comprehensive record if that’s your only recourse.

So to start getting the Press Release Times we go to the RSS

Feedly Link

Feedly has 945 results going from February 11, 2015 to today associated with that location. Wayback machine does not have any archives before December 23, 2014, so they likely had a different RSS feed before that, which hopefully was still functional for some time afterward AND stored on feedly.

The link here:

http://www.sec.gov/rss/news/press.xml

When parsed from Feedly yields 1000 results, indicating that our results are being clipped. They also go from today to 2014-12-04. Using the continuation key to get the next ones,

http://feedly.com/v3/streams/contents?streamId=feed/http://www.sec.gov/rss/news/press.xml&count=1000&continuation=14a166d4eed:128a3c7:b669a9fb

We get 602 more results, indicating that’s all there is. These results go back to 2012-11-08. That’s pretty good! Google Reader went offline July 1, 2013, so in theory we might get more there wading through the archives. But before we do that, let’s make sure that there’s not another RSS feed that was being used until before that time. Indeed, https://web.archive.org/web/20120626021816/http://www.sec.gov:80/rss/news/press.xml goes back to 2006. So probably our best hope at this point would be the google reader archive. To get that, you download the index first, which is a huge 12 GB compressed file, and then you need to grep it for the feed you want.

It turns out that we’re in luck and there is some archived RSS data from Google Reader. Searching through is file is pretty easy, you just use grep -e ‘www.sec.gov’ BIGFILENAME.cdx > search_results.fw, the result will be a space delimited file.

And once that big search is done, you can view the results like so

d = pd.read_csv('search_results.fw',delimiter=' ',header=None)
#N b a m s k r M S V g
d.columns = ['massaged_url',
    'date',
    'original_url',
    'mime_original',
    'response_code',
    'new_style_checksum',
    'redirect',
    'meta_tags',
    'S',
    'compressed_arc_offset',
    'file_name']

relevant_feeds = d[d.massaged_url.str.contains('www.sec.gov/rss/news/press.xml')]

It turns out that the backup is in a file called archiveteam_greader_20130619095946/greader_20130619095946.megawarc.warc.gz, which is 25 GB and also easily downloaded.

Now these warc files are confusingly documented and I don’t really understand how to use the tools internet archive provides to access them. But it turns out that just reading the file byte by byte from information given in the index works well enough. Below find some code that extracts all the SEC press releases

import pandas as pd
import StringIO
import gzip
from jsonlines import Writer
import codecs
d = pd.read_csv('GoogleReader/archiveteam-googlereader201306-indexes.cdx/search_results.fw',delimiter=' ',header=None)
#N b a m s k r M S V g
d.columns = ['massaged_url',
    'date',
    'original_url',
    'mime_original',
    'response_code',
    'new_style_checksum',
    'redirect',
    'meta_tags',
    'S',
    'compressed_arc_offset',
    'file_name']

relevant_feeds = d[d.massaged_url.str.contains('www.sec.gov/rss/news/press.xml')]

file_name = 'GoogleReader/archiveteam_greader_20130619095946/greader_20130619095946.megawarc.warc.gz'

all_items = []
for idx,row in relevant_feeds.iterrows():
    with open(file_name) as fp:
        #move ahead in the file by the amount specified in index
        fp.seek(row['compressed_arc_offset'])
        #read in some 
        raw = fp.read(row['S'])
    text = gzip.GzipFile(fileobj=StringIO.StringIO(raw)).read()
    w = warc.WARCFile(fileobj=StringIO.StringIO(text))
    record = w.read_record()
    payload = record.payload.read()
    j=json.loads(payload.split('\r\n\r\n')[1])
    all_items = all_items + j['items']

with codecs.open('google_reader_archived.jsonl',"w+",encoding='utf-8') as fp:
    writer = Writer(fp)
    writer.write_all(all_items)    

And the files that result should look quite similar to what we obtain from Feedly.

Now that we can assign very precise dates and times to the appearance of a rule, we only need to acquire the remaining press releases. The following code does that job.

import scrapy
from urllib import urlencode
from scrapy.shell import inspect_response
import json

class PressReleaseSpider(scrapy.Spider):
    name = 'press_release_spider'

    def make_feedly_url(self):
        return self.feedly + self.rss + urlencode(self.params)

    def start_requests(self):
        self.feedly = 'http://feedly.com/v3/streams/contents?streamId=feed/' 
        self.rss = 'http://www.sec.gov/rss/news/press.xml&'
        self.params = {'count' : 1000}
        url = self.make_feedly_url()
        self.log(url)
        yield scrapy.Request(url=url,callback=self.parse)


    def parse(self,response):
        #inspect_response(self,response)
        if 'feedly.com' in response.url:
            j = json.loads(response.text)
            if len(j['items']) == 1000:
                self.params['continuation'] = j['continuation']
                yield scrapy.Request(url=self.make_feedly_url(),callback=self.parse)
            for item in j['items']:
                item['kind'] = 'rss'
                yield item
Advertisements


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s