Scraping SEC Rulemaking (Pt. 3)

In the previous post I showed how I scraped the rulemaking archives of the SEC to find, among other things, all the rulemaking dockets available. The next thing we want to do is to acquire the comments and whatever else might be of interest in the rulemaking dockets. Here’s the code I start with:

import scrapy
import pandas as pd
from urlparse import urljoin

class CommentSpider(scrapy.Spider):
    name = 'comment_spider'
    allowed_domains = ['www.sec.gov']

    #make start_urls by first loading in our previously scraped data
    src = pd.read_csv("sec_rulemaking_directory_scraped.csv")
    #dockets not always available, so drop
    #also might be more than one link per entry, so split-melt-drop
    links = src.docket_link_cleaned.dropna().\
        str.split("|",expand=True).\
        melt(value_name='links').dropna().links

    start_urls = map(lambda x: urljoin('https://www.sec.gov/',x),links)

    def parse(self,response):
        yield {'url' : response.url}

I don’t use the CrawlSpider class because I have the set of starting urls and don’t need all that structure about rules. So an example of what a docket page looks like is here:

https://www.sec.gov/rules/proposed/s71205.shtml

This is a good first cut. They do get more complicated with multiple types of data, as we’ll see, however for now this will do. The name of the page is at the top and some other identifying information. For now we can ignore this, as we already have this kind of information associated with the URL (see last post). As a general rule, it makes sense to scrape as little data as you actually want to clean and organize.

Thus let’s think about what we want the items to look like. Basically the rows seem to have three pieces of information.

  • Date
  • Description
  • PDF attachment.

The description suggests richer data, for example name of the commenter, title, organization, location. There are also “memorandums” of meetings mixed in with these documents. This seems to me to be one of those cases where parsing the meta data is probably best done outside of the scrapy context in another script. So we’ll make a class just to get these items. Similarly, we are likely going to want to extract the OCRed text from the PDFs, but this seems like a task best left for another cleanup script.

class DocketItem(scrapy.Item):
    date = scrapy.Field()
    description = scrapy.Field()
    attachment_web_location = scrapy.Field()
    attachment_file_location = scrapy.Field()

With a clear idea of what we want, we can now focus on how the page is organized. Google Developer extension is good for this, since you can expect elements and also copy out the xpath.

So the nearest analogue of our items is the “table row” and this would normally be indicated by

tags. The problem is that in the 1990s and early 2000s developers often liked to control the widths of pages by making the whole page a big table, so selecting by

yields a mess. In the rulemaking scraper we were able to get around this by noting that a table row needed to have 3 entries for us to be interested in it. Here no such luck, because we want two column rows and the whole page is setup as a two column table!

So the thing to note is that the table we want is a sibling that follows the headers. A quick xpath command in console $x(“//h1”) shows that there is a unique header on this site. So we can try $x(“//h1/following-sibling::*”) to see what siblings there are. Turns out there are two tables that results, one is the footer (“contact | employment | links … “) and the other is the table we want. And it’s also a two column table (d’oh!). Fortunately one of these footer elements has a class (“footer”) so we can simply kick that out.

$x(“//h1/following-sibling::table[not(.//td[contains(@class,’footer’)])]”)

Returns 1 table, which is the one we want. This says, “give me all the tables that are siblings below h1, except for those that have a descendant data item classed as a footer. We want all the table rows below this table that have two rows (append ‘//tr[count(td)>2]’). Doing the obvious stuff after that we obtain the following script:

import scrapy
import pandas as pd
from urlparse import urljoin
from scrapy.loader import ItemLoader

class DocketItem(scrapy.Item):
    date = scrapy.Field()
    description = scrapy.Field()
    attachment_web_location = scrapy.Field()
    attachment_file_location = scrapy.Field()
    listed_on = scrapy.Field()

class CommentSpider(scrapy.Spider):
    name = 'comment_spider'
    allowed_domains = ['www.sec.gov']

    #make start_urls by first loading in our previously scraped data
    src = pd.read_csv("sec_rulemaking_directory_scraped.csv")
    #dockets not always available, so drop
    #also might be more than one link per entry, so split-melt-drop
    links = src.docket_link_cleaned.dropna().\
        str.split("|",expand=True).\
        melt(value_name='links').dropna().links

    start_urls = map(lambda x: urljoin('https://www.sec.gov/',x),links)

    def parse(self,response):
        rows  = response.xpath(
        "//h1/following-sibling::table[not(.//td[contains(@class,'footer')])]//tr[count(td)=2]"
            )
        for row in rows:
            l = ItemLoader(item=DocketItem(),selector=row)
            l.add_xpath('date','td[1]/text()')
            l.add_xpath('description','td/a/text()')
            l.add_xpath('attachment_web_location','td/a/@href')
            l.add_value('listed_on',response.url)
            out = l.load_item()
            yield out

And it produces output like

[ {'attachment_web_location': [u'/rules/proposed/s71205/eurocomp031805.pdf'],
'date': [u'Mar. 18, 2005'],
'description': [u'Alexandre Tessier, Directeur G\xe9n\xe9ral, AFEP, Association Franc\xe7aise des Entreprises Priv\xe9es; Dr. Hellmut Longin, Pr\xe4sident, Aktienforum; \r\nProf. R\xfcdiger von Rosen, Managing Director, Deutsches Aktieninstitut; John Pierce, Chief Executive, The Quoted Companies Alliance; Mrs. Angeliki Petroulaki, General Manager, The Seiset Association; Panayotis G. Dracos, President and CEO, UCL/ASE; Evelyne Deloirie, Secr\xe9taire G\xe9n\xe9ral, MiddleNext; Digby Jones, Director General, CBI; Dr. Ludolf von Wartenbreg, Director General, BDI; Ebba Linds\xf6, Director General, Confederation of Swedish Enterprise; Pieris Theodorou, Chairman, SYDEK'],
'listed_on': ['https://www.sec.gov/rules/proposed/s71205.shtml'%5D},
{'attachment_web_location': [u'/rules/proposed/s71205/cleary020904.pdf'],
'date': [u'Feb. 9, 2004'],
'description': [u'Edward F. Greene, Cleary, Gottlieb, Steen & Hamilton'],
'listed_on': ['https://www.sec.gov/rules/proposed/s71205.shtml'%5D},
{'attachment_web_location': [u'/rules/proposed/s71205/eurocompanies020904.pdf'],
'date': [u'Feb. 9, 2004'],
'description': [u'\r\nAlain Joly, Pr\xe9sident, EALIC; Alexandre Tessier, Directeur G\xe9n\xe9ral, AFEP (Association Fran\xe7aise des Enterprises Priv\xe9es; Prof. R\xfcdiger von Rosen, Managing Director, Deutsches Aktieninstitut; Baron Vandamme, Administrateur, ABSC; Dr. Hellmut Longin, Pr\xe4sident, Aktienforum; Dr. Ludolf von Wartenberg, Director General, BDI; Edouard de Royere, Pr\xe9sident, ANSA; Mrs. Angeliki Petroulaki, General Manager of The Seiset Association; Ebba Lindso, Director General, Confederation of Swedish Enterprise; Tarmo Korpela, Deputy Director General, Confederation of Finnish Industry and Empoyers; Jones Digby, Director General, CBI'],
'listed_on': ['https://www.sec.gov/rules/proposed/s71205.shtml'%5D}]

What about downloading the files? Scrapy will have an easy time doing this, apparently, using some prewritten item pipelines. An item pipeline gets applied to an item after it has been scraped. This is apparently the place where you would put validation, HTML cleaning, duplicate checking, and database storage (oh so that’s where it goes!). And it so happens that which pipelines get applied is simply a setting that you amend.

class CommentSpider(scrapy.Spider):
    name = 'comment_spider'
    allowed_domains = ['www.sec.gov']

    #make start_urls by first loading in our previously scraped data
    src = pd.read_csv("sec_rulemaking_directory_scraped.csv")
    #dockets not always available, so drop
    #also might be more than one link per entry, so split-melt-drop
    links = src.docket_link_cleaned.dropna().\
        str.split("|",expand=True).\
        melt(value_name='links').dropna().links

    start_urls = map(lambda x: urljoin('https://www.sec.gov/',x),links)

    #to download files
    custom_settings = {
        'ITEM_PIPELINES' : {'scrapy.pipelines.files.FilesPipeline': 1},
        'FILES_STORE' : 'Downloads',
        'FILES_RESULT_FIELD' : 'attachment_file_location',
        'FILES_URLS_FIELD' : 'attachment_web_location'

    }

    def parse(self,response):
        rows  = response.xpath(
        "//h1/following-sibling::table[not(.//td[contains(@class,'footer')])]//tr[count(td)=2]"
            )
        for row in rows:
            l = ItemLoader(item=DocketItem(),selector=row)
            l.add_xpath('date','td[1]/text()')
            l.add_xpath('description','td/a/text()')
            web_loc = row.xpath('td/a/@href').extract()
            l.add_value('attachment_web_location',map(response.urljoin,web_loc))
            l.add_value('listed_on',response.url)
            out = l.load_item()
            yield out

This code looks as though it’s running well and doesn’t have a great deal of edge cases. Nevertheless, let’s start think about producing some improved output. Here’s what might get fed into the pipeline

{u'attachment_file_location': [{u'checksum': u'a462a3fcd7c45775b6fcf545f5ee626f',
   u'path': u'full/01c89e34bf2978c41eeabd9fb7c83c2edf86da66.pdf',
   u'url': u'https://www.sec.gov/comments/s7-15-11/s71511-12.pdf'}],
 u'attachment_web_location': [u'https://www.sec.gov/comments/s7-15-11/s71511-12.pdf'],
 u'date': [u'Jan. 24, 2014'],
 u'description': [u'Sean C. Davy, Managing Director, Corporate Credit Markets Division, Securities Industry and Financial Markets Association'],
 u'listed_on': [u'https://www.sec.gov/comments/s7-15-11/s71511.shtml']}

Let’s make a pipeline that would produce a clean(ish) tabular csv file. I will have this one write to file separately, it won’t change the items at all. In order to write this stuff I load the json produced from a first run into a pandas dataframe. Then I write the process_item function to make changes that mirror what I am doing to the dataframe in the first instance. here’s what I do

  • Drop items where the description is missing.
  • Focus on the ones where we have more than one entry per line
    — Drop from these, entries that are just whitespace
    — Drop Those that begin ‘Letter Type’ or “Attachment’
    — Join the text
  • Based on the text decide whether it’s a meeting, phone call, or what
  • Flatten the file locations by eliminating some unnecessary data.

Now this actually took a fair amount of time to optimize this script, and a lot of examining rule-based parsing. The end result seems to produce excellent results, however, so it’s well worth the changes. Here’s the final output.

<br />import scrapy
import pandas as pd
from urlparse import urljoin
from scrapy.loader import ItemLoader
import regex
import natty

class DocketItem(scrapy.Item):
    date = scrapy.Field()
    description = scrapy.Field()
    attachment_web_location = scrapy.Field()
    attachment_file_location = scrapy.Field()
    listed_on = scrapy.Field()
    kind = scrapy.Field()
    flagged = scrapy.Field()

class TabularOutputPipeline(object):
    """
    Produce a csv file that is a flat table from scraped data.
    Column headers include

    attachment_file_location,attachment_web_location,' + \
                'date,description,listed_on,kind,flagged
    """
    def open_spider(self,spider):
        #clobber whatever file is there and write a header
        self.file = open('lobbying_sec_rulemaking.csv','w')
        #the line we wish to write
        self.header= 'attachment_file_location,attachment_web_location,' + \
            'date,description,listed_on,kind,flagged\n'
        #we'll need to add the columnns to the items
        #to ensure that the csv aligns when missing
        self.columns = self.header.strip().split(',')
        self.file.write(self.header)

    def close_spider(self,spider):
        self.file.close()

    def process_item(self,item,spider): 
        #---- Description Section
        if 'description' not in item.keys():
            item['flagged'] = True
            return item #do nothing with bad ones, just note that we will drop them.
        x = item['description']    
        if x != x:
            item['flagged'] = True
            return item
        x = filter(lambda x: len(x)!=0,map(unicode.strip,x))
        x = filter(lambda x: not regex.match('^Letter Type|^Attachment',x,regex.IGNORECASE),x)
        item['description'] = " ".join(x)
        # ----  Parse the description to determine which are meetings.
        if regex.search('Memorandum',item['description'],regex.IGNORECASE,item['description']):
             if regex.search('call|phone|telephon|teleconference',
                 item['description'],regex.IGNORECASE):
                 item['kind'] = 'Phone call'
             elif regex.search('meeting|discussion|consultation|presentation|meeating',
                 item['description'],
                 regex.IGNORECASE):
                 item['kind'] = 'Meeting'
             elif regex.search('e-mail|email',
                 item['description'],
                 regex.IGNORECASE):
                 item['kind'] = 'Email'
             else:
                 item['kind'] = 'Memo'
        else:
            item['kind'] = 'Comment'
        #----- parse dates
        if 'date' in item.keys():
             raw = item['date'][0]
             if raw=='undated' or len(raw.strip())==0:
                 item['date'] =np.nan
                 item['flagged'] = True
             else:
                 parsed = natty.DateParser(raw).result()[0]
                 item['date'] = parsed.date().isoformat()
        #--- clean information to help track down
        item["listed_on"] = item["listed_on"][0]
        item["attachment_web_location"] = item["attachment_web_location"][0]
        item["attachment_file_location"] = item["attachment_file_location"][0]['path']
        if 'flagged' not in item.keys():
            item['flagged'] = False
        out = pd.DataFrame([item],columns=self.columns)
        out.to_csv(self.file,mode='a',
            header=False,index=False,encoding='utf8')
        return item

class CommentSpider(scrapy.Spider):
    name = 'comment_spider'
    allowed_domains = ['www.sec.gov']

    #make start_urls by first loading in our previously scraped data
    src = pd.read_csv("sec_rulemaking_directory_scraped.csv")
    #dockets not always available, so drop
    #also might be more than one link per entry, so split-melt-drop
    links = src.docket_link_cleaned.dropna().\
        str.split("|",expand=True).\
        melt(value_name='links').dropna().links

    start_urls = map(lambda x: urljoin('https://www.sec.gov/',x),links)

    #to download files
    custom_settings = {
        'ITEM_PIPELINES' : {'scrapy.pipelines.files.FilesPipeline': 1,
                            'SEC.spiders.sec_comments.TabularOutputPipeline': 800},
        'FILES_STORE' : 'Downloads',
        'FILES_RESULT_FIELD' : 'attachment_file_location',
        'FILES_URLS_FIELD' : 'attachment_web_location'

    }

    def parse(self,response):
        rows  = response.xpath(
        "//h1/following-sibling::table[not(.//td[contains(@class,'footer')])]//tr[count(td)=2]"
            )
        for row in rows:
            l = ItemLoader(item=DocketItem(),selector=row)
            l.add_xpath('date','td[1]/text()')
            l.add_xpath('description','td/a/text()')
            web_loc = row.xpath('td/a/@href').extract()
            l.add_value('attachment_web_location',map(response.urljoin,web_loc))
            l.add_value('listed_on',response.url)
            out = l.load_item()
            yield out

One brief note before proceeding. I’ve put everything in a single giant script, but this is not the way projects are supposed to be organized in scrapy. When you create a project in scrapy, it makes scripts called ‘settings.py’, ‘pipelines.py’, ‘middlewares.py’, and ‘items.py’, as well as a folder called spiders. So the three classes my code defines should really be in three separate places, which probably makes code more maintainable and easy to reuse. It also suggests that much of what I’ve put in the process_item function in the TabularOutputPipepline really should go in a custom item_loader.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s