Using text similarity estimation algorithms to find ALEC model bills introduced in state legislatures

When I first started getting involved in city and state politics during my time in college, one of the first things I found surprising was how ideas turn into bills. While the process for bills to become laws is straightforward and has been taught to us in our civics classes since elementary school (I'm sure you remember Schoolhouse Rock), I'd been under the assumption that lawmakers or their staff write all the bills that are introduced themselves. But when I was looking for someone in West Virginia to support legislation to improve protections for rental tenants, the response from the first delegate I talked to about introducing the bill said, This sound really great. Send me over a draft and I'll take a look at it.

With state legislators often having small staffs and limited resources, the appeal of having prewritten drafts of bills makes sense. It saves them time, and often the people or organization approaching the lawmaker have much more in-depth knowledge on the issue than the lawmaker themselves.

In this post, I'm going to look at model bills that are written by a political organization called ALEC. If you haven't heard of them, you're certainly not alone. The New York Times describes them as a "stealth business lobbyist."

Here are some more short summaries on what ALEC is from a few different viewpoints:


[ALEC is] a nonprofit organization of conservative state legislators and private sector representatives that drafts and shares model state-level legislation for distribution among state governments in the United States.

ALEC's website:

[ALEC] provides a constructive forum for state legislators and private sector leaders to discuss and exchange practical, state-level public policy issues. The potential solutions discussed at ALEC focus on free markets, limited government and constitutional division of powers between the federal and state governments.

And Sourcewatch:

ALEC is a corporate bill mill. It is not just a lobby or a front group; it is much more powerful than that. Through ALEC, corporations hand state legislators their wishlists to benefit their bottom line. Corporations fund almost all of ALEC's operations. They pay for a seat on ALEC task forces where corporate lobbyists and special interest reps vote with elected officials to approve “model” bills

For a fascinating deep dive look at ALEC and how it works, check out this paper by a doctoral candidate at Harvard that is based on leaked internal documents from ALEC in 1995.

With model legislation in hand, bills based on ALEC's recommendations are often introduced (sometimes verbatim) by lawmakers that support ALEC's suggested policies. It works great for both parties — the lawmakers don't need to spend time researching and writing these type of bills, and ALEC gets bills that support their beliefs introduced.

Additionally, one of the first steps in putting together a potential bill to be introduced is look at how the laws in other states approach the topic, so a law in one state can have a cascading effect into other states, which fits in nicely with what ALEC is trying to accomplish.

In the interest of disclosure, I'll mention that while I'm not a fan of the majority of ALEC's policies, I do think that their strategy has been both savvy and effective at passing legislation nationwide in support of their views. The point of this post is simply to show how code can be used to detect ALEC bills that have been introduced.

If you don't care about the process, you can jump straight to the results for the West Virginia legislature from 2009 to 2016.

Scraping ALEC

The first thing we'll need to do is get all the model bills that ALEC provides. Fortunately, ALEC provides a list of all model legislation on their website. Unfortunately, the only data format this is available in is HTML, so that means scraping needs to be used to convert into a format that is easy to analyze.

For scraping with Ruby, the most commonly used library is Nokogiri. Here's a good overview of how to use it if you aren't familiar with it already. Nokogiri can be installed with gem install nokogiri.

The strategy for scraping the model legislation is straightforward: first we'll scrape the index page to get a list the URLs for each piece, then we will load the individual pages for each and grab the important content: the title, url, content (both raw HTML and stripped), and the tags. The script then saves all this data as JSON so it will be easy to work with later.

Below is an example script that can be used to get the results:

Scraping state legislature sites for bills

Next up we need to get the bills we want to compare with the ALEC model legislation for similarity. ALEC is quite popular in the West Virginia legislature — I remember reading at one point that WV has one of the highest membership rates in the country — so we'll use WV bills for exploring. It's also where I cut my teeth with state politics, so I personally find the results fascinating.

Not much of a shocker, but West Virginia doesn't provide bills in an easy to parse format like CSV or JSON, so we'll need to scrape again to get the data we need. The strategy for scraping the West Virginia legislature site is similar how we did it with ALEC's site above: find an index page, loop through each bill listed, and save it so we can compare against the ALEC model bills later. This index page lists bills introduced into both the House and Senate.

Below is a script that can be used to grab the bills introduced in a certain year. The year to scrape should be passed along as a parameter e.g. ruby scraper.rb 2015. The script will take some time to chug through and save everything.

The markup on the West Virginia legislature site is atrocious, so you may find you need to adjust the script above depending on the year to match the HTML.

Finding similarity

Now that we have both the ALEC model legislation and West Virginia bills saved, we need to find a way to compare and find similarity between the two. There are numerous ways to do this ( has a nice breakdown of some of the techniques) and so the choice depends a lot on how it will be used and by simply playing around with various techniques. For content that will be queried often, Elasticsearch and Lucene both have various algorithms built in that assign scores. For more of a one time thing that is easily customizable, other algorithms like SimHash can be used, which is what we'll be playing with here.

SimHash got its start at Google as a way to prevent duplicate content from appearing in search results. There are plenty of articles online that do a better job than I can explaining the details behind how it works, but the short summary is that text is converted into a hash, which is essentially a fingerprint of the content. With the fingerprint in hand, it can be compared against the hashes of other text to determine similarity. Instead of an 2,000 word text document, we'll end up with numbers that looks something like 114294594088531729867882921390921571.

For converting text into a SimHash, we can use the bookmate/simhash gem, which has helpfully wrapped up the algorithm into a easy to use Ruby library. The gem also has some useful options built in like stripping stop words and the ability to decide how we want to split the string.

We'll want to use the implementation below, which says to split the text by words and remove any stop words.

text.simhash(split_by: / /, stop_words: true, hashbits: 512)  

Here's an example script that will loop through the bills and create a simhash of each:

Now let's generate simhashes for each of the ALEC model bills by running ruby simhasher.rb alec. This will save the results to a JSON file called alec-simhashes.json.

The result for each model bill will look something like this:

  "slug": "workers-compensation-fraud-warning-act",
  "title": "Workers’ Compensation Fraud Warning Act",
  "simhash": 114294594088531729867882921390921571

Now that we have simhashes for the model ALEC bills, we need to compare them against the introduced bills to find similarity. A script to do that is below:

The script is mostly self-explanatory — loop through the bills, generate a simhash for the bill, and compare it against each simhash for the ALEC model legislation. To compare each, we will use an algorithm called Hamming Distance, which Wikipedia describes as:

[...] the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In another way, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other.

The Hamming Distance formula written in Ruby looks like this:

def hamming_distance(a, b)  
  (a ^ b).to_s(2).count('1')

It looks quite simple thanks to the beauty of Ruby, but what is actually happening is fairly complex. In this method we use a bitwise XOR operator for comparison, convert the result to a string in base 2, then count the number of 1 characters to find a score.

After the script has finished running, we can see the results.


Below are the results for similarity between bills introduced in the West Virginia Legislature from 2009 to 2016 and currently listed ALEC model legislation. The lower the score, the higher the similarity.

For example, in 2015, HB2729 "Relating to welfare system integrity" was very similar to the nearly identically named ALEC model bill "Welfare System Integrity Act". As you scroll through the results, you'll notice that a lot of the names closely mirror each other.

Keep in mind too that similarity doesn't guarantee that the bill was modeled on an ALEC model bill — it says that there is a potential that it was.

I'm aiming to do a few other states when I get some free time. If you end up using the scripts to find ALEC model bills introduced in other states, please let me know at @tylerpearson!

Also, if you find this interesting, be sure to check out some of my other posts.