Testwiki:WikiProject Brazilian Laws/Scripts

From testwiki
Revision as of 23:15, 21 April 2021 by imported>Ederporto
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

{{Wikidata:WikiProject Brazilian Laws/Tabs|Scripts}}

Introduction

There are two main official platforms to access Brazilian legislation: the LexML Project and the Palácio do Planalto website. Both sites provide a search interface so one can query items in a limited set of parameters. LeXML also provides an API and has controlled dictionary for its metadata. Neither platform offers the option to download its contents. All the metadata about legislation and other official acts in Brazil, including their full text, are not subject to copyright. [1]

The schema crosswalk of these entities shows which metadata each website provides; None of them contains all the metadata available. So, as an activity of this project, both websites were scrapped using Python scripts to fetch as much information about the legislation they hold as possible. Latter on, we compiled all this information into a spreadsheet, wikidatified and uploaded it to Wikidata.

Below we illustrate the kind of metadata present in both websites, using Template:Q as an example. Next, we go further into the details of the scripts used in each step of the process of scrapping them.

Illustration

As an example, has the following metadata table in its record page at LeXML:

Localidade Brasil
Autoridade Federal
Título Lei nº 13.709, de 14 de Agosto de 2018
Data 14/08/2018
Apelido Lei Geral de Proteção de Dados Pessoais (LGPDP)
Apelido LEI-13709-2018-08-14 , LEI GERAL DE PROTEÇÃO DE DADOS
Ementa Dispõe sobre a proteção de dados pessoais e altera a Lei nº 12.965, de 23 de abril de 2014 (Marco Civil da Internet).
Nome Uniforme urn:lex:br:federal:lei:2018-08-14;13709
Mais detalhes Senado Federal
Mais detalhes Câmara dos Deputados
Projeto de Origem [ Projeto de Lei (CD) nº 4060/2012 > Projeto de Lei da Câmara nº 53/2018 > Veto nº 33/2018 : Lei nº 13.709 de 14/08/2018 ]

And on its record page in Palácio do Planalto:

Data de assinatura: 14 de Agosto de 2018
Ementa: DISPÕE SOBRE A PROTEÇÃO DE DADOS PESSOAIS E ALTERA A LEI Nº 12.965, DE 23 DE ABRIL DE 2014 (MARCO CIVIL DA INTERNET). Vigência

Veto Parcial

Situação: Não consta revogação expressa
Chefe de Governo: MICHEL TEMER
Origem: Executivo
Data de Publicação: 15 de Agosto de 2018
Fonte: D.O.U de 15/08/2018, pág. nº 59
Link: Texto integral
Referenda: ---
Alteração: MPV 869, DE 27/12/2018: ALTERA ARTS. 3º, 4º, 5º, 11, 20, 26, 27, 29; ACRESCE ARTS. 55-A, 55-B, 55-C, 55-D, 55-E, 55-F, 55-G, 55-H, 55-I, 55-J, 55-K. 58-A, 58-B

LEI 13.853, DE 08/07/2019: ALTERA A EMENTA, ART. 1º, 3º, 4º, 5º, 7º, 11, 18, 23, 26, 27, 29, 41, 52, 55-A, 55-B, 55-C, 55-D, 55-E, 55-F, 55-G, 55-H, 55-I, 55-J, 55-K, 55-L, 58-A, 58-B E 65. REVOGA §§ 1º E 2º DO ART. 7º

MPV 959, DE 29/04/2020: ALTERA ART. 65 Vigência

LEI 14.010, DE 10/06/2020: ACRESCE INCISO I-A AO ART. 65

Correlação:
Veto: Mensagem de veto: MSG 451, DE 14/08/2018 - DOU DE 15/08/2018, P. 75: VETO PARCIAL - PARTE VETADA: INCISO II DO § 1º DO ART. 26; ART. 28; INCISOS VII, VIII E IX DO ART. 52; ARTS. 55 AO 59.
Assunto: CRITERIOS , TRATAMENTO , PROTEÇÃO , SEGURANÇA , SIGILO , DADOS PESSOAIS , PESSOA FISICA , PESSOA JURIDICA . ALTERAÇÃO , MARCO REGULATORIO , INTERNET , DIREITOS , USUARIO , ARMAZENAGEM , DADOS PESSOAIS , REGISTRO , EXCLUSÃO .
Classificação de direito: DIREITOS E GARANTIAS FUNDAMENTAIS .
Observação: ---

LexML API scraper

This scraper extracts the information retrieved from the LexML API and writes it on a .txt file separated by *. It is also possible to generate a lexicon file, containing all the Template:P listed under each legislation item, and how frequently they've are used as keyword. This script was created to run in Python 3.8.5.

Importing libraries

First, some usefull libraries are imported: lxml, re, requests, collections and time. Each one have a important role in access the API and process its results. It is important to notice that this is not the only method available to scrape API results and the reason this was the one chosen is merely because of familiarity of the programers.

from lxml import etree
import re
import requests
import collections
import time

# Namespace
ns = {
    'srw_dc': 'info:srw/schema/1/dc-schema',
    'dc'    : 'http://purl.org/dc/elements/1.1/',
    'srw'   : 'http://www.loc.gov/zing/srw/',
    'xsi'   : 'http://www.w3.org/2001/XMLSchema'
}

Defining classes

The script makes use of two classes to store the scraped information: Law and Lexicon. Classes create new type of objects, and bundle data and functions (or methods) that allow us to work more swiftly and be better organized with the data we are working. We choose the names of variables returned by the API as class Law variable names as a way of cutting a few code steps when storing the information.

class Law:
    def __init__(self):  
        self.tipoDocumento = ""
        self.date = ""
        self.urn = ""
        self.localidade = ""
        self.autoridade = ""
        self.title = ""
        self.description = ""
        self.identifier = ""
        self.subject = []
        
    def print_self(self):
        return (self.tipoDocumento + "*" + self.date + "*" + self.urn + \
               "*" + self.localidade + "*" + self.autoridade + "*" + self.title + \
               "*" + self.description + "*" + self.identifier + "*" + "*".join(self.subject) + "\n")

class Lexicon:
    def __init__(self, word, count):
        self.word = word
        self.count = count

The main function

The main function is responsible to run the other functions of the script. The first step is to define a set of parameters:

  • attributes declares which parameter we want to scrape from the API;
  • search_term declares the words used to query the items in the API; Here we use %22federal+decreto.lei%22 to gather federal laws-decrees and %22lei+federal%22 for the federal laws;
  • file_name is the name of the file in which will be written the legislation metadata.
  • n is the number of times the API will be requested. n=(Total of results/500)+1
  • lexicon_flag is a boolean value that determines if we want to create a separate file for the lexicon;

Using these parameters, sucessive requests are made to the API, each one returning not more that 500 results at a time, in XML format. Each result is passed to the get_values function and the Law object generated is stored in a laws list variable; This process goes until the results are done, then this list might have its lexicon extracted and written to a file. Finally, the results are written in the file_name file declared.

attributes = ("tipoDocumento", "date", "urn", "localidade", "autoridade", "title", "description", "identifier", "subject")
search_term = "%22federal+decreto.lei%22"
base_url = "https://www.lexml.gov.br/busca/SRU?operation=searchRetrieve&query=urn+=" + search_term + "&maximumRecords=500&startRecord="

file_name = "saida"
n = 50 
lexicon_flag = True

if __name__ == "__main__":
    laws = []
    for i in range(0, n):
        url = base_url + str(i * 500 + 1)
        req = requests.request('GET', url)
        tree = etree.fromstring(req.content)

        # x stands for each entry in <srw_dc:dc>
        for x in tree.findall(".//srw_dc:dc", namespaces=ns):
            new_law = get_values(x)
            laws.append(new_law)
            
        # Being polite
        time.sleep(2)
            
    if (lexicon_flag):
        lexicon = get_lexicon(laws)
        print_lexicon(lexicon, file_name)

    print_scraped_info(laws, file_name)

Unraveling the API result

The function get_values receives API result entry, hereby called item x, extracts all its attributes and stores them in a new Law object. Some formatting is necessary here because of the values and tags returned. For instance, all values are formated to remove new line characters (\n) and all tags have excess text removed (everything before }). The tag subject is dealt separately, as there may be several of them; Each subject also has its string cleaned up (removal of dots and unecessary commas) before we ensure that there are no duplicates among them.

def get_values(x):
    new_law = law()
    for i in x.iter():
        tag = i.tag
        if ("}") in tag:
            tag = tag.split("}")[1]
        if tag in attributes:
            if (tag != "subject"):
                if (i.text):
                    i.text = i.text.replace("\n", "")
                setattr(new_law, tag, i.text)
            else:
                subjects = [x.strip() for x in re.split('\s[,.]\s', i.text)]
                unique_subjects = list(set(subjects))
                for i in range(0, len(unique_subjects)):
                    if (" ." in unique_subjects[i]):
                        unique_subjects[i] = unique_subjects[i][:-2]
                setattr(new_law, tag, unique_subjects)
    return new_law

Writting the results

The list of Law objects created using the API results is then written into a .txt file. Each tag, for each Law object, is separeted by a * character, so a row in this file would have a format like Type_of_document*date*urn*locality*authority*title*description*identifier*subject_1*...*subject_n. Again, instead of a .txt file, one can create in csv or any other format they want, only being necessary the right library.

def print_scraped_info(laws, file_name):
    file = open(file_name + ".txt", "w")  
    file.write("tipo de documento*data*urn*localidade*autoridade*título*descricao*identifier*assuntos->\n")
    for i in laws:
        file.write(i.print_self())
    file.close()

Generating a lexicon file

If you're interested in generating a file with all Template:P listed in the legislation you've searched, and how often each subject was mentioned, set lexicon_flag = True.

The get_lexicon() function receives a list of Law objects, hereby called laws, and returns a list of unique terms of the lexicon. To count the frequency of each term, we only consider how many times each term was mentioned throughout different laws, as there are laws listing the same term more than once. The lexicon output file is also separated by the * character, and is sorted in descending order of number of uses.

def get_lexicon(laws):
    lex = []
    for l in laws:
        for s in l.subject:
            lex.append(s)
    lexicon = collections.Counter(lex)
    lexicon = sorted(lexicon.items(), key = lambda lex: lex[1], reverse = True)
    return lexicon

def print_lexicon(lexicon, file_name):
    file_lex = open(file_name + "_lexicon.txt", "w")
    for key, value in lexicon:
        file_lex.write(str(value) + "*" + key + "\n")
    file_lex.close()

Merging lexicon files

If you've generated more than one lexicon file, you might be interested in merging them, to see the frequency of all terms in those two or more search queries. To do so, call merge_lexicon_files with a list of file names. The output will be saved to a new file, merged_lexicon_files.txt.

Only use this when you are certain there are no intersecting laws between the files. (For an instance: when you're searching for two different types of laws, as we did). This code won't check for duplicates.

def merge_lexicon_files (list_of_files):
    lex = {}
    for file in list_of_files:
        f = open(file)
        for line in f.readlines():
            value, key = line.split("*")
            key = key.strip()
            lex[key] = lex.get(key, 0) + int(value)
            
    lex = sorted(lex.items(), key = lambda l: l[1], reverse = True)
    
    file = open("merged_lexicon_files.txt", "w")
    for k, v in lex:
        file.write(str(v) + "*" + k + "\n")
    file.close()
            
lex = merge_lexicon_files(["decretos-lei-complete_lexicon.txt", "leis-complete_lexicon.txt"])

If you've generated several lexicon files and would like to merge them, you can also use the following command line on Unix systems:

sort file1.txt file2.txt | uniq > exit_file

Palácio do Planalto scrapers

The Palácio do Planalto website was significantly harder to scrape. Unlike LexML, this website offers no machine-readable form of accessing the entirety of its database. Moreover, it implements features that make it harder to access it automatically. Therefore, the scripts to scrape the information took far more time to run than the previous one, as we were not able to use the same libraries and had to resort to other resources.

Difficulties faced and our way around them

  1. When trying to scrape the website, we discovered that it detects automated accesses and blocks them by requesting a CAPTCHA to be filled.
  2. Although the URL for a legislation record uses a query string in its URL, each one of them includes a hash, which makes it impossible to automatically build the URLs for a request;
  3. Sometimes, the names of the legislation items is not the same as in the LeXML website, and it doesn't list any unique identifier (like a URN, for example).

To these problems, we found the respective solutions:

  1. With the request library out of the possibilities, we tried to use the Selenium framework, a known technique for situations like this. But the CAPTCHA feature can still detects a Selenium operation. To overcome this situation, we resorted to a Python library called undetected_chromedriver, that, as the name says, pass the Selenium as undetected, so the website do not detect the Selenium operation as automated. We also used a few time.sleep() functions throughout the code; This was necessary because due to internet oscilations, or website response, the time needed to load one page would vary. The scripted still crashed a few times, roughly once every 1 000 pages visited.
  2. We developed two scripts: legislacao_scrapper, which browses the search results and gathers the URLs to all laws and law-decrees, and presidencia_scraper, which scrapes the content on each legislation's page.
  3. We matched the laws scraped in both databases by comparing their number and date. (There are instances of numbers of two different laws of the same type sharing the same number). The remaining cases we matched manually.

legislacao_scrapper

This script visits the presidency's website and parses it, page by page, scraping the links to the record page and full text page of each law and decree-law listed on it.

The script is a Jupyter Notebook that runs on Python 3.8.5. The complete source code can be found on GitHub. It uses the undetected_chromedriver and time libraries. undetected_chromedriver is a patch for Selenium's Chromedriver, which allows us to bypass the website's automated access recognition. We use the time library to make the processes wait for a little while until the page is fully loaded.

Importing libraries and defining options

import undetected_chromedriver as uc
import time

When visiting the Palácio do Planalto website on Chrome, you usually receive a warning stating that the website's security certificate is not trusted. To avoid having to click a few links before accessing the legislation's record page itself, we add a ignore-certificate-errors flag to the options of our undetected_chromedriver.

options = uc.ChromeOptions()
options.binary_location = "<path to Chrome app>"
options.add_argument("ignore-certificate-errors")

Running the webdriver

The next step is to initiate the driver and go to the homepage of Palácio do Planalto's legislation website.

driver = uc.Chrome(chrome_options = options)
driver.get("https://legislacao.presidencia.gov.br/")

We chose to keep the next few steps manual. This allows greater flexibility, as we can select the filters, especially when the code crashes. It is also easier for users to filter the several options they might be interested in manually, instead of having to program lines of code to do so.

The actions we took are specifically for the Palácio do Planalto website; Other sites probably will require different actions to be performed.

  • Click on "Pesquisa Avançada" (Advanced Search);
  • Pick the options to filter the legislations you are interested in. Currently, on the Palácio do Planalto website, you can query them by their type, current status, date, President in Office, person who signed it, it's Referenda Ministerial, where it originated from and which official publication it was published in;
  • Click on "Pesquisar" (Search);
  • The content of the page will be loaded, showing the legislation from the most recent to the oldest, 10 per page.

The scraping action itself starts here. First, we try to find a div with the class "w-100", which is the element that contains all the following elements we are interested in. If that hasn't loaded yet, the driver waits 2 seconds before trying again.

Scrapping the website

With the manual actions took in the last block, the first page of results will be loaded. There will be shown the total of results available of that selection, that will be exhibited in groups of 10 per page. Then, as we make a loop to load the next pages, we need to define n, that follows the equation n=Total of results/10+1.

n = 2880
for i in range(n):
    try:
        div_tags = driver.find_elements_by_class_name("w-100")

We ignore the first and last two elements with the "w-100" class, as they are part of the page itself, and not tags we are interested in. We then locate the <h4> tag, which contains the name of the law (that might require some cleaning up).

        for div_tag in div_tags[2:-2]:
            try:
                name = div_tag.find_element_by_tag_name("h4").text

We then locate the next two links contained within a <ul> tag. The first one refers to the record sheet of the law, and the second one refers to the page with its full text. The output is then printed to the console, or to a file.

                # Get the links to the record sheet of the law, as well as to the full text of it
                ul = div_tag.find_element_by_tag_name("ul").find_elements_by_tag_name("a")
                ficha = ul[0].get_attribute("href")
                text = ul[1].get_attribute("href")
                print("%s\t%s\t%s"%(name,ficha,text))
            except:
                pass

We then locate the "next page" button (it is the last button with the class "page-link"), and click it. We wait an arbitrary amount of time for the page to load, and then repeat the same process for the next pages.

        driver.find_elements_by_class_name("page-link")[-1].click()
        
        time.sleep(1)
    # If it fails to load the page, wait a few seconds and then try again for the next page.
    except:
        time.sleep(2)

We advise, if you are scrapping the Palácio do Planalto website, to scrape it a few thousand pages at a time. This can be done by setting different values to n, or by filtering the laws a couple decades at a time. It's a try and error step to define the arbitrary times.

Troubleshooting

On occasion, the driver would eventually load a white screen. This might be due to several reasons: too many requests to the website, a fluctuation on the internet connection, among others. To troubleshoot this issue, we suggest following these steps:

  1. Stop the script and close the driver window;
  2. Restart the script, and rerun it up to manual selection stage.
  3. Check for the date printed in the last line of the file. On the driver's page, change the "De" ("From") and "Até" ("To") dates on the "Pesquisa Avançada" panel to 01/01/1500 and to a day after the last date printed, respectively, and click in "Pesquisar".
  4. Run the next section of the script.

presidencia_scraper

This script opens a .txt file generated by the legislacao_scrapper, visits the URLs listed on it one by one, and scrapes its content. The URLs are stored there in the format https://legislacao.presidencia.gov.br/atos/?tipo=<type>&numero=<number>&ano=<year>&data=<date in DD/MM/YYYY format>&ato=<hash>. The script is a Jupyter Notebook that runs on Python 3.8.5. The complete source code can be found on GitHub.

Importing libraries and defining options

The script uses the time, re, bs4, undetected_chromedriver and selenium libraries. undetected_chromedriver is a patch for Selenium's Chromedriver, which allows us to bypass the website's automated access recognition. Using the regular Chromedriver might work if only a few accesses are to be made. However, even if using randomized waiting timers, the website would eventually request that a captcha be filled. We never had that problem again after switching to undetected_chromedriver. Here we use the wait function from Selenium to show how it can be used. In the previous script this functionalitty was not implemented because, although the elements were not loaded, they were present, with no information, on the source code of the pages, which made the wait from Selenium unfeasible there.

import time
import re
from bs4 import BeautifulSoup
import undetected_chromedriver as uc
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

We use the same options as before, because of the same problem: The alert of the website's security certificate not being trusted.

options = uc.ChromeOptions()
options.add_argument('ignore-certificate-errors')
driver = uc.Chrome(chrome_options = options)

Defining classes and attributes

The script uses a Law_presidencia class to store the date it scrapes.

class Law_presidencia:
    def __init__(self):  
        self.titulo = ""
        self.data_assinatura = ""
        self.ementa = ""
        self.situacao = ""
        self.chefe_governo = ""
        self.origem = ""
        self.data_publicacao = ""
        self.fonte = ""
        self.fonte_link = ""
        self.texto_integral = ""
        self.referenda = ""
        self.alteracao = ""
        self.correlacao = ""
        self.veto = ""
        self.assunto = ""
        self.classificacao_direito = ""
        self.obs = ""
        
    def print_self(self):
        return(self.titulo + "*" + self.data_assinatura + "*" + self.ementa + "*" + self.situacao + "*" \
              + self.chefe_governo + "*" + self.origem + "*" + self.data_publicacao + "*" \
              + self.fonte + "*" + self.fonte_link + "*" + self.texto_integral + "*" \
              + self.referenda + "*" + self.alteracao + "*" + self.correlacao + "*" \
              + self.veto + "*" + self.assunto + "*" + self.classificacao_direito + "*" \
              + self.obs + "\n")

The items we are scraping are organized as shown below. If you wish to see the item's complete source code, you can check an example here. The name of the attribute we're interested in is always stored within a <h2> tag. The value of that attribute is always located in the <div> right next to it.

<li class="list-group-item border-0 p-0">
	<div class="row">
		<div class="col-sm-2 label p-2">
			<h2>Data de Publicação</h2>						<!-- attribute name-->
		</div>
		<div class="col-sm  bg-conteudo p-2 text-justify">  
			30 de Março de 2021	                            <!-- attribute value -->
		</div>
	</div>
</li>

We use a dictionary to map out how the attribute's name is written in the website to how it is referred to in the Law_presidencia class.

dictionary = {
    "Data de assinatura:":       "data_assinatura",
    "Ementa:":                   "ementa",
    "Situação:":                 "situacao",
    "Chefe de Governo:":         "chefe_governo",
    "Origem:":                   "origem",
    "Data de Publicação:":       "data_publicacao",
    "Referenda:":                "referenda",
    "Alteração:":                "alteracao",
    "Correlação:":               "correlacao", 
    "Veto:":                     "veto", 
    "Assunto:":                  "assunto", 
    "Classificação de direito:": "classificacao_direito", 
    "Observação:":               "obs"
}

Scrapping a webpage

The scraping action itself takes place in the scrape_page_presidencia function. It receives a url to a law's record page, and returns a Law_presidencia item with all attributes we're interested in that it found on the page. As the function itself is quite long, we will break it down chunk by chunk.

To start it off, it accesses the received url. To check if the page has fully loaded, the code waits until the element barra-brasil is detected in the page, or it sleeps for 10 seconds before proceeding.

def scrape_page_presidencia(url):
    driver.get(url)
    element_present = EC.presence_of_element_located((By.ID, 'barra-brasil'))
    WebDriverWait(driver, 10).until(element_present)

A new Law_presidencia object is created. The page's source code is parsed by the BeautifulSoup library. We then locate all attributes' names in the page by looking up all the <h2> tags. This information is saved on a tags variable. The title of the law is identified by looking up for the sole <h1> in the page's source code.

    current_law = Law_presidencia()
    soup = BeautifulSoup(driver.page_source, "html.parser")

    tags = soup.findAll('h2')
    
    time.sleep(0.5)
    
    current_law.titulo = soup.find('h1').text.strip()

We then browse each tag contained in tags. To find the value to that attribute, we use tag.find_next(). We clean up the value, and then take different approaches depending on what the tag_text (the attribute's name) is.

    for tag in tags:
        tag_text = tag.text.strip()
        value = tag.find_next().text.strip()
        value = re.sub(' +', ' ', value)
        value = value.replace('\n', '').replace('\r', '').capitalize()

If tag_text is "Link:", that means it is a link to the law's full text. We save the href of that link tag to the texto_integral (full text) attribute of the current_law.

Else, if it's "Fonte:", that means it's a reference to the official publication of that law. We save that plain text to the fonte (source) attribute of the current_law. Then, we check if there's a link within that text. If there is one, we save the href of that link tag to the fonte_link (source link) attribute of the current_law. If there isn't one, we set the fonte_link value to "".

Else, we check if tag_text is contained within our dictionary. When an attribute on Palácio do Planalto has no value, it's listed as: "<Attribute name>: ---". Hence, if the value we identify is "---", we change that to "", an empty string. We then set the current_law's value for the attribute in question to value.

Once we've run through all the listed tags, we return the current_law item.

        if tag_text == "Link:":
            current_law.texto_integral = tag.find_next().find('a')['href']
        elif tag_text == "Fonte:":
            current_law.fonte = value
            try:
                current_law.fonte_link = tag.find_next().find('a')['href']
            except:
                current_law.fonte_link = ""
        
        elif tag_text in dictionary:
            if(value == "---"):
                value = ""
            setattr(current_law, dictionary[tag_text], value)
    return current_law

The main function

The main part of the script is below. We open the file generated by the legislacao_scrapper script, and read it line by line. We then parse those lines, one at time, splitting them at ;. We take the second element from the split list, which is the url to the law's record page. We then scrape it, and add the Law_presidencia item we get in return to a laws list.

if __name__ == "__main__":

    file = open("links_leis_federais_planalto.csv", "r")

    lines = file.readlines()[11600:]

    i = 0
    for line in lines:
        url = line.split(";")[1]
        laws.append(scrape_page_presidencia(url))
        i += 1

We then print the output to a .txt file, separated by *.

    saida = open("saida.txt", "w")
    saida.write("nome*data assinatura*ementa*situacao*chefe_governo*origem*data_publicacao" +\
                "*fonte*fonte_link*texto integral*referenda*alteracao*correlacao*veto" +\
                "*assunto*cassificacao direito*observacao*")
    for l in laws:
        saida.write(l.titulo + "*" + l.data_assinatura + "*" + l.ementa + "*" + l.situacao + "*" \
                    + l.chefe_governo + "*" + l.origem + "*" + l.data_publicacao + "*" \
                    + l.fonte + "*" + l.fonte_link + "*" + l.texto_integral + "*" \
                    + l.referenda + "*" + l.alteracao + "*" + l.correlacao + "*" \
                    + l.veto + "*" + l.assunto + "*" + l.classificacao_direito + "*" \
                    + l.obs + "\n")

Uniting the results from both scrapers

LexML and Palácio do Planalto's legislation website don't share a common identifier. As such, the result matching was done by comparing their titles in each system. More specifically, we compared the date and number in their titles. This match was made by concatenating such values, in a DD-MM-YYYY;number format in Google Sheets and comparing the results using the VLOOKUP formula.

However, this didn't always work due to a few discrepancies between the title of the legislation items in both systems. In such cases, as they are a minority of the whole batch, we have no option that try to match manually. In the end, there were 74 federal laws and 171 law-decrees that had a different date and/or number in Palácio do Planalto than they have listed in the LexML platform. If you are scraping the laws of your own country, or any kind of data, in more than one system, it is important to keep in mind that this is a problem you might run into.

Types of Brazilian laws within Wikidata

The already existing Brazilian laws in Wikidata can be found through the following SPARQL query.

Types of laws, by number of instances

Template:SPARQL