---
title: "Introduction to getLattes"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{introduction_getLattes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

<!--
Opções no chunk

eval        incluir o resultado do código executado, pode ser logico ou numérico
echo        mostrar o código 
warning     mostrar mensagens de aviso
error       mostrar mensagens de erro
message     mostrar mensagens
tidy        mostrar ajustar o código ao display, ignora identação
comment     '##' ou qq símbolo, para os resultados dos códigos serem antecedidos por ##
include     se falso, executa mas não inclui o chunk no relatório
-->

The [Lattes](http://lattes.cnpq.br/) platform has been hosting curricula of Brazilian researchers since the late 1990s, containing more than 5 million curricula. The data from the Lattes curricula can be downloaded to `XML` format, the complexity of this reading process motivated the development of the `getLattes` package, which imports the information from the `XML` files to a list in the `R` software and then tabulates the Lattes data to a `data.frame`.

The main information contained in `XML` files, and imported via `getLattes`, are:

- Research Area `getAreasAtuacao()`  
- Published Papers `getArtigosPublicados()`  
- Accepted Papers `getArtigosAceitos()`  
- Profissional Links `getAtuacoesProfissionais()`  
- Ph.D. Examination Board's `getBancasDoutorado()`  
- Undergraduate Examination Board's `getBancasGraduacao()`  
- Master Examination Board's `getBancasMestrado()`  
- Books Chapters `getCapitulosLivros()`  
- General Data `getDadosGerais()`  
- Profissional Address `getEnderecoProfissional()`  
- Events and Congresses `getEventosCongressos()`  
- Profissional Formation (Ph.D. Thesis) `getFormacaoDoutorado()`  
- Profissional Formation (Master Thesis) `getFormacaoMestrado()`  
- Profissional Formation (Undergraduation) `getFormacaoGraduacao()`  
- Languages `getIdiomas()`  
- Research Lines `getLinhaPesquisa()`  
- Published Books `getLivrosPublicados()`  
- Event's Organization `getOrganizacaoEvento()`
- Academic Advisory (Ph.D. Thesis) `getOrientacoesDoutorado()`  
- Academic Advisory (Master Thesis) `getOrientacoesMestrado()`  
- Academic Advisory (Post Doctorate) `getOrientacoesPosDoutorado()`  
- Other Technical Productions `getOutrasProducoesTecnicas()`  
- Participation in Projects `getParticipacaoProjeto()`  
- Technical Production `getProducaoTecnica()`  
- Patents `getPatentes()`  
- Patents `getTrabalhosEmEventos()`  
- Personal Lattes 16 digits identification `getId()`  

From the functionalities presented in this package, the main challenge to work with the Lattes curriculum data is now to download the data, as there are Captchas. To download a lot of curricula I suggest the use of [Captchas Negated by Python reQuests - CNPQ](https://github.com/josefson/CNPQ). The second barrier to be overcome is the management and processing of a large volume of data, the whole Lattes platform in `XML` files totals over 200 GB. In this tutorial we will focus on the `getLattes` package features, being the reader responsible for download and manage the files.

Follow an example of how to search and download data from the [Lattes](http://lattes.cnpq.br/) website.

![](http://roneyfraga.com/volume/keep_it/lattes_busca_curriculo.gif)

## getLattesWeb

Alternative for no-coders:

- link 1 [https://roneyfraga.shinyapps.io/getlattesweb/](https://roneyfraga.shinyapps.io/getlattesweb/)
- link 2 [http://roneyfraga.com/shiny/getLattesWeb/](http://roneyfraga.com/getLattesWeb/)

![](http://roneyfraga.com/volume/keep_it/getLattesWeb_exemplo.gif)

## Installation

To install the newest released version of getLattes from [github](https://CRAN.R-project.org).

```{r eval=F}
# install and load devtools from CRAN
# install.packages("devtools")
library(devtools)

# install and load getLattes
devtools::install_github("roneyfraga/getLattes")
```

Stable version from [CRAN](https://cran.r-project.org/).

```{r eval=F, include=T}
install.packages('getLattes')
```

Load `getLattes`.

```{r eval=T, warning=FALSE, message=FALSE}
library(getLattes)

# support packages
library(xml2)
library(dplyr)
library(tibble)
library(purrr)
```

## Single curriculum

### Import 

Using the `get*` functions to import data from a single curriculum is straightforward. The curriculum need to be imported into `R` by the `read_xml()` function from the `xml2` package. 

```{r eval=T, include=T}
# find the file in system
zip_xml <- system.file('extdata/4984859173592703.zip', package = 'getLattes')

curriculo <- xml2::read_xml(zip_xml)
```

### `get` functions

```{r eval=F}
getDadosGerais(curriculo)
getAreasAtuacao(curriculo)
getArtigosPublicados(curriculo)
getAtuacoesProfissionais(curriculo)
getBancasDoutorado(curriculo)
getBancasGraduacao(curriculo)
getBancasMestrado(curriculo)
getCapitulosLivros(curriculo)
getDadosGerais(curriculo)
getEnderecoProfissional(curriculo)
getEventosCongressos(curriculo)
getFormacaoDoutorado(curriculo)
getFormacaoGraduacao(curriculo)
getFormacaoMestrado(curriculo)
getIdiomas(curriculo)
getLinhaPesquisa(curriculo)
getLivrosPublicados(curriculo)
getOrganizacaoEventos(curriculo)
getOrientacoesDoutorado(curriculo)
getOrientacoesMestrado(curriculo)
getOrientacoesPosDoutorado(curriculo)
getOutrasProducoesTecnicas(curriculo)
getParticipacaoProjeto(curriculo)
getPatentes()
getProducaoTecnica(curriculo)
getTrabalhosEmEventos()
getId(curriculo)
```

## Several curricula 

### Import

To import data from two or more curricula it is easier to use `list.files()`, a native R function, or `dir_ls()` from `fs` package. As `xml2::read_xml()` allow to read a `xml` curriculum inside a `zip` files.

```{r eval=T, warning=FALSE, message=FALSE}
# find the files in system
zips_xmls <- c(system.file('extdata/4984859173592703.zip', package = 'getLattes'),
               system.file('extdata/3051627641386529.zip', package = 'getLattes'))
```

Import the listed curricula to R memory as `xml2::read_xml` object.

```{r eval=T, warning=FALSE, message=FALSE}
curriculos <- lapply(zips_xmls, read_xml)
```

The `lapply()` function is a well-known and widely used alternative in the `R` world. However, it does not natively handle errors, which makes the `map` function from the `purrr` package an excellent alternative. 

Adding an extra layer of complexity, I will use pipe `|>`. Programming using the pipe operator `|>` allows faster coding and clearer syntax.

```{r eval=T, warning=FALSE, message=FALSE}
curriculos <- 
    purrr::map(zips_xmls, safely(read_xml)) |> 
    purrr::map(pluck, 'result') 
```

### `get` functions

To read data from only one curriculum any function `get` can be executed singly, but to import data from two or more curricula is easier to use `get*` functions with `lapply()` or `map()`.

```{r eval=T, warning=FALSE, message=FALSE}
dados_gerais <- 
    purrr::map(curriculos, safely(getDadosGerais)) |>
    purrr::map(pluck, 'result') 

dados_gerais
```

Import general data from 2 curricula. The output is a list of data frames, converted by a unique data frame with `bind_rows()`.

```{r eval=T, warning=FALSE, message=FALSE}

dados_gerais <- 
    purrr::map(curriculos, safely(getDadosGerais)) |>
    purrr::map(pluck, 'result') |>
    dplyr::bind_rows() 

glimpse(dados_gerais)
```

It is worth remembering that all variable names obtained by `get*` functions are the transcription of the field names in the `XML` file, the `-` being replaced with `_` and the capital letters replaced with lower case letters.

## Publications

```{r eval=T, warning=FALSE, message=FALSE}
artigos_publicados <- 
    purrr::map(curriculos, safely(getArtigosPublicados)) |>
    purrr::map(pluck, 'result') |>
    dplyr::bind_rows() 

artigos_publicados |>
    dplyr::arrange(desc(ano_do_artigo)) |>
    dplyr::select(titulo_do_artigo, ano_do_artigo, titulo_do_periodico_ou_revista) 

livros_publicados <- 
    purrr::map(curriculos, safely(getLivrosPublicados)) |>
    purrr::map(pluck, 'result') |>
    dplyr::bind_rows() 

capitulos_livros <- 
    purrr::map(curriculos, safely(getCapitulosLivros)) |>
    purrr::map(pluck, 'result') |>
    dplyr::bind_rows() 
```

## Grouping data

To group the data key variable is `id`, which is a unique 16 digit code. 

```{r eval=T, warning=FALSE, message=FALSE}

artigos_publicados2 <- 
    dplyr::group_by(artigos_publicados, id) |>
    dplyr::tally(name = 'artigos') 

artigos_publicados2

livros_publicados2 <- 
    dplyr::group_by(livros_publicados, id) |>
    dplyr::tally(name = 'livros') 

livros_publicados2

capitulos_livros2 <- 
    dplyr::group_by(capitulos_livros, id) |>
    dplyr::tally(name = 'capitulos') 

capitulos_livros2
```

## Merge data

to join the data from different tables the recommended variable is `id`, which is a unique 16 digit code. 

```{r eval=T, warning=FALSE, message=FALSE}

artigos_publicados2 |>
    dplyr::left_join(livros_publicados2) |>
    dplyr::left_join(capitulos_livros2)
```

Add information from a different tables.

```{r eval=T, warning=FALSE, message=FALSE}

artigos_publicados2 |>
    dplyr::left_join(livros_publicados2) |>
    dplyr::left_join(capitulos_livros2) |>
    dplyr::left_join(dados_gerais |> dplyr::select(id, nome_completo)) |>
    dplyr::select(nome_completo, artigos, livros, capitulos) 
```

## Export to RIS format

```{r eval=F, echo=T, warning=FALSE, message=FALSE}

writePublicationsRis(artigos_publicados, 
                     filename = '~/Desktop/artigos_nome_citacao.ris', 
                     citationName = T, 
                     append = F, 
                     tableLattes = 'ArtigosPublicados')

# full author name, ex: Antonio Marcio Buainain
writePublicationsRis(artigos_publicados, 
                     filename = '~/Desktop/artigos_nome_completo.ris', 
                     citationName = F, 
                     append = F,
                     tableLattes = 'ArtigosPublicados')

writePublicationsRis(livros_publicados, 
               filename = '~/Desktop/livros.ris', 
               append = F, 
               citationName = T, 
               tableLattes = 'Livros')

writePublicationsRis(capitulos_livros, 
                     filename = '~/Desktop/capitulos_livros.ris', 
                     append = T,
                     citationName = F, 
                     tableLattes = 'CapitulosLivros')

```